Adding Unified Kernels

Ginkgo provides a unified kernel framework to facilitate adding new kernels to all of the non-Reference backends with one kernel definition. These unified kernels are defined in common/unified/. The key to these kernels is the run_kernel_ functions defined by Ginkgo. Each unified kernel can launch one or more device/OpenMP kernels through the run_kernel_ options.

There are three main types of run_kernel_ routines in Ginkgo: standard, solver, and reduction.

Standard kernels

Ginkgo has two types of standard unified kernels: 1D and 2D. The size parameter determines which kind of kernel is launched, with a single size_type value launching a 1D kernel, while a gko::dim<2> parameter will launch a 2D kernel.

The interface for calling a standard unified kernel within a Ginkgo kernel is

 run_kernel(std::shared_ptr<Executor> exec, // Executor
            KernelFunction fn, // lambda function defining kernel
            [size_type or dim<2>] size,
            KernelArgs&&... args // parameters passed on to the lambda function 
 );

For a 1D kernel, the first parameter of fn must be a flat index i. For the GPU backends, this is the thread ID as defined by thread::get_thread_id_flat. This index is not included in args in the call to run_kernel. So, for example, consider the simple 1D scale kernel (part of common/unified/matrix/csr_kernels.cpp). Its lambda function is

[] GKO_KERNEL(auto nnz, auto alpha, auto x) { x[nnz] *= alpha[0]; }

where nnz is the index variable, and alpha is the scaling factor for x. Each thread scales its entry in x by alpha. The full kernel launch code is

 run_kernel(
        exec,
        [] GKO_KERNEL(auto nnz, auto alpha, auto x) { x[nnz] *= alpha[0]; },
        x->get_num_stored_elements(), // size
        alpha->get_const_values(), // alpha -- first non-index arg of the lambda function
        x->get_values()); // x -- second non-index arg

For a 2D kernel, the first two parameters taken by fn must correspond to row and col. For the OpenMP backend, the lambda function will be called inside loops: an outer loop on rows, which is parallelized with #pragma omp parallel, and explicitly unrolled inner loop(s) for col. For the GPU backends, row and col are determined by

row = tidx / cols;
col = tidx % cols;

with tidx the flat thread ID. As with the 1D kernel, these two parameters are not included in the parameters passed in args, since they are part of Ginkgo’s kernel launch infrastructure.

Solver kernels

Solver kernels are like standard 2D kernels, but with additional support to handle the many Dense objects that may be passed as arguments. Often, many of the Dense parameters will either have the same stride, or have a stride of 1 (and thus be able to be treated directly as a plain array of data). Ginkgo’s solver kernel framework allows us to take advantage of this fact and simplify the definition of the kernel launch, without needing to pass a stride parameter for every Dense argument.

The interface for calling a solver kernel looks like

 run_kernel_solver(std::shared_ptr<Executor> exec,
                   KernelFunction fn, // lambda function defining kernel
                   dim<2> size,
                   size_type default_stride, // default stride for Dense args 
                   KernelArgs&&... args // parameters passed on to the lambda function
 );

The solver kernels work like 2D standard kernels, in that the first two parameters of the lambda function must be row and col, which are handled the same way as in the standard kernel version. The main difference for solver kernels is the addition of the default_stride parameter and the use of helper functions in the args list. These helper functions can be seen in common/unified/base/kernel_launch_solver.hpp.

For the purposes of using the unified solver kernel framework, it’s enough to know that we can mark Dense kernel arguments as having either stride 1 (a row_vector) or default_stride. For example, consider the step_2 kernel in the CG solver, which updates the solution (x) and residual (r) vectors:

 run_kernel_solver(
        exec,
        [] GKO_KERNEL(auto row, auto col, auto x, auto r, auto p, auto q,        
                      auto beta, auto rho, auto stop) {                          
            if (!stop[col].has_stopped()) {
                auto tmp = safe_divide(rho[col], beta[col]);                     
                x(row, col) += tmp * p(row, col);                                
                r(row, col) -= tmp * q(row, col);                                
            }                                                                    
        },                                                                       
        x->get_size(), // size for 2D kernel launch
        r->get_stride(), // default stride
        x, // first non-index arg
        default_stride(r), default_stride(p), default_stride(q),
        row_vector(beta), row_vector(rho), *stop_status);    

Note that of the Dense arguments, r, p, and q are marked as having the default stride, which is the stride of r, while beta and rho have stride 1 (beta and rho are scalar values for each right hand side being solved). x does not have a stride marked because it is the solution vector. Unlike the other Dense arguments, is passed to the solver by the user, rather than created by the solver. x could have a different stride than its number of columns if, for example, it is actually a submatrix of a larger matrix.

Reduction kernels

The final kind of unified kernel in Ginkgo is the reduction kernel. These kernels perform a computation, then do a reduction operation on the resulting temporary values computed by each thread.

Reduction kernels come in 4 flavors: 1D, 2D, row, and column. Their interfaces are a little more complicated than the standard or solver kernels:

/* 1D or 2D reduction kernel -- reduce to single value */
  
 run_kernel_reduction(std::shared_ptr<Executor> exec,
                      KernelFunction fn, // main lambda function defining kernel
                      ReductionOp op, // lambda function for reduction operation
                      FinalizeOp finalize, // lambda function for finalization
                      ValueType identity, // the identity operator
                      ValueType* result, // pointer to location for result
                      [size_type or dim<2>] size,
                      KernelArgs&&... args)

/* Row reduction kernel -- reduce across rows, produce a column */

 run_kernel_row_reduction(std::shared_ptr<Executor> exec,
                          KernelFunction fn, // main lambda function defining kernel
                          ReductionOp op, // lambda function for reduction operation
                          FinalizeOp finalize, // lambda function for finalization
                          ValueType identity, // the identity operator
                          ValueType* result, // pointer to location for result
                          size_type result_stride, // stride for result column 
                          dim<2> size, // size must be 2D
                          KernelArgs&&... args)

/* Column reduction kernel -- reduce across columns, produce a row */
 
 run_kernel_col_reduction(std::shared_ptr<Executor> exec,
                          KernelFunction fn, // main lambda function defining kernel
                          ReductionOp op, // lambda function for reduction operation
                          FinalizeOp finalize, // lambda function for finalization
                          ValueType identity, // the identity operator
                          ValueType* result, // pointer to location for result
                          dim<2> size, // size must be 2D
                          KernelArgs&&... args)

In Figure 1 we give a visual representation of the four types of reduction kernels.

../../_images/reduction-kernels-schematic.png

Using the four types of unified reduction kernels in Ginkgo.

As seen in the interfaces, all reduction kernels use three lambda functions rather than one. There is still a main KernelFunction defining the kernel; this specifies the computation done by each thread prior to the reduction. The reduction operation is also defined via lambda function. The finalize function can be used to perform one final operation prior to storing the result.

Ginkgo provides helpful macros for two common reduction cases: GKO_KERNEL_REDUCE_SUM and GKO_KERNEL_REDUCE_MAX. These macros expand to define the reduction operation, finalization operation, and identity operator for the kernel; in other words, the op, finalize, and identity parameters are replaced with a single macro when you call run_kernel_[row/col_]reduction. The usage then looks like the count_nonzeros_per_row kernel for a Dense matrix:

 run_kernel_row_reduction(
     exec,
     [] GKO_KERNEL(auto i, auto j, auto mtx) {
         return is_nonzero(mtx(i, j)) ? 1 : 0;
     },
     GKO_KERNEL_REDUCE_SUM(IndexType), // use macro to indicate a standard summation reduction
     result, // start of result column in memory
     1, // stride of result
     mtx->get_size(), // 2D size for kernel
     mtx // first (and only) non-index arg for fn
 );