# Adding Unified Kernels

Ginkgo provides a unified kernel framework to facilitate adding new kernels to all of the non-`Reference` backends with one kernel definition. 
These unified kernels are defined in `common/unified/`. The key to these kernels is the `run_kernel_` functions defined by Ginkgo. Each unified kernel can launch one or more
device/OpenMP kernels through the `run_kernel_` options.

There are three main types of `run_kernel_` routines in Ginkgo: standard, solver, and reduction.


## Standard kernels

Ginkgo has two types of standard unified kernels: 1D and 2D. The `size` parameter determines which kind of
kernel is launched, with a single `size_type` value launching a 1D kernel, while a `gko::dim<2>` parameter
will launch a 2D kernel. 

The interface for calling a standard unified kernel within a Ginkgo kernel is

```
 run_kernel(std::shared_ptr<Executor> exec, // Executor
            KernelFunction fn, // lambda function defining kernel
            [size_type or dim<2>] size,
            KernelArgs&&... args // parameters passed on to the lambda function 
 );
```

For a 1D kernel, the first parameter of `fn` must be a flat index `i`. For the GPU backends, this is the thread ID as defined by `thread::get_thread_id_flat`.
This index is *not* included in `args` in the call to `run_kernel`. So, for example, consider the simple 1D `scale` kernel (part of `common/unified/matrix/csr_kernels.cpp`).
Its lambda function is

```
[] GKO_KERNEL(auto nnz, auto alpha, auto x) { x[nnz] *= alpha[0]; }
```

where `nnz` is the index variable, and `alpha` is the scaling factor for `x`. Each thread scales its entry in `x` by `alpha`. The full kernel launch code is

```
 run_kernel(
        exec,
        [] GKO_KERNEL(auto nnz, auto alpha, auto x) { x[nnz] *= alpha[0]; },
        x->get_num_stored_elements(), // size
        alpha->get_const_values(), // alpha -- first non-index arg of the lambda function
        x->get_values()); // x -- second non-index arg
```

For a 2D kernel, the first two parameters taken by `fn` must correspond to `row` and `col`. For the OpenMP backend, the lambda function will be called inside loops: an outer loop on `rows`,
which is parallelized with `#pragma omp parallel`, and explicitly unrolled inner loop(s) for `col`. For the GPU backends, `row` and `col` are determined by 

```
row = tidx / cols;
col = tidx % cols;
```
with `tidx` the flat thread ID. As with the 1D kernel, these two parameters are *not* included in the parameters passed in `args`, since they are part of Ginkgo's kernel
launch infrastructure.

## Solver kernels

Solver kernels are like standard 2D kernels, but with additional support to handle the many `Dense` objects that may be passed as arguments. Often, many
of the `Dense` parameters will either have the same stride, or have a stride of 1 (and thus be able to be treated directly as a plain array of data).
Ginkgo's solver kernel framework allows us to take advantage of this fact and simplify the definition of the kernel launch, without needing to pass a stride
parameter for every `Dense` argument.

The interface for calling a solver kernel looks like

```
 run_kernel_solver(std::shared_ptr<Executor> exec,
                   KernelFunction fn, // lambda function defining kernel
                   dim<2> size,
                   size_type default_stride, // default stride for Dense args 
                   KernelArgs&&... args // parameters passed on to the lambda function
 );
```

The solver kernels work like 2D standard kernels, in that the first two parameters of the lambda function must be `row` and `col`, which are handled the same way
as in the standard kernel version. The main difference for solver kernels is the addition of the `default_stride` parameter and the use of helper functions in 
the `args` list. These helper functions can be seen in `common/unified/base/kernel_launch_solver.hpp`.

For the purposes of *using* the unified solver kernel framework, it's enough to know that we can mark `Dense` kernel arguments as having either stride 1 (a `row_vector`) 
or `default_stride`. For example, consider the `step_2` kernel in the CG solver, which updates the solution (`x`) and residual (`r`) vectors:

```
 run_kernel_solver(
        exec,
        [] GKO_KERNEL(auto row, auto col, auto x, auto r, auto p, auto q,        
                      auto beta, auto rho, auto stop) {                          
            if (!stop[col].has_stopped()) {
                auto tmp = safe_divide(rho[col], beta[col]);                     
                x(row, col) += tmp * p(row, col);                                
                r(row, col) -= tmp * q(row, col);                                
            }                                                                    
        },                                                                       
        x->get_size(), // size for 2D kernel launch
        r->get_stride(), // default stride
        x, // first non-index arg
        default_stride(r), default_stride(p), default_stride(q),
        row_vector(beta), row_vector(rho), *stop_status);    
```

Note that of the `Dense` arguments, `r`, `p`, and `q` are marked as having the default stride, which is the stride of `r`, while `beta` and `rho` have stride 1
(`beta` and `rho` are scalar values for each right hand side being solved). `x` does not have a stride marked because it is the solution vector. Unlike the other
`Dense` arguments, is passed to the solver by the user, rather than created by the solver. `x` could have a different stride than its number of columns if, for example,
it is actually a submatrix of a larger matrix.

## Reduction kernels

The final kind of unified kernel in Ginkgo is the reduction kernel. These kernels perform a computation,
then do a reduction operation on the resulting temporary values computed by each thread. 

Reduction kernels come in 4 flavors: 1D, 2D, row, and column.
Their interfaces are a little more complicated than the standard or solver kernels:

```
/* 1D or 2D reduction kernel -- reduce to single value */
  
 run_kernel_reduction(std::shared_ptr<Executor> exec,
                      KernelFunction fn, // main lambda function defining kernel
                      ReductionOp op, // lambda function for reduction operation
                      FinalizeOp finalize, // lambda function for finalization
                      ValueType identity, // the identity operator
                      ValueType* result, // pointer to location for result
                      [size_type or dim<2>] size,
                      KernelArgs&&... args)

/* Row reduction kernel -- reduce across rows, produce a column */

 run_kernel_row_reduction(std::shared_ptr<Executor> exec,
                          KernelFunction fn, // main lambda function defining kernel
                          ReductionOp op, // lambda function for reduction operation
                          FinalizeOp finalize, // lambda function for finalization
                          ValueType identity, // the identity operator
                          ValueType* result, // pointer to location for result
                          size_type result_stride, // stride for result column 
                          dim<2> size, // size must be 2D
                          KernelArgs&&... args)

/* Column reduction kernel -- reduce across columns, produce a row */
 
 run_kernel_col_reduction(std::shared_ptr<Executor> exec,
                          KernelFunction fn, // main lambda function defining kernel
                          ReductionOp op, // lambda function for reduction operation
                          FinalizeOp finalize, // lambda function for finalization
                          ValueType identity, // the identity operator
                          ValueType* result, // pointer to location for result
                          dim<2> size, // size must be 2D
                          KernelArgs&&... args)
```

In Figure 1 we give a visual representation of the four types of reduction kernels.

:::{figure} reduction-kernels-schematic.png
Using the four types of unified reduction kernels in Ginkgo.
:::

As seen in the interfaces, all reduction kernels use three lambda functions rather than one. There is still a main `KernelFunction` defining
the kernel; this specifies the computation done by each thread prior to the reduction. The reduction operation is also defined via 
lambda function. The `finalize` function can be used to perform one final operation prior to storing the result.

Ginkgo provides helpful macros for two common reduction cases: `GKO_KERNEL_REDUCE_SUM` and `GKO_KERNEL_REDUCE_MAX`. These macros expand to define
the reduction operation, finalization operation, and identity operator for the kernel; in other words, the `op`, `finalize`, and `identity` 
parameters are replaced with a single macro when you call `run_kernel_[row/col_]reduction`. The usage then looks like the `count_nonzeros_per_row` kernel
for a `Dense` matrix:

```
 run_kernel_row_reduction(
     exec,
     [] GKO_KERNEL(auto i, auto j, auto mtx) {
         return is_nonzero(mtx(i, j)) ? 1 : 0;
     },
     GKO_KERNEL_REDUCE_SUM(IndexType), // use macro to indicate a standard summation reduction
     result, // start of result column in memory
     1, // stride of result
     mtx->get_size(), // 2D size for kernel
     mtx // first (and only) non-index arg for fn
 );
```