|
| Executor (Executor &)=delete |
|
| Executor (Executor &&)=delete |
|
Executor & | operator= (Executor &)=delete |
|
Executor & | operator= (Executor &&)=delete |
|
virtual void | run (const Operation &op) const =0 |
| Runs the specified Operation using this Executor. More...
|
|
template<typename ClosureOmp , typename ClosureCuda , typename ClosureHip , typename ClosureDpcpp > |
void | run (const ClosureOmp &op_omp, const ClosureCuda &op_cuda, const ClosureHip &op_hip, const ClosureDpcpp &op_dpcpp) const |
| Runs one of the passed in functors, depending on the Executor type. More...
|
|
template<typename T > |
T * | alloc (size_type num_elems) const |
| Allocates memory in this Executor. More...
|
|
void | free (void *ptr) const noexcept |
| Frees memory previously allocated with Executor::alloc(). More...
|
|
template<typename T > |
void | copy_from (ptr_param< const Executor > src_exec, size_type num_elems, const T *src_ptr, T *dest_ptr) const |
| Copies data from another Executor. More...
|
|
template<typename T > |
void | copy (size_type num_elems, const T *src_ptr, T *dest_ptr) const |
| Copies data within this Executor. More...
|
|
template<typename T > |
T | copy_val_to_host (const T *ptr) const |
| Retrieves a single element at the given location from executor memory. More...
|
|
virtual std::shared_ptr< Executor > | get_master () noexcept=0 |
| Returns the master OmpExecutor of this Executor. More...
|
|
virtual std::shared_ptr< const Executor > | get_master () const noexcept=0 |
| Returns the master OmpExecutor of this Executor. More...
|
|
virtual void | synchronize () const =0 |
| Synchronize the operations launched on the executor with its master.
|
|
void | add_logger (std::shared_ptr< const log::Logger > logger) override |
|
void | remove_logger (const log::Logger *logger) override |
|
void | set_log_propagation_mode (log_propagation_mode mode) |
| Sets the logger event propagation mode for the executor. More...
|
|
bool | should_propagate_log () const |
| Returns true iff events occurring at an object created on this executor should be logged at propagating loggers attached to this executor, and there is at least one such propagating logger. More...
|
|
bool | memory_accessible (const std::shared_ptr< const Executor > &other) const |
| Verifies whether the executors share the same memory. More...
|
|
virtual scoped_device_id_guard | get_scoped_device_id_guard () const =0 |
|
void | add_logger (std::shared_ptr< const Logger > logger) override |
|
void | remove_logger (const Logger *logger) override |
|
void | remove_logger (ptr_param< const Logger > logger) |
|
const std::vector< std::shared_ptr< const Logger > > & | get_loggers () const override |
|
void | clear_loggers () override |
|
void | remove_logger (ptr_param< const Logger > logger) |
|
The first step in using the Ginkgo library consists of creating an executor.
Executors are used to specify the location for the data of linear algebra objects, and to determine where the operations will be executed. Ginkgo currently supports five different executor types:
- OmpExecutor specifies that the data should be stored and the associated operations executed on an OpenMP-supporting device (e.g. host CPU);
- CudaExecutor specifies that the data should be stored and the operations executed on the NVIDIA GPU accelerator;
- HipExecutor specifies that the data should be stored and the operations executed on either an NVIDIA or AMD GPU accelerator;
- DpcppExecutor specifies that the data should be stored and the operations executed on an hardware supporting DPC++;
- ReferenceExecutor executes a non-optimized reference implementation, which can be used to debug the library.
The following code snippet demonstrates the simplest possible use of the Ginkgo library:
auto omp = gko::create<gko::OmpExecutor>();
auto A = gko::read_from_mtx<gko::matrix::Csr<float>>("A.mtx", omp);
First, we create a OMP executor, which will be used in the next line to specify where we want the data for the matrix A to be stored. The second line will read a matrix from the matrix market file 'A.mtx', and store the data on the CPU in CSR format (gko::matrix::Csr is a Ginkgo matrix class which stores its data in CSR format). At this point, matrix A is bound to the CPU, and any routines called on it will be performed on the CPU. This approach is usually desired in sparse linear algebra, as the cost of individual operations is several orders of magnitude lower than the cost of copying the matrix to the GPU.
If matrix A is going to be reused multiple times, it could be beneficial to copy it over to the accelerator, and perform the operations there, as demonstrated by the next code snippet:
auto cuda = gko::create<gko::CudaExecutor>(0, omp);
auto dA = gko::copy_to<gko::matrix::Csr<float>>(A.get(), cuda);
The first line of the snippet creates a new CUDA executor. Since there may be multiple NVIDIA GPUs present on the system, the first parameter instructs the library to use the first device (i.e. the one with device ID zero, as in cudaSetDevice() routine from the CUDA runtime API). In addition, since GPUs are not stand-alone processors, it is required to pass a "master" OmpExecutor which will be used to schedule the requested CUDA kernels on the accelerator.
The second command creates a copy of the matrix A on the GPU. Notice the use of the get() method. As Ginkgo aims to provide automatic memory management of its objects, the result of calling gko::read_from_mtx() is a smart pointer (std::unique_ptr) to the created object. On the other hand, as the library will not hold a reference to A once the copy is completed, the input parameter for gko::copy_to() is a plain pointer. Thus, the get() method is used to convert from a std::unique_ptr to a plain pointer, as expected by gko::copy_to().
As a side note, the gko::copy_to routine is far more powerful than just copying data between different devices. It can also be used to convert data between different formats. For example, if the above code used gko::matrix::Ell as the template parameter, dA would be stored on the GPU, in ELLPACK format.
Finally, if all the processing of the matrix is supposed to be done on the GPU, and a CPU copy of the matrix is not required, we could have read the matrix to the GPU directly:
auto omp = gko::create<gko::OmpExecutor>();
auto cuda = gko::create<gko::CudaExecutor>(0, omp);
auto dA = gko::read_from_mtx<gko::matrix::Csr<float>>("A.mtx", cuda);
Notice that even though reading the matrix directly from a file to the accelerator is not supported, the library is designed to abstract away the intermediate step of reading the matrix to the CPU memory. This is a general design approach taken by the library: in case an operation is not supported by the device, the data will be copied to the CPU, the operation performed there, and finally the results copied back to the device. This approach makes using the library more concise, as explicit copies are not required by the user. Nevertheless, this feature should be taken into account when considering performance implications of using such operations.