Ginkgo  Generated from pipelines/1589998975 branch based on develop. Ginkgo version 1.10.0
A numerical linear algebra library targeting many-core architectures
The mixed-spmv program

The mixed spmv example..

Table of contents
  1. Introduction
  2. The commented program
  1. Results
  2. The plain program

Introduction

This mixed spmv example should give the usage of Ginkgo mixed precision. This example is meant for you to understand how Ginkgo works with different precision of data. We encourage you to play with the code, change the parameters and see what is best suited for your purposes.

About the example

Each example has the following sections:

  1. Introduction:This gives an overview of the example and mentions any interesting aspects in the example that might help the reader.
  2. The commented program: This section is intended for you to understand the details of the example so that you can play with it and understand Ginkgo and its features better.
  3. Results: This section shows the results of the code when run. Though the results may not be completely the same, you can expect the behaviour to be similar.
  4. The plain program: This is the complete code without any comments to have an complete overview of the code.

The commented program

* @param value_dist distribution of array values
* @param engine a random engine
*
* @return ValueType
* /
template <typename ValueType, typename ValueDistribution, typename Engine>
typename std::enable_if<!gko::is_complex_s<ValueType>::value, ValueType>::type
get_rand_value(ValueDistribution&& value_dist, Engine&& gen)
{
return value_dist(gen);
}
/ **
* Specialization for complex types.
*
* @copydoc get_rand_value
* /
template <typename ValueType, typename ValueDistribution, typename Engine>
typename std::enable_if<gko::is_complex_s<ValueType>::value, ValueType>::type
get_rand_value(ValueDistribution&& value_dist, Engine&& gen)
{
return ValueType(value_dist(gen), value_dist(gen));
}
/ **
* timing the apply operation A->apply(b, x). It will runs 2 warmup and get
* average time among 10 times.
*
* @return seconds
* /
double timing(std::shared_ptr<const gko::Executor> exec,
std::shared_ptr<const gko::LinOp> A,
std::shared_ptr<const gko::LinOp> b,
std::shared_ptr<gko::LinOp> x)
{
int warmup = 2;
int rep = 10;
for (int i = 0; i < warmup; i++) {
A->apply(b, x);
}
double total_sec = 0;
for (int i = 0; i < rep; i++) {

always clone the x in each apply

auto xx = x->clone();

synchronize to make sure data is already on device

exec->synchronize();
auto start = std::chrono::steady_clock::now();
A->apply(b, xx);

synchronize to make sure the operation is done

exec->synchronize();
auto stop = std::chrono::steady_clock::now();

get the duration in seconds

std::chrono::duration<double> duration_time = stop - start;
total_sec += duration_time.count();
if (i + 1 == rep) {

copy the result back to x

x->copy_from(xx);
}
}
return total_sec / rep;
}
} // namespace
int main(int argc, char* argv[])
{

Use some shortcuts. In Ginkgo, vectors are seen as a gko::matrix::Dense with one column/one row. The advantage of this concept is that using multiple vectors is a now a natural extension of adding columns/rows are necessary.

using HighPrecision = double;
using RealValueType = gko::remove_complex<HighPrecision>;
using LowPrecision = float;
using IndexType = int;

The gko::matrix::Ell class is used here, but any other matrix class such as gko::matrix::Coo, gko::matrix::Hybrid, gko::matrix::Csr or gko::matrix::Sellp could also be used. Note. the behavior will depends GINKGO_MIXED_PRECISION flags and the actual implementation from different matrices.

Print the ginkgo version information.

std::cout << gko::version_info::get() << std::endl;
if (argc == 2 && (std::string(argv[1]) == "--help")) {
std::cerr << "Usage: " << argv[0] << " [executor] " << std::endl;
std::exit(-1);
}

Where do you want to run your operation?

The gko::Executor class is one of the cornerstones of Ginkgo. Currently, we have support for an gko::OmpExecutor, which uses OpenMP multi-threading in most of its kernels, a gko::ReferenceExecutor, a single threaded specialization of the OpenMP executor and a gko::CudaExecutor which runs the code on a NVIDIA GPU if available.

Note
With the help of C++, you see that you only ever need to change the executor and all the other functions/ routines within Ginkgo should automatically work and run on the executor with any other changes.
const auto executor_string = argc >= 2 ? argv[1] : "reference";
std::map<std::string, std::function<std::shared_ptr<gko::Executor>()>>
exec_map{
{"omp", [] { return gko::OmpExecutor::create(); }},
{"cuda",
[] {
}},
{"hip",
[] {
}},
{"dpcpp",
[] {
}},
{"reference", [] { return gko::ReferenceExecutor::create(); }}};

executor where Ginkgo will perform the computation

const auto exec = exec_map.at(executor_string)(); // throws if not valid

Preparing your data and transfer to the proper device.

Read the matrix using the read function and set the right hand side randomly.

Note
Ginkgo uses C++ smart pointers to automatically manage memory. To this end, we use our own object ownership transfer functions that under the hood call the required smart pointer functions to manage object ownership. gko::share and gko::give are the functions that you would need to use.

read the matrix into HighPrecision and LowPrecision.

auto hp_A = share(gko::read<hp_mtx>(std::ifstream("data/A.mtx"), exec));
auto lp_A = share(gko::read<lp_mtx>(std::ifstream("data/A.mtx"), exec));

Set the shortcut for each dimension

auto A_dim = hp_A->get_size();
auto b_dim = gko::dim<2>{A_dim[1], 1};
auto x_dim = gko::dim<2>{A_dim[0], b_dim[1]};
auto host_b = hp_vec::create(exec->get_master(), b_dim);

fill the b vector with some random data

std::default_random_engine rand_engine(32);
auto dist = std::uniform_real_distribution<RealValueType>(0.0, 1.0);
for (int i = 0; i < host_b->get_size()[0]; i++) {
host_b->at(i, 0) = get_rand_value<HighPrecision>(dist, rand_engine);
}

copy the data from host to device

auto hp_b = share(gko::clone(exec, host_b));
auto lp_b = share(lp_vec::create(exec));
lp_b->copy_from(hp_b);

create several result x vector in different precision

auto hp_x = share(hp_vec::create(exec, x_dim));
auto lp_x = share(lp_vec::create(exec, x_dim));
auto hplp_x = share(hp_x->clone());
auto lplp_x = share(hp_x->clone());
auto lphp_x = share(hp_x->clone());

Measure the time of apply

We measure the time among different combination of apply operation.

Hp * Hp -> Hp

auto hp_sec = timing(exec, hp_A, hp_b, hp_x);

Lp * Lp -> Lp

auto lp_sec = timing(exec, lp_A, lp_b, lp_x);

Hp * Lp -> Hp

auto hplp_sec = timing(exec, hp_A, lp_b, hplp_x);

Lp * Lp -> Hp

auto lplp_sec = timing(exec, lp_A, lp_b, lplp_x);

Lp * Hp -> Hp

auto lphp_sec = timing(exec, lp_A, hp_b, lphp_x);

To measure error of result. neg_one is an object that represent the number -1.0 which allows for a uniform interface when computing on any device. To compute the residual, all you need to do is call the add_scaled method, which in this case is an axpy and equivalent to the LAPACK axpy routine. Finally, you compute the euclidean 2-norm with the compute_norm2 function.

auto neg_one = gko::initialize<hp_vec>({-1.0}, exec);
auto hp_x_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto hplp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lplp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lphp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lp_diff = hp_x->clone();
auto hplp_diff = hp_x->clone();
auto lplp_diff = hp_x->clone();
auto lphp_diff = hp_x->clone();
hp_x->compute_norm2(hp_x_norm);
lp_diff->add_scaled(neg_one, lp_x);
lp_diff->compute_norm2(lp_diff_norm);
hplp_diff->add_scaled(neg_one, hplp_x);
hplp_diff->compute_norm2(hplp_diff_norm);
lplp_diff->add_scaled(neg_one, lplp_x);
lplp_diff->compute_norm2(lplp_diff_norm);
lphp_diff->add_scaled(neg_one, lphp_x);
lphp_diff->compute_norm2(lphp_diff_norm);
exec->synchronize();
std::cout.precision(10);
std::cout << std::scientific;
std::cout << "High Precision time(s): " << hp_sec << std::endl;
std::cout << "High Precision result norm: " << hp_x_norm->at(0)
<< std::endl;
std::cout << "Low Precision time(s): " << lp_sec << std::endl;
std::cout << "Low Precision relative error: "
<< lp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Hp * Lp -> Hp time(s): " << hplp_sec << std::endl;
std::cout << "Hp * Lp -> Hp relative error: "
<< hplp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Lp * Lp -> Hp time(s): " << lplp_sec << std::endl;
std::cout << "Lp * Lp -> Hp relative error: "
<< lplp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Lp * Hp -> Hp time(s): " << lplp_sec << std::endl;
std::cout << "Lp * Hp -> Hp relative error: "
<< lphp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
}

Results

The following is the expected result (omp):

High Precision time(s): 2.0568800000e-05
High Precision result norm: 1.7725534898e+05
Low Precision time(s): 2.0955600000e-05
Low Precision relative error: 9.1052887738e-08
Hp * Lp -> Hp time(s): 2.1186100000e-05
Hp * Lp -> Hp relative error: 3.7799774251e-08
Lp * Lp -> Hp time(s): 2.0312300000e-05
Lp * Lp -> Hp relative error: 5.7910008031e-08
Lp * Hp -> Hp time(s): 2.0312300000e-05
Lp * Hp -> Hp relative error: 3.7173133506e-08

Comments about programming and debugging

The plain program

#include <ginkgo/ginkgo.hpp>
#include <fstream>
#include <iostream>
#include <map>
#include <string>
#include <chrono>
#include <random>
namespace {
template <typename ValueType, typename ValueDistribution, typename Engine>
typename std::enable_if<!gko::is_complex_s<ValueType>::value, ValueType>::type
get_rand_value(ValueDistribution&& value_dist, Engine&& gen)
{
return value_dist(gen);
}
template <typename ValueType, typename ValueDistribution, typename Engine>
typename std::enable_if<gko::is_complex_s<ValueType>::value, ValueType>::type
get_rand_value(ValueDistribution&& value_dist, Engine&& gen)
{
return ValueType(value_dist(gen), value_dist(gen));
}
double timing(std::shared_ptr<const gko::Executor> exec,
std::shared_ptr<const gko::LinOp> A,
std::shared_ptr<const gko::LinOp> b,
std::shared_ptr<gko::LinOp> x)
{
int warmup = 2;
int rep = 10;
for (int i = 0; i < warmup; i++) {
A->apply(b, x);
}
double total_sec = 0;
for (int i = 0; i < rep; i++) {
auto xx = x->clone();
exec->synchronize();
auto start = std::chrono::steady_clock::now();
A->apply(b, xx);
exec->synchronize();
auto stop = std::chrono::steady_clock::now();
std::chrono::duration<double> duration_time = stop - start;
total_sec += duration_time.count();
if (i + 1 == rep) {
x->copy_from(xx);
}
}
return total_sec / rep;
}
} // namespace
int main(int argc, char* argv[])
{
using HighPrecision = double;
using RealValueType = gko::remove_complex<HighPrecision>;
using LowPrecision = float;
using IndexType = int;
std::cout << gko::version_info::get() << std::endl;
if (argc == 2 && (std::string(argv[1]) == "--help")) {
std::cerr << "Usage: " << argv[0] << " [executor] " << std::endl;
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
std::map<std::string, std::function<std::shared_ptr<gko::Executor>()>>
exec_map{
{"omp", [] { return gko::OmpExecutor::create(); }},
{"cuda",
[] {
}},
{"hip",
[] {
}},
{"dpcpp",
[] {
}},
{"reference", [] { return gko::ReferenceExecutor::create(); }}};
const auto exec = exec_map.at(executor_string)(); // throws if not valid
auto hp_A = share(gko::read<hp_mtx>(std::ifstream("data/A.mtx"), exec));
auto lp_A = share(gko::read<lp_mtx>(std::ifstream("data/A.mtx"), exec));
auto A_dim = hp_A->get_size();
auto b_dim = gko::dim<2>{A_dim[1], 1};
auto x_dim = gko::dim<2>{A_dim[0], b_dim[1]};
auto host_b = hp_vec::create(exec->get_master(), b_dim);
std::default_random_engine rand_engine(32);
auto dist = std::uniform_real_distribution<RealValueType>(0.0, 1.0);
for (int i = 0; i < host_b->get_size()[0]; i++) {
host_b->at(i, 0) = get_rand_value<HighPrecision>(dist, rand_engine);
}
auto hp_b = share(gko::clone(exec, host_b));
auto lp_b = share(lp_vec::create(exec));
lp_b->copy_from(hp_b);
auto hp_x = share(hp_vec::create(exec, x_dim));
auto lp_x = share(lp_vec::create(exec, x_dim));
auto hplp_x = share(hp_x->clone());
auto lplp_x = share(hp_x->clone());
auto lphp_x = share(hp_x->clone());
auto hp_sec = timing(exec, hp_A, hp_b, hp_x);
auto lp_sec = timing(exec, lp_A, lp_b, lp_x);
auto hplp_sec = timing(exec, hp_A, lp_b, hplp_x);
auto lplp_sec = timing(exec, lp_A, lp_b, lplp_x);
auto lphp_sec = timing(exec, lp_A, hp_b, lphp_x);
auto neg_one = gko::initialize<hp_vec>({-1.0}, exec);
auto hp_x_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto hplp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lplp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lphp_diff_norm = gko::initialize<real_vec>({0.0}, exec->get_master());
auto lp_diff = hp_x->clone();
auto hplp_diff = hp_x->clone();
auto lplp_diff = hp_x->clone();
auto lphp_diff = hp_x->clone();
hp_x->compute_norm2(hp_x_norm);
lp_diff->add_scaled(neg_one, lp_x);
lp_diff->compute_norm2(lp_diff_norm);
hplp_diff->add_scaled(neg_one, hplp_x);
hplp_diff->compute_norm2(hplp_diff_norm);
lplp_diff->add_scaled(neg_one, lplp_x);
lplp_diff->compute_norm2(lplp_diff_norm);
lphp_diff->add_scaled(neg_one, lphp_x);
lphp_diff->compute_norm2(lphp_diff_norm);
exec->synchronize();
std::cout.precision(10);
std::cout << std::scientific;
std::cout << "High Precision time(s): " << hp_sec << std::endl;
std::cout << "High Precision result norm: " << hp_x_norm->at(0)
<< std::endl;
std::cout << "Low Precision time(s): " << lp_sec << std::endl;
std::cout << "Low Precision relative error: "
<< lp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Hp * Lp -> Hp time(s): " << hplp_sec << std::endl;
std::cout << "Hp * Lp -> Hp relative error: "
<< hplp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Lp * Lp -> Hp time(s): " << lplp_sec << std::endl;
std::cout << "Lp * Lp -> Hp relative error: "
<< lplp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
std::cout << "Lp * Hp -> Hp time(s): " << lplp_sec << std::endl;
std::cout << "Lp * Hp -> Hp relative error: "
<< lphp_diff_norm->at(0) / hp_x_norm->at(0) << "\n";
}
gko::layout_type::array
The matrix should be written as dense matrix in column-major order.
gko::matrix::Dense
Dense is a matrix format which explicitly stores all values of the matrix.
Definition: dense_cache.hpp:19
gko::clone
detail::cloned_type< Pointer > clone(const Pointer &p)
Creates a unique clone of the object pointed to by p.
Definition: utils_helper.hpp:173
gko::HipExecutor::create
static std::shared_ptr< HipExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_hip_alloc_mode, CUstream_st *stream=nullptr)
Creates a new HipExecutor.
gko::version_info::get
static const version_info & get()
Returns an instance of version_info.
Definition: version.hpp:139
gko::dim< 2 >
gko::share
detail::shared_type< OwningPointer > share(OwningPointer &&p)
Marks the object pointed to by p as shared.
Definition: utils_helper.hpp:224
gko::CudaExecutor::create
static std::shared_ptr< CudaExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_cuda_alloc_mode, CUstream_st *stream=nullptr)
Creates a new CudaExecutor.
gko::OmpExecutor::create
static std::shared_ptr< OmpExecutor > create(std::shared_ptr< CpuAllocatorBase > alloc=std::make_shared< CpuAllocator >())
Creates a new OmpExecutor.
Definition: executor.hpp:1396
gko::matrix::Ell
ELL is a matrix format where stride with explicit zeros is used such that all rows have the same numb...
Definition: csr.hpp:31
gko::log::profile_event_category::operation
Kernel execution and data movement.
gko::remove_complex
typename detail::remove_complex_s< T >::type remove_complex
Obtain the type which removed the complex of complex/scalar type or the template parameter of class b...
Definition: math.hpp:260
gko::DpcppExecutor::create
static std::shared_ptr< DpcppExecutor > create(int device_id, std::shared_ptr< Executor > master, std::string device_type="all", dpcpp_queue_property property=dpcpp_queue_property::in_order)
Creates a new DpcppExecutor.