Ginkgo  Generated from pipelines/1068515030 branch based on master. Ginkgo version 1.7.0
A numerical linear algebra library targeting many-core architectures
The distributed-solver program

The distributed solver example..

This example depends on simple-solver, three-pt-stencil-solver.

Table of contents
  1. Introduction
  2. The commented program
  1. Results
  2. The plain program

Introduction

This distributed solver example should help you understand the basics of using Ginkgo in a distributed setting. The example will solve a simple 1D Laplace equation where the system can be distributed row-wise to multiple processes. To run the solver with multiple processes, use mpirun -n NUM_PROCS ./distributed-solver [executor] [num_grid_points] [num_iterations].

If you are using GPU devices, please make sure that you run this example with at most as many processes as you have GPU devices available.

The commented program

Include files

This is the main ginkgo header file.

#include <ginkgo/ginkgo.hpp>

Add the C++ iostream header to output information to the console.

#include <iostream>

Add the STL map header for the executor selection

#include <map>

Add the string manipulation header to handle strings.

#include <string>
int main(int argc, char* argv[])
{

Initialize the MPI environment

Since this is an MPI program, we need to initialize and finalize MPI at the begin and end respectively of our program. This can be easily done with the following helper construct that uses RAII to automate the initialization and finalization.

Type Definitions

Define the needed types. In a parallel program we need to differentiate between global and local indices, thus we have two index types.

using GlobalIndexType = gko::int64;
using LocalIndexType = gko::int32;

The underlying value type.

using ValueType = double;

As vector type we use the following, which implements a subset of gko::matrix::Dense.

As matrix type we simply use the following type, which can read distributed data and be applied to a distributed vector.

using dist_mtx =
gko::experimental::distributed::Matrix<ValueType, LocalIndexType,
GlobalIndexType>;

We still need a localized vector type to be used as scalars in the advanced apply operations.

The partition type describes how the rows of the matrices are distributed.

using part_type =
GlobalIndexType>;

We can use here the same solver type as you would use in a non-distributed program. Please note that not all solvers support distributed systems at the moment.

Create an MPI communicator get the rank of the calling process.

const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
const auto rank = comm.rank();

User Input Handling

User input settings:

  • The executor, defaults to reference.
  • The number of grid points, defaults to 100.
  • The number of iterations, defaults to 1000.
if (argc == 2 && (std::string(argv[1]) == "--help")) {
if (rank == 0) {
std::cerr << "Usage: " << argv[0]
<< " [executor] [num_grid_points] [num_iterations] "
<< std::endl;
}
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const auto grid_dim =
static_cast<gko::size_type>(argc >= 3 ? std::atoi(argv[2]) : 100);
const auto num_iters =
static_cast<gko::size_type>(argc >= 4 ? std::atoi(argv[3]) : 1000);
const std::map<std::string,
std::function<std::shared_ptr<gko::Executor>(MPI_Comm)>>
executor_factory_mpi{
{"reference",
[](MPI_Comm) { return gko::ReferenceExecutor::create(); }},
{"omp", [](MPI_Comm) { return gko::OmpExecutor::create(); }},
{"cuda",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"hip",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"dpcpp", [](MPI_Comm comm) {
int device_id = 0;
} else {
throw std::runtime_error("No suitable DPC++ devices");
}
device_id, gko::ReferenceExecutor::create());
}}};
auto exec = executor_factory_mpi.at(executor_string)(MPI_COMM_WORLD);

Creating the Distributed Matrix and Vectors

As a first step, we create a partition of the rows. The partition consists of ranges of consecutive rows which are assigned a part-id. These part-ids will be used for the distributed data structures to determine which rows will be stored locally. In this example each rank has (nearly) the same number of rows, so we can use the following specialized constructor. See gko::distributed::Partition for other modes of creating a partition.

const auto num_rows = grid_dim;
auto partition = gko::share(part_type::build_from_global_size_uniform(
exec->get_master(), comm.size(),
static_cast<GlobalIndexType>(num_rows)));

Assemble the matrix using a 3-pt stencil and fill the right-hand-side with a sine value. The distributed matrix supports only constructing an empty matrix of zero size and filling in the values with gko::experimental::distributed::Matrix::read_distributed. Only the data that belongs to the rows by this rank will be assembled.

A_data.size = {num_rows, num_rows};
b_data.size = {num_rows, 1};
x_data.size = {num_rows, 1};
const auto range_start = partition->get_range_bounds()[rank];
const auto range_end = partition->get_range_bounds()[rank + 1];
for (int i = range_start; i < range_end; i++) {
if (i > 0) {
A_data.nonzeros.emplace_back(i, i - 1, -1);
}
A_data.nonzeros.emplace_back(i, i, 2);
if (i < grid_dim - 1) {
A_data.nonzeros.emplace_back(i, i + 1, -1);
}
b_data.nonzeros.emplace_back(i, 0, std::sin(i * 0.01));
x_data.nonzeros.emplace_back(i, 0, gko::zero<ValueType>());
}

Take timings.

comm.synchronize();
ValueType t_init_end = gko::experimental::mpi::get_walltime();

Read the matrix data, currently this is only supported on CPU executors. This will also set up the communication pattern needed for the distributed matrix-vector multiplication.

auto A_host = gko::share(dist_mtx::create(exec->get_master(), comm));
auto x_host = dist_vec::create(exec->get_master(), comm);
auto b_host = dist_vec::create(exec->get_master(), comm);
A_host->read_distributed(A_data, partition);
b_host->read_distributed(b_data, partition);
x_host->read_distributed(x_data, partition);

After reading, the matrix and vector can be moved to the chosen executor, since the distributed matrix supports SpMV also on devices.

auto A = gko::share(dist_mtx::create(exec, comm));
auto x = dist_vec::create(exec, comm);
auto b = dist_vec::create(exec, comm);
A->copy_from(A_host);
b->copy_from(b_host);
x->copy_from(x_host);

Take timings.

comm.synchronize();
ValueType t_read_setup_end = gko::experimental::mpi::get_walltime();

Solve the Distributed System

Generate the solver, this is the same as in the non-distributed case. with a local block diagonal preconditioner.

Setup the local block diagonal solver factory.

auto local_solver = gko::share(bj::build().on(exec));

Setup the stopping criterion and logger

const gko::remove_complex<ValueType> reduction_factor{1e-8};
std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
auto Ainv = solver::build()
.with_preconditioner(
schwarz::build().with_local_solver(local_solver))
.with_criteria(
gko::stop::Iteration::build().with_max_iters(num_iters),
.with_reduction_factor(reduction_factor))
.on(exec)
->generate(A);

Add logger to the generated solver to log the iteration count and residual norm

Ainv->add_logger(logger);

Take timings.

comm.synchronize();
ValueType t_solver_generate_end = gko::experimental::mpi::get_walltime();

Apply the distributed solver, this is the same as in the non-distributed case.

Ainv->apply(b, x);

Take timings.

comm.synchronize();

Get the residual.

auto res_norm = gko::as<vec>(logger->get_residual_norm());

Printing Results

Print the achieved residual norm and timings on rank 0.

if (comm.rank() == 0) {

clang-format off

std::cout << "\nNum rows in matrix: " << num_rows
<< "\nNum ranks: " << comm.size()
<< "\nFinal Res norm: " << *res_norm->get_const_values()
<< "\nIteration count: " << logger->get_num_iterations()
<< "\nInit time: " << t_init_end - t_init
<< "\nRead time: " << t_read_setup_end - t_init
<< "\nSolver generate time: " << t_solver_generate_end - t_read_setup_end
<< "\nSolver apply time: " << t_end - t_solver_generate_end
<< "\nTotal time: " << t_end - t_init
<< std::endl;

clang-format on

}
}

Results

This is the expected output for mpirun -n 4 ./distributed-solver:

Num rows in matrix: 100
Num ranks: 4
Final Res norm: 5.58392e-12
Iteration count: 7
Init time: 0.0663887
Read time: 0.0729806
Solver generate time: 7.6348e-05
Solver apply time: 0.0680783
Total time: 0.141351

The timings may vary depending on the machine.

The plain program

/*******************************<GINKGO LICENSE>******************************
Copyright (c) 2017-2023, the Ginkgo authors
All rights reserved.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
******************************<GINKGO LICENSE>*******************************/
#include <ginkgo/ginkgo.hpp>
#include <iostream>
#include <map>
#include <string>
int main(int argc, char* argv[])
{
const gko::experimental::mpi::environment env(argc, argv);
using GlobalIndexType = gko::int64;
using LocalIndexType = gko::int32;
using ValueType = double;
using dist_mtx =
gko::experimental::distributed::Matrix<ValueType, LocalIndexType,
GlobalIndexType>;
using part_type =
GlobalIndexType>;
ValueType, LocalIndexType, GlobalIndexType>;
const auto comm = gko::experimental::mpi::communicator(MPI_COMM_WORLD);
const auto rank = comm.rank();
if (argc == 2 && (std::string(argv[1]) == "--help")) {
if (rank == 0) {
std::cerr << "Usage: " << argv[0]
<< " [executor] [num_grid_points] [num_iterations] "
<< std::endl;
}
std::exit(-1);
}
const auto executor_string = argc >= 2 ? argv[1] : "reference";
const auto grid_dim =
static_cast<gko::size_type>(argc >= 3 ? std::atoi(argv[2]) : 100);
const auto num_iters =
static_cast<gko::size_type>(argc >= 4 ? std::atoi(argv[3]) : 1000);
const std::map<std::string,
std::function<std::shared_ptr<gko::Executor>(MPI_Comm)>>
executor_factory_mpi{
{"reference",
[](MPI_Comm) { return gko::ReferenceExecutor::create(); }},
{"omp", [](MPI_Comm) { return gko::OmpExecutor::create(); }},
{"cuda",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"hip",
[](MPI_Comm comm) {
device_id, gko::ReferenceExecutor::create());
}},
{"dpcpp", [](MPI_Comm comm) {
int device_id = 0;
} else {
throw std::runtime_error("No suitable DPC++ devices");
}
device_id, gko::ReferenceExecutor::create());
}}};
auto exec = executor_factory_mpi.at(executor_string)(MPI_COMM_WORLD);
const auto num_rows = grid_dim;
auto partition = gko::share(part_type::build_from_global_size_uniform(
exec->get_master(), comm.size(),
static_cast<GlobalIndexType>(num_rows)));
A_data.size = {num_rows, num_rows};
b_data.size = {num_rows, 1};
x_data.size = {num_rows, 1};
const auto range_start = partition->get_range_bounds()[rank];
const auto range_end = partition->get_range_bounds()[rank + 1];
for (int i = range_start; i < range_end; i++) {
if (i > 0) {
A_data.nonzeros.emplace_back(i, i - 1, -1);
}
A_data.nonzeros.emplace_back(i, i, 2);
if (i < grid_dim - 1) {
A_data.nonzeros.emplace_back(i, i + 1, -1);
}
b_data.nonzeros.emplace_back(i, 0, std::sin(i * 0.01));
x_data.nonzeros.emplace_back(i, 0, gko::zero<ValueType>());
}
comm.synchronize();
ValueType t_init_end = gko::experimental::mpi::get_walltime();
auto A_host = gko::share(dist_mtx::create(exec->get_master(), comm));
auto x_host = dist_vec::create(exec->get_master(), comm);
auto b_host = dist_vec::create(exec->get_master(), comm);
A_host->read_distributed(A_data, partition);
b_host->read_distributed(b_data, partition);
x_host->read_distributed(x_data, partition);
auto A = gko::share(dist_mtx::create(exec, comm));
auto x = dist_vec::create(exec, comm);
auto b = dist_vec::create(exec, comm);
A->copy_from(A_host);
b->copy_from(b_host);
x->copy_from(x_host);
comm.synchronize();
ValueType t_read_setup_end = gko::experimental::mpi::get_walltime();
auto local_solver = gko::share(bj::build().on(exec));
const gko::remove_complex<ValueType> reduction_factor{1e-8};
std::shared_ptr<const gko::log::Convergence<ValueType>> logger =
auto Ainv = solver::build()
.with_preconditioner(
schwarz::build().with_local_solver(local_solver))
.with_criteria(
gko::stop::Iteration::build().with_max_iters(num_iters),
.with_reduction_factor(reduction_factor))
.on(exec)
->generate(A);
Ainv->add_logger(logger);
comm.synchronize();
ValueType t_solver_generate_end = gko::experimental::mpi::get_walltime();
Ainv->apply(b, x);
comm.synchronize();
auto res_norm = gko::as<vec>(logger->get_residual_norm());
if (comm.rank() == 0) {
std::cout << "\nNum rows in matrix: " << num_rows
<< "\nNum ranks: " << comm.size()
<< "\nFinal Res norm: " << *res_norm->get_const_values()
<< "\nIteration count: " << logger->get_num_iterations()
<< "\nInit time: " << t_init_end - t_init
<< "\nRead time: " << t_read_setup_end - t_init
<< "\nSolver generate time: " << t_solver_generate_end - t_read_setup_end
<< "\nSolver apply time: " << t_end - t_solver_generate_end
<< "\nTotal time: " << t_end - t_init
<< std::endl;
}
}
gko::matrix_data::nonzeros
std::vector< nonzero_type > nonzeros
A vector of tuples storing the non-zeros of the matrix.
Definition: matrix_data.hpp:482
gko::log::profile_event_category::solver
Solver events.
gko::matrix::permute_mode::rows
The rows will be permuted.
gko::log::Convergence::create
static std::unique_ptr< Convergence > create(std::shared_ptr< const Executor >, const mask_type &enabled_events=Logger::criterion_events_mask|Logger::iteration_complete_mask)
Creates a convergence logger.
Definition: convergence.hpp:106
gko::CudaExecutor::get_num_devices
static int get_num_devices()
Get the number of devices present on the system.
gko::matrix::Dense
Dense is a matrix format which explicitly stores all values of the matrix.
Definition: dense_cache.hpp:48
gko::HipExecutor::get_num_devices
static int get_num_devices()
Get the number of devices present on the system.
gko::experimental::distributed::Matrix
The Matrix class defines a (MPI-)distributed matrix.
Definition: matrix.hpp:264
gko::experimental::distributed::Vector
Vector is a format which explicitly stores (multiple) distributed column vectors in a dense storage f...
Definition: matrix.hpp:155
gko::experimental::mpi::environment
Class that sets up and finalizes the MPI environment.
Definition: mpi.hpp:227
gko::DpcppExecutor::get_num_devices
static int get_num_devices(std::string device_type)
Get the number of devices present on the system.
gko::HipExecutor::create
static std::shared_ptr< HipExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_hip_alloc_mode, CUstream_st *stream=nullptr)
Creates a new HipExecutor.
gko::preconditioner::Jacobi
A block-Jacobi preconditioner is a block-diagonal linear operator, obtained by inverting the diagonal...
Definition: jacobi.hpp:213
gko::stop::ResidualNorm
The ResidualNorm class is a stopping criterion which stops the iteration process when the actual resi...
Definition: residual_norm.hpp:138
gko::experimental::mpi::communicator
A thin wrapper of MPI_Comm that supports most MPI calls.
Definition: mpi.hpp:437
gko::matrix_data
This structure is used as an intermediate data type to store a sparse matrix.
Definition: matrix_data.hpp:155
gko::matrix_data::size
dim< 2 > size
Size of the matrix.
Definition: matrix_data.hpp:473
gko::solver::Cg
CG or the conjugate gradient method is an iterative type Krylov subspace method which is suitable for...
Definition: cg.hpp:74
gko::share
detail::shared_type< OwningPointer > share(OwningPointer &&p)
Marks the object pointed to by p as shared.
Definition: utils_helper.hpp:254
gko::experimental::mpi::map_rank_to_device_id
int map_rank_to_device_id(MPI_Comm comm, int num_devices)
Maps each MPI rank to a single device id in a round robin manner.
gko::experimental::mpi::get_walltime
double get_walltime()
Get the rank in the communicator of the calling process.
Definition: mpi.hpp:1523
gko::CudaExecutor::create
static std::shared_ptr< CudaExecutor > create(int device_id, std::shared_ptr< Executor > master, bool device_reset, allocation_mode alloc_mode=default_cuda_alloc_mode, CUstream_st *stream=nullptr)
Creates a new CudaExecutor.
gko::OmpExecutor::create
static std::shared_ptr< OmpExecutor > create(std::shared_ptr< CpuAllocatorBase > alloc=std::make_shared< CpuAllocator >())
Creates a new OmpExecutor.
Definition: executor.hpp:1373
gko::experimental::distributed::Partition
Represents a partition of a range of indices [0, size) into a disjoint set of parts.
Definition: matrix.hpp:153
gko::int64
std::int64_t int64
64-bit signed integral type.
Definition: types.hpp:143
gko::int32
std::int32_t int32
32-bit signed integral type.
Definition: types.hpp:137
gko::remove_complex
typename detail::remove_complex_s< T >::type remove_complex
Obtain the type which removed the complex of complex/scalar type or the template parameter of class b...
Definition: math.hpp:354
gko::DpcppExecutor::create
static std::shared_ptr< DpcppExecutor > create(int device_id, std::shared_ptr< Executor > master, std::string device_type="all", dpcpp_queue_property property=dpcpp_queue_property::in_order)
Creates a new DpcppExecutor.
gko::experimental::distributed::preconditioner::Schwarz
A Schwarz preconditioner is a simple domain decomposition preconditioner that generalizes the Block J...
Definition: schwarz.hpp:79