ParSolGen

Why ParSolGen?

Automated Distributed Memory support

ParSolGen automatically generates a highly-efficient parallel program from the given numerical algorithm description. It automatically handles communication between the nodes of a supercomputer, distributes and balances the load. No manual MPI programming required.

Automated GPU support

ParSolGen is designed for modern heterogeneous HPC systems combining both CPUs and GPUs. Goopax (https://goopax.com/) allows the automated offload of the work to various GPU-based accelerators. ParSolGen automatically handles the load distribution and data transfer between the host and GPUs.

No more complicated network programming

Communication between the nodes of a supercomputer and between the accelerators installed are handled automatically. No MPI knowledge is required. Generate parallel programs with ParSolGen and compile them with any mpicxx compiler.

Complex information dependencies handling

ParSolGen supports various numerical algorithm domains including dense and sparse linear algebra providing high performance of the generated programs.

Performance Tests

ParSolGen vs PETSc performance test. Four Google Cloud C2 high-performance virtual nodes were used. Each node contains 16 cores Intel Xeon (R) (Intel Cascade Lake platform) VCPU operating at 3100 MHz and 120 GB of RAM. The performance of ParSolGen was compared to that of PETSc v3.22.0. BiCG (Biconjugate Gradient) sparse CSR iterative solver algorithm was used as a test problem. "Janna/Queen_4147" sparse matrix from Sparse Matrix Collection was used as test data. Both implementations computed 1000 iterations of the BiCG method. The total number of the used MPI processes (threads in case of ParSolGen) varied from 1 to 32.

ParSolGen vs ScaLAPACK performance test. Four Google Cloud C2 high-performance virtual nodes were used. Each node contains 16 cores Intel Xeon (R) (Intel Cascade Lake platform) VCPU operating at 3100 MHz and 120 GB of RAM. The performance of ParSolGen was compared to that of ScaLAPACK v2.2.0. A randomly-filled distributed dense matrix of 150x150 dense matrix blocks (each storing 256x256 double precision floating-point values) was used as test data. The total number of the used MPI processes (threads in case of ParSolGen) varied from 1 to 64

GPU performance test. Two Google Cloud G2 nodes were used. Each node contains 16 cores Intel Xeon (R) (Intel Cascade Lake platform) VCPU, 16 GB of RAM and Nvidia L4 accelerator card. The test was conducted on heat distribution problem (stencil-based algorithm).