Performance

SINGLE-NODE CPU performance comparison

Comparison between LAMA, PETSc and a plain MKL BLAS implementation of an CG solver running 1000 iterations

System

  • Both libraries make use of Intel®’s high performance MKL BLAS implementation

Results

  • Runtime is proportional to the number of non-zeros
  • only the irregular structure of inline_1 and audikw_1 show remarkably higher runtime
  • demonstrating, that LAMA's as well as PETSc’s design overhead is negligible

In Summary

  • LAMA and PETSc perform similar on CPUs

SINGLE-NODE GPU performance comparison

Comparison between LAMA and PETSc implementations of an CG solver running 1000 iterations

System

  • Nvidia® K40 (12GB GDDR 5)
  • CSR and ELL format

CSR format results

  • the run time proportional to the number of non-zeros
  • irregular structure of inline_1 and audikw_1 leads to  higher runtime

ELL format results

  • show shorter run times in general
  • except inline_1 and audikw_1 exhibiting nearly twice the number of entries per row compared to the other matrices

In Summary

  • for the CSR format
    • LAMA and PETSc perform similar with a tiny overall benefit in favor of LAMA
    • both libraries rely on cuSPARSE SpMV implementation (dominating with 80% of the overall runtime)
    • LAMA calls cuBLAS routines for the axpy and dot operations while PETSc exploits implementations using the Thrust library
  • for the ELL format
    • the runtime results are more sensitive to the actual sparse matrix structure in comparison with CSR
    • LAMA uses a custom kernel
    • exploiting texture cache
    • increases the performance slightly in most cases

Case Study

SOFI3D is a seismic modelling code developed at the Geophysical ­Institute, KIT, Karlsruhe. The existing MPI version has been re-implemented with LAMA using explicit matrix-vector ­formalism. While the MPI version was ­difficult to maintain, the developers can now focus on geophysical problems and do not have to deal any more with ­implementation details and HPC issues. For a strong scaling benchmark, a 3D problem size with 600 grid points in each dimension has been ­selected. On the JURECA HPC system (­Jülich Supercomputer Center) this benchmark shows nearly same ­performance for both versions on CPU nodes (2 x Intel Xeon E5 2680 v3 Haswell à 12 cores @ 2.5 GHz). In contrary to the MPI version, the LAMA version runs without modifications also on GPU nodes (2 x ­NVIDIA Tesla K80), see Fig. 2.

LAMA White Paper

You can download our latest white paper on LAMA, its design, implementation, and performance right here.

LAMA in the Press - Publications

Brandes, Th. and Schricker, E. and Soddemann, Th.: The LAMA Approach for Writing Portable Applications on Heterogeneous Architectures - Projects and Products of Fraunhofer SCAI, 2017, DOI :http://www.springer.com/de/book/9783319624570

 

Süß, Tim; Döring, Nils; Gad, Ramy; Nagel, Lars; Brinkmann, André; Feld, Dustin; Schricker, Eric; Soddemann, Thomas; Impact of the Scheduling Strategy in Heterogeneous Systems That Provide Co-Scheduling, in Proceedings of the 1st COSH Workshop on Co-Scheduling of HPC Applications, 2016, DOI: 10.14459/2016md1286954

 

Förster, M., Kraus, J.: Scalable parallel AMG on cc-NUMA machines with OpenMP. In: Computer Science - Research and Development, 2011, Volume 26, Issue 3-4, pp 221-228, DOI: 10.1007/s00450-011-0159-z

 

Kraus, J., Förster, M.: Efficient AMG on Heterogeneous Systems. In: Facing the Multicore Challenge II, Lecture Notes in Computer Science, 2012, Volume 7174, pp 133-146, DOI: 10.1007/978-3-642-30397-5_12

 

Kraus, J., Förster, M., Brandes, T., Soddemann, T.: Using LAMA for efficient AMG on hybrid clusters, Computer Science - Research and Development, 2013, Volumn 28, Issue 2-3, pp 211-220, DOI: 10.1007/s00450-012-0223-3