GROMACS Performance Optimisation on AMD GPUs

Szilárd Páll, PDC, & Andrey Alekseenko, KTH/SciLifeLab

GROMACS, which is a widely used molecular dynamics (MD) simulation engine, has seen significant performance gains on AMD GPU-based heterogeneous systems (like the Dardel system at PDC and LUMI in Finland) thanks to optimisations made to the SYCL backend. This builds upon our previous work enabling GROMACS on AMD GPUs.

The packed math instruction optimisations have been undertaken to better leverage the single- precision floating-point hardware capabilities of the AMD MI250x GPUs. While these should ideally be transparently generated by the compiler, manual optimisations have been added to GROMACS 2024 to make sure that performance is not dependent on optimal code generation.

A key focus area has been SYCL runtime performance. We identified that the way the SYCL runtimes handle task launches was hindering GROMACS’s highly latency-sensitive scheduling. To address this, collaborations with developers of AdaptiveCpp and oneAPI DPC++ runtimes aimed to optimise performance. Additionally, we actively participated in the design of the next SYCL standard to incorporate these findings.

GROMACS primarily relies on AdaptiveCpp for AMD GPUs. Improving its performance was crucial. By default, AdaptiveCpp uses a deferred task launch strategy, caching SYCL API calls for later submission, allowing for runtime optimisation but introducing significant latency. This is detrimental to GROMACS’s latency- sensitive MD engine. Delays in submitting tasks lead to GPU starvation, particularly when CPU operations (like MPI calls) require CPU-GPU synchronisation.

To address low-latency submissions, the AdaptiveCpp 23.10 release introduced an instant submission mode, improving performance by up to 22% compared to the optimised cached mode. To benefit from this, users only need to recompile GROMACS with a recent AdaptiveCpp version enabling instant submission; there is no need to change the GROMACS code. This allows users of earlier GROMACS releases to gain performance improvements as well.

The graph shows the application performance (ns/day) and the corresponding iteration rate (ms/step) for the STMV benchmark running on different numbers of GCDs within a single node. The performance of GROMACS 2024.0 using AdaptiveCpp 23.10.0 with instant submission mode is compared to that of AMD’s GROMACS HIP fork.

Our long-term work on readying GROMACS for AMD GPUs using SYCL was presented at the 2024 Cray User Group Conference (see [1]). The paper provides a detailed analysis of node-level kernel and runtime performance, sharing best practices for using SYCL as a performance-portable GPU framework within the high- performance computing (HPC) community. Performance demonstrations are provided for Cray EX235a machines with MI250X accelerators, illustrating that portability can be achieved without sacrificing significant performance.

References

A. Alekseenko, S. Páll, and E. Lindahl, “GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability.” arXiv, May 2, 2024. doi.org/10.48550/arXiv.2405.01420 .