Preparing for GPU Computing on the Dardel and LUMI Systems

Peter Larsson, PDC/LUMI User Support Team

Did you know that you can get started on programming for AMD GPUs right now using the ROCm software stack on NVIDIA GPUs?

Early next year, phase two of PDC’s new system, Dardel, and the European pre-exascale LUMI system will come online with their powerful new graphics process units (GPUs) from AMD. The exact technical specifications of these GPUs are still under wraps, but it is expected that they will offer better performance than the GPUs available today. This will be a big shift in the Swedish computing landscape as researchers have not previously had access to large GPU computing resources at the national level. Starting next year, most of the Swedish compute capacity will be GPU-based with NVIDIA GPUs in the Alvis and Berzelius clusters, which are located at the Chalmers Centre for Computational Science and Engineering (C3SE) and the National Supercomputer Centre at Linköping University (NSC) respectively, plus AMD GPUs in Dardel and LUMI. This article is about what you can do now to start preparing to take advantage of the AMD GPUs when they come online next year.

If you are a researcher using software that others have written, you should first check the documentation for clues about GPU usage. Does the software support GPUs at all? If so, can you find anything about the way that software has been programmed to use GPUs? The keywords to look for are “CUDA”, “HIP”, “OpenCL”, “OpenMP offloading”, “OpenACC” and “SYCL”. A checklist is presented in the table below. The most important thing is that CUDA, which is a propriety framework from NVIDIA for developing applications for their own GPUs, will not be available on LUMI and Dardel. This means that any programs using it will need to be updated in order to work with the AMD GPUs. The same applies to many of the popular libraries for NVIDIA GPUs, such as cuBLAS, cuFFT, and cuDNN. Instead, for AMD GPUs, the corresponding language for GPU programming is called HIP. If you see that mentioned, then it is great news but, unfortunately, only a few software packages have been modified to use HIP so far. Furthermore, if OpenCL is mentioned, then the software should also work on AMD GPUs, even though the performance may not be optimal. For high-level GPU programming, there are three popular models: OpenACC, OpenMP and SYCL. If the software uses OpenACC for GPU computing, then it might work eventually, but will not work right now (which is discussed further later on in this article). If the software already uses OpenMP offloading for GPU computing, then it should work well on LUMI and Dardel, as this is the recommended high-level way to write GPU code for these systems. SYCL is gaining popularity, as it is backed by Intel for their upcoming GPUs, and it is possible that SYCL will work well on AMD GPUs in the future, but right now it is an experimental project (hipSYCL at Heidelberg University).

Overview of GPU programming models and their support on different kinds of GPUs
Programming model	Nvidia GPUs	AMD GPUs	Intel GPUs
CUDA	yes	no, but HIPIFY tools may help with conversion	no
HIP	yes	yes, best performance	?
OpenCL	yes, but likely lower performance	yes, medium performance	yes
OpenMP (offloading)	yes	yes	yes
OpenACC	yes (Cray/Nvidia compilers)	yes (in the future through Clac?)	yes (in the future through Clac?)
SYCL	yes (Codeplay’s ComputeCpp product)	experimental support (HipSYCL)	yes (Intel OneAPI)

If you are a software developer writing your own low-level GPU code, you should investigate the ROCm framework from AMD and the HIP language extension for C/C++. ROCm and HIP are intended to be a full replacement of the CUDA stack with the added advantage that code written in HIP can be compiled for both AMD and NVIDIA GPUs! This allows you (in theory) to have a single codebase that supports both. It is important to realise that this means that you can start porting your code to HIP today and test it on NVIDIA GPUs, even if you do not have access to new AMD GPUs that support ROCm. This is the recommended approach as supplies of new GPUs are extremely tight due to high consumer demand and supply chain problems as a result of the coronavirus pandemic. The HIP language is designed to be very similar to CUDA. In some cases, it is possible to automatically translate CUDA code to HIP code using the "hipify" tool in ROCm. The hipify tool will even translate CUDA library calls to the corresponding AMD library calls but, in general, some code modifications are needed to get the best performance. HIP code can then be compiled either with Clang (which is included with ROCm) or Cray’s compiler (which only has experimental support for now). Courses in HIP are already starting to appear; there was one at the CSC – IT Center for Science (CSC) in Finland in February 2021 and another in Sweden in April 2021 that was jointly arranged by the EuroCC National Competence Centre Sweden (ENCCS) and CSC, and there will be more to come.

If you are programming for GPUs using a high-level approach such as OpenACC, OpenMP device offloading, or more recently SYCL, you need to investigate the support for AMD GPUs in your toolchain. OpenMP offloading is likely the best choice as it is the only cross-platform framework supported by the big three GPUs vendors (NVIDIA, AMD, and Intel). SYCL could be a good choice in the future when there is a backend for AMD GPUs. The most problematic is likely to be OpenACC. In practice, OpenACC has been heavily centred around NVIDIA products, which made it unclear how well it would be supported on AMD GPUs. There is hope that the “Clacc” project will solve this problem. It is a project sponsored by the US Exascale Computing Project. It aims to develop an OpenACC compiler for Clang that will effectively translate OpenACC to OpenMP. This approach should allow OpenACC code to run on AMD GPUs in the future. Recently, Cray also announced that they will provide full support for OpenACC in their compilers for both C/C++ and Fortran, likely based on the Clacc project. Initially, there will be support at the OpenACC 2.7 level starting next year and eventually at the OpenACC 3.0 level. This should help to make some major software packages (like VASP) use GPUs on LUMI and Dardel.

Many applications rely on subroutines from optimised numerical libraries, such as BLAS and FFTW, to get good performance on CPUs. Several of these libraries are also available for NVIDIA GPUs, which has helped the uptake of GPU computing. A similar ecosystem is currently under development for AMD GPUs, but many of the well-known GPU libraries have actually already been ported (see the table below). It is a good idea to continue this approach with AMD GPUs and rely on libraries when you can. This way, you will automatically get most of the performance benefits with future AMD GPUs without changing your own code, as the libraries are likely to be updated to support the latest features. Generally, the ROCm libraries come in two flavours: there is a “hip-xxxˮ version which can run on both AMD and NVIDIA hardware, and a “roc-xxx” version which only runs on AMD GPUs. The HIP version is just a thin library of wrappers that call the best underlying library depending on what GPU the code is compiled for. For example, on an NVIDIA system, hipBLAS will call cuBLAS, but on an AMD system, hipBLAS will call rocBLAS instead.

Translation table from CUDA libraries to the corresponding HIP libraries
Description	CUDA library	HIP library	ROCm backend
Basic linear algebra like matrix-matrix multiplication	cuBLAS	hipBLAS	rocBLAS
Fast Fourier transforms	cuFFT	hipFFT	rocFFT
Linear algebra (subset of LAPACK)	cuSOLVER	hipSOLVER	rocSOLVER
Basic linear algebra for sparse matrices	cuSPARSE	hipSPARSE	rocSPARSE
Parallel algorithms like scan and reduce	CUB	hipCUB	rocPRIM
Random number generation	cuRAND	hipRAND	rocRAND
GPU to GPU communication	NCCL (“Nickel”)	n/a	RCCL (“Rickle”)
Neural network operations	cuDNN	hipDDN	miOpen

The above libraries are available under github.com/ROCmSoftwarePlatform .

Finally, what software is already available for AMD GPUs? In general, the fields of deep learning and molecular dynamics have the most software packages which are either OpenCL-based or have early HIP ports. The deep learning frameworks Tensorflow and Pytorch can already run on AMD GPUs right now, and the performance looks promising. For molecular dynamics, several of the big molecular dynamics packages have support or are being ported: Gromacs, LAMMPS, NAMD, and Amber. In the materials science field, there is some support in CP2K and Sirius through the underlying DBCSR sparse matrix-matrix multiplication library, but it is not complete. There has been no official statement on VASP, but their GPU functionality is being moved from CUDA to OpenACC, which could then work in the future when there is OpenACC support for AMD GPUs. In the weather and climate field, there are ongoing efforts to support porting the ICON modelling framework and, in the area of computational fluid dynamics, the NEK5000 code is also being prepared. In order to find AMD GPU-enabled software, it might be worth looking into projects associated with US National Laboratories and the Exascale Computing Project (ECP) as they are betting heavily on the AMD-GPUs for their upcoming Frontier and El Capitan supercomputers.

If you have questions about preparing your research code for GPU computing, you are welcome to contact Peter Larsson ( ypetla@kth.se ).