Building for AMD GPUs

The AMD ROCm development platform

The AMD Radeon Open Compute (ROCm) platform is a software stack for programming and running of programs on GPUs. The ROCm platform has support for different programming models such as heterogeneous interface for portability (HIP), offloading to GPU with OpenMP directives, and the SYCL programming model.

Programs on Dardel are installed using a specific Cray Parallel Environment (CPE). The main version of the Cray Parallel Environment on Dardel is currently 23.03 which can be loaded with

ml PDC/23.03

To load the ROCm module version 5.0.2 and set the accelerator target to amd-gfx90a (AMD MI250X GPU)

ml rocm/5.0.2
ml craype-accel-amd-gfx90a

There is also a more recent version of ROCm (5.3.3) available. The combination of ROCm 5.3.3 and CPE 23.03 is not officially supported by HPE but seems to work quite well and contains some additional functionality.

ml rocm/5.3.3
ml craype-accel-amd-gfx90a

Programs can then be built with different toolchains (Cray, Gnu, AOCC), as are available in the different versions of the Cray Programming Environments Compilers and libraries.

For running programs as batch jobs on the GPU nodes, see job script example 6 on Job script examples.

Compiler and linker flags environment variables

For executables that are built with the compilers of the Cray Compiler Environment (CCE), verbose runtime information can be enabled with the environment variable CRAY_ACC_DEBUG which takes values 1, 2 or 3. For the highest level of information

export CRAY_ACC_DEBUG=3

Build and run examples

Example 1: Build and run a C++ code with offloading to GPU with HIP

In this example we build and test run a Hello World C++ code in which offloading to GPU is done with the heterogeneous interface for portability (HIP). The program is built with the AMD hipcc compiler.

# Download the source code
wget https://raw.githubusercontent.com/PDC-support/introduction-to-pdc/master/example/hello_world_gpu.cpp

# Load the ROCm module and set the accelerator target to amd-gfx90a (AMD MI250X GPU)
ml rocm/5.0.2
ml craype-accel-amd-gfx90a

# We use the AMD hipcc compiler. Check the full path of the command hipcc
which hipcc
# returns
/opt/rocm-5.0.2/bin/hipcc

# Compile the code on the login node
hipcc --offload-arch=gfx90a hello_world_gpu.cpp -o hello_world_gpu.x

# Test the code in an interactive session.
# First queue to get one GPU node reserved for 10 minutes
salloc -N 1 -t 0:10:00 -A <project name> -p gpu
# wait for a node.

# then run the program
srun -n 1 ./hello_world_gpu.x

# with program output to standard out
You can access GPU devices: 0-7
GPU 0: hello world
...

Example 2: Build and run a Fortran code with offloading to GPU with OpenMP

In this example we build and test run a Fortran program that calculates the dot product of two long vectors by means of offloading to GPU with OpenMP. The build is done within the PrgEnv-cray environment using the Cray Compiler Environment.

# Download the source code
wget https://github.com/ENCCS/openmp-gpu/raw/main/content/exercise/ex04/solution/ex04.F90

# Load the ROCm module and set the accelerator target to amd-gfx90a (AMD MI250X GPU)
ml rocm/5.0.2
ml craype-accel-amd-gfx90a

# Check which compiler the compiler wrapper is pointing to
ftn --version
# returns
Cray Fortran : Version 15.0.1

# Compile the code on the login node
ftn -fopenmp ex04.F90 -o ex04.x

# Test the code in interactive session.
# First queue to get one GPU node reserved for 10 minutes
salloc -N 1 -t 0:10:00 -A <project name> -p gpu
# wait for a node.

# then run the program
srun -n 1 ./ex04.x

# with program output to standard out
The sum is:  1.25

# Alternatively, login to the node with (for example)
ssh nid002792
# where nid002792 is one of the Dardel GPU nodes.

# Load the rocm module
ml rocm/5.0.2

# then run the program
./ex04.x

# with program output to standard out
The sum is:  1.25

# For CCE build executables, enable verbose runtime information on
# the offloading to GPU with the environment variable
export CRAY_ACC_DEBUG=3

# When rerunning the program
./ex04.x

# a detailed listing of data transfer to and from the host memory to the
# device memory is displayed
ACC: Version 5.0 of HIP already initialized, runtime version 50013601
ACC: Get Device 0
...
...
ACC: End transfer (to acc 0 bytes, to host 4 bytes)
ACC:
The sum is:  1.25
ACC: __tgt_unregister_lib

References, general information

AMD ROCm Information Portal

ENCCS and AMD training material for ROCm

LUMI software development

LUMI training materials

Frontier user guide

AMD Instinct product line

AMD Instinct Wikipedia page

ENCCS general GPU programming course

Introductory videos from AMD

Introduction to HIP Programming

Introduction to AMD GPU Hardware

GPU Programming Concepts

GPU Programming Software

Porting CUDA to HIP

Heterogeneous interface for portability (HIP)

PRACE training GPU Programming with HIP

AMD’s HIP Programming Guide

OpenMP

Michael Klemm, Intro to GPU Programming with the OpenMP API (2021-10-20)

AMD’s ROCm documentation, chapter OpenMP support

ENCCS and CSC, OpenMP for GPU offloading

SYCL

Codeplay’s introduction to SYCL (videos)

Introduction to SYCL

Topology Discovery and Queue Creation

SYCL Kernel Functions

Managing Data in SYCL

ENCCS workshop, Heterogeneous programming with SYCL

Aksel Alpay, Universität Heidelberg, SYCL Tutorial: An Introduction to hipSYCL (video)

hipSYCL blog, benchmarking hipSYCL with HeCBench on AMD hardware (2022-07-20)