Skip to main content

Boosting AI/ML Research on Dardel

Xavier Aguilar, PDC

The hardware for the second phase of Dardel will arrive later this summer. There will be a new partition comprising 56 graphics processing unit (GPU) nodes, each of which will be configured with one AMD EPYC processor with 64 cores and four AMD Instinct MI250X GPUs. These GPUs can perform both vector and matrix operations. While typical HPC applications use vector operations (and hence express performance in those terms), machine learning (ML) code benefits from matrix operations. The performance of these GPUs equates to 383 TFLOPS for half-precision floating-point format (FP16) and 95.7 TFLOPS for single-precision floating-point format (FP32) per card when using matrix operations. So each node will pack a good amount of computational power, which means the new GPU partition on Dardel will be a platform that is highly suitable for ML workflows. 

PDC was provided with a few experimental nodes containing AMD Instinct MI100 cards, a predecessor of the MI250X. While the MI100s cannot be directly compared to the MI250Xs, they serve as a good testbed to try out the AMD software stack for their GPUs while waiting for the final ones to arrive. In this case, we are using the nodes as a testbed for various ML frameworks and workloads. We have installed Tensorflow and Pytorch, the two most used frameworks for ML and Deep Learning (DL), and are testing their functionality as well as their performance, even though the performance observed on the MI100s will not directly relate to the performance that will be provided by the MI250Xs. We are currently testing native installations of the software, however, Singularity has been already deployed on Dardel, and thus, using the GPU nodes for ML/DL will be even easier with the containerised solutions provided directly by AMD. Furthermore, libraries such as RCCL and frameworks such as Horovod will make it possible to use multiple GPUs and multiple nodes at the same time, thereby opening the door to developing and training larger AI models on Dardel.