Skip to main content

New Major HPC System for PDC

Gert Svensson, PDC

As reported in the previous PDC Newsletter, PDC received a substantial grant from the Swedish National Infrastructure for Computing (SNIC) to install a new general-purpose high-performance computing (HPC) system for academic research. The process of procuring this system (which will replace PDC’s current flagship system, Beskow) is now well underway. The plan is that the new system will be installed in two steps: the first part of the system (phase one) is expected to be delivered early in 2021 and the second part (phase two) is planned to be in place in early 2022 at the latest.

Background

The new supercomputer system at PDC is intended for a wide range of academic research use. It will be able to execute highly parallel jobs using a large number of nodes, as well as jobs using a single node or a small number of nodes, in an efficient way. The new system will have a partition using only central processing units (CPUs), and another partition which will be equipped with graphics processing unit (GPU) accelerators or high-speed CPUs with some properties similar to accelerators. This second partition will also be suited to Artificial Intelligence/Machine Learning (AI/ML) workloads, especially in combination with HPC simulations. In addition, the system will include a fast Lustre storage subsystem.

The capacity of the CPU-module of the new system will replace that of the following SNIC systems: Beskow and Tegner at PDC, Aurora at Lund University and Hebbe at Chalmers University of Technology. All of these systems will be retired from SNIC duty when the new system is in operation at PDC.

Sweden is a part of the LUMI pre-exascale consortium, so researchers who plan to run their codes on the LUMI system (or other pre-exascale or exascale systems) will be able to use the new system at PDC as a stepping stone. Development and testing of new codes could partly be done at PDC while the LUMI system would be available to execute extensive simulations exceeding the capabilities of the SNIC systems. Both computer systems are likely to have a substantial GPU- partition. However, due to the different time schedules for procuring the two systems, the type and manufacturer of each of the systems may differ.

The budget from SNIC for the new system at PDC is 129 million SEK in Total Cost of Ownership (TCO) over five years. This includes the purchase price, plus the costs for the installation, maintenance, power and cooling. In addition to that, SNIC will provide funding of 41 million SEK for system administration and running expenses for the computer hall for five years.

For many years the KTH Royal Institute of Technology and PDC have been involved in an extensive research collaboration with Scania. In the PDC part of the project, HPC simulations are used to improve the efficiency of vehicle designs. In 2017 the capacity of the Beskow system was extended significantly, thanks to the Scania collaboration. Future funding arising from this collaboration has the potential to increase the size of the new system at PDC substantially, which would mean that larger simulations could also be performed by academic SNIC researchers.

The Procurement Process

As mentioned previously, the procurement process is in full swing. At the time of writing, the initial invitation had been published and submitted to a range of potential vendors. In the invitation, we have roughly described the type of system we intend to purchase, together with some requirements that any companies submitting tenders would need to satisfy.

Once we know which vendors are interested in bidding for the new system, we provide those companies with detailed specifications and benchmarks (programs the vendors need to run on the system that they propose selling to us). The benchmarks represent a typical workload for the system and have been carefully selected with the help of a scientific reference group. As we mentioned earlier, the system will be installed in two phases, and the same company will be chosen to provide the hardware for both phases (so there is only one company to deal with for maintenance and trouble-shooting).

Details of Phase One

The first phase of the new system will consist of a CPU module and a disk module. The CPU module should reach a total High-Performance Linpack (HPL) benchmark performance of at least 2 PFLOPS. The nodes will have Intel x86-64 compatible CPUs and will probably have two CPUs per node. There will be a range of nodes with the same architecture but with differing amounts of memory. Most of the nodes will have 256 GB of memory and will be known as thin nodes, but some nodes will have more memory as shown in the table.

Name of nodes Memory Number of nodes
large 512 GB 20
huge 1 TB 8
giant 2 TB 2

Our market analysis indicates that the number of cores per CPU is increasing and we expect in the order of 70-100 cores per node.

The disk module will support a Lustre high- speed file system, similar to Klemming, but with a capacity of at least 7 PB. The metadata disk of the disk module will be completely based on solid- state drives (SSD), which will increase the speed for metadata operations (like creating or deleting files). Phase one of the system is expected to be delivered in the first quarter of 2021.

Details of Phase Two

Phase two of PDC’s new system will consist of a further module that is expected to be GPU-based (although it might be based on CPU technology that provides the same level of performance as GPUs). We are allowing for up to four GPUs per CPU. Each CPU should have a memory of 128 GB per GPU attached to the CPU, and each GPU should have at least 32 GB of memory. With the current budget, we expect to reach an HPL performance of at least 5 PFLOPS for this module. The phase two module should be delivered before the end of the first quarter of 2022.

Preparing for the Future

It is important to note that the number of cores per node in the CPU-module of the new system will be significantly higher than for Beskow, Tegner and similar systems. This may require changes to existing applications to incorporate the necessary parallelism so that they run efficiently on the hardware of the first phase. To take full advantage of the high capacity of the accelerators in the second phase, more extensive changes to the software may be required.

Researchers who have access to the source code might be able to undertake this potentially complex task themselves. In other cases, for example with commercial software, we recommend contacting the developers or vendor of the application. While this may sound intimidating, we expect that similar changes will be required to prepare for future exascale systems that will use comparable accelerator technologies. To make the transition easier, PDC plans to offer workshops and assistance to help to convert research codes. For further details of the workshops, please join the PDC announcements mailing list , watch our Events page  or follow our Facebook page . You are also welcome to ask questions about this or other HPC-related research matters at the PDC cafes .