Further Dardel upgrades in 2023
Note that research groups will need to use recompiled application software when running jobs on Dardel nodes with the faster Slingshot 11 interconnect and the Strawberry software stack. Groups can start testing and running software on those nodes now - see below for link to instructions.
In recent months, Dardel has undergone a number of expansions and upgrades. The most significant is that 468 CPU nodes and 56 graphics processing unit (GPU) nodes have been installed. They are now in operation and in the process of undergoing acceptance tests. The new nodes use an improved 200 Gbps interconnect called Slingshot 11. The nodes that were installed in the first phase of Dardel use an earlier version of the interconnect known as Slingshot 10, and the plan is to upgrade the entire first phase of Dardel to also use the faster Slingshot 11 interconnect. However, to make it possible to use the Slingshot 11 interconnect, a new software stack called Strawberry has been installed on Dardel. This means that all relevant application software will need to be recompiled before the nodes with Slingshot 11 can be used. Any software on Dardel that was installed by PDC will, of course, be updated by PDC, but any software that was installed by a research group will need to be recompiled by the group. PDC has already compiled and tested the most commonly used software packages on the new nodes that use Slingshot 11. Seldom used software or older versions of software will be installed on demand, if possible.
The rest of the Dardel system will be upgraded to Slingshot 11 in the first half of February. Some nodes may, however, be kept on the Slingshot 10 network for some time if that is required to keep all the software applications running. When the first phase of Dardel is being upgraded to use Slingshot 11, it should be possible for research groups to use the new Dardel nodes (which already use Slingshot 11) while PDC is updating the remaining parts. Consequently, there should not be any need for the entire system to be down at any time during this upgrade.
Instructions on how to compile and run software on the Slingshot 11 and Strawberry partition of the system (which, at the moment, is the part of the Dardel system with the recently installed new CPUs and GPUs) can be found at How to test software on the parts of Dardel that have Slingshot 11 . Research groups can start doing this right away. The Slingshot 11 partition will be expanded as more of the Dardel nodes are converted to Slingshot 11. It is estimated that most of the nodes will be converted by mid-February.
Later this year, several other upgrades to Dardel will be required. Here are the most important ones.
- The water cooling in the old part of the system uses inhibitors to stop microbial or other growth in the cooling water. HPE has just recently discovered that the chemicals used as inhibitors are not allowed to be used in the European Union. So, the water for cooling that contains these inhibitors must be replaced with a mix of water and glycol. This will require the older part of the Dardel system to be shut down for around a week.
- Due to the upgrades, the high-speed network is not balanced and requires some re-cabling and additions. This will involve shutting down the system for several days.
- The size of the Lustre file system will be expanded by 50 %, and the metadata speed will also be increased by 50%, which will require some downtime. The software for the disk system will also be upgraded to a new version which includes many updates, such as offering optimised TCP/IP access. (For more details, see the Dardel Fastest in Sweden article from the previous PDC Newsletter.)
- The GPU software of Dardel is continuously being developed and improved by AMD and HPE. To take advantage of the latest improvements as they are released, new versions of the Dardel software stack will need to be installed on a regular basis.
PDC, together with HPE, is working hard to find ways to combine some of the above upgrades to maximise access to Dardel by minimising downtime for upgrades. The risk that unforeseen problems may arise during the above upgrades is higher than usual, so please be aware that unscheduled downtime may occur at short notice during these periods. PDC will, of course, strive to provide as much information as soon as possible if the system or some parts of it need to be shut down temporarily due to these upgrades.