Skip to main content

Dardel is updated and the GPU partition is open

Published Apr 04, 2023

Dardel has just been updated in several ways. Most importantly, almost the entire system now uses a faster interconnect called Slingshot 11, which runs at a speed of 200 Gbit/s, rather than 100 Gbit/s. This means that Strawberry, a new software stack with many improvements, must be used. With these upgrades completed, the GPU partition of Dardel is now open. Unfortunately, the new 2 TB nodes are not available yet, but we hope they will be usable shortly.

To use the updated system

Log in to the dardel.pdc.kth.se login node in the usual way. For Scania users, please use dardel-scania.pdc.kth.se to log in.

Load the module PDC/22.06 (ml PDC/22.06) and any other Dardel modules required for your work.

To use the GPU nodes

Log in as above.

Follow the instructions at: www.pdc.kth.se/support/documents/software_development/development_gpu.html

Partitions

All the old partitions, like main, shared, memory and long, can be used, and there is a new partition, gpu, for accessing the GPU nodes.

Newer versions of CPE and ROCm on the GPUs

There is now a newer version of the Cray Programming Environment available on Dardel (version 22.12) and a new version of ROCm (5.3.3) available on the GPUs. This combination of software is not officially supported by HPE but seems to work quite well and contains some additional functionality. We recommend these versions to more advanced users.

Jobs left in the queues

All jobs waiting in the queue prior to the maintenance have been set to 'UserHold', that is, they are not eligible to start running.

To see jobs of yours currently in the queue:

uan$ squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9876543 main interact username PD 0:00 1 (JobHeldUser)
uan$

or

uan$ squeue -u username
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9876543 main interact username PD 0:00 1 (JobHeldUser)
uan$

To release held jobs of yours (that is, to make them eligible to start):

uan$ scontrol release <job-id-list>
..
uan$

To remove a job from the queue:

uan$ scancel <job-id>
..
uan$

To remove all your jobs from the queue:

uan$ scancel -u <username>
..
uan$