Events:

2024-11-27 at 14:22

PDC will release the new PDC support pages, with added search capability, the 1st of December. How to access this will not change as you will still be able to access the new documentation via the support menu at our website, but will be redirected to a new web portal containing information about support, application, system statistics and system status.

If you already today would like to access new new information goto https://support.pdc.kth.se/doc/

In case you find any errors, please inform us using https://support.pdc.kth.se/doc/support/?section=/doc/support-docs/contact/contact_support/

2024-10-31 at 14:06 [dardel]

A note to all users of bioinformatic software: To access UPPMAX legacy software, please simply load 'bioinfo-tools' from now on - the modules 'UPPMAX/1.0.0', 'bioinfo-tools/1.0.0' and 'bioinfo-tools/test_1.0.0' are all deprecated.

2024-10-25 at 19:23 [dardel]

The Dardel system testing is mainly finished. Dardel-CPU jobs are running normally since about 18:15. The Dardel-GPU partition will be tested some more during the night, we expect jobs to start running normally some time Saturday morning.

2024-10-23 at 10:45 [klemming]

Hardware investigation has been finished and job-starts have been resumed. Please note that night-time system testing of Dardel is still being conducted and that long-running jobs will not be able to start due to this.

2024-10-23 at 10:25 [klemming]

Investigation of hardware problems is ongoing. As a precaution all job-starts on Dardel are blocked until further notice. Please note that night-time system testing of Dardel is still being conducted and that long-running jobs will not be able to start due to this.

2024-10-18 at 22:10 [dardel]

Klemming maintenance finished. Job starts resumed. Oncoming system tests coming Monday still apply.

2024-10-17 at 14:48 [dardel]

Dardel will be unavailable night time due to system tests starting Monday 2024-10-21 at 17:00. Jobs will be running when the system is not used for testing. The tests are expected to be finished at the latest Saturday 2024-10-26 morning.

2024-10-11 at 15:00 [dardel]

The Klemming file system will be down for replacement of a meta data server chassis on Friday 18:th October 2024 from 09:00 CEST. The repair is expected to be finished some time during the afternoon. During this time, no jobs will allowed to run, and the login nodes will have login disabled.

2024-10-03 at 22:38 [klemming]

Klemming is now back on-line and new jobs are allowed to start again. Please note however that we are waiting for more hardware parts to replace. There is a risk of further problems until all issues have been resolved.

2024-10-03 at 15:02 [klemming]

Two of the meta data servers of Klemming are currently down. Most file system operations are hanging. Investigation in progress. No new jobs will be allowed to start.

2024-09-05 at 12:32 [klemming]

Klemming maintenance over, jobs starting again shortly.

2024-09-05 at 10:26 [klemming]

Temporary block of new job starts due to Klemming maintenance.

2024-08-23 at 07:10 [dardel]

A liquid sensor in a cabinet indicated a leak. A row of cabinets did shut down. No leaks found. Cabinets/compute nodes were rebooted and runs jobs since a couple of hours.

2024-08-22 at 21:10 [dardel]

A large number of compute nodes went off-line roughly an hour ago. The exact cause is yet to be determined, but it could indicate that parts of the system have shut down due to some environmental issue, i.e., power/cooling/..

2024-07-31 at 15:30 [dardel]

Most system updates completed, remaining pieces are not expected to have noticeable impact.

2024-07-31 at 11:12 [dardel]

Required system updates will take place from now on, possibly causing hiccups when issuing commands to e.g. the batch system.

2024-07-04 at 16:21 [dardel]

New job starts resumed. A subset of jobs started earlier today did experience access issues underneath /cfs/klemming/projects/supr/, this should now have been remedied.

2024-07-04 at 15:00 [dardel]

a temporary block of new job starts is in place while investigating a glitch that has affected a subset of all job starts.

2024-06-28 at 10:51 [dardel]

The login nodes login1 and login4 (Thinlinc) will be restarted next Wednesday, 2024-07-03, at 13:00 CEST (11:00 UTC). You will be logged out from the login nodes. The operation is expected to take less than 30 minutes. Running jobs will not be affected.

2024-06-14 at 12:46 [dardel]

Most commands/file-accesses et cetera got stuck roughly half an hour ago, and were in that state for roughly 20 minutes. Actual root cause needs to be investigated.

2024-06-11 at 17:35 [dardel]

The Dardel GPU partition is now back in operation.

There might still be some issues though, please report problems to support@pdc.kth.se.

2024-05-29 at 18:35 [dardel]

Due to a configuration mistake jobs may have failed with an error reported by slurmstepd indicating a missing task prolog at the start of a job step, e.g. srun invocation:

error: run_command: slurm task_prolog can not be executed (/etc/slurm/omnivector-task_prolog.sh) No such file or directory

Multi-step jobs may have failed partially completed, please inspect the job output before resubmitting. Failed jobs have to be resubmitted to the queue, we apologize for this inconvenience.

2024-05-28 at 15:06 [dardel]

Dardel (CPU) is now back in operation, login and jobs are enabled. We are experiencing some problems with the GPU partition, so Dardel GPU is not yet back in production. We are working to resolve the issues as quickly as possible, a new flashnews message will be sent out as soon as Dardel GPU is operational.

Please note that, as announced earlier, the upgraded system will have an upgraded software stack with e.g. a newer version of the Cray programming environment (CPE 23/12) so applications may need to be recompiled or reconfigured.

2024-05-27 at 11:00 [dardel]

The Dardel upgrade is taking a little bit longer than expected. We are now testing the system and provided everything goes well we should be able to open it to all users sometime tomorrow (Tuesday May 28.)

2024-05-15 at 19:31 [dardel]

the entity in charge of the slurm (batch system) master daemon got issues a couple of hours ago and has been restarted. Operation should be back to normal.

2024-05-13 at 11:44 [dardel]

Move to performance cluster manager software system HPCM, update of Cray Operating System COS on all nodes, and update of Cray Programming Environment software CPE is expected to start 2024-05-20, Monday May 20, around 08 in the morning.

Expect the system to be unavailable throughout the week. (corrected: name of month above)

2024-04-20 at 18:21 [dardel]

Singularity and other container based jobs can now be used again on the compute nodes.

We aim at also restart login nodes coming Tuesday 23:rd of April around 09:00. After the restart, containers will work on those too.

All new jobs started will run on a compute where the lustre file-system client has been updated.

2024-04-11 at 21:29 [dardel]

Earlier today, around 15:15, and over a couple of hours, some jobs may have been affected by getting a void default PDC module. Jobs specifying explicit PDC/version number should not have been affected.

2024-04-05 at 17:39 [dardel]

Containers using user namespaces are disabled until further notice. This means that for example Singularity will likely not work, also other programs such as Firefox.

2024-03-18 at 19:29 [dardel]

The dardel login node is having issues and is being rebooted again.

Until resolved/an update is available we ask any user to resist other actions than submitting/checking jobs, edit plain files.

Please avoid spawning a new ssh session every other second, to initiate a massive file transfer, or start I/O intensive multi-cpu/multi-task heavy pre/post-processing analysis of very large data sets, &c

2024-03-18 at 14:34 [dardel]

The dardel login node is having issues and is being rebooted.

2024-03-14 at 16:30 [klemming]

The server side upgrade of the Klemming and Scania file systems are now done. Job start has been resumed. Please report anything out of the ordinary.

2024-03-13 at 15:24 [klemming]

Tomorrow, March 14th, starting at 10AM CET, we will upgrade the Klemming and Scania file systems to a version that should fix the server side bug. The file systems should stay available during the procedure, but a number of shorter freezes will occur. Already running jobs will be allowed to continue, but new job starts will be delayed during the procedure to minimize the risk of jobs being disturbed.

2024-03-11 at 11:25 [dardel]

The lustre server side of /cfs/klemming/ was restarted shortly after 08:00 this morning.

Any job running making use of /cfs/klemming/ beetween roughly 2024-03-10/20:00 and the restart this morning likely affected. Potentially completely stuck.

As many compute nodes got flagged being in poor shape, and do not run any jobs, we will take this opportunity and re-start them with a bug fix (CAST-35315) aimed at the lustre _client_ kernel bug.

The lustre server side bug remain.

New jobs started will run on a compute where lustre client kernel bug is fixed.

2024-03-10 at 22:07 [dardel]

Login nodes and compute nodes did loose contact with important parts of the server side of /cfs/klemming/ roughly an hour ago.

Possibility to login and to access files seriously affected as is likely any running job needing access to klemming as well.

2024-02-27 at 15:05 [dardel]

The file system issue has been identified, awaiting fix from the vendor. Jobs are slowly being started but please be aware that there is a risk of further outages until the fix has been delivered and applied. We are really sorry for the inconvenience.

2024-02-24 at 20:07 [dardel]

Serious file system problems, job starts have been disabled again, investigation is ongoing.

2024-02-24 at 18:12 [dardel]

System maintenance done, Dardel is running jobs since a few hours.

2024-02-14 at 18:00 [dardel]

Issues related to flapping network connectivity between file-servers and compute clients addressed. Job starts resumed since half an hour.

Please be aware of that the forthcoming extensive update 2024-02-19, and that the internal bug in the lustre file-system both remain.

Important info can be found at issues/update .

2024-02-12 at 20:55 [dardel]

As issues continue (also involving flapping connectivity between file-servers and clients) no jobs will be allowed to enter running state, should they reside under /cfs/klemming.

Please find more pieces of info on this, and info on forthcoming update starting 2024-02-19 at issues/update .

2024-02-12 at 18:30 [dardel]

Status of the ongoing serious issues regarding the lustre client (/cfs/klemming) and of forthcoming extensive upgrade, starting 2024-02-19, can be found at issues/update .

2024-02-05 at 14:58 [dardel]

After the updates last week (starting Wednesday 2024-01-31) many applications have hit what seem to be an internal bug in the lustre file-system client.

Typically this manifests itself through jobs not terminating/finishing properly. Nodes get stuck 'completing' after job finish for longer periods of time. Other jobs fail to start up properly on all nodes.

Several applications seem to be hit by the bug. However, 'vasp' applications seem more unfortunate.

Work to apply a work-around is on-going.

2024-02-01 at 13:47 [dardel]

System software update finished, jobs running since a while.

Please find description of updates .

Be equipped with some patience on new login sessions as it takes time to populate user private module cache.

2024-01-24 at 16:16 [dardel]

System will be unavailable due to system software upgrade starting Wednesday 2024-01-31 at 10:00. The work is estimated to be finished within two days and the system available again Friday 2024-02-02. More information will follow in the beginning of next week.

Please find Dardel being updated starting on 31 January

2024-01-16 at 12:13 [dardel]

POD and workload manager restart/restore complete.

2024-01-16 at 11:27 [dardel]

the POD hosting the slurm workload manager master-daemon got a failure roughly half an hour ago. Restart/restore in progress.

All flash news for 2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss