Events:

2024-07-04 at 16:21 [dardel]
New job starts resumed. A subset of jobs started earlier today did experience access issues underneath /cfs/klemming/projects/supr/, this should now have been remedied.
2024-07-04 at 15:00 [dardel]
a temporary block of new job starts is in place while investigating a glitch that has affected a subset of all job starts.
2024-06-28 at 10:51 [dardel]
The login nodes login1 and login4 (Thinlinc) will be restarted next Wednesday, 2024-07-03, at 13:00 CEST (11:00 UTC). You will be logged out from the login nodes. The operation is expected to take less than 30 minutes. Running jobs will not be affected.
2024-06-14 at 12:46 [dardel]
Most commands/file-accesses et cetera got stuck roughly half an hour ago, and were in that state for roughly 20 minutes. Actual root cause needs to be investigated.
2024-06-11 at 17:35 [dardel]
The Dardel GPU partition is now back in operation.

There might still be some issues though, please report problems to support@pdc.kth.se.

2024-05-29 at 18:35 [dardel]
Due to a configuration mistake jobs may have failed with an error reported by slurmstepd indicating a missing task prolog at the start of a job step, e.g. srun invocation:

error: run_command: slurm task_prolog can not be executed (/etc/slurm/omnivector-task_prolog.sh) No such file or directory

Multi-step jobs may have failed partially completed, please inspect the job output before resubmitting. Failed jobs have to be resubmitted to the queue, we apologize for this inconvenience.

2024-05-28 at 15:06 [dardel]
Dardel (CPU) is now back in operation, login and jobs are enabled. We are experiencing some problems with the GPU partition, so Dardel GPU is not yet back in production. We are working to resolve the issues as quickly as possible, a new flashnews message will be sent out as soon as Dardel GPU is operational.

Please note that, as announced earlier, the upgraded system will have an upgraded software stack with e.g. a newer version of the Cray programming environment (CPE 23/12) so applications may need to be recompiled or reconfigured.

2024-05-27 at 11:00 [dardel]
The Dardel upgrade is taking a little bit longer than expected. We are now testing the system and provided everything goes well we should be able to open it to all users sometime tomorrow (Tuesday May 28.)
2024-05-15 at 19:31 [dardel]
the entity in charge of the slurm (batch system) master daemon got issues a couple of hours ago and has been restarted. Operation should be back to normal.
2024-05-13 at 11:44 [dardel]
Move to performance cluster manager software system HPCM, update of Cray Operating System COS on all nodes, and update of Cray Programming Environment software CPE is expected to start 2024-05-20, Monday May 20, around 08 in the morning.

Expect the system to be unavailable throughout the week. (corrected: name of month above)

2024-04-20 at 18:21 [dardel]
Singularity and other container based jobs can now be used again on the compute nodes.

We aim at also restart login nodes coming Tuesday 23:rd of April around 09:00. After the restart, containers will work on those too.

All new jobs started will run on a compute where the lustre file-system client has been updated.

2024-04-11 at 21:29 [dardel]
Earlier today, around 15:15, and over a couple of hours, some jobs may have been affected by getting a void default PDC module. Jobs specifying explicit PDC/version number should not have been affected.
2024-04-05 at 17:39 [dardel]
Containers using user namespaces are disabled until further notice. This means that for example Singularity will likely not work, also other programs such as Firefox.
2024-03-18 at 19:29 [dardel]
The dardel login node is having issues and is being rebooted again.

Until resolved/an update is available we ask any user to resist other actions than submitting/checking jobs, edit plain files.

Please avoid spawning a new ssh session every other second, to initiate a massive file transfer, or start I/O intensive multi-cpu/multi-task heavy pre/post-processing analysis of very large data sets, &c

2024-03-18 at 14:34 [dardel]
The dardel login node is having issues and is being rebooted.
2024-03-14 at 16:30 [klemming]
The server side upgrade of the Klemming and Scania file systems are now done. Job start has been resumed. Please report anything out of the ordinary.
2024-03-13 at 15:24 [klemming]
Tomorrow, March 14th, starting at 10AM CET, we will upgrade the Klemming and Scania file systems to a version that should fix the server side bug. The file systems should stay available during the procedure, but a number of shorter freezes will occur. Already running jobs will be allowed to continue, but new job starts will be delayed during the procedure to minimize the risk of jobs being disturbed.
2024-03-11 at 11:25 [dardel]
The lustre server side of /cfs/klemming/ was restarted shortly after 08:00 this morning.

Any job running making use of /cfs/klemming/ beetween roughly 2024-03-10/20:00 and the restart this morning likely affected. Potentially completely stuck.

As many compute nodes got flagged being in poor shape, and do not run any jobs, we will take this opportunity and re-start them with a bug fix (CAST-35315) aimed at the lustre _client_ kernel bug.

The lustre server side bug remain.

New jobs started will run on a compute where lustre client kernel bug is fixed.

2024-03-10 at 22:07 [dardel]
Login nodes and compute nodes did loose contact with important parts of the server side of /cfs/klemming/ roughly an hour ago.

Possibility to login and to access files seriously affected as is likely any running job needing access to klemming as well.
2024-02-27 at 15:05 [dardel]
The file system issue has been identified, awaiting fix from the vendor. Jobs are slowly being started but please be aware that there is a risk of further outages until the fix has been delivered and applied. We are really sorry for the inconvenience.
2024-02-24 at 20:07 [dardel]
Serious file system problems, job starts have been disabled again, investigation is ongoing.
2024-02-24 at 18:12 [dardel]
System maintenance done, Dardel is running jobs since a few hours.
2024-02-14 at 18:00 [dardel]
Issues related to flapping network connectivity between file-servers and compute clients addressed. Job starts resumed since half an hour.

Please be aware of that the forthcoming extensive update 2024-02-19, and that the internal bug in the lustre file-system both remain.

Important info can be found at issues/update .

2024-02-12 at 20:55 [dardel]
As issues continue (also involving flapping connectivity between file-servers and clients) no jobs will be allowed to enter running state, should they reside under /cfs/klemming.

Please find more pieces of info on this, and info on forthcoming update starting 2024-02-19 at issues/update .

2024-02-12 at 18:30 [dardel]
Status of the ongoing serious issues regarding the lustre client (/cfs/klemming) and of forthcoming extensive upgrade, starting 2024-02-19, can be found at issues/update .
2024-02-05 at 14:58 [dardel]
After the updates last week (starting Wednesday 2024-01-31) many applications have hit what seem to be an internal bug in the lustre file-system client.

Typically this manifests itself through jobs not terminating/finishing properly. Nodes get stuck 'completing' after job finish for longer periods of time. Other jobs fail to start up properly on all nodes.

Several applications seem to be hit by the bug. However, 'vasp' applications seem more unfortunate.

Work to apply a work-around is on-going.

2024-02-01 at 13:47 [dardel]
System software update finished, jobs running since a while.

Please find description of updates .

Be equipped with some patience on new login sessions as it takes time to populate user private module cache.

2024-01-24 at 16:16 [dardel]
System will be unavailable due to system software upgrade starting Wednesday 2024-01-31 at 10:00. The work is estimated to be finished within two days and the system available again Friday 2024-02-02. More information will follow in the beginning of next week.

Please find Dardel being updated starting on 31 January

2024-01-16 at 12:13 [dardel]
POD and workload manager restart/restore complete.
2024-01-16 at 11:27 [dardel]
the POD hosting the slurm workload manager master-daemon got a failure roughly half an hour ago. Restart/restore in progress.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss