2024-02-24 at 20:07 [dardel]
Serious file system problems, job starts have been disabled again, investigation is ongoing.
2024-02-24 at 18:12 [dardel]
System maintenance done, Dardel is running jobs since a few hours.
2024-02-14 at 18:00 [dardel]
Issues related to flapping network connectivity between file-servers and compute clients addressed. Job starts resumed since half an hour.

Please be aware of that the forthcoming extensive update 2024-02-19, and that the internal bug in the lustre file-system both remain.

Important info can be found at issues/update .

2024-02-12 at 20:55 [dardel]
As issues continue (also involving flapping connectivity between file-servers and clients) no jobs will be allowed to enter running state, should they reside under /cfs/klemming.

Please find more pieces of info on this, and info on forthcoming update starting 2024-02-19 at issues/update .

2024-02-12 at 18:30 [dardel]
Status of the ongoing serious issues regarding the lustre client (/cfs/klemming) and of forthcoming extensive upgrade, starting 2024-02-19, can be found at issues/update .
2024-02-05 at 14:58 [dardel]
After the updates last week (starting Wednesday 2024-01-31) many applications have hit what seem to be an internal bug in the lustre file-system client.

Typically this manifests itself through jobs not terminating/finishing properly. Nodes get stuck 'completing' after job finish for longer periods of time. Other jobs fail to start up properly on all nodes.

Several applications seem to be hit by the bug. However, 'vasp' applications seem more unfortunate.

Work to apply a work-around is on-going.

