2020-07-31 at 14:45 [klemming]
We have now identified a probable cause for the current "out of space" problems in Klemming, related to how the clients cache data during writes. It is triggered by a change of behavior in the new version, combined with work-arounds for old bugs and a quite full file system. We are currently implementing some configuration changes on Beskow that seems to solve the problem. All jobs starting from now will run on reconfigured nodes. If jobs still fail with "No space left on device", please report this to support.
2020-07-27 at 18:58 [klemming]
We currently have an issue with Klemming causing some IO operations to fail with ENOSPC(28), "No space left on device". The errors occur both from Beskow and Tegner. Since there is space left on all the servers, and no errors reported in any of the logs, the investigation continues.
2020-07-25 at 10:52 [beskow]
Many blade controllers in one cabinet, c1-0, report errors. The cabinet is being drained of jobs, i.e., running jobs will finish, new jobs will not get compute nodes in that cabinet.
2020-07-10 at 20:56
Maintenance of the Lustre file system /cfs/klemming/ and of Beskow are mostly through. Klemming now runs Lustre 2.12, and Beskow has been updated to CLE7.UP02. Cray Programming Environment 20.06 have been added. Beskow and Tegner are open for access again.

As a few applications/jobs behaved unexpectedly after the upgrade, most jobs are in 'userhold.' To release your job type "scontrol release jobid" where jobid is the number of your job. This is to avoid a large number of crashed jobs for you to keep track of.

We will investigate on what library/dependencies are not working satisfactory.

All flash news for 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss