2020-10-27 at 18:44 [klemming]
Around 19:00 today we will do a fail-over operation on the meta data server of Klemming, to hopefully release some hanging operations. This means that there will be a pause in file system access at this time. We expect the procedure to take a couple of minutes, but might take a bit longer if there are any complications due to the current state of the file system.
2020-10-12 at 10:23 [klemming]
Two servers in Klemming has recently crashed again and has now been restarted. Job start is currently suspended while we investigate the cause and check the health of the system. Initial checks indicates similarities with previous crashes, and we have managed to collect more information this time. Hopefully we can gradually start letting jobs start again during the day, while we continue with the investigation.
2020-10-06 at 22:04 [klemming]
Jobs have been started on Tegner since a couple of hours, and gradually on Beskow since roughly an hour. As the root cause for the crashes still is unknown, investigation continues.
2020-10-06 at 15:25 [klemming]
We have now started to try to bring Klemming back on-line. If this is successful, we will start jobs first on Tegner and later on Beskow. The root cause for the crashes is still unknown, investigation continues.
2020-10-06 at 09:05 [klemming]
During the night, around 01:20, the second server in the fail-over pair also crashed. This means that many operations to Klemming are now hanging. This affects both batchs jobs and interactive use on Beskow and Tegner and the batch queues on both systems have been stopped while the investigation continues.
2020-10-05 at 17:15 [klemming]
One server in Klemming crashed about 2 hours ago, for yet unknown reason. There seem to have been some complications during the fail-over, which might have caused IO problems for some nodes/jobs.
2020-10-04 at 10:40 [beskow]
Problems with nodes/cabinet c2-1 yesterday, 2020-10-03, starting about 23:02. Jobs with nodes in cabinet c2-1 (nid01536 throughout nid01727) likely have been affected.
2020-09-29 at 22:28 [klemming]
The old disk system used for storing metadata in Klemming has physically been replaced with a new one, which is now in use.

On Beskow PatchSet 05 has been applied. Cray Developer Toolkit/Programming Environment 20.09 has been added, including: cce/10.0.3 cdt/20.09 cray-fftw/ cray-hdf5/ cray-hdf5-parallel/ cray-jemalloc/ cray-libsci/20.09.1 cray-mpich/7.7.16 cray-mpich-abi/7.7.16 cray-netcdf/ cray-netcdf-hdf5parallel/ cray-openshmemx/9.1.2 cray-parallel-netcdf/ craype/2.7.1 cray-petsc/ cray-petsc-64/ cray-petsc-complex/ cray-petsc-complex-64/ cray-python/ cray-R/ cray-shmem/7.7.16 cray-stat/4.6.3(default) cray-trilinos/ gcc/10.1.0 papi/ perftools-base/20.09.0 pmi/5.0.17 pmi-lib/5.0.17 PrgEnv-cray/6.0.9 PrgEnv-gnu/6.0.9 PrgEnv-intel/6.0.9 valgrind4hpc/2.7.2(default)

On Tegner and Beskow logins are enabled, jobs will gradually start to execute.

