Events:

2020-12-17 at 11:24 [beskow]

tomorrow Friday, 2020-12-18, at 08:00 one beskow cabinet will be taken off-line when replacing an Environmental Control and a Blower Control assembly. The procedure is intended to be non-intrusive.

2020-12-02 at 15:07 [beskow]

Earlier today nodes in cabinet c1-1 did experience what likely is a power loss. Manual inspection will take place in the late afternoon/early evening, but likely a tripped wall circuit breaker can have had this impact. Jobs executing on nodes nid0[1344-1535] likely have failed earlier today.

2020-12-01 at 08:18 [beskow]

the motherboard on the SMW, System Management Workstation, will be replaced today 2020-12-01 starting around 10:00. During the replacement no new jobs will be allowed to start to run. The replacement procedure is expected to be finished in the order of hour/hours.

2020-11-27 at 10:03 [beskow]

over night the SMW, System Management Workstation, got a hardware failure and shut down. It has been restarted and is running since close to an hour. For many jobs this was not noticeable, however, some jobs have gotten 'stuck.' This possibly as reporting of, and reaction to, hardware events piled up while the SMW was unresponsive.

2020-11-23 at 23:33 [beskow]

So far no 'smoking gun' has been found. Over the past hour(s) several test-jobs have successfully been executed on the system. A fairly large number of jobs seem to have failed to execute, starting ~19:50, with a peak at ~20:10. If you have experienced anything out of the ordinary, please let us know. General job starts are enabled.

2020-11-23 at 21:16 [beskow]

The scheduler is acting sluggish since roughly 70 minutes. Listing jobs in the queue, sending in new jobs &c hang for a long time, or times out. Investigation in progress.

2020-10-27 at 18:44 [klemming]

Around 19:00 today we will do a fail-over operation on the meta data server of Klemming, to hopefully release some hanging operations. This means that there will be a pause in file system access at this time. We expect the procedure to take a couple of minutes, but might take a bit longer if there are any complications due to the current state of the file system.

2020-10-12 at 10:23 [klemming]

Two servers in Klemming has recently crashed again and has now been restarted. Job start is currently suspended while we investigate the cause and check the health of the system. Initial checks indicates similarities with previous crashes, and we have managed to collect more information this time. Hopefully we can gradually start letting jobs start again during the day, while we continue with the investigation.

2020-10-06 at 22:04 [klemming]

Jobs have been started on Tegner since a couple of hours, and gradually on Beskow since roughly an hour. As the root cause for the crashes still is unknown, investigation continues.

2020-10-06 at 15:25 [klemming]

We have now started to try to bring Klemming back on-line. If this is successful, we will start jobs first on Tegner and later on Beskow. The root cause for the crashes is still unknown, investigation continues.

2020-10-06 at 09:05 [klemming]

During the night, around 01:20, the second server in the fail-over pair also crashed. This means that many operations to Klemming are now hanging. This affects both batchs jobs and interactive use on Beskow and Tegner and the batch queues on both systems have been stopped while the investigation continues.

2020-10-05 at 17:15 [klemming]

One server in Klemming crashed about 2 hours ago, for yet unknown reason. There seem to have been some complications during the fail-over, which might have caused IO problems for some nodes/jobs.

2020-10-04 at 10:40 [beskow]

Problems with nodes/cabinet c2-1 yesterday, 2020-10-03, starting about 23:02. Jobs with nodes in cabinet c2-1 (nid01536 throughout nid01727) likely have been affected.

2020-09-29 at 22:28 [klemming]

The old disk system used for storing metadata in Klemming has physically been replaced with a new one, which is now in use.

On Beskow PatchSet 05 has been applied. Cray Developer Toolkit/Programming Environment 20.09 has been added, including: cce/10.0.3 cdt/20.09 cray-fftw/3.3.8.8 cray-hdf5/1.12.0.0 cray-hdf5-parallel/1.12.0.0 cray-jemalloc/5.1.0.3 cray-libsci/20.09.1 cray-mpich/7.7.16 cray-mpich-abi/7.7.16 cray-netcdf/4.7.4.0 cray-netcdf-hdf5parallel/4.7.4.0 cray-openshmemx/9.1.2 cray-parallel-netcdf/1.12.1.0 craype/2.7.1 cray-petsc/3.13.3.0 cray-petsc-64/3.13.3.0 cray-petsc-complex/3.13.3.0 cray-petsc-complex-64/3.13.3.0 cray-python/3.8.5.0 cray-R/4.0.2.0 cray-shmem/7.7.16 cray-stat/4.6.3(default) cray-trilinos/12.18.1.1 gcc/10.1.0 papi/6.0.0.2 perftools-base/20.09.0 pmi/5.0.17 pmi-lib/5.0.17 PrgEnv-cray/6.0.9 PrgEnv-gnu/6.0.9 PrgEnv-intel/6.0.9 valgrind4hpc/2.7.2(default)

On Tegner and Beskow logins are enabled, jobs will gradually start to execute.

2020-09-25 at 13:22

reminder: service window starting Monday 2020-09-28 at 09:00 affecting Klemming, Tegner, and Beskow.

2020-09-22 at 16:34

Tegner now on-line also.

2020-09-22 at 14:25

Klemming on-line, and Beskow on-line. Tegner to follow.

2020-09-22 at 10:50

A full system stop/start of Beskow, Tegner, and Klemming will take place shortly, 11:00 2020-09-22. A small number of currently running jobs will be prematurely terminated.

2020-09-21 at 23:32

Unfortunately the gradual job start caused file-system accesses to lockup after a few jobs got started, further job starts prevented. We will now call it a day. We ponder on a full restart of all systems during tomorrow.

2020-09-21 at 22:45

Very early this morning two of the Klemming servers crashed. The first one due to very high IO load, and the second one due to a software bug while trying to handle the failover of the first server. During today many Beskow compute nodes have had problems recovering properly after the servers restarts, most of them have now been rebooted on a rolling scheme. We will now gradually let new Beskow jobs to start.

2020-09-20 at 01:10

In the early morning hours of Sunday, a PCI hardware error occured inside the virtualization server used for several services run by PDC. Among these were several non-redundant license servers for several products. Restart of services was stated Monday morning. As of Monday 13:30 all licenses should be available again, please contact PDC if you notice any that are still missing.

2020-09-21 at 08:17 [klemming]

Accesses to filesystem /cfs/klemming/ seem to experience problems and freeze. Aside from production job runs this also affects logins in many cases.

2020-09-20 at 17:01 [beskow]

No single root cause of sluggishness found, a very large number of jobs that failed/terminated immediately after being spawned did add substantially to response-time.

2020-09-20 at 12:42 [beskow]

The batch system/slurm seem to have issues, being sluggish/non-responsive. Investigation in progress.

2020-09-18 at 13:37 [klemming]

A service window is planned starting Monday 2020-09-28 at 09:00 where the Klemming file system will not be available. This implies that access to Beskow and Tegner also is affected. All systems are expected to be back Tuesday 2020-09-29, in the evening.

During the service window we will physically replace the disk system used for storing metadata in Klemming with a bigger and faster one, and then transfer the current metadata and reconfigure the file system to fully utilize the new metadata disk system. Most of this work will be done by the file system vendor. No files should be harmed during this procedure, but, as always, there is no backup of any data in Klemming, so please make sure that you also have a copy of your important data somewhere else.

We will also take this opportunity to install the latest patchsets of Beskow system software, add the latest Cray Developer Toolkit/Programming Environment versions, and make some Lustre configuration changes on Tegner.

2020-08-22 at 11:40 [beskow]

During a warmswap operation preparing for a routine CPU replacement cabinet c2-1 unexpectedly lost its HSN, high speed network. All jobs running with nodes in cabinet c2-1 (nodes nid01536 throughout node nid01727) around 10:32 today likely have failed.

2020-07-31 at 14:45 [klemming]

We have now identified a probable cause for the current "out of space" problems in Klemming, related to how the clients cache data during writes. It is triggered by a change of behavior in the new version, combined with work-arounds for old bugs and a quite full file system. We are currently implementing some configuration changes on Beskow that seems to solve the problem. All jobs starting from now will run on reconfigured nodes. If jobs still fail with "No space left on device", please report this to support.

2020-07-27 at 18:58 [klemming]

We currently have an issue with Klemming causing some IO operations to fail with ENOSPC(28), "No space left on device". The errors occur both from Beskow and Tegner. Since there is space left on all the servers, and no errors reported in any of the logs, the investigation continues.

2020-07-25 at 10:52 [beskow]

Many blade controllers in one cabinet, c1-0, report errors. The cabinet is being drained of jobs, i.e., running jobs will finish, new jobs will not get compute nodes in that cabinet.

2020-07-10 at 20:56

Maintenance of the Lustre file system /cfs/klemming/ and of Beskow are mostly through. Klemming now runs Lustre 2.12, and Beskow has been updated to CLE7.UP02. Cray Programming Environment 20.06 have been added. Beskow and Tegner are open for access again.

As a few applications/jobs behaved unexpectedly after the upgrade, most jobs are in 'userhold.' To release your job type "scontrol release jobid" where jobid is the number of your job. This is to avoid a large number of crashed jobs for you to keep track of.

We will investigate on what library/dependencies are not working satisfactory.

2020-07-01 at 01:32 [beskow]

On previous flash (2020-06-30 at 23:40 [beskow] Batch system commands got unresponsive ..

A number of accounts have been blocked from further logins and jobs have been cancelled en masse - i.e.

x making repeated ssh accesses every second for hours.

x sending in, as it seems, self submitting jobs that fail all the time.

x have sent in thousands of jobs running on less than a handful of nodes for very short amounts of time.

x &c

system is now responsive again.

Please note: anyone can make newbie mistakes. Possibly some more experienced users/jobs were blocked/got cancelled as well, caught by friendly fire.

2020-06-30 at 23:40 [beskow]

Batch system commands got unresponsive roughly one and a half hour ago (a quarter past ten.) Investigation in progress.

2020-06-26 at 15:47

Maintenance of the Lustre file-system /cfs/klemming, and of Beskow, are planned to take place starting July 6. Beskow will be updated to CLE7.UP02 and Klemming will be updated to Lustre 2.12. A new Klemming meta-data-server is under consideration to be brought on-line. The updates are expected to take at least 3 full days, and migrating to a new meta-data-server one to two days. As /cfs/klemming/ will be off-line, Tegner also is affected.

2020-05-25 at 12:13 [beskow]

One out of two wall circuit breakers of cabinet c4-0 did trip yesterday evening causing many compute nodes and a login node in that cabinet to shutdown during the power dip. The breaker has been repaired and is engaged again. The cabinet will be fully/partially drained for jobs for the time being. It is likely that jobs on nodes outside the cabinet also were affected by i.e. file-system accesses being locked as HPS network is quiesced while doing fail-over(s.)

2020-05-24 at 23:05 [beskow]

One cabinet out of eleven in beskow did, seemingly, experience a HW fault and gave up a couple of hours ago. Roughly a quarter to 8PM. Any job using compute nodes within that cabinet have likely failed.

2020-04-28 at 13:26 [tegner]

Tegner is up and running again. The cfs (/klemming) slowness problems has seemingly been resolved. Please report any problems to support@pdc.kth.se.

2020-04-24 at 10:47 [tegner]

Tegner will have a service stop 2020-04-28 from 09:00 for a try to fix the cfs (/klemming) problem.

2020-03-04 at 14:51 [beskow]

The cabinet PDU assy has been replaced, the system is up and running jobs again.

2020-02-20 at 15:31 [beskow]

The system will be taken off-line Wednesday 2020-03-04/09:00 for hardware maintenance. We will replace a cabinet PDU assy (Power Distribution Unit.) We expect to have the system back the same day.

2020-01-28 at 11:29

Network problems (routing not working as configured). Disturbances of all network (IP) services like DNS, AFS etc. We are working on isolating the failing part, probably routing software.

2020-01-23 at 10:22 [beskow]

Overnight there were thousands of jobs sent in that failed within seconds, each then writing core-dumps to the file-system. This have on occasion caused time-outs when communicating with the slurmctld (scheduler master daemon) and other jobs have been affected.

2020-01-17 at 18:30

we are currently experiencing problems with one /afs/ file-server. Depending on whether your home directory, your files, or applications you use, are located on that file-server your access will likely freeze or time out.

2020-01-02 at 12:48 [beskow]

the Beskow login node got OOM'ed (out of memory) and is about to be restarted.

All flash news for 2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss