2019-12-13 at 11:46 [beskow]
In addition to the blower control assembly and the environment distribution board that were replaced slightly over a week ago, we need to also replace a coil control assembly. This will take place coming Tuesday. The system will go off-line shortly after 09:00, December 17.
2019-12-12 at 00:45 [beskow]
Login nodes have been restarted, and so far +90% of compute nodes have also been restarted and are now running jobs again. Please report unexpected behavior.
2019-12-10 at 21:58 [tegner]
Even if klemming comes on-line again during the evening, Tegner will stay off-line pending the planned down-time for system upgrades on Wednesday the 11th of December. Our apologies for an inconvenience this may cause.
2019-12-11 at 14:55 [beskow]
the issues on Beskow with the file-system Klemming have unfortunately not yet been eliminated. We will let running jobs keep running, and initiate a rolling restart of all compute nodes, and login nodes. This is planned to start later tonight, and will continue for 36..48 hours. Jobs will only start on compute nodes that have been rebooted.
2019-12-10 at 22:15 [klemming]
Servers in Klemming have server-by-server been restarted, and the file-system is back on-line. Beskow compute nodes in uncertain state with respect to Klemming have been restarted. Jobs in state running have been kept in state running. Jobs have been allowed to start to run again.
2019-12-10 at 08:00 [klemming]
Overnight problems seem to have developed between the file-system /cfs/klemming and systems tegner and beskow. Investigation in progress.
2019-12-06 at 17:23 [tegner]
Tegner will have a service window on Wednesday 2019-12-11 starting 07:00 CET with an expected duration of 10 hours. The login and transfer nodes will also be restarted during this time.
2019-12-04 at 18:12 [beskow]
Maintenance completed. The system is on-line and running jobs since a while.
2019-12-02 at 13:42 [beskow]
Coming Wednesday, December 4, the system will be stopped for urgent hardware maintenance replacing a blower control assembly and an environment distribution board. This will start 09:00 in the morning. These are suspected to be behind the problems over the weekend. The system will be off-line during maintenance.
2019-12-01 at 21:25 [beskow]
the system is running jobs again since a while. The exact reason not pinpointed with complete certainty, but very likely a piece of hardware that, simply put, monitors and controls environmental parameters (temperature, air-flow, ..) needs to be replaced.
2019-12-01 at 10:00 [beskow]
System got unresponsive overnight. Investigation started.
2019-11-19 at 08:46 [beskow]
Over night there have been issues in connecting to / using the slurm scheduler/controller. Investigation is in progress.
All flash news for 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss