Events:

2021-12-20 at 17:39 [beskow]
To mitigate an excess number of notification mails reflecting batch job status changes being sent out, all notification mails were grounded between ~17:27 and ~17:37. This while removing thousands of jobs (i.e., changing their status.)

Mail on job status changes now enabled again.

Please take care. It is doubtful if requesting 5000 mails per hour containing job status changes sent to your home site makes sense.

2021-11-05 at 19:27 [beskow]
User processes creating large number of larger sized files in /tmp/ (/tmp/ use primary memroy) in cooperation with large user processes cause the beskow login to crash with OOM, out of memory. It is being rebooted.
2021-09-29 at 18:06 [beskow]
Less than an hour ago cabinet c3-0 lost half of its power feed as a fuse did trip. Many nodes in that cabinet, ~70, did shutdown during the power dip. Several running jobs were affected.
2021-08-14 at 12:05
A change has been made to an internal network config to address the issue with network glitches. Please report if you still experience what might be unexpected network behaviour.
2021-08-14 at 10:00
Symptoms of network glitches are being experienced/investigated. Typical failures: cannot check out a license, problem mounting a home directory, ..
2021-07-09 at 19:28 [tegner]
Tegner is back on-line, running jobs since ~half an hour.
2021-07-09 at 19:18 [beskow]
Beskow is back on-line, running jobs since ~25 minutes.
2021-07-09 at 16:56 [klemming]
Klemming has now been fully restarted. Some hardware problems showed up that had to be fixed first, which delayed the restart somewhat. We are now working on getting Beskow and Tegner back on-line again.
2021-07-08 at 23:00 [klemming]
the problems have re-appeared over the past ~hour and gotten worse. We plan to do a full system stop/start of Klemming, Tegner, and Beskow over the day of tomorrow (2021-07-09.) No new jobs will start until restart completed. Forthcoming updates on www.pdc.kth.se only, not through mail.
2021-07-08 at 19:55 [klemming]
Klemming has had intermittent problems since around 17:20 today and two of the servers did eventually crash, causing parts of the file system to be inaccessible for a while. They have now been restarted and the file system should be fully back again. Please report anything still out of the ordinary.
2021-06-12 at 20:56 [beskow]
While investigating if a CVE (Common Vulnerabilities and Exposure) potentially is affecting us, certain services have been shutdown. A current upfront effect is that your standard login (say through ssh) might take a couple of minutes prior giving a prompt back, rather than the usual immediate one.
2021-05-12 at 21:15
Updates now finished on Tegner as well, jobs running since ~half an hour.
2021-05-12 at 14:14
Updates finished on Klemming and Beskow, jobs running again on Beskow. Tegner to follow.
2021-05-11 at 08:30
reminder: System software updates on Klemming, Tegner and Beskow will take place tomorrow Wednesday, 2021-05-12, starting 08:00. The work is anticipated to be finished tomorrow. No jobs will be running during the updates.
2021-05-05 at 18:20
System software updates on Klemming, Tegner and Beskow will take place coming Wednesday, 2021-05-12, starting 08:00. The work is anticipated to be finished during that day. No jobs will be running during the updates.
2021-04-21 at 14:18
All license servers for ordinary production use are supposed to [be] back in operation. Please report unexpected behaviour.
2021-04-21 at 12:59
First ~half of license servers are back on-line.
2021-04-21 at 08:30
A server hosting virtual machines, among them a couple of license servers, seem down. Investigation in progress.
2021-04-09 at 16:30 [beskow]
Cray Linux Environment CLE 7.0.UP02.PS12 applied and Cray Development Toolkit CDT-21.02-01 installed. System is running jobs since a while.
2021-04-05 at 13:00 [beskow]
The announced patch set / software update work on Beskow is planned start around 09:00 (Wednesday, 2021-04-07.) Jobs won't be allowed to execute past that time. Logins will be disabled.
2021-03-29 at 15:58 [beskow]
Machine room work planned in preparation for our new system, Dardel, April 7-9. Additional water cooling pipes will be installed during April 8.

Beskow is planned to be shut down the day of April 7. Minor patch set / software updates will be applied. Water cooling to Beskow will be disabled during parts of April 8. The work is intended to be completed during April 9, and Beskow will resume operation.

2021-04-06 at 12:35
One of our three Kerberos servers was down 20210403 16:50 to 20210406 17:00. Operations should automatically fail over to the other servers, but there might be some extra delays noticed. Replacememt server is now in place, inform us if you see any further problems.
2021-01-14 at 08:25 [tegner]
A serious bug has been discovered 2021-01-14 in OpenAFS: All 1.8.x versions execpt the new 1.8.7 released 2021-01-15 will not be able to (re)start after 14 Jan 2021 08:25:36 AM UTC (Unix epoch time 0x60000000) which has passed. As we need to restart servers and tegner login- and compute-nodes with the fixed version, small interruptions might occur in the near future. We think this can be done without interrupting running jobs. If you use OpenAFS on your own computer, you might need to upgrade as well. The Auristor AFS product is not affected.
All flash news for 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss