Events:

2018-11-21 at 09:34 [beskow]

overnight, roughly between Nov 20 21:16 and Nov 21 02:40, the system was experiencing problems such as timeouts (to internal filesystems, its scheduler, ..) No obvious culprit found.

2018-11-13 at 11:15 [klemming]

The meta data server of Klemming did fail over to the standby around midnight. The reason is still unknown. Please report anything out of the ordinary.

2018-11-13 at 10:37 [tegner]

The primary Tegner login node was down and has been restarted. Please report further unexpected behaviour on its behalf.

2018-11-07 at 12:14 [tegner]

The tegner login node tegner-login-1 has had login problems that should now be resolved - our apologies for any inconvenience this may have caused.

2018-10-16 at 12:46 [beskow]

Preventive maintenance completed. System is running jobs since a while, and login just got enabled again.

2018-10-15 at 09:43 [beskow]

Preventive maintenance to take place tomorrow Tuesday (2018-10-16) 08:30. The system will be unavailable during parts of the day. As usual jobs that needs to pass by 08:30 will be deferred until after the maintenance.

2018-09-24 at 12:00

The email infrastructure which SNIC uses to filter SPAM has been under a DoS attack. This might have delayed your email to support. If you are in doubt that your email has reached support and you have not gotten an automatic confirmation that your email has reached the RT system within the hour, you can contact our support by phone as well. For details see the contact page. Currently the email filter is strengthened to cope with the additional pressure. The curious can look at statistics here: http://stats.sunet.se/mailfilter/index.html

2018-08-29 at 11:29 [klemming]

The part handling the meta data in Klemming had some problem this morning, causing a pause in these types of file system requests. The reason was probably that the server got too loaded. Why the outage was so long is still unclear. Please make sure to not run too big, or too many at the same time, applications that do lots of meta data operations. While Klemming is quite big in size, the meta data performance is not very high.

2018-08-29 at 10:16 [tegner]

Klemming is back again and the queues on Tegner has been activated again.

2018-08-29 at 09:53 [tegner]

The tegner queue has been temporarily paused during investigations of issues connecting to the klemming file system.

2018-08-04 at 13:26 [beskow]

the system has been restarted and is running jobs since a while. The suspect component (a cabinet controller board) has not been replaced.

2018-08-04 at 10:02 [beskow]

the primary cabinet c0-0 seem to have shutdown and is not reachable. This resembles the failure as of 2018-06-08 and likely requires manual intervention (i.e. wait for transit time to site.)

2018-08-01 at 10:45 [tegner]

Primary login node now available again. Exact reason for crash unclear.

2018-08-01 at 10:06 [tegner]

The primary login node of Tegner is currently down. Investigation in progress.

2018-06-19 at 15:47 [tegner]

The PDC cluster Tegner had planned maintenance today, and is up and running again since about two hours. We are sorry that this information did not reach everyone. There were some issues with the mail handling.

2018-06-18 at 09:18 [tegner]

Tegner will have a service window Tue 2018-06-19 for primarily system software updates. No jobs should be affected by this (except for no jobs running during the service window).

2018-06-12 at 04:39 [klemming]

Beskow job starts have now been re-enabled, but investigation on root cause continues.

2018-06-12 at 03:37 [klemming]

Klemming has had problems with access to some files during the day, probably due to some locking gone wrong. Unfortunately, during the evening, these problems suddenly caused most of the threads on the meta data server to hang, making the whole file system very sluggish. We seem to have been able to untie most of the knots now, but the root cause is still unknown, so the investigation continues.

2018-06-11 at 22:28 [beskow]

as file operations towards /cfs/klemming currently are extremely sluggish, no new Beskow jobs will be started until further notice. It is not clear where the problem resides.

2018-06-08 at 12:32 [beskow]

the system is restarted and is running jobs again since a while.

2018-06-08 at 08:09 [beskow]

tentatively the primary cabinet just seem to have experienced a power loss. As today is graduation day in most schools response-time might be somewhat higher.

2018-06-06 at 17:31

some resolvers (name-service lookup) seem to be in poor shape, giving big impact on i.e. time to login to different systems. The problem seem to have started in the morning today. A workaround is now in place on several systems, but the underlying issue is not yet addressed. You should now be able to login more orderly though.

2018-05-25 at 12:59 [beskow]

UPS maintenance completed. Jobs running again.

2018-05-24 at 14:40 [beskow]

The UPS which supplies the Cray with power is up and running but indicating an internal HW-Problem. UPS maintainance will start tomorrow friday at 12:00 MEST(*). To minimize impacts if we should lose power during maintainance, the system will be idle, which means that jobs running across the 12:00 boundary will start _after_ the maintainance window. This is a good opportunity to run short and wide jobs.

Sorry for the inconvinience,
PDC-Staff.

(*) Or sooner if system is idle sooner.

2018-05-18 at 17:55 [klemming]

Some more information about the recent downtime of Klemming. While we are still waiting for the full analysis from the vendor, the cause for the file system problems seem to be that one of the four disk systems lost access to some of its disks, probably due to a software bug that it couldn't recover from. When the disk system was finally up again, it needed to check all its disks to find possible RAID parity inconsistencies and data corruptions, which took about a day due to the large amount of disk involved. After that the file system was scanned to identify all files that had been affected by the corruptions found on disk, which took more than a day due to the large number of files involved. Out of the 113 million files on that particular disk system, 13 were found to be affected, most of them created by the batch jobs running when the problem occurred. The owners of those files have been notified.

2018-05-18 at 14:07 [klemming]

The Klemming file system is now up again after the problems with one of its disk systems. We have identified a number of files that has lost data, luckily not too many, and the affected users will be informed. Beskow is now running batch jobs again. Please report if you find anything out of the ordinary.

2018-05-18 at 00:36 [klemming]

klemming repairs coming along, beskow powered up with klemming mounted. Alas, login is blocked and jobs are kept idle awaiting an assessment of files eventually garbled. Hopefully finished by tomorrow morning, May 18.

2018-05-17 at 01:33 [beskow]

to cut some slack on behalf of the ongoing klemming repairs, and hopefully also reduce confusion on what is doable, and not, on beskow, the system will be shut down and powered off for the time being.

2018-05-16 at 16:03 [klemming]

Unfortunately the check is taking more time, and finding potential problems that will need further investigation. The file system will probably not be back until very late today, at the earliest.

2018-05-16 at 07:52 [klemming]

The disk system is now working better again, but to minimize the risk that we will loose data from the file system, we will let the disk system finish its internal verification of all the data on the disks. This unfortunately means that Klemming will probably not be available again until after lunch today.

2018-05-16 at 00:01 [klemming]

the investigation/work on the klemming file-system issue still is a hot topic and an on-going issue. No batch jobs will be started, login-sessions still have issues, file-transfers might fail, &c

2018-05-15 at 11:30 [klemming]

We have some problems with the Klemming file system right now. Investigation in progress.

2018-05-14 at 15:31 [tegner]

The tegner service window is now concluded and jobs are running again. Unfortunatelly due to unforeseen consequences of how klemming was mounted on the previously installed system image several jobs with working directory in klemming failed immediately upon jobs start. Our apologies for that. We have now implemented a mitigation to this issue to hopefully prevent future jobs to be affected.

2018-05-14 at 09:00 [tegner]

At 09:00 on Monday May 14th 2018 Tegner will have a service window for system software updates. The update is expected to take several hours. During the service window access to login-nodes and transfer nodes will be enabled but they may be restarted with very little prior notice.

2018-04-18 at 15:07 [beskow]

Preventive maintenance completed. System is running jobs since an hour, and login just got enabled again.

2018-04-11 at 14:52 [beskow]

Preventive maintenance to take place coming Wednesday, 2018-04-18 08:00. The system will be unavailable during parts of that day.

2018-03-25 at 13:53 [beskow]

Excess use on one beskow batch front-end got it stuck, and a restart of it is in progress. All running jobs spawned from it lost.

2018-03-19 at 10:26

Because of an urgent security issue, the backends running the Slurm databases had to be updated. This should not produce any problems on the systems, but please report to support if you get any results from slurm (the "s-commands") that seem wrong. For beskow, see as well previous note to load the correct slurm version.

2018-03-19 at 08:43 [beskow]

Over the weekend slurm, the beskow batch system, has been upgraded in several steps on the fly. If you type 'squeue --version' and it neither display 17.02.10 (preferred) nor 16.05.10-2 you are advised to log out, and log in again to get the new default.

2018-03-16 at 12:38 [klemming]

The meta data server of Klemming just crashed and failed over to the second one. Reason still unknown. Everything should now be back to normal though. Please report anything still out of the ordinary.

2018-02-21 at 19:25 [beskow]

A voltage failure on one (out of 22) lustre router nodes ~18:58 tonight potentially made impact on a limited set of jobs during fail-over.

2018-01-11 at 15:26 [tegner]

Tegner login server (tegner-login-2) crashed at about 11:37 CET and was restarted.

2018-01-10 at 19:49 [beskow]

Since ~18:54 this afternoon there are intermittent dropouts when communicating with the slurm master controller node. Your typical batch command eventually get a time-out, or takes long. Investigation in progress but nothing conclusive found.

2018-01-04 at 12:26 [klemming]

About an hour ago the meta data server for Klemming did an unplanned fail over to the secondary server. Exact reason yet unclear, but some indications point to a file system bug. Please report anything still out of the ordinary.

All flash news for 2025, 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss