Events:
- 2019-12-17 at 22:45
[tegner]
-
NOTICE
Tegner still has performance problems with certain file operations on
/cfs/klemming. Jobs accessing many files may suffer from bad
performance. We are working on finding a solution.
Jobs are running again.
- 2019-12-17 at 16:55
[klemming]
- Klemming is currently down due to crashed servers. Investigation in progress.
- 2019-12-17 at 14:41
[tegner]
-
Tegner experience hangings on the /cfs/klemming filesystem, which in some cases makes it very slow.
Job start has temporarily been stopped.
Running jobs will continue to run.
- 2019-12-17 at 12:42
[beskow]
- A coil control assembly has been replaced. The system is running jobs, and login enabled.
- 2019-12-13 at 11:46
[beskow]
- In addition to the blower control assembly and the environment distribution board that were replaced slightly over a week ago, we need to also replace a coil control assembly. This will take place coming Tuesday. The system will go off-line shortly after 09:00, December 17.
- 2019-12-12 at 00:45
[beskow]
- Login nodes have been restarted, and so far +90% of compute nodes have also been restarted and are now running jobs again.
Please report unexpected behavior.
- 2019-12-10 at 21:58
[tegner]
- Even if klemming comes on-line again during the evening, Tegner will stay off-line pending the planned down-time for system upgrades on Wednesday the 11th of December. Our apologies for an inconvenience this may cause.
- 2019-12-11 at 14:55
[beskow]
- the issues on Beskow with the file-system Klemming have unfortunately not yet been eliminated.
We will let running jobs keep running, and initiate a rolling restart of all compute nodes, and login nodes. This is planned to start later tonight, and will continue for 36..48 hours.
Jobs will only start on compute nodes that have been rebooted.
- 2019-12-10 at 22:15
[klemming]
- Servers in Klemming have server-by-server been restarted, and the file-system is back on-line. Beskow compute nodes in uncertain state with respect to Klemming have been restarted. Jobs in state running have been kept in state running. Jobs have been allowed to start to run again.
- 2019-12-10 at 08:00
[klemming]
- Overnight problems seem to have developed between the file-system /cfs/klemming and systems tegner and beskow. Investigation in progress.
- 2019-12-06 at 17:23
[tegner]
-
Tegner will have a service window on Wednesday 2019-12-11 starting 07:00 CET with an expected duration of 10 hours. The login and transfer nodes will also be restarted during this time.
- 2019-12-04 at 18:12
[beskow]
- Maintenance completed. The system is on-line and running jobs since a while.
- 2019-12-02 at 13:42
[beskow]
- Coming Wednesday, December 4, the system will be stopped for urgent hardware maintenance replacing a blower control assembly and an environment distribution board. This will start 09:00 in the morning. These are suspected to be behind the problems over the weekend. The system will be off-line during maintenance.
- 2019-12-01 at 21:25
[beskow]
- the system is running jobs again since a while. The exact reason not pinpointed with complete certainty, but very likely a piece of hardware that, simply put, monitors and controls environmental parameters (temperature, air-flow, ..) needs to be replaced.
- 2019-12-01 at 10:00
[beskow]
- System got unresponsive overnight. Investigation started.
- 2019-11-19 at 08:46
[beskow]
- Over night there have been issues in connecting to / using the slurm scheduler/controller. Investigation is in progress.
- 2019-11-13 at 11:21
[tegner]
- There is a problem with the Lustre filesystem Klemming, which may affect jobs on Tegner (as well as on Beskow, of course, but Beskow is down for service today).
Job start has been stopped on Tegner until we know more.
We hope that running jobs will continue to run without problems.
Job queueing is still open.
- 2019-11-13 at 16:22
[beskow]
- Due to some unexpected circumstances, the scope of the upgrades on Beskow today had to be reduced. Consequently there will be another service window scheduled before the end of 2019 to complete the upgrades. The system is running jobs since roughly half an hour and login has been enabled.
- 2019-11-06 at 13:00
[beskow]
- Coming Wednesday, 2019-11-13, beskow will be taken off-line around 09:00 AM for applying patches, minor updates, bug fixes, and programming environment updates. Some hardware will also be replaced. We anticipate to have the system on-line again in the afternoon.
- 2019-10-31 at 16:34
- One of the AFS servers are currently having problems, causing many operations in AFS to hang. Investigation in progress.
- 2019-10-15 at 10:15
[klemming]
- This morning, around 9:20AM, one of the file servers in Klemming crashed. Initial debugging suggest hardware problems. Due to this, Klemming is currently running with reduced performance.
- 2019-09-10 at 08:53
[klemming]
- A failover/recovery was made for the active Klemming meta data server around 3 AM this morning. Please report anything out of the usual.
- 2019-09-04 at 11:13
[klemming]
- The Klemming file system has some problems at the moment due to some servers being overloaded. The root cause of the load is still unknown. To try to remedy the immediate problems, we will restart the involved servers now, which will temporarily cause some (more) pauses in the file system accesses.
- 2019-09-01 at 15:18
[beskow]
- the beskow login node got OOM'ed (Out Of Memory) around 13:00 today, but is now restarted. Take care not to spawn parallel workloads on the shared login node.
- 2019-08-15 at 23:07
[beskow]
- Most of the intended updates finished, and jobs are running again. The installation of latest programming environment (development toolkits) postponed to a later moment in time. A first few jobs failed immediately upon start with symptom 'srun: command not found.' The root cause of this is not yet known, but a workaround is in place. Logins will soon be re-enabled.
- 2019-08-12 at 14:36
[beskow]
- On Thursday, 2019-08-15, beskow will be taken off-line around 09:00 AM for applying patches, minor updates, bug fixes, and programming environment updates. The changes are not nearly as extensive as with the previous upgrade, but we still anticipate it may take a full workday.
- 2019-08-02 at 16:18
[klemming]
- The primary meta data server for the Klemming file system crashed a while ago, possibly due to being overloaded. We are now running on the secondary server. Please report anything out of the usual.
- 2019-07-25 at 22:48
[beskow]
- the beskow login node got OOM'ed (Out Of Memory) earlier tonight, around 21:00, and is now restarted.
- 2019-07-09 at 13:10
- The reconfiguration of the Scania file system should now be complete. Please report anything out of the ordinary.
- 2019-07-09 at 11:54
- One server in the Scania file system has "crashed" during the night and the vendor has identified a configuration issue. We will need to restart at least parts of the file system now to prevent further problems. This is expected to only result in temporary pauses in the file system access.
- 2019-07-04 at 01:47
[beskow]
- Beskow is available since a couple of hours. Please read the 'message of today' when you log in.
- 2019-07-03 at 21:36
[tegner]
- Tegner is now available for use, including fully functioning access to Klemming.
- 2019-07-02 at 17:52
- Work on Klemming (file-by-file inspection) is coming to a situation where we 'soon' will be able to let remaining inspection run simultaneously as jobs using Klemming on Tegner.
We think that we will be in a position to enable jobs on Beskow as well not long after that.
- 2019-07-01 at 16:18
- Work on Beskow and Klemming is continuing. So far today there have been made tests on emergency cooling capacity, file-by-file inspection and modification of all files of /cfs/klemming/ to allow future quota enforcements, and initial tests of applications have just started. New information will follow tomorrow.
- 2019-06-24 at 13:10
[beskow]
-
As you are certainly aware of, Beskow is undergoing maintenance. The current service window serves to update the operating system on Beskow from Cray OS 5 to Cray OS 7, as well as upgrade the Lustre file storage system.
With this upgrade some necessary changes will be implemented, leading
to an adjustment to how software will be used on the cluster.
1. The upgraded Beskow cluster will use native SLURM, which means that
“aprun” will not be available any longer and you will instead need to use “srun”.
More information is available at https://www.pdc.kth.se/support/documents/run_jobs/job_scripts_beskow.html
and at https://www.pdc.kth.se/support/documents/run_jobs/run_interactively.html#running-on-beskow
Please change your submit scripts accordingly prior to submitting your batch scripts to
the upgraded cluster.
2. The “module” command will by default load the latest version into your environment
if no version is provided manually. More information
is available at https://www.pdc.kth.se/support/documents/run_jobs/job_scheduling.html#using-modules
We however recommend that every “module load” command contains information about what version of the software will be used - for example, type “module load gromacs/2019.2” instead of “module load gromacs”. This ensures that your jobs are consistent when the default module changes.
3. As Beskow is upgraded from Cray OS 5 to Cray OS 7, all software that is available on the cluster needs to be recompiled. We are therefore taking this opportunity to purge unused software packages.
Initially we will install recent versions of the most popular software packages, but
if you are using less common software or particular versions of software packages, please send an email to PDC support with information about software and version, so we can add it to our installation list.
More information about what software packages are being installed is found at https://www.pdc.kth.se/software
In case of any questions, please do not hesitate to contact PDC support. For contact information see https://www.pdc.kth.se/support/documents/contact/contact_support.html
Best regards
PDC Support
- 2019-06-05 at 13:21
- Tegner, Beskow and the lustre file-system Klemming will be taken off-line Monday morning, June 17. We will bring Beskow to CLE7/SLES15 and Klemming to Lustre 2.10. The upgrades are quite extensive and we have set aside two weeks for them to complete, i.e., throughout all of June.
- 2019-06-30 at 16:20
- The upgrades of Beskow to CLE7/SLES15 and Klemming to Lustre 2.10 are still in progress. Most of the work is finished and in operation, i.e. base operating system running, system software installed, hardware microcode updated, ..
We are currently doing config modifications, tuning, optimization, and testing. While we are doing that you cannot login to the system.
Next update will be posted to www.pdc.kth.se during the day of tomorrow Monday, July first (i.e. no mail sent out.)
- 2019-06-11 at 12:40
[tegner]
- The singularity version on Tegner has been updated to address security issues. Some changes to how singularity stores settings files for running containers has changed as a consequence. Please contact support if you experience any problems.
- 2019-06-04 at 10:41
[klemming]
- About half an hour ago the Klemming file system had some problems creating new files in certain cases. This was due to the file system not properly handling one of its disks getting too full. We urge users to remember to remove inactive data from Klemming as soon as it is not needed anymore for their computations, to reduce the risk of more similar problems in the future.
- 2019-04-30 at 17:10
[klemming]
- The problems with the meta data servers of Klemming earlier today is believed to be caused by jobs over-loading the file system with too high rate of meta data operations. Please make sure your jobs behave well in this respect, especially before scaling up to many nodes.
- 2019-04-30 at 15:33
[tegner]
- The queue on tegner has now been started again.
- 2019-04-30 at 14:02
[tegner]
- Due to issues with the shared file-system Klemming, job starts on Tegner has been temporarily paused while diagnostics are in progress. Our apologies for any inconvenience caused, more information will follow.
- 2019-04-30 at 13:50
[klemming]
- The meta data server for the Klemming file system just failed and is in the process of starting on the stand by server. Reason yet unclear. Investigation in progress.
- 2019-04-26 at 12:10
[klemming]
- At 10AM on Monday, April 29th, we will do some maintenance on one of the servers in the Klemming file system. To do this, we will have to move those resources over to their secondary server, which will cause some of the I/O operations in Klemming to hang for some minutes when failing over, and then again later when moving back. We do not expect running jobs to be negatively affected, except for a possible increase in running time. That said, there is a risk that something goes wrong during the procedure, so new jobs will not be allowed to start during that time. The service is expected to take less than an hour.
- 2019-04-11 at 03:00
[tegner]
- The tegner cluster is now available again.
- 2019-04-10 at 20:19
[tegner]
- Due to unforeseen issues with the interconnect and a newer version of a file-system client we need to downgrade parts of the upgrade which causes more delays. Hopefully Tegner should be back during late evening.
- 2019-04-10 at 09:27
[tegner]
- The tegner service window has been extended until 2019-04-10 15:00 CEST. Our apologies for any inconvenience this may cause.
- 2019-04-10 at 06:00
[tegner]
- Tegner will have a service window on 20190410 starting 06:00 CEST with an expected duration of 3 hours. The login and transfer nodes will also be restarted during this time.
- 2019-03-21 at 14:45
[beskow]
- tomorrow, Friday March 22 around 10:00, a central raid controller will be reset. It will ideally not have impact on services.
- 2019-03-21 at 00:17
[beskow]
- the system has been fully restarted and is running jobs again. Logins re-enabled. Jobs executing over the stop failed.
- 2019-03-20 at 22:24
[beskow]
- the system got unresponsive roughly an hour ago. Investigation in progress.
- 2019-02-20 at 21:55
[beskow]
- Preventive hardware maintenance completed and login enabled again. When allowing jobs to start there were initially an unexpected high rate of job failures, investigation has so far not given any conclusive reason as of why.
- 2019-02-13 at 13:01
[beskow]
- Forthcoming Wednesday, February 20, the beskow system will be brought off-line in the morning for preventive system hardware maintenance. It will take all day until finished.
- 2019-02-07 at 13:29
[klemming]
- The metadata server for Klemming crashed during the night and failed over to the standby. Probably due to a Lustre program error this time. Running jobs probably experienced up to 15 minutes pause in some file operations, until the file system recovered fully. Please report anything still out of the ordinary.
All flash news for
2024,
2023,
2022,
2021,
2020,
2019,
2018,
2017,
2016,
2015,
2014,
2013,
2012,
2011,
2010,
2009,
2008,
2007,
2006,
2005,
2004,
2003,
2002,
2001,
2000,
1999,
1998,
1997,
1996,
1995
Back to PDC
Subscribe to rss