Events:

2022-12-27 at 20:19 [dardel]
The possibility for a user to ssh login from a login node to compute node/nodes belonging to an active node allocation of user should now be back in operation.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-12-27 at 10:43 [dardel]
The possibility to do ssh login from a login node to compute node/nodes one has been assigned is currently not working in a fairly large number of situations.

Ordinary batch job runs, or i.e. interactive srun, should not be affected.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-12-13 at 20:43 [dardel]
Compute/login rebooted, jobs running since a while. Login just re-enabled.

There have been many (internal) changes to they system. In case you think your application performs/executes less well, please let us know.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-12-13 at 12:23 [dardel]
Work on system discovery of new cabinet with new compute nodes, and high speed network layout changes, coming to an end.

Once finished, compute/login will be rebooted, and test-jobs run.

If all goes well we aim at have ordinary jobs running by the evening today.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-12-01 at 18:35 [dardel]
High speed network/switch work completed. Jobs running since a while, and logins re-enabled.

Please be aware of forthcoming longer system work coming week.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-12-01 at 10:12 [dardel]
During high speed network/switches work, all lustre file-systems had to be stopped. All jobs still running affected, and no logins possible. Sorry for inconveniences. Logins will remain disabled until further notice.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-30 at 16:24 [dardel]
Job starts still disabled, work with high speed network / switches still ongoing. Note: there is expansion work ahead coming week, starting December 06, as announced earlier.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-29 at 23:11 [dardel]
correction to previous message:

please replace "... tomorrow Wednesday Sep 30 the earliest." with "... tomorrow Wednesday Nov 30 the earliest. Sorry for extra noice.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-29 at 20:51 [dardel]
Since roughly 20 minutes job starts fail with errors similar to "Socket timed out on send/recv operation".

No new jobs will be allowed to start until further notice, i.e. expect no further info until tomorrow Wednesday Sep 30 the earliest. intended to say Nov 30

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-29 at 14:15 [dardel]
Coming week a new cabinet of compute nodes will be added to the system. The high speed network will also get a new layout to support the expansion.

No jobs allowed to run beyond Tuesday December 06, 08:00 in the morning.

Work is intended to be completed within a week.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-17 at 14:19 [dardel]
Internal services running slurm will get minor config changes starting 15:00 today, Thursday November 17.

During the updates interaction with i.e. the slurm master daemon could get sluggish. Worst case, commands such as 'sbatch' or 'squeue' might fail. Jobs executing or waiting in queue are not expected to be affected.

The changes are expected to be finished within two hours.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-11-08 at 22:28 [dardel]
The network methods in use by all lustre file-systems, i.e. /cfs/klemming/, has changed.

Most parts of the system has been modified and rebooted - a disk server, a compute node, and anything between.

Jobs are running since ~half an hour, login re-enabled. Please report odd behaviour.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-10-31 at 11:07 [dardel]
Coming Monday 2022-11-07 the network methods in use by all lustre file-systems, i.e. /cfs/klemming/, will be changed.

No jobs allowed to run beyond Monday, November 07 / 08:00.

Ideally the changes will be completed within one to two days.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-10-11 at 10:42 [dardel]
Connections to i.e. external license servers should now be operational again. Job starts resumed.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-10-11 at 09:20 [dardel]
The changes made yesterday has had impact on i.e. compute nodes not being able to connect to i.e. external license servers.

New job starts blocked until fixed. Info on when fixed will not be sent through mail, only at www.pdc.kth.se and motd.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-10-11 at 00:48 [dardel]
The topology change to the HSN, high speed network, has been made and it should be aware of the forthcoming production topology.

Jobs are running since a couple of hours.

Login re-enabled.

Please report anomalies.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-10-03 at 11:52 [dardel]
Coming Monday 2022-10-10 the HSN, high speed network, will be reconfigured to get aware of the new topology (cables/bundles/sizes/switches) which is going to be introduced during that week.

No jobs allowed to run beyond October 10 / 08:00.

Ideally the changes will be completed within one to two days.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-27 at 17:43 [dardel]
During ongoing upgrades DNS lookups (name services) got dis-functional, repair in progress.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-26 at 19:27 [dardel]
New cooling water circuit in operation. Fail-over to backup cooling circuit also in operation. Jobs running again since half an hour.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-23 at 13:46 [dardel]
Reboot of compute nodes completed, jobs running again. Please note planned work forthcoming, Monday 2022-09-26.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-23 at 11:29 [dardel]
We will shortly start to reboot all compute nodes. Some have jobs in a 'completing' state since yesterday and on-wards, and some show jobs as 'running.' These will be also be rebooted.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-22 at 18:00 [dardel]
During updates of internal server nodes, many compute nodes started freeze when accessing local files roughly 90 minutes ago. No new jobs will be allowed to start to run until further notice.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-19 at 08:42 [dardel]
Correction: the switch of cooling water will take place coming Monday 2022-09-26 (September 26th) 10:00 and jobs prevented to run correspondingly beyond 08:00 that morning.
2022-09-19 at 08:10 [dardel]
A switch of the cooling water circuit for Dardel will take place, starting Monday 2022-09-26 at 10:00. A scheduler reservation is in place preventing jobs to run beyond 2022-09-26 08:00. (Date has been corrected)

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-13 at 13:11 [dardel]
Name resolution i.e. referring to domains, and/or hostnames, outside dardel should now perform as expected since roughly an hour.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-12 at 22:14 [dardel]
After upgrades today of kubernetes, hosting internal services of dardel, name resolution has started to malfunction. I.e. when referring to domain-names, and/or host-names, outside dardel it fails to translate to a numerical value.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-09 at 17:27 [dardel]
The login node(s) up again since half an hour. Running jobs will be left as they are. A rolling reboot of all compute nodes will be made over the forthcoming ~48 hours. Jobs will only be allowed to start on fresh rebooted compute nodes.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-09 at 16:19 [dardel]
A few minutes ago the login node(s) started to experience freezes. Since early this morning investigation on i.e. as of why several jobs get stuck'Completing' properly is on-going, and other less plain hiccups.

(Please find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.)

2022-09-07 at 09:26 [dardel]
A rolling reboot of all compute nodes will start to take place. This to incorporate a 'missing' K8s worker-node in the pool of worker-nodes, to increase redundancy. The compute reboots are intended to take place on idle compute nodes in between user jobs. The aim is to be completed within 36 hours for the bulk of all compute nodes.

Please also find Dardel expansion starting aimed at giving a background, and description of recent, current and oncoming activities.

2022-09-02 at 16:10 [dardel]
Work with restarting an interconnect switch finished, and job starts enabled again since half an hour.

Work in preparation for forthcoming hardware/software updates is ongoing, but the intention is that it should not be in conflict, i.e. disturb, ordinary day to day use unless otherwise announced in advance.

2022-09-01 at 14:00 [dardel]
We currently have a problem with an interconnect switch which needs to be restarted. To avoid potential impact on running jobs a scheduler block is active, starting tomorrow Friday, 2022-09-02 at 09:30.
2022-08-29 at 23:45 [dardel]
First step in the preparation of hardware/software updates completed.

Internal PODs have been updated and restarted. Services such as slurm in operation again. Job submission (sbatch) job start (srun) &c expected to be back in operation since roughly an hour.

Please report anomalies and/or odd behaviour.

2022-08-27 at 13:55 [dardel]
In preparation for future hardware/software updates, an update of internal PODs running services such as slurmctld is planned to take place coming Monday 2022-08-29, starting at 10:30.

During the updates interaction with i.e. the slurm master daemon will be blocked, commands such as 'sbatch' or 'squeue' will time out (fail.) The updates are expected to be finished during that day.

There is a blocking reservation in place to prevent jobs to get into a running state, should their execution need to overlap with the updates.

2022-07-17 at 14:22 [beskow]
Full system restart completed.
2022-07-13 at 11:37 [beskow]
Central internal services of beskow did experience hiccups early this morning (i.e. services providing internal boot images.) Limited recovery work can eventually start coming week (i.e. July 18th.)
2022-07-06 at 14:30 [dardel]
Many operations/accesses of files in the Klemming file system perform sluggish, have high latency. Investigation is in progress. There are as of writing no indications that this is related to the maintenance work made yesterday.
2022-07-05 at 15:57 [dardel]
Klemming metadata server replaced and in operation. Maintenance tasks completed. Job starts resumed.
2022-07-04 at 13:23 [dardel]
Tomorrow, Tuesday July 5, starting at 10, maintenance work will be performed on the Klemming file system. A faulty metadata server will be replaced and some smaller maintenance tasks will be carried out as well. No downtime expected, but minor disturbances might occur. New job starts will be blocked until maintenance is done, but running jobs will be allowed to continue until finished.
2022-06-23 at 22:20 [dardel]
Software maintenance finished. Firmware upgrades of the klemming file-system hardware completed. In addition, slingshot (high speed network) software got an update from 1.3.1 to 1.7.2, aimed at improving reliability. Cray programming environment cpe/22.04 also is available on compute/login, not set as default. Jobs have gradually been started over the past 6..7 hours, and login now is about to be re-enabled.
2022-06-13 at 14:56 [dardel]
System will be unavailable during system software maintenance and upgrades planned to start Monday 2022-06-20 at 09:00, and intended to be finished that work-week. I.e. end at Thursday 2022-06-23 the latest.
2022-05-31 at 17:39 [dardel]
Compute cabinets have been powered up, compute nodes booted, and jobs are running since a couple of minutes.
2022-05-31 at 11:02 [dardel]
During planned replacement of side stream filters, a large number of compute nodes in the system did shutdown. Whether related or not, filter work will continue until completed, and after that restart work of large parts of the system will be begin. Job starts blocked until further notice.
2022-05-04 at 16:01 [dardel]
Required operations made, investigation finished, login re-enabled, and jobs starts enabled again. The root cause (likely certificate related) remain, but the intention is that it can be addressed without interfering with access/running jobs.
2022-05-03 at 19:25 [dardel]
A fortnight ago, 2022-04-21, certificates in use by internal kubernetes services were updated. These are services required to operate the system. Past days basic operations and services have been performing less than optimal, some not at all. One assumption is that more certificates are in need to be extended. No logins will be allowed while this is investigated. Running jobs will continue until finished.
2022-04-28 at 13:34 [dardel]
Late last evening, 2022-04-27, a internal server within dardel went gradually un-responsive. An overlooked impact was that jobs needing licenses from external license servers could not reach these servers during its absence, and license checkouts did fail. The server in question has been restarted over the past couple of hours. License checkouts should now work as usual.
2022-04-22 at 17:42 [dardel]
Updates of certificates in use by internal kubernetes services have been completed, and jobs are running again. The system is expected to perform as prior the updates, please let us know if you experience something else.
2022-04-21 at 19:30 [dardel]
The certificate update work finished for today, and will continue tomorrow 2022-04-22 starting at 09:30. Certain operations today have had impact on some of the running jobs. To avoid uncertainties a scheduler block is placed tomorrow at 09:30. Only jobs that can finish before that time will be allowed to get running.
2022-04-21 at 10:18 [dardel]
During initial updates of certificates the internal name-services went out of order. Job starts of jobs in queue have been blocked. Running jobs likely affected in many cases.
2022-04-14 at 12:31 [dardel]
Updates of certificates in use by internal kubernetes services will start in the morning of Thursday 2022-04-21, and continue throughout Friday 2022-04-22. During activation many services will be restarted. I.e. the pod running slurmctld (scheduler master daemon) will temporarily be non-responsive.
2022-03-28 at 12:38 [dardel]
around 11:47 today a network reconfiguration caused several jobs to experience mpi related timeouts/hangs/failures. The issue should have been rectified roughly five minutes later.
2022-03-21 at 21:22
Due to external circumstances out of our reach the planned test of safety and backup power tomorrow 2022-03-22 is cancelled. Scheduler blocks over test removed.
2022-03-17 at 14:35
On Tuesday, 2022-03-22 at 10:00, we will simulate a power outage to test our safety and backup power machinery, this by shutting down the power feed to our building. Dardel and beskow schedulers have reservations made so that systems have no jobs in a running state over test.
2022-03-17 at 00:20 [dardel]
As was reported (yesterday, 2022-03-16, around ~18:54) access to dardel was not possible anymore. This should now have been remedied. Network equipment between dardel and the outside of it has been replaced. Any user job requiring access to external services have been affected.
2022-03-16 at 20:11 [dardel]
Dardel not reachable from network. Investigating.
2022-03-14 at 12:50 [dardel]
Yesterday, 2022-03-13, a HSN restart was performed, and jobs have been let running on gradually larger sections of dardel over night. User logins are now re-enabled.

From the perspective of a user/job, no changes have intentionally been made. Execution environment should be the same as before.

Should the HSN fail on us again there are several forward paths to choose among, ranging from repeat a plain restart, to request software upgrades to be made requiring down-time in the order of days.

2022-03-12 at 19:39 [dardel]
It seems like the HSN network and/or lustre file-system have issues. Among symptoms are frozen file-system accesses. All new logins blocked and no new jobs allowed to start, until further notice.
2022-03-12 at 16:45 [dardel]
Setup/reconfiguration of the HSN fabric manager finished. All HSN switches restarted. Jobs running since a while. Login about to be re-enabled. From the perspective of a user job, no changes have intentionally been made. Execution environment should be the same as before.
2022-03-11 at 20:44 [dardel]
The setup/reconfiguration of the HSN fabric manager is still ongoing. The HSN, high speed network, needs to be reliably up and running for a well functioning system. Work is still ongoing. Next progress update will be sent Monday, or earlier.
2022-03-10 at 20:56 [dardel]
The setup/reconfiguration for new nodes/future nodes is taking slightly more time than three days. We think there is a good chance having the system on-line running jobs some time over the day tomorrow.
2022-03-02 at 17:01
A network switch between Dardel and i.e. external license servers got intermittent failures, starting yesterday 2022-03-01, and eventually gave up earlier this afternoon. An alternative switch should now be in operation. License check-outs should be operational again.
2022-02-25 at 13:00 [dardel]
dardel: On Tuesday March 8 in the morning Dardel will be shut down to add more nodes and to prepare for adding additional nodes later this spring. HPE estimates the downtime to be one to three days. SLURM queues will be drained in advance.
2022-02-24 at 20:57 [dardel]
dardel: four HSN (slingshot) switches were reset and traffic between nodes and all file-servers have resumed. Likely jobs accessing files over the drop-out were affected.
2022-02-24 at 19:04 [dardel]
dardel: since roughly an hour there are issues related to reaching file-system servers from compute/login nodes. Accesses freeze. One typical effect is that login fail, as home-directory access freeze.
2022-02-18 at 14:15 [beskow]
Remote access to beskow.pdc.kth.se (beskow-login2.pdc.kth.se) has been disabled for most users in active allocations, and disabled for users of retired allocations.
2022-01-09 at 14:57 [beskow]
A file-system self-test for errors in /cfs/klemming did indicate failure. This is not (anymore) the case, and jobs have resumed start running since a couple of hours.
2022-01-08 at 23:03 [beskow]
Since a couple of hours compute nodes does not pass self test after job termination. Such nodes are automatically set aside, and cannot serve further jobs. Fewer and fewer batch nodes will be available for jobs until further notice.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss