Events:

2022-05-04 at 16:01 [dardel]
Required operations made, investigation finished, login re-enabled, and jobs starts enabled again. The root cause (likely certificate related) remain, but the intention is that it can be addressed without interfering with access/running jobs.
2022-05-03 at 19:25 [dardel]
A fortnight ago, 2022-04-21, certificates in use by internal kubernetes services were updated. These are services required to operate the system. Past days basic operations and services have been performing less than optimal, some not at all. One assumption is that more certificates are in need to be extended. No logins will be allowed while this is investigated. Running jobs will continue until finished.
2022-04-28 at 13:34 [dardel]
Late last evening, 2022-04-27, a internal server within dardel went gradually un-responsive. An overlooked impact was that jobs needing licenses from external license servers could not reach these servers during its absence, and license checkouts did fail. The server in question has been restarted over the past couple of hours. License checkouts should now work as usual.
2022-04-22 at 17:42 [dardel]
Updates of certificates in use by internal kubernetes services have been completed, and jobs are running again. The system is expected to perform as prior the updates, please let us know if you experience something else.
2022-04-21 at 19:30 [dardel]
The certificate update work finished for today, and will continue tomorrow 2022-04-22 starting at 09:30. Certain operations today have had impact on some of the running jobs. To avoid uncertainties a scheduler block is placed tomorrow at 09:30. Only jobs that can finish before that time will be allowed to get running.
2022-04-21 at 10:18 [dardel]
During initial updates of certificates the internal name-services went out of order. Job starts of jobs in queue have been blocked. Running jobs likely affected in many cases.
2022-04-14 at 12:31 [dardel]
Updates of certificates in use by internal kubernetes services will start in the morning of Thursday 2022-04-21, and continue throughout Friday 2022-04-22. During activation many services will be restarted. I.e. the pod running slurmctld (scheduler master daemon) will temporarily be non-responsive.
2022-03-28 at 12:38 [dardel]
around 11:47 today a network reconfiguration caused several jobs to experience mpi related timeouts/hangs/failures. The issue should have been rectified roughly five minutes later.
2022-03-21 at 21:22
Due to external circumstances out of our reach the planned test of safety and backup power tomorrow 2022-03-22 is cancelled. Scheduler blocks over test removed.
2022-03-17 at 14:35
On Tuesday, 2022-03-22 at 10:00, we will simulate a power outage to test our safety and backup power machinery, this by shutting down the power feed to our building. Dardel and beskow schedulers have reservations made so that systems have no jobs in a running state over test.
2022-03-17 at 00:20 [dardel]
As was reported (yesterday, 2022-03-16, around ~18:54) access to dardel was not possible anymore. This should now have been remedied. Network equipment between dardel and the outside of it has been replaced. Any user job requiring access to external services have been affected.
2022-03-16 at 20:11 [dardel]
Dardel not reachable from network. Investigating.
2022-03-14 at 12:50 [dardel]
Yesterday, 2022-03-13, a HSN restart was performed, and jobs have been let running on gradually larger sections of dardel over night. User logins are now re-enabled.

From the perspective of a user/job, no changes have intentionally been made. Execution environment should be the same as before.

Should the HSN fail on us again there are several forward paths to choose among, ranging from repeat a plain restart, to request software upgrades to be made requiring down-time in the order of days.

2022-03-12 at 19:39 [dardel]
It seems like the HSN network and/or lustre file-system have issues. Among symptoms are frozen file-system accesses. All new logins blocked and no new jobs allowed to start, until further notice.
2022-03-12 at 16:45 [dardel]
Setup/reconfiguration of the HSN fabric manager finished. All HSN switches restarted. Jobs running since a while. Login about to be re-enabled. From the perspective of a user job, no changes have intentionally been made. Execution environment should be the same as before.
2022-03-11 at 20:44 [dardel]
The setup/reconfiguration of the HSN fabric manager is still ongoing. The HSN, high speed network, needs to be reliably up and running for a well functioning system. Work is still ongoing. Next progress update will be sent Monday, or earlier.
2022-03-10 at 20:56 [dardel]
The setup/reconfiguration for new nodes/future nodes is taking slightly more time than three days. We think there is a good chance having the system on-line running jobs some time over the day tomorrow.
2022-03-02 at 17:01
A network switch between Dardel and i.e. external license servers got intermittent failures, starting yesterday 2022-03-01, and eventually gave up earlier this afternoon. An alternative switch should now be in operation. License check-outs should be operational again.
2022-02-25 at 13:00 [dardel]
dardel: On Tuesday March 8 in the morning Dardel will be shut down to add more nodes and to prepare for adding additional nodes later this spring. HPE estimates the downtime to be one to three days. SLURM queues will be drained in advance.
2022-02-24 at 20:57 [dardel]
dardel: four HSN (slingshot) switches were reset and traffic between nodes and all file-servers have resumed. Likely jobs accessing files over the drop-out were affected.
2022-02-24 at 19:04 [dardel]
dardel: since roughly an hour there are issues related to reaching file-system servers from compute/login nodes. Accesses freeze. One typical effect is that login fail, as home-directory access freeze.
2022-02-18 at 14:15 [beskow]
Remote access to beskow.pdc.kth.se (beskow-login2.pdc.kth.se) has been disabled for most users in active allocations, and disabled for users of retired allocations.
2022-01-09 at 14:57 [beskow]
A file-system self-test for errors in /cfs/klemming did indicate failure. This is not (anymore) the case, and jobs have resumed start running since a couple of hours.
2022-01-08 at 23:03 [beskow]
Since a couple of hours compute nodes does not pass self test after job termination. Such nodes are automatically set aside, and cannot serve further jobs. Fewer and fewer batch nodes will be available for jobs until further notice.
All flash news for 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss