Events:

2009-12-17 at 11:27 [xxx (Hebb)]
The update of Hebb is now finished and the queue is turned on again.
2009-12-16 at 23:30
All queues are know running again.
2009-12-16 at 15:11
All queues are temporarily halted due to a crashed file-server. They will be started again once the file-server is back on-line. We are apologise for the inconvenience.
2009-12-14 at 18:59 [xxx (Hebb)]
Hebb will be unavailable for a system upgrade starting at 09:00 on Thursday, the 17th. The downtime is expected to be 1-2 hours and will affect the whole system, even though some parts might be available again earlier then others.
2009-11-30 at 18:03 [xxx (Ekman)]
Node allocation started again. However, one fileserver for the scratch filesystem, /cfs/ekman, is currently running without redundancy. This means a greatly increased risk of loosing files on that server, until the rebuild process is finished. It is expected to finish by Thursday.
2009-11-30 at 10:50 [xxx (Ekman)]
/cfs/ full. Node-allocation paused until situation cleared.
2009-11-27 at 23:13 [xxx (Ellen)]
Key is on-line again since early afternoon.
2009-11-27 at 23:00
Earlier this evening 2 out of 3 afs-servers that got partially disconnected yesterday did unexpected restarts serving files. One file-server do mostly contain applications, while the other one do mostly contain home volumes. Both servers are back on-line since a couple of hours.
2009-11-27 at 11:04 [xxx (Ellen)]
Currently Key is not accessible for login. We're investigating the cause of the problem.
2009-11-26 at 14:17
During work in the machine room, some afs-file-servers seem to have gotten reduced redundancy to networks. During fail-over it has sometimes caused sluggish performance, and on rare occasions also failure (connection timed out.)
2009-11-20 at 10:40
afs-server now salvaged, gradually resuming batch-operations.
2009-11-20 at 06:05
We seem to have problems with at least one afs-server. Home catalogues, applications, and also batch-job-processing affected.
2009-11-11 at 12:01 [xxx (Ferlin)]
Batch resumed. No jobs removed. Unfortunately the new login node has already experienced the same dead-lock as the previous one, and had to be restarted. We have no clues so far of the cause. Expect a somewhat bumpy road ahead.
2009-11-10 at 22:41 [xxx (Ferlin)]
The login node of ferlin has gone bad. We will try to recover it during tomorrow. In the meantime there is a new login node set up, you should reach it through the same name/method (ferlin.pdc.kth.se) Until we have managed to salvage, or reclaim the old node as junk, no new jobs will start. Possibly all waiting and held jobs will be removed. This will be decided some time tomorrow, 2009-11-11.
2009-11-04 at 16:21 [xxx (lenngren)]
Due to security exploit CVE-2009-3547, all logins have been disabled on the "Lenngren" system. Running jobs will be able to finish though. As an alternative for file access (AFS), use the "Ferlin" system. If you have any problems logging in there, contact our support. We regret the problems that this may cause you. For PDC-staff, Harald.
2009-11-03 at 17:31
AFS server up again. This happened: After replacing a faulty redundant HD, the device driver for accessing the RAID card panicked the kernel and the machine crashed. So much for investing in "hot swapable" devices. We don't expect any data loss, but your jobs using that AFS server might have crashed as well if they tried to use the server (named trevally) during the downtime. For PDC-staff, Harald.
2009-11-03 at 17:22
One AFS server is down and all jobs writing to AFS volumes located on this server will be effected. Manual access to volumes on this server is also not possible.
2009-11-01 at 18:31 [xxx (Ferlin)]
The log-in node of ferlin got catatonic and is being restarted.
2009-10-09 at 19:22 [xxx (Ekman)]
User jobs are now starting on Ekman again. The delay was due to some problems activating the new server when restarting the scratch filesystem again after the upgrade. The size of the filesystem has now also been increased ~20TB to a total of ~80TB.
2009-09-29 at 16:12
relayed, informational, CSC users: Maintenance work on some CSC servers will be performed on Saturday 10 October starting at 10 am. Most UNIX computers at CSC will be heavily affected during this time. Services like email and www will also be affected.
2009-09-29 at 12:00 [xxx (Ekman)]
Between Thursday the 8th of October at 09:00 and Friday the 9th of October at 17:00, Ekman will be inaccessible due to a cooling acceptance test. We are sorry for any inconvenience that this might cause.
2009-09-29 at 11:56 [xxx (Ferlin)]
Between Thursday the 8th of October at 09:00 and Friday the 9th of October at 17:00, Ferlin will be inaccessible due to a cooling acceptance test. We are sorry for any inconvenience that this might cause. PDC Staff
2009-09-18 at 11:28 [xxx (Ekman)]
The scratch filesystem on Ekman is now back again after one fileserver crashed. Some tuning was also done to hopefully reduce the risk of another crash before the upcoming server upgrade of the filesystem.
2009-09-18 at 08:02 [xxx (Ekman)]
the ekman cluster-files-system seem unavailable.
2009-09-15 at 16:00 [xxx (lenngren)]
Juliana is now back in production.
2009-09-15 at 15:58 [xxx (lenngren)]
Lenngren and Betti is now back in production.
2009-09-15 at 11:09 [xxx (Ekman)]
The fileserver has now been restarted and the queue enabled again.
2009-09-15 at 10:35 [xxx (Ekman)]
Queue on Ekman currently paused due to problems with one of the fileservers for the cluster filesystem. It is expected to be back shortly.
2009-09-14 at 17:22 [xxx (Ellen)]
Key is back after a system upgrade and subsequent reboot.
2009-09-11 at 23:28 [xxx (lucidor)]
The queue on Lucidor is now running again and login to the login-node is enabled.
2009-09-11 at 14:16
CVE-2009-2698 continued: we will now pay attention to systems Lucidor, Key, and Lenngren. In that order.
2009-09-10 at 07:54 [xxx (Ekman)]
As announced on the user-list yesterday, ekman is available again. Please use ekman-tmp.pdc.kth.se to log in for the time being.
2009-09-09 at 18:05 [xxx (Ferlin)]
All of Ferlin is gradually being patched. If you log out and log in to ferlin.pdc.kth.se, you should end up on a new login node. If you are denied access your name services has not yet caught the change, and you could try again a little later. All compute nodes will from now on have CentOS 5.3. It is not expected to cause any changed behavior on applications.
2009-09-09 at 11:23
Due to CVE-2009-2698 we need to upgrade most systems. Systems will gradually become available again. The first systems hopefully during today (2009-09-09.)
2009-09-09 at 08:10 [xxx (Ekman)]
/cfs/ekmanscratch - we have issues with one of the disk-servers not responding. Accesses to /cfs/ekmanscratch get stuck.
2009-08-18 at 07:28
There are quite a few log entries on systems having problems writing file(s) to /afs/ between 0300 and ~0500 this morning. Your job might have experienced problems. Network timeouts and/or server timeouts among possible causes.
2009-08-14 at 15:50
Login should now be enabled again on previously disabled systems.
2009-08-14 at 13:40
We have disabled most new login sessions as there are several reports on a severe security vulnerability in the linux kernel.
2009-08-06 at 14:05 [xxx (Ekman)]
the cluster-wide file-system (/cfs/ekmanscratch) have been restarted. Please report if you experience anomalies.
2009-08-06 at 09:16 [xxx (Ekman)]
operations on the cluster-wide file-system (/cfs/ekmanscratch) get stuck. All nodes affected. Fault search initiated.
2009-07-27 at 13:58 [xxx (Ferlin)]
Due to repeated failure of the login node of the ferlin cluster it has been decided to replace the entire node. This means a slightly longer queue stop until all relevant files has been moved from the old node to the new one.
2009-07-20 at 13:38 [xxx (Ferlin)]
The login node for the Ferlin cluster (ferlin.pdc.kth.se) will be taken down for hardware replacement on Tuesday at 12:00 and possibly Wednesday 12:00. The downtime is expected to be about 15 minutes. During this time the node will not be accessible and no new jobs will be started on the cluster. There will be no impact on running or queued jobs.
2009-07-20 at 12:24 [xxx (Ekman)]
The Ekman will have several service windows over the coming two weeks in order to perform system-wide hardware replacements. More information will be sent to the users of ekman through vagnekman-users@snic.vr.se
2009-07-16 at 16:49 [xxx (Ferlin)]
The login node of Ferlin crashed for unknown reason with a kernel panic. It is rebooting now.
2009-07-08 at 12:32 [xxx (Ferlin)]
The login node is now up again and the queue has been resumed. The crash was probably due to problems with the AFS cache filesystem this time. Please report anything out of the ordinary.
2009-07-08 at 11:55 [xxx (Ferlin)]
Once again, the login node of Ferlin is being rebooted due to careless usage.
2009-07-06 at 18:13 [xxx (Ellen)]
Key has run out of memory and has consequently been restarted.
2009-07-04 at 23:38 [xxx (Ferlin)]
The login-node (ferlin.pdc.kth.se) was restarted as it crashed due to excess memory use.
2009-07-04 at 21:56 [xxx (Ferlin)]
The login node of Ferlin is currently unavailable.
2009-07-02 at 11:10 [xxx (Ferlin)]
The login node of Ferlin, ferlin.pdc.kth.se, is no longer responding and is therefor beeing rebooted.
2009-06-23 at 11:37 [xxx (Ferlin)]
The login-node of ferlin (aka ferlin.pdc.kth.se) got stuck and have been restarted.
2009-06-22 at 11:10 [xxx (Ekman)]
The cooling system of Ekman failed over the weekend leading to failure of several jobs and a continued lowered stability of the entire system and total outage of minor parts. Updates will follow
2009-06-15 at 10:00
AFS file server kelp.pdc.kth.se is back up. Reason for stop: The RAM suffered an uncorrecrable memory error. As a result the operating system did halt. The RAM has been replaced for other spare RAM and the fileserver is back in production. At the same time, the AFS server software has been upgraded. We have _not_ seen any error messages that indicate data loss, but if you find anything missing, please let us know.
2009-06-15 at 08:49
One afs-server is reporting hardware problems. Your volumes residing on that file-server will not respond ('Connection timed out' would be a typical symptom.
2009-06-10 at 12:24 [xxx (Hebb)]
Hebb is now back in production after being restarted and passing diagnostics. The broken CPU-card has now been replaced. Please report anything out of the ordinary.
2009-06-10 at 10:32 [xxx (Hebb)]
Hebb is currently down due to some problems during maintenance.
2009-05-28 at 10:39 [xxx (lenngren)]
The login node (lise.pdc.kth.se) has been salvaged and is online again.
2009-05-28 at 10:29 [xxx (lenngren)]
Hardware issues on the login node lise.pdc.kth.se. Investigation in progress.
2009-05-15 at 23:45 [xxx (Ferlin)]
The test on ferlin is now finished and the scheduling has returned to normal.
2009-05-15 at 13:42 [xxx (HSM)]
The HSM system is now fully functional again.
2009-05-14 at 16:56 [xxx (Ferlin)]
Ferlin is currently reserved for a performance test taking place tomorrow. Our apologies for the short notice.
2009-05-14 at 13:11 [xxx (HSM)]
Technician will get here tomorrow before noon and probably fix the problem. The HSM will be unavailable until the problem is fixed.
2009-05-14 at 11:28 [xxx (HSM)]
The HSM is currently off-line due to problems with the tape library. Awaiting support from vendor.
2009-04-29 at 03:34 [xxx (lenngren)]
the scheduler-node on lenngren/lise got a kernel panic yesterday evening and froze. It has now been restarted.
2009-04-27 at 11:10 [xxx (Ferlin)]
Removing excess processes, overusing interactive nodes and log-in node.
2009-04-17 at 17:43 [xxx (Ferlin)]
4 racks of Ferlin (nodes starting with a08,a09,a10 and a11) are switched off for network testing during Monday, April 27th 2009. Testing will take at most one working day. Easy will not schedule jobs onto the affected nodes.
2009-04-06 at 14:32 [xxx (Ferlin)]
As /scratch on the login-node (ferlin.pdc.kth.se) of the Ferlin cluster is full it will be selectively emptied (with immediate effect). Files belonging to users that consumes most space will be deleted first. When using resources on login-nodes, please keep in mind that they are shared with other users.
2009-03-27 at 13:30 [xxx (Ferlin)]
The interactive node a11c11n16 was rebooted due to running out of memory.
2009-03-17 at 13:14
The KTH telephone system (voice) including the PDC support number (790 7800) has major reachability problems. When calling PDC from inside or outside KTH you may get random error messages or just "nothing". This has been reported to KTH telephone services. Unfortunately, we do not have a date/time estimate when this will be fixed. PDC is still reachable through email.
2009-03-01 at 15:09 [xxx (Ferlin)]
The login node (ferlin.pdc.kth.se) of ferlin has now been reset. It has been unreachable large parts of today.
2009-03-01 at 15:06 [xxx (Ferlin)]
The login-node (ferlin) of the ferlin cluster is not responding.
2009-02-23 at 13:48
Because of a memory fault in the email server, mail service to addresses ending @pdc.kth.se was delayed from Feb 22 04:41:57 to Feb 23 09:18:18. Email service is now working properly again.
2009-02-13 at 07:55
Forwarded/Informational: Maintenance work on some CSC servers will be performed on Saturday 28 February starting at 10 am. Most UNIX computers at CSC will be heavily affected during this time. Services like email and www will also be affected
2009-02-11 at 14:53
The power is now restored and the systems that had to be shutdown are now gradually being restarted.
2009-02-11 at 13:49
There is currently no electricity into the building where PDC is located. Investigation to why in progress.
2009-02-04 [xxx (lucidor)]
The Myricom switch is now in operation again. Running of parallell jobs over the interconnect on Lucidor is again possible!
2009-02-03 at 10:23 [xxx (lucidor)]
The Lucidor interconnect will be up shortly, hopefully today. The somewhat delayed start is due to an update of the mx stack and the mpi implementations.
2009-01-28 at 13:55 [xxx (lucidor)]
The replacement for Lucidor's broken Myricom switch has finally arrived today from Köln! We are checking its status and hope that we will have Lucidor accepting more than single node jobs before this week is over. As always - stay tuned for more flash news! Cheers, PDC Support
2009-01-13 at 16:02 [xxx (lucidor)]
The myricom switch on Lucidor is still not available. A new one is ordered but we have not yet an estimate on when it will arrive. However, meanwhile you may now run single node jobs on Lucidor! So, you can submit jobs that does not use, for instance, MPI and communicate over the high speed myricom interconnect.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss