Events:

2005-12-31 at 00:09
Network syrup: bit-errors on networks might have caused problems with network sensitive operations (i.e. shared filesystem file-accesses.) Please report anomalies.
2005-12-23 at 13:00
The PDC helpdesk shuts down for the holidays. We restart January 2nd, 2006. Happy holidays!
2005-12-21 at 14:24 [xxx (HSM)]
The HSM system is now open again. As always, please report anything out of the ordinary.
2005-12-19 at 16:59
ftp to ftp.pdc.kth.se re-enabled.
2005-12-16 at 10:31 [xxx (lenngren)]
Job submission is now enabled
2005-12-16 at 10:03 [xxx (lenngren)]
Login to Lenngren works, but job submission is not yet possible. It will be fixed within a couple of hours.
2005-12-15 at 21:30
Lenngren: Juliana and Lise batch-lines are open again. Jobs submitted prior shutdown now executing, with some exceptions. Please report anomalies.
2005-12-13 at 11:55
It has been found that a few computer accounts at KTH School of Computer Science and Communication including PDC have been misused. Measures have been taken during the weekend to investigate and control the potential damage. Systems are now gradually coming back to normal operation and login will be enabled again. Sorry, for the inconvenience this incident is causing.
2005-12-12 at 09:46
Sorry, but the login to PDC systems will be disabled for a few more hours.
2005-12-09 at 20:22
User login of all systems will be disabled during the weekend due to urgent maintenance of Kerberos database. Sorry for the short notice and inconvinince this might cause.
2005-12-08 at 17:55
The ftp server was restarted due to really slow reponse times.
2005-12-01 at 17:21 [xxx (strindberg)]
/gpfs/scratch/ operational again. Please report anomalies.
2005-12-01 at 16:00 [xxx (strindberg)]
Node serving /gpfs/scratch/ got a fault. Recovery in progress.
2005-11-28 at 10:00 [xxx (lenngren)]
All nodes reserved for system work from 2005-11-28 10:00:00. Ethernet switch replacement (all 15), mass reboot and computer room a/c test. Duration 2-6h.
2005-11-21 at 01:15 [xxx (SBC / CBR)]
The SBC cluster queue will be paused until the present problems with the AFS server cysteine.pdc.kth.se have been resolved. We apologise for any inconvinience this may cause.
2005-11-18 at 18:01 [xxx (SBC / CBR)]
The SBC afs server cysteine.pdc.kth.se, hosting most SBC users home volumes, has crashed again and is currently being updated and restarted.
2005-11-17 at 17:22 [xxx (SBC / CBR)]
The SBC afs server cysteine.pdc.kth.se, hosting most SBC users home volumes, have crashed and is currently being restarted.
2005-11-05 at 10:00
NADA reports work on their severs. If you have your $HOME in /afs/nada.kth.se/... you may be affected. PDC-only users are not affected.
2005-10-26 at 18:19 [xxx (lucidor)]
Log in node of Lucidor (blumino.pdc.kth.se) is being rebooted to clear a broken system cache.
2005-10-26 at 17:58 [xxx (strindberg)]
Due to excessive user behaviour on the log in node of kallsup, all running jobs were lost.
2005-10-22 at 10:00
NADA reports work on their severs. If you have your $HOME in /afs/nada.kth.se/... you may be affected. PDC-only users are not affected.
2005-10-15 at 13:21 [xxx (SBC / CBR)]
The previously announced service of SBC AFS-servers is now completed.
2005-10-15 at 13:00 [xxx (SBC / CBR)]
The SBC afs server cysteine.pdc.kth.se will be restarted in order to fix a critical server-local filesystem problem. The SBC queue will be paused until the server is back on-line, which is expected to be within 2 hours. Please note that most SBC home-directories will be unavailiable during the restart.
2005-10-13 at 12:42 [xxx (SBC / CBR)]
Due to problems with at least two AFS servers the SBC cluster queue is temporarily stopped. Additionaly a service window has been scheduled for 22:00 this evening, during which most of SBC's AFS volumes are likely to be unavailiable.
2005-10-12 at 13:00
The tsm (backup) system is back online with reduced capacity. (The previous message regarding this matter was dated poorly.)
2005-10-12 at 12:00
The tsm system is offline due to a faulty raid disk system.
2005-10-10 at 19:25 [xxx (lenngren)]
The log-in node lise.pdc.kth.se of the Lenngren cluster will be rebooted at 19:30 to clear a system cache.
2005-10-08 at 11:00 [xxx (SBC / CBR)]
The SBC afs server aspartate.pdc.kth.se will be restarted in order to fix a problem which, among other things, affects backups. The SBC queue will be paused until the server is back on-line, which is expected to be within 2 hours.
2005-09-27 at 10:09
Systems lenngren and lucidor have new versions of compilers and mathematic libraries installed. The previous versions are available with the short name "previous". See the output from "module avail" and "module whatis" for more information.
2005-09-16 at 14:00
The fileserver houting's OS crashed when one of the network ports it was attached to broke. The fileserver has been restarted.
2005-09-16 at 11:00 [xxx (lenngren)]
Whole cluster will be dowm for upgrade of 3Com ethernet switch firmware and AFS client software. If time permits, upgrade of infiniband switch firmware, too. Duration: Whole day (approx to 18:00).
2005-09-06 at 06:33 [xxx (lucidor)]
The log-in-node of Lucidor (blumino.pdc.kth.se) is now maintenance rebooted.
2005-09-01 at 16:38 [xxx (lenngren)]
Another gigabit ethernet switch died (total fan failure) thursday evening. This time the one for rack #8. Until Dell sends a replacement, the nodes are connected with some temporary switches from various sources. The nodes and are held in "interactive" mode because network performance may vary. Have a look at spusage or our webb for node status.
2005-08-25 at 10:10 [xxx (lucidor)]
The login node was rebooted due to AFS client problems.
2005-08-25 at 09:17 [xxx (HSM)]
The HSM system will be brought down on Monday for hardware upgrades. It will remain unavailable until Thuesday afternoon.
2005-08-24 at 13:40 [xxx (lucidor)]
Lucidor - high load on log-in (blumino) - log in rebooted.
2005-08-23 at 16:30 [xxx (lenngren)]
Login node lise was rebooted to fix some AFS problems (the new version is hoped to fix this particular issue for good).
2005-08-18 at 09:46 [xxx (lucidor)]
The login node blumino had to be rebooted due to a acute case of afs trouble.
2005-08-12 at 11:01
The PDC Summer School will be held August 15-26 and will have higher priority on some of the systems at PDC. Mostly daytime and mainly on Lucidor.
2005-08-01 at 15:28 [xxx (lenngren)]
A ethernet switch has died, taking rack 9 with it.
2005-07-06 at 12:41 [xxx (lenngren)]
afs on lenngren login node is now back up again.
2005-07-06 at 10:24 [xxx (lenngren)]
There are currently AFS client problems on the Lenngren login node lise.pdc.kth.se. Investigations are in progress.
2005-06-29 at 13:45 [xxx (lenngren)]
Allocation has resumed
2005-06-29 at 13:33 [xxx (lenngren)]
We had a switch failure in rack 2, allocation has been paused while the rack comes up again.
2005-06-22 at 13:05 [xxx (SBC / CBR)]
SBC scheduling is back on track.
2005-06-22 at 09:43 [xxx (SBC / CBR)]
The scheduler-host of the sbc-cluster have broken hardware (disk.) One can still submit/remove jobs, but transactions will not be processed until replacement/repair/restart is completed.
2005-06-22 at 01:03 [xxx (SBC / CBR)]
The SBC afs server aspartate.pdc.kth.se is currently being restarted after failure. The SBC queue will be paused until the server is back on-line, which is expected to be within 1 hour.
2005-06-07 at 12:00 [xxx (strindberg)]
Nighthawk: All of /gpfs/projects/ should be back on-line.
2005-06-07 at 11:44 [xxx (lenngren)]
A critical server node was accidentally rebooted. This _may_ affect running jobs using the infiniband network.
2005-06-05 at 19:41 [xxx (strindberg)]
Nighthawk - node serving /gpfs/projects/ have hardware problems. Contents residing on that node will give I/O error.
2005-06-01 at 17:10
General: resuming node allocation.
2005-06-01 at 11:20
General: Kerberos problems reappeared. Allocations stopped.
2005-05-31 at 18:30
General: allocation resumed (again.) Please report anomalies to pdc-staff@pdc.kth.se!
2005-05-31 at 15:30
General: ticket forwarding problem remain. All allocation paused.
2005-05-31 at 15:15
General: allocation resumed.
2005-05-31 at 12:33
General: ticket/kerberos problems. All schedulers paused.
2005-05-19 at 09:43
Forwarded/Informational - nada users: Maintenance work on many of the servers at Nada will be performed on 28 May starting at 10:00 am. Most UNIX computers at Nada will be heavily affected during this time. Services, like E-mail and WWW, will also be affected.
2005-05-11 at 09:22 [xxx (SBC / CBR)]
SBC queue restarted, AFS client updated.
2005-05-10 at 22:55 [xxx (SBC / CBR)]
Queue temporarliy paused due to faulty AFS client on newly installed nodes.
2005-05-04 at 22:28
The PDC helpdesk will be closed for the holiday and opens again on Monday 9 a.m.
2005-05-04 at 14:30 [xxx (lucidor)]
Lucidor class A-nodes got 64b-aware afs-clients. Please report any unexpected anomalies.
2005-05-04 at 10:50
General: one of the lines in/out of pdc, through kth-ea4, is now disabled as kth-ea4 seem unreliable. Depending on your location traffic should now flow without loss or not at all. Users external to kth should now have full connectivity.
2005-05-01 at 17:30 [xxx (lucidor)]
Info: The 2 interactive nodes will be rebooted 21:00 to activate new software. Log in has moved. Please report anomalies.
2005-04-30 at 16:08
Dates of previous three items corrected.
2005-04-30 at 16:00 [xxx (strindberg)]
Nighthawk: one node dead. /gpfs/scratch/ is operational but reading old files (with data on that node) will give I/O error for those pieces of data.
2005-04-28 at 17:40
Informational - if your home catalogue is in the afs-cell nada.kth.se: there seem to be problems accessing files in the nada-cell right now.
2005-04-28 at 17:38
General: there was a medium short network outage a while ago. Things went sluggish or not at all; most running jobs still execute as normal.
2005-04-21 at 15:51 [xxx (lucidor)]
Log in Lucidor/Blumino re-enabled.
2005-04-21 at 12:55 [xxx (lucidor)]
Login to Lucidor/Blumino disabled until further notice.
2005-04-20 at 10:00
Router software upgrade. Because we are rather safe than sorry for your jobs, an allocation stop will be held for the planned 4 hour period. Short interrupts of access to storage and login can happen during the period. Scheduling will start again as soon as we are done.
2005-04-04 at 15:06 [xxx (HSM)]
The HSM system is now fully functional again.
2005-04-04
Beppe/Swegrid/SEs will be offline Wednesday April 6th while adding to the disk sub-systems.
2005-04-04 at 12:00 [xxx (HSM)]
The HSM system is unavailable. Fault search in progress.
2005-04-04 at 09:37
Licence server is back up
2005-04-04 at 09:29
Yet another problem with the license server. Investigation in progress.
2005-04-01 at 11:10
Licence server is back up
2005-04-01 at 11:10
Licence server is missbahaving again. Investigation is in progress.
2005-03-31 at 12:02
License server is back up (FORESYS license is still down)
2005-03-31 at 11:27
License server is having troubles. Investigation is in progress.
2005-03-29 at 16:16 [xxx (HSM)]
Tape library (robot arm) working again, but status unknown. IBM will service tomorrow. Service time unknown.
2005-03-29 at 16:16 [xxx (HSM)]
Tape library (robot arm) unavailable, files from tape storage can not be fetched. Time yo fix unknown yet.
2005-03-24 at 17:13
The license server went up and down for a bit. It should hopefully be more stable now.
2005-03-23 at 10:08
General: networked maintenance work initiated. You might experience shorter (~seconds) dropouts in connectivity. Work should be completed by 1400. As a safety precaution no production jobs will be started until work done.
2005-03-19 at 00:36
General: the main router had a stuck card causing most services to get stuck. Your job was most certainly hit.
2005-03-18 at 17:22
Today's third afs crasch has been recovered. Quick fix is applied. More thorough fixes and further preventation identified and planned. The paused queues will be restarted after a suitable quarantine period.
2005-03-18 at 14:37
The afs server has crashed once more. Deeper investigation to follow.
2005-03-18 at 12:10
The reason for the afs crash has ben pin-pointed and all paused queues is back on again.
2005-03-18 at 10:34
Major afs-hickup. Investigation in progress, but the worst seem to have passed.
2005-03-10 at 16:39 [xxx (SBC / CBR)]
The SBC AFS fileserver cysteine.pdc.kth.se will be restarted at 20:00 today (2005-03-10). The restart is expected to be unnoticable, but may (depending on the severity of the fault causing the restart) cause a downtime of approximatelly 45 minutes. Please note that most SBC home directories resides on cysteine. The reason for the restart is that the volserver, which handles, among other things, backups, is unresponsive thus effectively stopping backups of the volumes on cysteine until fixed.
2005-03-06 at 17:31
The fileserver gills is being restarted.
2005-02-14 at 12:25 [xxx (lucidor)]
Blumino.pdc.kth.se (log in node of lucidor) will be maintenance rebooted at 15:00 today.
2005-02-07 at 15:30
ftp.pdc.kth.se and some related services will be unavailable for approximately 30 minutes, due to maintenance.
2005-02-04 at 11:40 [xxx (HSM)]
HSM system is back up.
2005-02-03 at 16:15 [xxx (HSM)]
The HSM system did not go up at 15:00; Possibly a hardware fault. Hardware support technician will arrive tomorrow, 2005-02-04, and start examine interior parts.
2005-02-01 at 16:00
Informational, forward from nada: Maintenance work on many of the servers at Nada will be done on 12 February starting at 10:00 am. Most UNIX computers at Nada will be heavily affected during this time. Services, like E-mail and WWW, will also be affected.
2005-02-03 at 10:00 [xxx (HSM)]
The HSM system will be brought down for hardware maintenance. It should be up again by 15.00.
2005-01-31 at 14:25
afs-services on gills are to be restarted within short.
2005-01-31 at 11:45
Afs-server gills is having problems. Investigation in progress.
2005-01-19 at 15:38
We have continuing AFS problems which cause some volumes to drop out from time to time.
2005-01-18 at 21:16 [xxx (SBC / CBR)]
The fileserver alanine is now back in service (with updated software).
2005-01-18 at 19:30 [xxx (SBC / CBR)]
The fileserver alanine has possibly hung. Investigation in progress.
2005-01-18 at 11:34 [xxx (SBC / CBR)]
The fileserver on alanine.pdc.kth.se was hung. It has been restarted and AFS for SBC should be available again now.
2005-01-13 at 19:00 [xxx (strindberg)]
Nighthawk: the log in node will be maintenance rebooted tomorrow, 2005-01-14, at 15:00.
2005-01-13 at 13:07 [xxx (SBC / CBR)]
Fileserver alanine also had to be restarted restarted. Scheduler should be running again.
2005-01-13 at 10:29
Problems with an AFS server caused some AFS volumes to be unavailable. This should now be fixed.
2005-01-10 at 18:00 [xxx (lucidor)]
The myricom network was 'reset' on all compute nodes. Please report any myricom related problems.
2005-01-05 at 19:55 [xxx (SBC / CBR)]
The SBC AFS fileserver aspartate.pdc.kth.se will be restarted at 15:00 2005-01-06. The restart is expected to be unnoticable, but may (depending on the severity of the fault causing the restart) cause a downtime of approximatelly 45 minutes. The reason for the restart is that the volserver, which handles, among other things, backups, is unresponsive thus effectively stopping backups of the volumes on aspartate until fixed.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss