Events:

1996-12-27 at 08:00
Finally, it's time for upgrade! The Kallsup (Cray) machine will be upgraded to 32 model 'se' CPUs.
The upgrade will begin 1997-01-02 and is expected to require two days of downtime.
Since the HSM system runs on Kallsup, it will also be unavailable during the upgrade.
1996-12-20 at 22:30
Hsm is also back.
1996-12-20 at 21:30
Kallsup got back running an hour ago. Verifying that hsm is back.
1996-12-20 at 20:00
Kallsup did panic: reboot in progress, this also impacts hsm.
1996-12-15 at 02:45
Shark had a broken power supply, but should be back up any minute.
1996-12-15 at 02:15
Fileserver shark is currently down.
1996-12-12 at 23:25
Scheduler shaky. Investigations are going on.
1996-12-12 at 00:25
Scheduler restarted. No jobs seems to have been lost.
1996-12-11 at 15:26
Scheduler in better condition. No jobs seems to have been lost.
1996-12-11 at 15:08
Scheduler got unstable again. Still keeping an eye on it. No jobs seems to have been lost though.
1996-12-11 at 14:20
Scheduler is temporarily out of service. We are looking into it.
1996-12-11 at 13:07 [xxx (strindberg)]
Problems resolved. Go back to using strindberg again.
1996-12-11 at 09:53 [xxx (strindberg)]
Problems logging in to strindberg for some people. Investigating. Please try to use syk-0604.pdc.kth.se temporarily.
1996-12-10 at 10:37
Kallsup up again.
1996-12-10 at 09:27
Kallsup is hung. Investigations under way.
1996-12-07 at 23:12
Kallsup has crashed but will hopefully be back up soon. This might have something to do with the latest upgrade.
1996-12-06 at 13:30
The Kallsup system is back in production.
1996-12-06 at 10:00
Disk failure on the Kallsup Cray computer. The system will be partially unavailable until the problems have been resolved. This also impacts the hsm system.
1996-12-03 at 16:00 [xxx (strindberg)]
Strindberg: Modified one T node to become a W node. Now got 88T, 10W and 2Z in batch pool.
1996-12-02 at 13:00
Log in node (syk-0606) rebooted.
1996-11-26 at 11:00
Log in node (syk-0606) rebooted.
1996-11-13 at 12:00
Info: Since we had restrictions on what jobs to run this weekend some long jobs were left in line. We have, and will, start a few of them out of queue-order. Ie, don't be surprized if there are weekend-jobs running.
1996-11-11 at 16:00 [xxx (strindberg)]
One of the new routers went bad. Strindberg ought to be back to normal within one hour.
1996-11-07 at 15:00 [xxx (strindberg)]
Gaussian/Strindberg, slight changes to the linda parts. Please let us know if you experience anything odd.
1996-11-04 at 12:00
Coming weekend, starting Saturday 1996-11-09 at 1000 and ending at Monday 1996-11-11 at 0600, will offer a) an upgrade of AFS volume location servers and b) new routers. Since this will cause a serious impact on the production environment we will perform special scheduling starting Friday 1996-11-08 at 0800. We aim at keeping the machine busy however not necessarily starting jobs in the ordinary queue-order.
1996-11-02 at 12:00 [xxx (strindberg)]
Problems accessing parts of strindberg:/pfs/.
1996-11-01 at 14:00
Network is back to normal, node allocation enabled again.
1996-11-01 at 13:14
Still transient network problems. EASY Scheduler Allocation turned off.
1996-11-01 at 12:25 [xxx (strindberg)]
Network Problems. strindberg.pdc.kth.se unavailable.
1996-10-27 at 23:15
Job manager restarted. Allocation enabled again.
1996-10-27 at 12:00
Status: We wait for running jobs (76 nodes busy) to complete until restarting the job-manager. As a worst case this will not happen earlier than 04:27 tomorrow morning, 1996-10-28.
1996-10-27 at 08:00
We will wait for most of the running jobs to complete before continuing easy. This to prevent `early death' of jobs. Also, since the job-manager is gone, you might have problems running interactive jobs. Summary: running jobs keeps on running but we can't start new ones without destroying the running jobs.
1996-10-27 at 01:22
Easy currently stopped due to switch problems.
1996-10-24 at 17:30 [xxx (strindberg)]
One interface on Strindberg, the log in one, went bad. It has been disabled. Strindberg renamed to another interface. It might take some time until the new name alias has propagated globally.
1996-10-21 at 18:00 [xxx (strindberg)]
Our SP Strindberg now provides 100 nodes for batch jobs! Another two nodes are available around the clock for interactive use.
1996-10-18 at 12:00
I failing fileserver is causing problems with some functions and some user data will be disabled for a short period of time.
1996-10-16 at 16:00
There is a new frame in the system. We have 110 nodes. All PSSP software upgraded to PTF set 18++. All AIX (unix of SP) software upgraded. Now also using AFS3.4a 4.39++. We will keep the new frame in interactive mode overnight.

Please report any errors you think are related to these changes

Thank you for your patience

1996-10-16 at 12:50
Upgrade of the SP system still in progress.
1996-10-16 at 09:00
Incorporation of new frame into the SP system and installation of system software upgrades will require a reboot of the SP system
1996-10-15 at 10:30
Scheduling let loose again.
1996-10-15 at 08:50
External file server problems. Scheduler temporary stopped to avoid risk of loosing jobs.
1996-10-14 at 19:40
Kallsup and hsm: Back up. The crash was probably due to temporary communication problems between mainframe and IO-processor(s).
1996-10-14 at 16:30
Kallsup and hsm: probably scsi or disk problems. Kallsup and hsm unavailable during investigation.
1996-10-14 at 07:00 [xxx (strindberg)]
A stuck lockfile blocked node allocation on strindberg last night.
1996-10-09 at 18:45 [xxx (strindberg)]
System software upgrades will be installed on the main part of strindberg Wednesday 1996-10-16. The system will be unavailable for three hours starting 09.00.
1996-10-07 at 12:45 [xxx (strindberg)]
Strindberg batch-queue re-activated.
1996-10-07 at 11:00 [xxx (strindberg)]
Strindberg batch-queue on hold until new frame is brought into the machine.
1996-10-07 at 09:00 [xxx (strindberg)]
Strindberg log in nodes rebooted due to new hardware.
1996-10-02 at 17:40
Network: Parts of the KTHLAN (network) to be reconfigured from 2000 and on. This might cause fluctuations in network performace, especially for users within KTHLAN.
1996-10-02 at 17:30
Log in: There were problems for about half an hour for users to authenticate themselves for afs use.
1996-09-28 at 13:00
AFS: one file-server restarted.
1996-09-27 at 00:30 [xxx (strindberg)]
Strindberg: Operation back to normal. Batch recovered, user-data recovered.
1996-09-26 at 19:00 [xxx (strindberg)]
Strindberg: Disk containing batch-queue data and user home catalogues did break. We will read from backup. This might cause loss of the batch-queue, todays accounting, and a few other things. Stay tuned.
1996-09-25 at 09:00 - 12:00
The CM200, mowitz and fredman (login.pdc.kth.se) will be down for maintainance.
1996-09-14 at 21:30 [xxx (strindberg)]
Log in node Strindberg (syk-0604) is available again.
1996-09-14 at 20:00 [xxx (strindberg)]
One fileserver went bad earlier today. This is now fixed. The log in node (syk-0604/strindberg) will be rebooted during the next hour. Please use syk-0101.pdc.kth.se until then.
1996-09-03 at 09:00
Machine hosting batch-system had problems last night causing delayed scheduling.
1996-09-02 at 16:00
We will replace power supplies on ten thin nodes Wednesday 1996-09-04, starting at 0900.
1996-08-28 at 12:00
Fujitsu FX "selma" down this afternoon. Upgrade to 2 processor machine. There might be stops with short notice on thursday due to OS tuning.
1996-08-27 at 13:00
Increased number of interactive nodes due to PDC Summer School in High Performance Computing.
1996-08-22 at 13:00
Slightly increased number of interactive nodes due to PDC Summer School in High Performance Computing.
1996-07-31 at 13:00
Network fluctuations will cause careful and slow return from interactive to batch.
1996-07-29 at 11:30
The fileserver is back.
1996-07-29 at 09:00
One fileserver is out of order. Some users might have problems accessing their files.
1996-07-25 at 10:00
SP job manager died. You ought to have experienced problems during past 10 hours.
1996-07-18 at 15:00 [xxx (strindberg)]
Reboot of syk-0604/strindberg.
1996-07-10 at 19:20
SP Scheduler operational
1996-07-10 at 18:00
SP Scheduler problems. The scheduler fail to release nodes. Recovery in progress
1996-07-09 at 14:00
If you are using the CM-200 Bellman, read about mowitz maintainance
1996-07-03 at 19:45
Scheduler running on a reduced number of nodes.
1996-07-03 at 18:20
Switch problems. All nodes to be rebooted. Scheduler stopped and will be resumed later tonight.
1996-06-29 at 12:30
Scheduling resumed. Several jobs lost.
1996-06-29 at 09:00
Network problems, job scheduling paused.
1996-06-27 at 14:19 [xxx (strindberg)]
Check usage page for current status of Strindberg.
1996-06-17 at 14:00 [xxx (strindberg)]
More than half of Strindberg running. Some nodes are still unavailable due to software problems. Check usage page for current status.

When reporting software problems, please include on which node the problem occured. Thank you.

1996-06-15 at 00:11
We believe we have a reasonably sound machine now. EASY started. Please report any unusual problems.
1996-06-14 at 22:13
We are still at it. Have just received software efixes from the US and at least on one frame it seems to help a lot.
1996-06-14 at 18:18 [xxx (strindberg)]
Software upgrade still in progress... Strindberg unavailable. The prognosis is less educated this time. Give us 2 more hours.
1996-06-14 at 15:15 [xxx (strindberg)]
Software upgrade in progress... Strindberg still unavailable. An educated guess is that is should come up around 18.00 hours today, just in time for the weekend jobs.
1996-06-14 at 10:15 [xxx (strindberg)]
Software upgrade needs tuning. Strindberg still unavailable. An optimistic guess is that is should come up around 15.00 hours today.
1996-06-12 at 17:45
Info about upgrade to be found in news section.
1996-06-04 at 22:00
Scheduler has moved to a new location. The move caused a delay of roughly an hour, pushing a number of night jobs into next night slot.
1996-06-04 at 10:53
Scheduler started. However, we still have some problems and strange message could appear when jobs are released. Let us know if so but do not worry.
1996-06-04 at 10:26
Problems in AIX 4.1.4 partition. Scheduler temporarely stopped.
1996-05-31 at 14:08 [xxx (strindberg)]
Monday June 10 we will start upgrading the AIX 3.2.5 partition of Strindberg to AIX 4.1.4 and PSSP 2.1. Intially this only affects the AIX 3.2.5 part of the machine, but later in week 24 the whole machine will be UNAVAILABLE. We expect to be back in normal production by 1200 hours June 14.
1996-05-30 at 22:22
Problem fixed.
1996-05-30 at 21:51
Obvously something wrong with the machine. Investigation under way.
1996-05-28 at 19:30 [xxx (strindberg)]
Both partitions of Strindberg up and running after all day testing of computer room facilities.
1996-05-28 at 17:30 [xxx (strindberg)]
You can submit jobs to the larger partition of Strindberg (log in syk-0101/syk-0604.)
1996-05-28 at 16:00
We are about to start again.
1996-05-28 at 09:30
All pdc machines down. Servers following soon.
1996-05-28 at 09:00
The process of taking down all machines is starting.
1996-05-21 at 11:00 [xxx (strindberg)]
Network problems limiting availability of the old parts of the machine (syk-0201.pdc.kth.se/strindberg.pdc.kth.se.)
1996-05-20 at 19:25
Login problems resolved.
1996-05-20 at 17:50
Login nodes in AIX 4.1 partition (syk-0101 and syk-0604) are at present not available for unknown reasons. We are working on it...
1996-05-15 at 17:12
On Tuesday May 28, at 0900 we will start doing functional tests of various installations in the computer room. Since this has the potential to be disruptive all hardware will be UNAVAILABLE for users until 96-05-28 at 1800. This work is necessary in order to finally fully test the new computer room facilites that are supposed to help us operate without interruptions;)
1996-05-14 at 15:32
Starting on Tuesday May 21, we will start to install new network equipment. Depending on how well the installations go, it may or may not affect the connectivity to PDC machines between May 21 and May 24.
1996-05-10 at 16:15
We will reboot the log in nodes syk-0101 and syk-0604 sometime between 1715 and 1800.
1996-05-06 at 17:40 [xxx (strindberg)]
Both partitions of Strindberg are up and running. There remain problems with one file-server. Some users might experience problems.
1996-05-06 at 17:00
The AIX 3.2 partition is up and running.
1996-05-06 at 16:00
The AIX 4.1 partition is up and running.
1996-05-06 at 13:00
We are in the process of restarting systems.
1996-05-02 at 15:09
On Monday May 6, starting at 09.00, all machines will be unavailable due to electrical installation work in the computer room.
1996-04-28 at 14:30
To log in on the aix 4.1.4 partition please use the name syk-0604.pdc.kth.se.
1996-04-18 at 18:45
Some queued jobs in the 3.2.5 partition were lost due to a failing job-manager.
1996-04-16 at 17:00
We are stressing the batch-system, which causes nodes to appear to be down.
1996-04-12 at 17:30
Control workstation restarted - batch-lines enabled.
1996-04-12 at 16:30
Draining batch-lines for restart of control-work-station.
1996-04-03 at 13:00 [xxx (strindberg)]
You might like to see the page with Strindberg current usage. Please note it's experimental.
1996-03-30 at 15:30
All systems running.
1996-03-30 at 10:00
We are about to start things again. Apologies for taking even the web page down.
1996-03-29 at 08:00
Systems down. Testing of computer-room environmental equipment to start.
1996-03-28 at 15:30
During Monday, April 1st, the aix 414 partition will be opened up for any pdc user. Further info to be found in the news page. Users in Sweden: Please note that we change to daylight savings time this weekend. The batch-line limits, using GMT, will not.
1996-03-22 at 15:32 [xxx (strindberg)]
Next Friday, 96-03-29, at 0800 we will start doing functional tests of various installations in the computer room that houses the SP2 Strindberg. Since this has the potential to be disruptive to the SP2 it will be UNAVAILABLE for users until 96-03-30 at 1700. We apologize for the down time, but it necessary in order to fully test the new computer room facilites that are supposed to help us operate without interruptions.
1996-03-01 at 11:00
Non-regular file-server activities may have caused problems for some users to reach their data during past night.
1996-03-01 at 03:00
Node syk-0112 down (hardware failure). This impacts currently running jobs and piofs (Parallel I/O File System). Expect at least 36 hours outage since there are no available spare parts in Sweden.
1996-02-27 at 16:20
Power cycle test finished.
1996-02-27 at 15:30
We will perform power cycle test now. All AFS servers will be halted. This as a safety precaution.
1996-02-27 at 12:30
SP systems back.
1996-02-27 at 10:00
All systems unavailable due to power failiure.
1996-02-26 at 10:00
We will start using UPS on all production machines tomorrow, 1996-02-27 at 0800. Batch lines will be held during installation though it is supposed to be invisible.
1996-02-25 at 21:00
Strinberg is now available again, with vital system data back on SCSI disks.
1996-02-25 at 16:00
We will remove SSA from production-related machines. System will be unavailable for some time.
1996-02-25 at 15:30
Control workstation hardware problems. SSA driver/adapters causes the machine to crash. This also caused the fddi-ring of the control workstation to go down(!)
1996-02-25 at 13:40
Struve down and out. Fault analysis started.
1996-02-24 at 21:00
Pending disk crash on a major node in the new partition. Disk replacement and data recovery in progress
1996-02-21 at 17:56
EASY scheduler ON. Jobs running MPI or MPL might have crashed due to absent job-manager since CWS reboot.
1996-02-21 at 17:50
Ripples of the previous CWS crash. EASY scheduler is temporarely stopped.
1996-02-21 at 16:20
SSA adapter card changed on control workstation.
1996-02-21 at 14:20
Job's held until fault analysis finished. Seems like making backup of control work station SSA disks causes it to crash.
1996-02-21 at 13:40
Control Workstation crash. Jobs may be lost.
1996-02-19 at 16:04 [xxx (strindberg)]
Hardware upgrade of SP2 Strindberg is done. It now contains 96 PE's, processing elements, of which 32 runs AIX 3.2.5. Those 32 nodes will be opened for public access at 1700 hours, February 19th. Please use any of the names {syk-0201,struve,strindberg}.pdc.kth.se when you log in!

What used to be Thin Nodes (T1) have been upgraded to Thin Nodes 2 (T2). The T2 nodes has twice the band-width to memory compared to T1 nodes. We will announce when the remaining 64 nodes, running AIX 4.1.4 and PSSP 2.1, will be open for public use.

1996-02-14 at 21:08
CM Bellman and FE:s fredman and mowitz back up.
1996-02-14 at 15:06
Now most of PDC file servers are up. SP-2 hardware has been upgraded. Continuing with the software upgrade as planned. CM Bellman is also up, except that there is now timesharing running yet.
1996-02-11 at 23:17
Most of PDC:s machines have now been powered off. The remaining ones will come down just before 04.00 on Monday, February 12. At 04.00 we turn off power and expect to be back again on Wednesday. Stay tuned on our web page to see how things progress.
1996-02-02 at 23:42 [xxx (strindberg)]
Upgrade of SP-2 Strindberg will start on February 9 at 17.00 hours and is expected to be done by Monday 19 17.00 hours. During this time the SP-2 will not be available.

The upgrade includes more and faster nodes, a switch cabinet, larger and faster disks to all nodes as well as installing a UPS in the PDC dungeon to be less sensitive to power outages. To top it all, when all the hardware has been upgraded we are also to install AIX 4.1.4 and PSSP 2.1 on 64 of the nodes while leaving the old software on 32 thin nodes which will have changed from T1 to T2 CPU:s. For more information see.

Installing the UPS means shutting off the power and consequently all user files will be gone during February 12 and 13. WWW will still be working though.

As our work progress we will add information to this page and in the relevant cases more detailed information in.

1996-02-02 at 21:29
Frame 4 (wide nodes) available again.
1996-02-01 at 18:15
Frame 4 (wide nodes) is not available due to communication problems. The problems will not be solved until tomorrow 1996-02-02.
1996-01-19 at 18:00
During the day a large disk of a file-server started to have problems. We are moving large amounts of data to new disks. Usually this is invisible to users, but not today. You might experience slow or, as a worst case, lacking access to files.
1996-01-19 at 13:00
All of the machine will be unavailable Wednesday 960124 starting at 08.00 and ending at 17.00 the very latest.
1996-01-09 at 20:09
Wednesday 960117, 08.00--13.00: a number of AFS servers will be moved. This means that home directories will be unavailable and consequently no SP jobs will be run during this time.
1996-01-09 at 13:00
All wide nodes, syk-49..63, will be unavailable between 0800 and 1200 tomorrow, 1996-01-10. This due to network service.
1996-01-03 at 13:57
AFS on struve reached an acute stage of disease - struve.pdc.kth.se to be rebooted.
1996-01-02 at 15:00
About to reboot all nodes. Will take a couple of hours.
All flash news for 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996, 1995

Back to PDC
Subscribe to rss