interruption of HPC service (solved)
Owing to problems with the file systems /home/hpc
and /home/vault
there is an interruption of all HPC services.
Typical symptoms are hanging batch jobs, hanging open shells or hanging ssh connection attempts (even if you do not have data on the two file systems).
The outage started on Tuesday June 15 at 18:45, the two file systems are unavailable since 19:45.
The HPC Group of RRZE and IBM are working on the issue.
At the moment we cannot estimate when IBM will fix the problems finally.
UPDATE June 16,15:00: the file systems seem to be temporarily back but there are still some problems; batch processes has not been resumed yet. Please be patient.
UPDATE June 16, 17:00: batch processing was resumed on Woody.
UPDATE June 16, 17:45: batch processing was resumed on TinyBlue, TinyGPU and snode3xx (Townsend).
UPDATE June 16, 18:00: Regular HPC service on most systems again with the following exceptions: The remaining nodes of the Transtec cluster (snode1xx and snode2xx) will follow in the next days as there are still long running jobs which prevent an automated reboot. The nodes in the Testcluster will also follow with low priority.
UPDATE June 17, 9:00: all snode2xx are available again; snode141-snode164 are available again; snode101-snode140 will follow in the next 8 hours.
Please resubmit any job which was running during the failure time (e.g. June 15, 18:00 and June 16, 15:00) and shows strange behavior.
For remaining issue please send an email to hpc-support@rrze.uni-erlangen.de .