RWTH Störungsmeldungen

Rechner-Cluster - HOME filesystem unavailable

Donnerstag 09.10.2025 22:15 - Freitag 10.10.2025 09:30

The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second. We apologize for the inconveniences this has caused.

Do 09.10.2025 16:45

Updates

The underlying problem has turned out to be more persistent than expected and we are consulting with the vendor for a fix. As a result, the majority of batch nodes remain out of operation until we can guarantee stable access to the filesystem. We are working to remedy the situation as soon as possible.

Do 09.10.2025 17:25

The problem has been solved and the cluster is back in operation.

Do 09.10.2025 20:08

The issue re-occured and persistes at the moment. We are currently working on a solution.

Fr 10.10.2025 07:21

The issues could be resolved.

Fr 10.10.2025 09:34

Rechner-Cluster - Full global maintenance of the HPC CLAIX Systems

Dienstag 30.09.2025 15:35 - Freitag 03.10.2025 19:25

Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable. Please note the following: - User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance. - No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance. - Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated. - Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs. - Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards. - Files on your personal or project directories will not be available during the maintenance.

Di 16.09.2025 15:31

Updates

Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.

Di 30.09.2025 15:08

We must unfortunately postpone the release of the HPC system for normal use until Wednesday. We apologise for the delays.

Di 30.09.2025 20:12

Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.

Mi 01.10.2025 17:09

All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m. We apologize once again for the unforseen inconveniences.

Mi 01.10.2025 18:18

The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.

Do 02.10.2025 10:25

The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.

Do 02.10.2025 15:39

Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.

Fr 03.10.2025 19:26

Rechner-Cluster - Migration to Rocky Linux 9

Mittwoch 01.10.2025 08:00 - Mittwoch 01.10.2025 15:30

The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.

Di 30.09.2025 17:45