The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second. We apologize for the inconveniences this has caused.
The underlying problem has turned out to be more persistent than expected and we are consulting with the vendor for a fix. As a result, the majority of batch nodes remain out of operation until we can guarantee stable access to the filesystem. We are working to remedy the situation as soon as possible.
The problem has been solved and the cluster is back in operation.
The issue re-occured and persistes at the moment. We are currently working on a solution.
The issues could be resolved.
Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable. Please note the following: - User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance. - No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance. - Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated. - Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs. - Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards. - Files on your personal or project directories will not be available during the maintenance.
Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.
We must unfortunately postpone the release of the HPC system for normal use until Wednesday. We apologise for the delays.
Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.
All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m. We apologize once again for the unforseen inconveniences.
The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.
The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.
Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.
The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.