Rechner-Cluster

Mehr Informationen zu dem Service finden Sie in unserem Dokumentationsportal.

Batch System Controller Unavailable

Teilstörung
Mittwoch, 10.12.2025 16:30 - Mittwoch, 10.12.2025 18:40

We are experiencing issues with Slurm's batch controller which causes all Slurm commands (sbatch, squeue etc.) to time out. Our team is investigating the issue and working on a quick solution. Until then, job submission will not be possible. Running jobs might run as usual but exit with failures if connections to the controller fails. Results might still be valid.

10.12.2025 17:00
Updates
We have singled out the problem and are still working on a solution. For the time being, job submissions are disabled to keep the controller in a stable state for all running jobs.
10.12.2025 17:45
The batch system is fully operational again. Please review all of your jobs that ran during the time frame above and resubmit them in case of failures. We apologize for the inconvenience.
10.12.2025 19:16

Maintenance of login23-g-1 due to GPU errors

Teilwartung
Mittwoch, 10.12.2025 08:00 - Mittwoch, 10.12.2025 15:25

Due to repeated issues with the GPUs on the GPU dialog node login23-g-1, the node will be under shorthand maintenance on 2025-12-10 and cannot be used until further notice.
A technician will perform an on-site hardware diagnosis and component modifications which require the node to be shut down.

Please consider using an interactive GPU job or the Jupyterhub with the interactive node n23i0001 for short GPU computations.

01.12.2025 09:50
Updates
The maintenance tasks are finished for today.
10.12.2025 15:30

Slurm downtime

Teilwartung
Montag, 01.12.2025 07:00 - Montag, 01.12.2025 09:09

Monday morning we will have to restart our Slurm database and this will cause Slurm commands to be unavailable for the planned downtime.
Jobs will not be able to be submitted.

28.11.2025 13:12
Updates
Works ended. Any jobs that failed to submit should be resubmitted.
01.12.2025 09:10

Slurm emergency downtime

Teilwartung
Freitag, 28.11.2025 09:00 - Freitag, 28.11.2025 12:07

Due to unforeseen circumstances we are forced to fully stop and restart the Slurm batch system infrastructure.
During this short downtime the submission of new jobs wont be possible, Slurm commands will not be available and other Slurm related tasks will not be possible.
Already running jobs should be able to continue and finish without issue.

28.11.2025 09:41
Updates
Updates went without problem from the server sides. Please report via Tickets any issues you might encounter AFTER the downtime.
28.11.2025 12:07