Zurück | Archiv

Rechner-Cluster - Emergency Shutdown of CLAIX-2023 Due to Failed Cooling

Montag 02.12.2024 15:15 - Mittwoch 04.12.2024 06:00

CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation. The cluster will be operational again if the inducting issues can be resolved.

Technische Erläuterung

All MPI and GPU nodes are liquid cooled. Both CDUs (= Cooling Distribution Units) failed such that heat dissipation was not possible anymore.

Mo 02.12.2024 15:20

Updates

Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.

Di 03.12.2024 10:34

The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance. Once these investigations are complete, the affected nodes will be made available through the batch system again.

Mi 04.12.2024 10:41