CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation. The cluster will be operational again if the inducting issues can be resolved.
All MPI and GPU nodes are liquid cooled. Both CDUs (= Cooling Distribution Units) failed such that heat dissipation was not possible anymore.
Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.
The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance. Once these investigations are complete, the affected nodes will be made available through the batch system again.