RWTH Störungsmeldungen

Rechner-Cluster - HOME filesystem unavailable

Donnerstag 09.10.2025 22:15 - Freitag 10.10.2025 09:30

The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second. We apologize for the inconveniences this has caused.

Do 09.10.2025 16:45

Updates

The underlying problem has turned out to be more persistent than expected and we are consulting with the vendor for a fix. As a result, the majority of batch nodes remain out of operation until we can guarantee stable access to the filesystem. We are working to remedy the situation as soon as possible.

Do 09.10.2025 17:25

The problem has been solved and the cluster is back in operation.

Do 09.10.2025 20:08

The issue re-occured and persistes at the moment. We are currently working on a solution.

Fr 10.10.2025 07:21

The issues could be resolved.

Fr 10.10.2025 09:34

Rechner-Cluster - Full global maintenance of the HPC CLAIX Systems

Dienstag 30.09.2025 15:35 - Freitag 03.10.2025 19:25

Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable. Please note the following: - User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance. - No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance. - Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated. - Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs. - Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards. - Files on your personal or project directories will not be available during the maintenance.

Di 16.09.2025 15:31

Updates

Unfortunately the maintenance works will have to be extended. We hope to be done as soon as possible. We apologize for the inconvenience.

Di 30.09.2025 15:08

We must unfortunately postpone the release of the HPC system for normal use until Wednesday. We apologise for the delays.

Di 30.09.2025 20:12

Within the maintenance, a pending system upgrade due to security issues, a system update is done as well. However, due to the large number of nodes, the update still requires some time. The cluster will be available as soon as possible. Unfortunately, we cannot give an exact estimate when the updates are finished.

Mi 01.10.2025 17:09

All updates should be completed later this evening. We target the cluster to be available tomorrow by 10:00 a.m.: The Frontend nodes should be available earlier prior to the batch service that will prospectively be resumed by 11:00 a.m. We apologize once again for the unforseen inconveniences.

Mi 01.10.2025 18:18

The updates are still not completed and require additional time. We estimate to be finished this afternoon. The Frontends are already available again.

Do 02.10.2025 10:25

The global maintenance tasks could be completed, and we are starting to put the cluster back to operation starting from now. However, several nodes will temporarily remain under maintenance due to issues that could not be solved yet.

Do 02.10.2025 15:39

Operation of most of the nodes could be restored. The remaining few nodes will be processed soon.

Fr 03.10.2025 19:26

Rechner-Cluster - Migration to Rocky Linux 9

Mittwoch 01.10.2025 08:00 - Mittwoch 01.10.2025 15:30

The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.

Di 30.09.2025 17:45

Rechner-Cluster - NFS Störung der GPFS Server

Mittwoch 17.09.2025 21:35 - Donnerstag 18.09.2025 10:06

Aktuell drainen alle Knoten aufgrund einer Störung. Wir arbeiten mit dem Hersteller daran.

Do 18.09.2025 09:16

Updates

Das Problem konnte gelöst werden, der Cluster ist wieder in Operation.

Do 18.09.2025 10:06

Rechner-Cluster - $HOME and $WORK filesystems are again unavailable

Freitag 29.08.2025 09:45 - Freitag 29.08.2025 11:00

Due to issues with the underlying filesystem servers for $HOME and $WORK, the batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

Fr 29.08.2025 09:55

Updates

Access to the filesystems has been restored. We apologize for the inconvenience.

Fr 29.08.2025 11:38

Rechner-Cluster - $HOME and $WORK unavailable on login23-g-1

Dienstag 26.08.2025 15:00 - Mittwoch 27.08.2025 11:30

Due to issues with the GPFS filesystem $HOME and $WORK are not available on login23-g-1. The issue has been resolved.

Mi 27.08.2025 11:40

Rechner-Cluster - login23-1 unreachable

Dienstag 26.08.2025 08:45 - Dienstag 26.08.2025 10:00

To resolve issues and fault analysis of the frontend node, the dialog node login23-1 is temporarily unavailable.

Di 26.08.2025 09:01

Rechner-Cluster - Reboot of CLAIX-2023 copy nodes

Montag 25.08.2025 08:00 - Montag 25.08.2025 16:16

Due to a pending kernel update, the CLAIX-2023 copy nodes copy23-1 and copy23-2 will be rebooted on Monday, 2025-08-25.

Fr 22.08.2025 12:38

Updates

Due to pending mandatory firmware updates, the maintenance needs to be prolonged.

Mo 25.08.2025 08:36

Due to issues with respect to the firmware update, the end of the maintenance is delayed.

Mo 25.08.2025 09:52

copy23-1 is available again.

Mo 25.08.2025 12:00

The Maintenance is completed.

Mo 25.08.2025 16:16

Rechner-Cluster - GPU login node unavailable due to maintenance

Freitag 22.08.2025 08:00 - Freitag 22.08.2025 15:45

Due to mandatory maintenance work, the login GPU node n23-g-1 will not be available during the maintenance. During the maintenance, the node's firmware will be updated which takes several hours to complete.

Do 21.08.2025 14:36

Updates

The Firmware upgrades are completed. However, there is an issue with respect to the SNC configuration. We were unable to enable the cluster-wide standard setting SNC4 on the dialog node. Further investigation is required. Notwithstanding this, the GPU login node can be used until further notice.

Fr 22.08.2025 15:40

Rechner-Cluster - $HOME and $WORK filesystems are again unavailable

Dienstag 19.08.2025 19:10 - Mittwoch 20.08.2025 09:30

Due to issues with the underlying filesystem servers for $HOME and $WORK, the majority of batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.

Di 19.08.2025 19:37

Updates

The issue has been resolved and the affected machines are back in operation. We are actively working with the vendor to remedy the situation and apologize for the repeated downtimes.

Mi 20.08.2025 10:16

Rechner-Cluster - File System Issues

Montag 18.08.2025 16:52 - Montag 18.08.2025 17:59

Due to issues with the shared filesystems, all services with respect to the HPC cluster are substantially impacted. The issues are under current investigation, and we are working on a solution.

Mo 18.08.2025 17:41

Updates

On most of the nodes, the shared filesystems' function could be restored.

Di 19.08.2025 08:12

Rechner-Cluster - $HOME and $WORK filesystems unavailable

Freitag 15.08.2025 13:00 - Freitag 15.08.2025 15:00

Due to issues with the GPFS file system, the $HOME and $WORK directories may currently be unavailable on the login nodes. Additionally, some compute nodes are temporarily not available due to these issues. Our team is working to resolve the problem as quickly as possible.

Fr 15.08.2025 14:19

Updates

The issues could be resolved.

Fr 15.08.2025 15:09

Rechner-Cluster - $HOME and $WORK filesystems unavailable and cause jobs to fail.

Mittwoch 13.08.2025 12:45 - Mittwoch 13.08.2025 14:47

Due to issues with the underlying filesystem servers for $HOME and $WORK; GPFS, some batch nodes are currently unavailable and login nodes might be unstable. This may have caused running jobs to fail. We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused. EDIT: Systems seem to be back online and operating normally.

Mi 13.08.2025 13:51

Rechner-Cluster - Emergency Shutdown of CLAIX-2023

Mittwoch 30.07.2025 12:30 - Mittwoch 30.07.2025 18:50

Due to cooling issues, the CLAIX-2023 cluster has been shut down. Accessing the cluster is currently not possible. Please note that running jobs may have been terminated due to the shutdown. We are actively working on identifying the root cause and resolving the issue as quickly as possible. We apologize for any inconvenience this may cause and thank you for your understanding.

Mi 30.07.2025 12:36

Updates

The cluster is resuming operation now that the cooling system has been cleared for load. All batch nodes should be back in operation shortly.

Mi 30.07.2025 18:33

Rechner-Cluster - Migration of login23-3 and login23-x-1 to Rocky 9

Montag 28.07.2025 07:00 - Montag 28.07.2025 18:00

The remaining Rocky 8 login nodes login23-3 and login23-x-1 will be upgraded to Rocky 9. These nodes will be unavailable during the upgrade process. Please use the other available login nodes in the meantime. The copy nodes are unaffected and will remain Rocky 8 nodes until further notice.

Fr 25.07.2025 11:15

Rechner-Cluster - Anmeldung schlägt fehl /// Login not possible

Dienstag 22.07.2025 11:30 - Dienstag 22.07.2025 11:44

Zurzeit ist eine Anmeldung über RWTH Single Sign-On nicht möglich. Bestehende Sessions sind nicht davon betroffen. Wir arbeiten an der Behebung des Problems. --English version--- RWTH Single Sign-On login ist not possible at the moment. Existing sessions are not affected. We are working on fixing the problem.

Di 22.07.2025 11:37

Updates

Die Störung wurde behoben./// The problem has been solved.

Di 22.07.2025 11:45

Rechner-Cluster - p7zip replaced with 7zip

Dienstag 22.07.2025 09:30 - Dienstag 22.07.2025 09:30

Due to security concerns, the previously installed file archiver "p7zip" (a fork of "7zip"), was replaced by the original program. p7zip was forked two decades ago from 7zip due to initially lacking Unix and Linux support. Unfortunately, development is lacking severly behind of the upstream code such that issues are only partially fixed if at all. Since the upstream 7zip contains also several functional and performance improvements in addition to the bug fixes, the overall performance should have improved for the users as well. The usage should not change aside of minor changes due to the development history.

Di 22.07.2025 13:22

Rechner-Cluster - $HOME and $WORK filesystems are again unavailable

Montag 14.07.2025 09:45 - Montag 14.07.2025 11:30

Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime. We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.

Mo 14.07.2025 10:20

Updates

The $HOME and $WORK filesystems are back online

Mo 14.07.2025 11:45

Rechner-Cluster - $HOME and $WORK filesystems unavailable

Donnerstag 10.07.2025 16:30 - Donnerstag 10.07.2025 18:30

Do 10.07.2025 16:47

Updates

The filesystems have been properly re-mounted on all nodes and the system is back in normal operation. If your jobs have crashed due to I/O errors, please resubmit them. We again apologize for the inconvenience.

Do 10.07.2025 19:04

Rechner-Cluster - Firmware Upgrade of Frontend Nodes

Montag 07.07.2025 06:00 - Montag 07.07.2025 07:30

The Firmware of login23-3 and login23-4 must be upgraded. During the Upgrade, the nodes will not be available. Please use the other dialog nodes in the meantime.

Mi 02.07.2025 12:56

Updates

The firmware upgrades are completed.

Mo 07.07.2025 07:36

Rechner-Cluster -

Samstag 05.07.2025 12:15 - Samstag 05.07.2025 12:45

Im angegeben Zeitraum kam es teilweise zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error. Dies betrifft alle Services, welche den Single Sign-On Login verwenden. --- English version --- There were disruptions to RWTH Single Sign-On in some cases during the specified period . After entering your login details, the screen loads and then an Internal Server Error appears. This affects all services that use the RWTH Single Sign-On login.

Mo 07.07.2025 11:45

Rechner-Cluster - User namespaces on all Rocky 9 systems deactivated

Montag 16.06.2025 14:15 - Donnerstag 03.07.2025 18:00

Due to an open security issue we have deactivated user namespaces on all Rocky 9 systems. This feature is mainly used by containerization software and affects the way apptainer containers will behave. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using.

Mo 16.06.2025 14:26

Updates

The user namespaces can be activated again once the Rocky 9 nodes are upgraded to Rocky 9.6. The dialog systems (cf. other ticket) were upgraded already. Currently we are procuring the upgrade to Rocky 9.6 in the background to minimize the downtime. Already upgraded nodes have user namespaces enabled by default again.

Mi 25.06.2025 09:32

The remaining Rocky 9.5 nodes are planned to be updated to Rocky Linux 9.6 until 2025-07-04

Do 26.06.2025 15:48

Rechner-Cluster - Usability of Desktop Environments on CLAIX-2023 Frontend Nodes Restricted due to Security Issues

Freitag 20.06.2025 09:15 - Donnerstag 03.07.2025 12:00

Due to high-risk security issues within some components, the affected packages had to be removed from the cluster to enable operation until the security issues can be fixed. Unfortunately, the removal breaks the desktop environments XFCE and Mate due to tight dependencies to the respective removed packages. Consequently, the file managers "Thunar" and "Caja" as well as some other utilities cannot be used at the moment when using a GUI login. If a GUI login is mandatory, please use the IceWM environment instead. However, some of the GUI applications might not work as expected (see above). The text-based/terminal usage of the frontend nodes is not affected by the temporal change.

Fr 20.06.2025 09:26

Updates

The upgraded Rocky 9 login nodes have received patches that solve the security issues. On these nodes, Mate and XFCE can be used again as desktop environment. However, The Rocky 8 login nodes (i.e., login23-3, login23-x-1) are still awaiting the respecitve patches. Until the patches are available, the usage of the respective environments remains limited.

Di 24.06.2025 11:01

The Rocky 8 frontend nodes received the pending update and can now be used with Mate and XFCE again.

Do 03.07.2025 13:02

Rechner-Cluster - Login per RWTH Single Sign-On gestört // Login via RWTH Single Sign-On not working

Donnerstag 26.06.2025 15:45 - Donnerstag 26.06.2025 16:16

Aktuell kommt es zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error. Dies betrifft alle Services, welche den Single Sign-On Login verwenden. Die zuständige Fachabteilung ist bereits informiert und arbeitet an der Behebung. --- English version --- There are currently disruptions to RWTH Single Sign-On. After entering your login details, the screen loads and then an Internal Server Error appears. This affects all services that use the RWTH Single Sign-On login. The specialist department has been informed and is working to resolve the issue.

Do 26.06.2025 15:53

Updates

Die Störung konnte behoben werden. --- English version --- The problem has been fixed.

Do 26.06.2025 16:16

Rechner-Cluster - HPC JupyterHub down for maintenance

Mittwoch 25.06.2025 09:00 - Mittwoch 25.06.2025 11:00

The HPC JupyterHub will be down for maintenance during the hours of 9:00 and 11:00.

Di 24.06.2025 15:53

Rechner-Cluster - Upgrade of Login Nodes

Montag 23.06.2025 06:00 - Mittwoch 25.06.2025 09:51

On Monday, June 23rd, the CLAIX-2023 Rocky 9 frontend nodes login23-1, login23-2 and login23-4 will be upgraded to Rocky Linux 9.6. During the upgrade, these nodes will temporarily not be available. Please use the Rocky 8 frontend nodes in the meantime.

Fr 20.06.2025 13:44

Updates

login23-g-1 and login23-x-2 will be reinstalled with Rocky Linux 9.6. Until the reinstallation is finished, these nodes are temporarily unavailable.

Mo 23.06.2025 05:50

The GPU login node login23-g-1 is reinstalled and available with Rocky 9.6. The other nodes require further work.

Mo 23.06.2025 08:56

The login nodes login23-1, login23-2, login23-4 are updated and available again. login23-x-2 requires some further work.

Mo 23.06.2025 10:33

All Rocky 9-login nodes are upgraded to Rocky Linux 9.6.

Mi 25.06.2025 09:51

Rechner-Cluster - Job submission to some new Slurm projects --accounts fails

Samstag 07.06.2025 10:00 - Montag 16.06.2025 14:16

Our second Slurm controller machine has lost memory modules and is currently pending maintenance. This machine is responsible for keeping our Slurm DB in a current state. Without it the submission to some new Slurm projects/ accounts will fail. Please write a ticket if the problem persists.

Di 10.06.2025 14:04

Rechner-Cluster - login23-2 currently not available

Donnerstag 12.06.2025 17:30 - Montag 16.06.2025 10:20

login23-2 is currently not available. Please use one of the other dialog systems.

Do 12.06.2025 17:39

Updates

login23-2 is available again.

Mo 16.06.2025 10:22

Rechner-Cluster - System maintenance for the hpcwork filesystem

Mittwoch 11.06.2025 07:00 - Donnerstag 12.06.2025 17:40

During the maintenance, the dialog systems will continue to be available most of the time but *without access to the hpcwork directories*. Batch jobs will not be running.

Mo 02.06.2025 14:59

Updates

Due to unexpected problems with the filesystem update, we will have to prolong the maintenance until tomorrow. As of now, we presume that the problem will be solved until noon and will release the batch queues as soon as possible. We apologize for the inconvenience.

Mi 11.06.2025 16:42

All maintenance tasks have been completed and both HPCWORK and the batch queues are operational again.

Do 12.06.2025 17:57

Rechner-Cluster - HPC password change unavailable

Donnerstag 12.06.2025 10:15 - Donnerstag 12.06.2025 11:00

Due to issues with the new RegApp version, changing the HPC password may fail.

Do 12.06.2025 10:20

Updates

The issue has been identified and a fix has been deployed.

Do 12.06.2025 11:06

Rechner-Cluster - RegApp Maintenance

Mittwoch 11.06.2025 09:30 - Mittwoch 11.06.2025 10:00

The RegApp software will be updated to the newest version. During this time, login to the frontends and perfmon may be unavailable. It is recommended to log in beforehand.

Mi 04.06.2025 14:52

Rechner-Cluster - cp command of hpcwork files may fail on Rocky Linux 9 systems

Donnerstag 05.06.2025 11:30 - Mittwoch 11.06.2025 07:00

On Rocky Linux 9 systems (especially login23-1 or login23-4) the cp command of hpcwork files may fail with the error "No data available". Current workarounds: either use "cp --reflink=never ..." to copy files or run the cp command on one of the Rocky Linux 8 nodes, e.g. copy23-1 or copy23-2.

Do 05.06.2025 11:44

Updates

During system maintenance on 11.06. we will install a new version of the filesystem client software which will fix the problem. Until then please use "cp --reflink=never ..." to copy hpcwork files or copy files on Rocky 8 systems (e.g. copy23-1 or copy23-2).

Fr 06.06.2025 11:01

Rechner-Cluster - Routing über neue XWiN-Router

Dienstag 10.06.2025 21:00 - Dienstag 10.06.2025 22:13

In dem Zeitraum wird das Routing der bisherigen XWiN-Router (Nexus 7700) auf die neuen XWiN-Router (Catalyst 9600) umgestellt. Diese Router sind für die Netzverbindung der RWTH unerlässlich. Für diese Umstellung ist auch die Migration der DFN-Anbindung, welche redundant nach Frankfurt und Hannover geschaltet sind, sowie der RWTH-Firewall auf die neuen System erforderlich. Innerhalb des Wartungsfensters wird es zu Ausfällen oder Teilausfällen der Außenanbindung kommen. Alle Services der RWTH (bspw. VPN, Email, RWTHonline, RWTHmoodle) werden innerhalb dieses Zeitraums nicht zur Verfügung zu stehen. Die Erreichbarkeit von Services innerhalb des RWTH-Netzes ist aufgrund eingeschränkter DNS-Funktionalität zeitweise nicht gegeben. ---english--- During this period, the routing of the previous XWiN routers (Nexus 7700) will be switched to the new XWiN routers (Catalyst 9600). These routers are essential for RWTH's network connection. This changeover also requires the migration of the DFN connection, which is switched redundantly to Frankfurt and Hannover, and the RWTH firewall to the new systems. There will be complete or partial outages of the external connection during the maintenance window. All RWTH services (e.g. VPN, email, RWTHonline, RWTHmoodle) will not be available during this period. The accessibility of services within the RWTH network will be temporarily unavailable due to limited DNS functionality.

Technische Erläuterung

Die neuen Systemen verfügen über eine höhere Portdichte mit leistungsfähigeren Anschlüssen bei geringerem Energiebedarf. Die Routing-Tabelle kann bei Fehlerfällen deutlich schneller angepasst werden, wodurch die Verfügbarkeit des RWTH Netzes weiter gesteigert wird.

Mi 07.05.2025 12:47

Updates

Der Uplink nach Frankfurt wurde erfolgreich auf das neue System geschwenkt.

Di 10.06.2025 21:07

Umbau des Uplinks nach Hannover beginnt.

Di 10.06.2025 21:36

Uplink nach Hannover auf das neue System umgezogen.

Di 10.06.2025 21:45

BGP v4/v6 nach Frankfurt und Hannover sind nun über die neue Routern funktional.

Di 10.06.2025 21:57

Es stehen noch ein paar kleinere Nacharbeiten an.

Di 10.06.2025 22:03

Wartung ist abgeschlossen. Der Datenverkehr läuft nun vollständig über die neuen Router!

Di 10.06.2025 22:12

Problematik mit der Anbindung zur Physik identifiziert, Lösung erfolgt morgen früh.

Di 10.06.2025 23:50

Rechner-Cluster - Partial Migration of CLAIX-2023 to Rocky Linux 9

Montag 02.06.2025 09:00 - Montag 02.06.2025 19:00

During the maintenance, we will migrate half of the CLAIX-2023 nodes (ML & HPC) and the entire devel partition to Rocky Linux 9. Additionally, the dialog systems login23-1, login23-x-1, and login23-x-2 will be migrated.

Fr 30.05.2025 11:14

Rechner-Cluster - Aktuell keine Home- und Work-Verzeichnisse auf login23-2

Freitag 16.05.2025 11:30 - Freitag 23.05.2025 15:00

Auf login23-2 stehen aktuell keine Home- und Work-Verzeichnisse zur Verfuegung. Bitte weichen Sie auf eines der anderen Dialogsysteme aus.

Fr 16.05.2025 11:38

Rechner-Cluster - Swapping MFA backend of RegApp

Mittwoch 21.05.2025 15:00 - Mittwoch 21.05.2025 16:00

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

Fr 16.05.2025 15:31

Rechner-Cluster - Swapping MFA backend of RegApp

Mittwoch 14.05.2025 12:45 - Mittwoch 14.05.2025 12:45

We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.

Fr 09.05.2025 08:37

Updates

we have to postpone the maintenance

Mi 14.05.2025 12:55

Rechner-Cluster - Coolage disturbed, emergency shutdown of CLAIX23

Montag 12.05.2025 13:15 - Montag 12.05.2025 16:15

We have a problem with the external coolage. We therefore have to do an emergency shutdown of CLAIX23.

Mo 12.05.2025 13:24

Updates

We are now powering on the cluster again. The frontend (login-) nodes will be available soon again. Until further notice, the batchsystem for CLAIX23 keeps stopped. We hope to resolve the issue today.

Mo 12.05.2025 14:49

Rechner-Cluster - HPC JupyterHub maintenance

Dienstag 06.05.2025 09:00 - Dienstag 06.05.2025 17:00

The HPC JupyterHub will be down for maintenance to update some internal software packages.

Fr 02.05.2025 10:14

Rechner-Cluster - Access to hpcwork directories hanging

Montag 05.05.2025 16:00 - Montag 05.05.2025 17:15

Currently the access to the hpcwork directories may hang due to problems with the file servers. The problem is being worked on.

Di 29.04.2025 10:02

Updates

The accesss to $HPCWORK should work again now.

Di 29.04.2025 10:47

The access to the hpcwork directories may hang again.

Mo 05.05.2025 16:48

The accesss to $HPCWORK should work again.

Di 06.05.2025 08:01

Rechner-Cluster - HPCJupyterHub Profile Installation Unavailable

Freitag 28.03.2025 13:15 - Dienstag 29.04.2025 11:27

The installation of HPCJupyterHub Profiles is currently unavailable due to issues with our HPC container system. A downtime of at least two weeks is expected. We apologize for the inconvenience. Update: New Profiles can be installed now, after Apptainer changes.

Fr 28.03.2025 13:30

Rechner-Cluster - System Upgrade of Login Node

Montag 24.03.2025 05:00 - Donnerstag 17.04.2025 15:25

The CLAIX-2023 login node login23-4 will be upgraded to Rocky Linux 9.5 for assisting the migration to Rocky Linux 9. During the version upgrade, the login node will not be available.

Fr 21.03.2025 11:20

Updates

The CLAIX-2023 dialog node login23-4 is currently unavalable due to testing with respect to the planned migration to Rocky Linux 9. Please use an other dialog node until access to this node is available again.

Di 25.03.2025 15:01

The new Rocky 9 login node login23-4 will be available soon. We are resolving the last issues before it can be used for the Rocky 9 pilot phase.

Do 17.04.2025 15:22

The Rocky 9 login node login23-4 is available now.

Do 17.04.2025 15:27

Rechner-Cluster - SSH Command Key Approval Currently Not Possible

Donnerstag 10.04.2025 08:00 - Mittwoch 16.04.2025 14:35

Due to a bug in the RegApp, SSH command keys cannot be approved at the moment. We are working on a solulution. Update 16.04 14:35: The cause of the bug has been identified and fixed

Technische Erläuterung

SSH command keys must be reviewed and approved by an admin prior to granting permissions via the command key due to security reasons.

Do 10.04.2025 14:57

Rechner-Cluster - [RegApp] hotfix deployment

Mittwoch 16.04.2025 14:30 - Mittwoch 16.04.2025 14:35

the bug affecting ssh command key approval has been identified. A hotfix will be deployed momentarily, two factor logins on the HPC will be temporarily unavailable.

Mi 16.04.2025 14:13

Rechner-Cluster - Maintenance of the RegApp application

Mittwoch 09.04.2025 09:00 - Mittwoch 09.04.2025 11:00

No new logins are possible during the maintenance work. Users who are already logged in will not be disturbed and the maintenance will not affect the rest of the cluster. All frontends are still available, as is the batch system with its computing nodes.

Mo 10.03.2025 11:02

Rechner-Cluster - Bad Filesystem Performance

Donnerstag 23.01.2025 12:30 - Freitag 28.03.2025 09:31

At the moment, some issues with the file systems that impact the performance are observed. File access can be severely delayed consequently. The issue is currently under investigation.

Fr 24.01.2025 14:38

Updates

We were not able to identify the root cause of the observed issues and are still working on a solution.

Mo 10.03.2025 13:30

We changed some network settings on the GPFS servers, but that did not change anything. We are still working with the manufacturer to get a solution.

Do 20.03.2025 12:08

Additional changes have been made to the configuration. Initial tests tend to show an improvement in the GPFS performance. A full qualitative analysis of the performance is pending

Fr 21.03.2025 11:23

The performance of the HOME and WORK filesystems are much better now.

Fr 28.03.2025 09:33

Rechner-Cluster - Global Maintenance of CLAIX-2023

Mittwoch 05.03.2025 08:00 - Freitag 07.03.2025 15:00

Due to maintenance tasks on a cluster-wide scale that cannot be performed online, the whole batch service will be suspended for the maintenance. The Regapp is also affected by the maintenance.

Di 18.02.2025 14:56

Updates

The infrastructure of the HOME and WORK filesystems will also be under maintenance. Hence, the frontend nodes will be inacessible as well until the pending tasks are completed.

Mi 26.02.2025 06:19

Due to delays in servicing the file systems, the cluster maintenance needs to be prolonged.

Do 06.03.2025 12:38

Due to delays in deploying the updated file system, the maintenance hat do be prolonged.

Fr 07.03.2025 13:57

Rechner-Cluster - Jupyterhub Temporarily Unavailable Due To Kernel Update

Montag 17.02.2025 06:00 - Montag 17.02.2025 09:00

Due to a scheduled kernel update, the jupyterhub node is temporarily unavailable. The node will be available again, as soon as the update is completed.

Mo 17.02.2025 06:16

Rechner-Cluster - Kurze SSO-Störung

Mittwoch 12.02.2025 20:25 - Mittwoch 12.02.2025 21:00

---English Version below--- Heute Abend kam es gegen ca. 20:25 - 21:00 zu einem Performance-Tief des SSO-Service und wurde innerhalb dieser Zeit von der Fachabteilung geprüft. Das Problem ist behoben und schnelle Logins ohne Wartezeiten sind wieder möglich. ---English--- This evening around 20:25 - 21:00 there was a performance low in the SSO service and was checked by the specialist department during this time. The problem has been solved and fast logins without waiting times are possible again.

Mi 12.02.2025 21:41

Rechner-Cluster - Login nodes unavailable due to kernel update

Montag 10.02.2025 05:45 - Montag 10.02.2025 07:45

Due to a scheduled Kernelupdate, the login nodes, i.e., login23-*, are temporarily unavailable during the update. The nodes will be available again, as soon as the update is completed.

Mo 10.02.2025 00:47

Updates

The maintenance had to be prolonged for few minutes.

Mo 10.02.2025 07:25

Rechner-Cluster - Some CLAIX23 Nodes Unavailable Due To Update Issues

Freitag 24.01.2025 07:00 - Freitag 24.01.2025 13:35

During a kernel update issues occured which lead to delays in re-deploying the nodes to the batch service. Consequently, a large number of compute nodes is unavailable at the moment (reserved and/or down). We are currently working on handing the respective nodes and will release them as fast as possible.

Fr 24.01.2025 09:52

Updates

The updates could be completed. The nodes are available again.

Fr 24.01.2025 14:34

Rechner-Cluster - HPC JupyterHub unavailable

Mittwoch 22.01.2025 13:30 - Mittwoch 22.01.2025 15:14

The HPC JupyterHub is currently unavailable due to unforeseen errors in the filesystems. A solution is being worked on.

Mi 22.01.2025 13:41

Rechner-Cluster - Node unavailable due to kernel update

Montag 20.01.2025 07:45 - Montag 20.01.2025 08:45

The kernel of the CLAIX23 copy node claix23-2 has to be updated. The node will be available again as soon as the update is finished.

Mo 20.01.2025 07:46

Rechner-Cluster - RegApp Störung

Dienstag 14.01.2025 09:00 - Dienstag 14.01.2025 16:40

Es ist zurzeit nicht möglich in der RegApp das HPC Passwort zu ändern. Das anlegen neuer Accounts funktioniert nicht. It is currently not possible to change your HPC password in the RegApp. Creating new accounts is not possible. Update 16:40 the password problem has been fixed and the RegApp service is operating as usual again.

Di 14.01.2025 15:25

Rechner-Cluster - RegApp Wartung

Dienstag 14.01.2025 08:00 - Dienstag 14.01.2025 10:10

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit sind keine Logins möglich. Update: Die RegApp ist seit 9 Uhr wieder erreichbar, die Kommunikation mit dem 2. Faktor ist aber gestört. Entsprechend funktioniert der 2FA-Login am HPC noch nicht. Update 10:10 Die Störung ist behoben, der 2FA-Login am HPC funktioniert wieder.

Mo 06.01.2025 15:31

Rechner-Cluster - Wartung der RegApp

Montag 23.12.2024 09:00 - Montag 23.12.2024 09:30

In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit ist kein Login in der RegApp möglich.

Do 19.12.2024 11:04

Rechner-Cluster - Slurm hiccup possible

Donnerstag 19.12.2024 15:50 - Donnerstag 19.12.2024 16:00

We are migrating the slurm controller to a new host. It might come to short timeouts. We try to minimize that as much as possible.

Do 19.12.2024 09:53

Updates

The first try was not successfull, we are on the old master again. We are analyzing the problems that occurred and try again later.

Do 19.12.2024 10:01

we make another attempt

Do 19.12.2024 15:52

Rechner-Cluster - Issues regarding Availability

Montag 02.12.2024 11:15 - Mittwoch 18.12.2024 16:00

There may currently be login problems with various login nodes. We are working on a solution.

Mo 02.12.2024 11:58

Updates

The observed issues affect the batch service as well. Consequently many batch jobs may have failed.

Mo 02.12.2024 12:18

The observed problems can be concluded from power issues. We cannot exclude that further systems may have to be controlled shut down temporarily due to a problem resolution if required. However, we hope the issues can be resolved without any additional measures.

Mo 02.12.2024 14:31

The cluster can be accessed again. cf. ticket https://maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/9521 for further details. Several nodes, however, are still unavailable due to the consequences of the aforementioned issues. We are currently working on resolving the issues.

Mi 04.12.2024 12:57

Rechner-Cluster - Infiniband issues leading to not reachable nodes

Freitag 13.12.2024 21:30 - Montag 16.12.2024 10:57

Due to to be analyzed infiniband problems many nodes including the whole GPU cluster are not reachable at the moment. We are working together with the manufacturer, to solve the problems.

Mo 16.12.2024 08:04

Updates

The problem could be fixed

Mo 16.12.2024 10:57

Rechner-Cluster - Single Sign-On und MFA gestört///Single sign-on and MFA mailfunction

Mittwoch 11.12.2024 08:00 - Mittwoch 11.12.2024 09:30

Zurzeit ist der Single Sign-On und die Multifaktorauthentifizierung sporadisch gestört. Wir arbeiten bereits an einer Lösung und bitten um Geduld. ---english--- At the moment, single sign-on and multi-factor authentication are sporadically disrupted. We are already working on a solution and ask for your patience.

Mi 11.12.2024 08:12

Rechner-Cluster - Nodes drained due filesystem issues.

Sonntag 08.12.2024 06:45 - Sonntag 08.12.2024 17:15

Dear users, Sunday the 8.12.2024 at 06:53:00 AM the home filesystems of most nodes went offline. This might have negatively crashed some jobs and no new jobs can start during the downtime. We are actively working on the issue.

So 08.12.2024 16:22

Updates

Most nodes are coming back online. Apologies for the troubles. We expect most nodes to be usable by 18:00.

So 08.12.2024 16:56

Rechner-Cluster - lost filesystem $HOME $WORK connection

Donnerstag 05.12.2024 22:00 - Freitag 06.12.2024 10:00

Due to a problem in our network, some nodes lost their connection to the $HOME and $WORK file system. This included the login23-1 and login23-2 nodes. The issue has been resolved now.

Fr 06.12.2024 14:00

Rechner-Cluster - Emergency Shutdown of CLAIX-2023 Due to Failed Cooling

Montag 02.12.2024 15:15 - Mittwoch 04.12.2024 06:00

CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation. The cluster will be operational again if the inducting issues can be resolved.

Technische Erläuterung

All MPI and GPU nodes are liquid cooled. Both CDUs (= Cooling Distribution Units) failed such that heat dissipation was not possible anymore.

Mo 02.12.2024 15:20

Updates

Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.

Di 03.12.2024 10:34

The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance. Once these investigations are complete, the affected nodes will be made available through the batch system again.

Mi 04.12.2024 10:41

Rechner-Cluster - New file server for home and work directories

Freitag 22.11.2024 12:00 - Dienstag 26.11.2024 13:45

We are putting a new file server for the home and work directories into operation. For this purpose we will carry out a system maintenance in order to finally synchronise all data over the weekend.

Mi 20.11.2024 09:32

Updates

The maintenance needs to be extended.

Mo 25.11.2024 13:32

Due to some issues preventing a normal batch service, the maintenance had to be extended.

Di 26.11.2024 13:33

Rechner-Cluster - Limited Usability of CLAIX-2023

Donnerstag 21.11.2024 09:45 - Donnerstag 21.11.2024 17:15

Due to concurrent external issues regarding the RWTH Aachen network, access and usability of the Compute Cluster is limited at the moment. The respective network department is currently working on a solution. -- Wegen anhaltender externer Störungen im RWTH-Netzwerk ist der Cluster nur eingeschränkt erreichbar und funktionsfähig. Die zuständige Netzwerkfachabteilung arbeitet bereits an einer Lösung der Probleme.

Do 21.11.2024 12:47

Updates

The issues could not be resolved until now and may persist trhoughout tomorrow as well.

Do 21.11.2024 16:00

The issues have been resolved.

Fr 22.11.2024 06:52

Rechner-Cluster - Power disruption / Stromausfall

Freitag 15.11.2024 17:00 - Samstag 16.11.2024 14:00

At 17:00, there was a brief interruption of the power lines in the Aachen area. The power is available again, however, most of the compute nodes went down consequently. Currently, it is unclear when the service can be resumed. At the moment, critical services are under special care and are, if required, being restored. Um 17:00 Uhr hat es einen kurzzeitigen Stromausfall im Raum Aachen gegeben. Die Stromversorgung besteht wieder, jedoch ist die Mehrzahl der Compute-Knoten infolgedessen ausgefallen. Es ist unklar, wann der Betrieb wieder aufgenommen werden kann. Es wird momentan daran gearbeitet, kritische Dienste zu sichern und wiederherzustellen.

Fr 15.11.2024 18:43

Updates

After restoring critical operational infrastructure services, the HPC service is resumed. However, a large portion of the GPU nodes are unavailable due to the impact of the incurred blackout. Until further notice, these nodes are unavailable. Nachdem die kritische Infrastruktur zum Betrieb der Systeme wiederhergestellt werden konnte, wurde der HPC-Cluster wieder bereitgestellt und freigegeben. Allerdings sind durch die Auswirkungen des Stromausfalls eine größere Zahl GPU-Knoten nicht mehr verfügbar. Wir arbeiten an der Behebung der Probleme, können allerdings noch keine Prognose geben, wann und ob die Systeme wieder verfügbar sein werden.

Fr 15.11.2024 21:06

Der Großteil der ML Systeme (GPUs) konnten heute wieder hochgefahren und in den Batchbetrieb übergeben werden. The majority of the ML systems (GPUs) were restarted today and are back in batch operation.

Sa 16.11.2024 14:04

Rechner-Cluster - Scheduler Hiccup

Donnerstag 14.11.2024 10:45 - Donnerstag 14.11.2024 10:55

Our Slurm workload manager crashed due to an unknown reason. Functionality could be restored at short hand. Further investigations are ongoing.

Do 14.11.2024 10:59

Rechner-Cluster - GPU Malfunction on GPU Login Node

Dienstag 12.11.2024 09:15 - Dienstag 12.11.2024 10:35

Currently, a GPU of the GPU login node login23-g-1 shows an issue. The node is unavailable until the issue is resolved.

Di 12.11.2024 09:29

Updates

The issues could be resolved.

Di 12.11.2024 10:36

Rechner-Cluster - Login malfunction

Mittwoch 23.10.2024 17:00 - Donnerstag 24.10.2024 08:40

It is currently not possible to log in to the login23-* frontends. There is a problem with two-factor authentication

Do 24.10.2024 08:09

Rechner-Cluster - Top500 run for GPU nodes

Freitag 27.09.2024 08:00 - Freitag 27.09.2024 11:00

We are doing a new top500 run for the ML partition of CLAIX23. The GPU nodes will not be available during that run. Other Nodes and login23-g-1 might be also unavailable: i23m[0027-0030],r23m[0023-0026,0095-0098,0171-0174],n23i0001

Do 26.09.2024 15:03

Rechner-Cluster - Zertifikat abgelaufen /// Expired certificate

Montag 23.09.2024 08:00 - Montag 23.09.2024 09:03

Aufgrund von dem abgelaufenen Zertifikat für idm.rwth-aachen.de können keine IdM-Anwendungen und die Anwendungen, die über RWTH Single Sign-On angebunden sind, aufgerufen werden. - Beim Aufrufen von IdM-Anwendungen wird eine Meldung zur unsicheren Verbindung angezeigt. - Beim Aufrufen von Anwendungen mit dem Zugang über RWTH Single Sign-On wird eine Meldung zu fehlenden Berechtigungen angezeigt. Wir arbeiten mit Hochdruck an der Lösung des Problems. --- English --- Due to the expired certificate for idm.rwth-aachen.de, no IdM applications and the applications that use RWTH Single Sign-On can be accessed. We are working on a solution. - An insecure connection message is displayed when calling up IdM applications. - When calling up applications with access via RWTH Single Sign-On, a message about missing authorisations is displayed.

Mo 23.09.2024 08:11

Updates

Das Zertifikat wurde aktualisiert und die Anwendungen können wieder aufgerufen werden. Bitte löschen Sie den Browsercache, bevor Sie die Seiten wieder aufrufen. /// The certificate has been updated and the applications can be accessed again. Please delete the browser cache before accessing the pages again.

Mo 23.09.2024 09:06

Rechner-Cluster - Störung der RegApp -> kein login auf dem Cluster möglich

Montag 09.09.2024 10:45 - Montag 09.09.2024 11:30

Leider kam es in dem genannten Zeitraum zu einer Störung der RegApp, so dass man sich nicht auf den Frontends des Clusters einloggen konnte. Bereits bestehende Verbindungen wurden davon nicht beeinflusst. Das Problem ist behoben.

Mo 09.09.2024 15:40

Rechner-Cluster -

Montag 09.09.2024 08:00 - Montag 09.09.2024 09:00

copy23-2 data transfer system will be unavailable for maintenance.

Mo 02.09.2024 09:01

Updates

The maintenance is completed.

Mo 09.09.2024 09:00

Rechner-Cluster - Firmware Update of InfiniBand Gateways

Donnerstag 05.09.2024 15:00 - Freitag 06.09.2024 12:15

The firmware of the InfiniBand Gateways will be updated. The firmware update will be performed in background and should not cause any interruption of service.

Do 05.09.2024 15:31

Updates

The updates are completed.

Fr 06.09.2024 13:19

Rechner-Cluster - Hosts der RWTH Aachen teilweise nicht aus Netzen von anderen Providern erreichbar // Hosts of RWTH Aachen University partly not accessible from networks of other providers

Samstag 24.08.2024 20:15 - Sonntag 25.08.2024 21:00

Aufgrund einer Störung des DNS liefern die Nameserver verschiedener Provider aktuell keine IP-Adresse für Hosts unter *.rwth-aachen.de zurück. Als Workaround können Sie alternative DNS-Server in Ihren Verbindungseinstellungen hinterlegen, wie z.B. die Level3-Nameserver (4.2.2.2 und 4.2.2.1) oder von Comodo (8.26.56.26 und 8.20.247.20). Ggf ist es auch möglich den VPN-Server der RWTH zu erreichen, dann nutzen Sie bitte VPN. // Due to DNS disruption, the name servers of various providers are currently not returning an IP address for hosts under *.rwth-aachen.de. As a workaround, you can store alternative DNS servers in your connection settings, e.g. the Level3-Nameserver (4.2.2.2 and 4.2.2.1) or Comodo (8.26.56.26 und 8.20.247.20). It may also be possible to reach the RWTH VPN server, in which case please use VPN.

So 25.08.2024 10:34

Updates

Anleitungen zur Konfiguration eines alternativen DNS-Server unter Windows finden Sie über die folgenden Links: https://www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/ https://www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html Als Alternative können Sie auch VPN nutzen. Wenn Sie den VPN-Server nicht erreichen, können Sie nach der folgenden Anleitung die Host-Datei unter Windows anpassen. Dadurch kann der Server vpn.rwth-aachen.de erreicht werden. Dazu muss der folgenden Eintrag hinzugefügt werden: 134.130.5.231 vpn.rwth-aachen.de https://www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/ // Instructions for configuring an alternative DNS server under Windows can be found via the following links: https://www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/ https://www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html You can also use VPN as an alternative. If you cannot reach the VPN server, you can adjust the host file under Windows according to the following instructions. This will allow you to reach the server vpn.rwth-aachen.de. To do this, the following entry must be added: 134.130.5.231 vpn.rwth-aachen.de https://www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/

So 25.08.2024 13:20

Die Host der RWTH Aachen sind nun wieder auch von ausserhalb des RWTH Netzwerkes erreichbar. // The hosts of RWTH Aachen University can now be reached again from outside the RWTH network.

So 25.08.2024 21:10

Auch nach der Störungsbehebung am 25.8. um 21 Uhr kann es bei einzelnen Nutzer*innen zu Problemen gekommen sein. Am 26.8. um 9 Uhr wurden alle Nacharbeiten abgeschlossen, sodass es zu keinen weiteren Problemen kommen sollte. // Individual users may have experienced problems even after the fault was rectified on 25 August at 9 pm. On 26.8. at 9 a.m. all follow-up work was completed, so there should be no further problems.

Mo 26.08.2024 15:25

Rechner-Cluster - MPI/CPU Jobs Failed to start overnight

Montag 19.08.2024 17:15 - Dienstag 20.08.2024 08:15

Many nodes suffered an issue after our updates on the 19.08.2024, resulting in jobs failing on the CPU partitions. If your job failed to start or failed on startup, please consider requeuing it if necessary. This list of jobs was identified as possibly being affected by the issue: 48399558,48468084,48470374,48470676,48473716,48473739,48473807,48473831, 48475599,48475607_0,48475607_1,48475607_2,48475607_3,48475607_4,48475607_5, 48475607_6,48475607_7,48475607_8,48475607_9,48475607_10,48475607_11,48475607_12, 48475607_13,48475607_14,48475607_15,48475607_16,48475607_17,48475607_18,48475607_19, 48476753,48482255,48485168,48486404,48488874_5,48488874_6,48488874_7,48488874_8, 48488874_9,48488874_10,48488874_11,48488875_9,48488875_10,48488875_11,48489133_1, 48489133_2,48489133_3,48489133_4,48489133_5,48489133_6,48489133_7,48489133_8,48489133_9, 48489133_10,48489154_0,48489154_1,48489154_2,48489154_3,48489154_4,48489154_5,48489154_6,48489154_7, 48489154_8,48489154_9,48489154_10,48489154_11,48489154_12,48489154_13,48489154_14,48489154_15, 48489154_16,48489154_17,48489154_18,48489154_19,48489154_20,48489154_21,48489154_22,48489154_23, 48489154_24,48489154_25,48489154_26,48489154_27,48489154_28,48489154_29,48489154_30,48489154_31, 48489154_32,48489154_33,48489154_34,48489154_35,48489154_36,48489154_37,48489154_38,48489154_39, 48489154_40,48489154_41,48489154_42,48489154_43,48489154_44,48489154_45,48489154_46,48489154_47, 48489154_100,48489154_101,48489154_102,48489154_103,48489154_104,48489154_105,48489154_106,48489154_107, 48489154_108,48489154_109,48489154_110,48489154_111,48489154_112,48489154_113,48489154_114,48489154_115, 48489154_116,48489154_117,48489154_118,48489154_119,48489154_120,48489154_121,48489154_122,48489154_123, 48489154_124,48489154_125,48489154_126,48489154_127,48489154_128,48489154_129,48489154_130,48489154_131, 48489154_132,48489154_133,48489154_134,48489154_135,48489154_136,48489154_137,48489154_138,48489154_139, 48489154_140,48489154_141,48489154_142,48489154_143,48489154_144,48489154_145,48489154_146,48489154_147, 48489154_148,48489154_149,48489154_150,48489154_151,48489154_152,48489154_153,48489154_154,48489154_155, 48489154_156,48489154_157,48489154_158,48489154_159,48489154_160,48489154_161,48489154_162,48489154_163, 48489154_164,48489154_165,48489154_166,48489154_167,48489154_168,48489154_169,48489154_170,48489154_171, 48489154_172,48489154_173,48489154_174,48489154_175,48489154_176,48489154_177,48489154_178,48489154_179, 48489154_180,48489154_181,48489154_182,48489154_183,48489154_184,48489154_185,48489154_186,48489154_187, 48489154_188,48489154_189,48489154_190,48489154_191,48489154_192,48489154_193,48489154_194,48489154_195, 48489618_1,48489618_2,48489618_3,48489618_4,48489618_5,48489618_6,48489618_7,48489618_8,48489618_9,48489618_10, 48489776,48489806_6,48489806_55,48489806_69,48489806_98,48489842,48489843,48489844,48489845,48489882_1,48489882_2, 48489882_3,48489882_4,48489882_5,48489882_6,48489882_7,48489882_8,48489882_9,48489882_10,48494481,48494490,48494752, 48494753,48494754,48494755,48494756,48494757,48494758,48494759,48494760

Di 20.08.2024 11:34

Rechner-Cluster - Maintenance

Montag 19.08.2024 07:00 - Montag 19.08.2024 16:00

Due to updates to our compute nodes, the HPC system will be unavailable for Maintenance. The login nodes will available at noon without interruptions, but the batch queue for jobs won't be usable during the maintenance work. As soon as the maintenance work has been completed, batch operation will be enabled again. These jobs should be requeued if necessary: 48271714,48271729,48271731,48463405,48463406,48463407,48466930, 48466932,48468086,48468087,48468088,48468089,48468090,48468091, 48468104,48468105,48468108,48468622,48469133,48469262,48469404, 48469708,48469734,48469740,48469754,48469929,48470011,48470017, 48470032,48470042,48470045,48474641,48474666,48475362,48489829, 48489831,48489833_2,48489838

Fr 09.08.2024 11:01

Rechner-Cluster - Old HPCJupyterHub GPU profiles might run slower on the new c23g nodes.

Freitag 24.05.2024 11:00 - Freitag 09.08.2024 13:46

Please migrate your notebooks to work with newer c23 GPU Profiles! -- The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs. Redeployment of these old profiles is necessary and will take some time.

Fr 24.05.2024 11:15

Rechner-Cluster - MPI jobs may crash

Dienstag 16.07.2024 16:12 - Donnerstag 01.08.2024 09:15

Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.

Mo 22.07.2024 09:37

Updates

We have identified the issue and are currently testing workarounds with the affected users.

Mi 24.07.2024 12:41

After successful tests with affected users, we have rolled out a workaround that automatically prevents this issue for our IntelMPI installations. We advise users to remove any custom workarounds from their job scripts to ensure compatibility with future changes.

Do 01.08.2024 10:28

Rechner-Cluster - c23i Partition is DOWN for the HPC JupyterHub

Donnerstag 18.07.2024 15:15 - Montag 29.07.2024 10:14

The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition. A solution is momentarily unknown and will be investigated. The HPC JupyterHub will not be able to use it until it is resolved.

Do 18.07.2024 15:29

Rechner-Cluster - Temporary Deactivation of User Namespaces

Montag 08.07.2024 14:15 - Donnerstag 18.07.2024 13:00

Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.

Mo 08.07.2024 14:32

Updates

User namespaces are available again.

Do 18.07.2024 13:00

Rechner-Cluster - Quotas on HPCWORK may not work correctly

Donnerstag 27.06.2024 14:30 - Donnerstag 18.07.2024 12:30

The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.

Do 27.06.2024 14:40

Updates

File quotas for all hpcwork directories were increased to one million.

Do 18.07.2024 12:39

Rechner-Cluster - Reconfiguration of File Systems and Kernel Update

Montag 15.07.2024 07:00 - Dienstag 16.07.2024 16:11

During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.

Mi 10.07.2024 09:43

Updates

The maintenance had to be extended for final filesystem tasks

Mo 15.07.2024 15:24

Due to unforseen problems, the maintenance has to be extended to tomorrow 16.07.2024 18.00. We do not expect the manufacturer of the filesystem to take that long, but expect to open the cluster earlier again.

Mo 15.07.2024 17:24

The maintenance could be ended successfully. Once again, sorry for the long delay.

Di 16.07.2024 16:12

Rechner-Cluster - HPCJupyterHub down due to update to 5.0.0

Mittwoch 26.06.2024 15:00 - Donnerstag 27.06.2024 16:00

HPCJupyterHub is down after faied update to 5.0.0. will stay until the update is complete. HPCJupyterHub could not be updated to 5.0.0. Remains at 4.1.5.

Mi 26.06.2024 15:04

Rechner-Cluster - FastX web servers on login18-x-1 and login18-x-2 stopped

Mittwoch 15.05.2024 14:00 - Donnerstag 27.06.2024 14:26

The FastX web servers on login18-x-1 and login18-x-2 have been stopped, i.e. the addresses https://login18-x-1.hpc.itc.rwth-aachen.de:3300 and https://login18-x-2.hpc.itc.rwth-aachen.de:3300 are not available anymore. Please use login23-x-1 or login23-x-2 instead.

Mi 15.05.2024 14:38

Updates

login18-x-1 and login18-x-2 has been decommissioned.

Do 27.06.2024 14:29

Rechner-Cluster - Maintenance

Mittwoch 26.06.2024 08:00 - Mittwoch 26.06.2024 16:00

Due to maintenance work on the water cooling system, Claix23 must be empty during the specified period. As soon as the maintenance work has been completed, batch operation will be enabled again. The dialog systems are not affected by the maintenance work.

Mi 12.06.2024 07:48

Updates

Additionally, between 10 and 11 o'clock, there will be a maintenance of the RegApp. During this time, new logins will not be possible, existing connections will not be disturbed.

Di 25.06.2024 14:04

Rechner-Cluster - Upgrade to Rocky Linux 8.10

Donnerstag 13.06.2024 11:15 - Mittwoch 26.06.2024 16:00

Due to the reached EOL of Rocky 8.9, the MPI nodes of CLAIX23 must be upgraded to Rocky 8.10. The upgrade is performed in background during production to minimize the downtime of the cluster. However, during the Upgrade, free nodes will be removed on a selection-basis and will not be available for job submission until the upgrade is completed. Please keep in mind that during the update, the library versions installed will likely change. Thus, the performance and application behaviour may vary compared to earlier runs.

Do 13.06.2024 11:49

Updates

Starting now, all new jobs will be scheduled to Rocky 8.10 nodes. The remaining nodes that still need to be updated are unvailable for job submission. These nodes will be upgraded as soon as possible after their jobs' completion.

Fr 14.06.2024 18:22

The update of the frontend and batch nodes is completed. Remaining nodes (i.e. integrated hosting and service nodes) will be updated on the cluster maintanance scheduled for 2024-06-26.

Do 20.06.2024 08:49

Rechner-Cluster - Update of Frontend Nodes

Mittwoch 26.06.2024 08:00 - Mittwoch 26.06.2024 10:00

The dialog nodes (i.e. login23-1/2/3/4, login23-x-1/2) will be updated to Rocky 8.10 today within the weekly reboot. The upgrade of copy23-1/2 will follow.

Mo 17.06.2024 05:08

Updates

The copy frontend nodes (copy23-1, copy23-2) will be updated to Rocky Linux 8.10 during the cluster maintanance ond 2024-06-26.

Mo 24.06.2024 09:13

The update of the remaining frontend nodes is completed.

Mi 26.06.2024 11:12

Rechner-Cluster -

Dienstag 25.06.2024 10:00 - Dienstag 25.06.2024 17:00

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

Di 25.06.2024 16:11

Rechner-Cluster - Error on user/project management

Donnerstag 20.06.2024 10:00 - Montag 24.06.2024 10:32

Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue

Do 20.06.2024 12:09

Updates

The issue has been resolved.

Mo 24.06.2024 10:33

Rechner-Cluster - Project management

Mittwoch 29.05.2024 15:30 - Mittwoch 12.06.2024 16:30

During this period no RWTH-S, THESIS, LECTURE or WestAI projects can be granted. We apologize for the inconvenience.

Mi 29.05.2024 15:42

Rechner-Cluster - RegApp Maintenance

Mittwoch 12.06.2024 09:00 - Mittwoch 12.06.2024 10:00

Due to maintenance of the RegApp Identiy Provider, it is not possible to establish new connections to the cluster during the specified period. Existing connections and batch operation are not affected by the maintenance.

Technische Erläuterung

The RegApp is moved to a new server with a new operating system.

Di 04.06.2024 14:28

Rechner-Cluster - Deactivation of User Namespaces

Mittwoch 27.03.2024 08:15 - Montag 29.04.2024 18:00

(German version below) Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability. --- Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte Usernamespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt, und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.

Mi 27.03.2024 08:14

Updates

A kernel update addressing the issue was released upstream and will be available to the compute cluster, soon. Upon the update, usernamespaces can be enabled, again.

Do 04.04.2024 11:11

We are planning to re-enable user namespaces on April, 29th after some final adjustments

Mi 24.04.2024 17:22

Rechner-Cluster - Performance Problems on HPCWORK

Montag 08.04.2024 11:00 - Mittwoch 24.04.2024 17:00

We currently register recurring performance degradations on HPCWORK directories which might be partly worsened by the on-going migration process leading on to the filesystem migration on April, 17th. The problems cannot be traced back to a single cause but are actively investigated.

Fr 12.04.2024 11:35

Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

Di 16.04.2024 16:21

Rechner-Cluster - System Maintenance

Dienstag 23.04.2024 07:00 - Mittwoch 24.04.2024 12:00

The whole clusters needs to be updated with a new kernel such that user namespaces can be reenabled again, please compare https://maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/8929 Simultaneously the Infiniband Stack will be updated for better performance and stability. During this maintenance, the dialog systems and the batchsystem will not be available. The dialog systems are expected to be reopened in the early morning. We do not believe that the maintenance will last the whole day but expect the cluster to open earlier.

Mi 10.04.2024 11:22

Updates

Due to technical problems, we will have to postpone the maintenance to 23.04.2024 07:00.

Di 16.04.2024 16:22

Unfortunately, unplanned complications have arisen during maintenance, so that maintenance will have to be extended until midday tomorrow. We will endeavor to complete the work by then. We apologize for any inconvenience this may cause.

Di 23.04.2024 16:27

Rechner-Cluster - Migration from lustre18 to lustre22

Dienstag 23.04.2024 07:00 - Mittwoch 24.04.2024 12:00

In the last weeks, we started migrating all HPCWORK data to a new filesystem. In this Maintenance we will do the final migration step. HPCWORK will not be available during this maintenance.

Mi 10.04.2024 11:26

Updates

Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.

Di 16.04.2024 16:23

Rechner-Cluster - HPC JupyterHub update

Dienstag 23.04.2024 07:00 - Mittwoch 24.04.2024 12:00

During the Claix HPC System Maintenance, the HPC JupyterHub will be updated to a newer version. This will improve Claix 2023 support as well mandatory security updates. The whole clusters needs to be updated with a new kernel.

Di 23.04.2024 07:03

Updates

The migration was successfully completed.

Mi 24.04.2024 13:40

Rechner-Cluster - Top500 - Benchmark

Donnerstag 11.04.2024 17:00 - Freitag 12.04.2024 09:10

During the stated time Claix-2023 will not be available due to a benchmark run for the Top500 list[1]. Batch jobs which cannot finish before the start of this downtime or which are scheduled during this time period will be kept in queue and started after the cluster resumes operation. [1] https://www.top500.org

Do 11.04.2024 17:09

Updates

The nodes are available now again

Fr 12.04.2024 09:27

Rechner-Cluster - Longer waiting times in the ML partition

Mittwoch 03.04.2024 16:00 - Donnerstag 11.04.2024 13:11

There are currently longer waiting times in the ML partition as the final steps of the acceptance process are still being carried out.

Do 04.04.2024 10:09

Updates

The waiting times should be better now

Do 11.04.2024 13:11

Rechner-Cluster - RegApp Service Update

Mittwoch 03.04.2024 14:00 - Mittwoch 03.04.2024 14:30

+++ German version below +++ The RegApp will be updated on 2024-04-03. During the update window, the service will be unavailable for short time intervals. Active sessions should not be affected. +++ English version above +++ Am 03.04.2024 wird die RegApp aktualisiert. Während des Updatefensters kann der Dienst für kurze Zeit unterbrochen sein. Aktive Sitzungen sollten nicht betroffen sein.

Mi 27.03.2024 13:59

Rechner-Cluster - Problems with submitting jobs

Mittwoch 03.04.2024 12:00 - Mittwoch 03.04.2024 14:03

There are currently problems when submitting jobs. We are working on fixing the problems and apologize for the inconvenience.

Mi 03.04.2024 12:36

Updates

The problem is solved now.

Mi 03.04.2024 14:03

Rechner-Cluster - Deactivation of User Namespaces

Freitag 12.01.2024 10:30 - Donnerstag 08.02.2024 08:00

(German version below) Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability. Update: We have installed a bugfix release for the affected software component and enabled user namespaces again. --- Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte User Namespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren. Update: Wir haben einen Bugfix für die betroffene Softwarekomponente installiert und User Namespaces wieder aktiviert.

Fr 12.01.2024 10:43

Rechner-Cluster - Verzeichnis "hpcwork" ist leer // hpcwork directory is empty

Montag 29.01.2024 10:15 - Montag 29.01.2024 11:34

Zurzeit werden keine Daten auf /hpcwork angezeigt. Die Fachabteilung ist informiert und arbeitet an der Lösung. ---english--- At the moment, no data are shown on /hpcwork. We are working on a solution of the problem.

Mo 29.01.2024 10:26

Updates

Die Störung wurde behoben. // The problem has been solved.

Mo 29.01.2024 11:34

Rechner-Cluster - Scheduled Reboot of CLAIX18 Copy Nodes

Montag 29.01.2024 06:00 - Montag 29.01.2024 07:15

Both CLAIX18 copy nodes will be rebooted on Monday, January 29th, 6.00 am (CET) due to a scheduled kernel upgrade. The systems will temporarily unavailable and cannot be used until the kernel update is finished.

Fr 26.01.2024 17:15

Rechner-Cluster - Netzwerkprobleme

Freitag 19.01.2024 19:45 - Samstag 20.01.2024 09:30

Aufgrund von Netzwerkromplemen kann es im angegeben Zeitraum zu Problemen bei der Nutzung des Clusters gekommen sein.

Mo 22.01.2024 07:45

Rechner-Cluster - Verbindung mit dem Windows-Cluster nicht möglich///Connection to windows cluster ist no possible

Freitag 29.12.2023 14:45 - Montag 01.01.2024 00:00

Momentan kann keine Verbindung zum Windows-Cluster hergestellt werden. Die Kollegen sind informiert und arbeiten an der Behebung des Problems. -- english -- At the moment it is not possible to connect to the windows cluster. We are working on a solution of the problem.

Fr 29.12.2023 14:55

Updates

--English Version Below-- Die Störung konnte behoben werden. Eine Verbindung mit dem Windows-Cluster ist wieder möglich. --English Version-- The error has been resolved. You can connect to the Windows cluster again.

Mi 03.01.2024 11:46

Rechner-Cluster - jupyterhub.hpc.itc.rwth-aachen.de DNS Temporary out of Service

Donnerstag 14.12.2023 15:30 - Donnerstag 14.12.2023 15:55

The jupyterhub.hpc.itc.rwth-aachen.de DNS is Temporary out of Service for 20 Minutes. Problems accessing the hpc JupyterHub might arise from this failure. Please wait until the system comes back online.

Do 14.12.2023 15:33

Rechner-Cluster - DGX-2 Node nd20-02 unavailable

Montag 27.11.2023 00:00 - Dienstag 12.12.2023 08:00

Der DGX-2-Knoten nd20-02 wird voraussichtlich Montag, den 27.11. und Dienstag, den 28.11. ganztägig nicht zur Verfügung stehen. Grund hierfür ist das Betriebssystemupdate auf Rocky 8. -- The DGX-2 node nd20-02 will not be available on Monday (27.11.) and Tuesday (28.11.) for the whole day. We will be updating the operating system to Rocky 8 in the specified time

Di 21.11.2023 12:21

Updates

The node needs to be reinstalled and cannot be used until further notice.

Di 28.11.2023 12:51

The update of the system was successful.

Di 12.12.2023 07:59

Rechner-Cluster - Wartung HPC-Benutzerverwaltung

Dienstag 05.12.2023 10:00 - Dienstag 05.12.2023 12:00

Aufgrund von Wartungsmassnahmen erfolgt das Einrichten von HPC-Accounts verzoegert. Passwort-Aenderungen sind nicht moeglich.

Di 05.12.2023 09:55

Rechner-Cluster - login18-x-2 gestoert

Montag 27.11.2023 12:45 - Dienstag 28.11.2023 14:40

login18-x-2 ist defekt und steht deshalb aktuell nicht zur Verfuegung.

Di 28.11.2023 12:50

Updates

Das System ist wieder ok.

Di 28.11.2023 14:40

Rechner-Cluster - System Maintenance & Upgrade to Rocky 8.9

Montag 27.11.2023 08:00 - Montag 27.11.2023 14:00

The complete cluster will not be available from 8am to 12am due to system maintenance. Within the maintenance, the HPC Cluster will be upgraded to Rocky 8.9.

Fr 17.11.2023 08:00

Updates

Due to technical problems, we have to postpone the maintenance to next week monday

Di 21.11.2023 11:54

due to technical problems, we have to prolong the maintenance

Mo 27.11.2023 11:34

The maintenance could be finished successfully

Mo 27.11.2023 14:57

Rechner-Cluster - Login problems regapp

Montag 23.10.2023 12:00 - Donnerstag 26.10.2023 12:00

Currently, some users receive an error message after logging into the regapp application.. We are already working on a solution. --- Aktuell kommt es bei einigen Nutzern nach dem Login in die Regapp zu einer Fehlermeldung. Wir arbeiten bereits an einer Lösung.

Mi 25.10.2023 13:00

Rechner-Cluster - Interruption of HPC Service due to Network Maintenance

Dienstag 17.10.2023 09:00 - Dienstag 17.10.2023 10:15

Due to a network maintenance in the IT Center building SW23, the HPC Service will be temporarily suspended. During the maintenance, the cluster (including all frontend nodes) will not be available. -- Wegen Wartungsarbeiten am Netzwerk im IT-Center SW23 wird der HPC-Betrieb vorübergehend unterbrochen. Während der Wartung ist der Cluster (alle Frontendknoten einbegriffen) nicht erreichbar.

Di 17.10.2023 08:15

Updates

The network maintenance is completed. Until all services of the cluster are restored, the HPC service will remain suspended.

Di 17.10.2023 09:18

The cluster is reachable again.

Di 17.10.2023 10:48

Rechner-Cluster - Temporary Shutdown of Lustre18 & Reboot of Frontend Nodes

Dienstag 17.10.2023 07:30 - Dienstag 17.10.2023 10:10

Lustre18 will be temporarily shut down during the maintanance. The frontend nodes will be mandatorily rebooted. -- Lustre18 wird während der Wartung temporär gestoppt. Die Frontendknoten werden erforderlicherweise neu-gestartet.

Di 17.10.2023 07:38