Rechner-Cluster
Mehr Informationen zu dem Service finden Sie in unserem Dokumentationsportal.
Slurm emergency downtime
Due to unforeseen circumstances we are forced to fully stop and restart the Slurm batch system infrastructure.
During this short downtime the submission of new jobs wont be possible, Slurm commands will not be available and other Slurm related tasks will not be possible.
Already running jobs should be able to continue and finish without issue.
RegApp Umstellung hinter den NHR-Proxy
Der Aachener RegApp wird die NHR RegApp als föderativer Proxy vorangestellt.
Während der Wartungsarbeiten sind keine neuen Anmeldungen beim Cluster, bei Perfmon und bei JupyterHub möglich.
Es wird empfohlen, dass Benutzer sich vorab eine Sitzung sichern.
Global outage HPC System
A failure in several interconnected systems is causing a global outage of the HPC system.
We are working on a solution.
RegApp Wartung
The RegApp service will be updated to a new version.
During the maintenance, no new logins to the cluster, perfmon, and JupyterHub will be possible.
It is recommended that users acquire a session beforehand.
Störung cluster
Aktuell gibt es Probleme mit dem Login zum Cluster. Es kommt während des Logins zu längeren Wartezeiten und es kann zu Abbrüchen kommen. An einer Lösung wird gearbeitet.
HPC JupyterHub down due to updates
The HPC JupyterHub is currently down to fix issues with internal updates
Bug with Jobs landing on _low partitions
We currently have issues with Jobs using the default account landing on _low partitions
Full system maintenance.
Due to various necessary maintenance works, the entire CLAIX HPC System will be unavailable.
The initial phase of the maintenance should last until 12:00. After which filesystem and login nodes should be available.
Jobs will not run until the full maintenance works are completed.
We aim to also upgrade the Slurm scheduler during the downtime.
Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the initial part of the maintenance the maintenance.
- No Slurm jobs will be able to run during the entire maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the initial part of the maintenance.
RegApp Wartung
The RegApp service will be temporarily unavailable due to OS updates and to prepare for upcoming changes.
Login to the HPC frontends and to perfmon.hpc.itc.rwth-aachen.de will be unavailable.
It is recommended to login to the services beforehand.
Temporary Suspension of GPU Batch Service
Due to urgent security updates, the batch service will be temporarily suspended for all GPU nodes. After the updates are deployed, the service will be resumed. Running jobs are not affected and will continue. The dialog node login23-g-1 is not affected.
Once the updates are automatically installed upon job finalization, the nodes will be available again on short-notice.
HOME filesystem unavailable
The users' HOME directories were temporarily unavailable on both the login and the batch nodes. As a consequence, existing sessions trying to access these directories would hang and new connections to the cluster could not be established. There is a chance that batch jobs that started or run during this time frame may have failed as a result. If your job has crashed, please investigate the output and in case of error messages related to files or directories underneath /home, please resubmit the affected jobs. Part of the batch nodes were automatically disabled as a result and are put back in operation as of this second.
We apologize for the inconveniences this has caused.
Full global maintenance of the HPC CLAIX Systems
Our GPFS global filesystem needs to be updated and will cause the entire CLAIX HPC System to be unavailable.
Please note the following:
- User access to the HPC system through login nodes, HPC JupyterHub or any other connections will not be possible during the maintenance.
- No Slurm jobs, filesystems dependent tasks will be able to run during the maintenance.
- Before the maintenance, Slurm will only start jobs that guarantee to be finished before the start of maintenance; any running jobs must finish by then or might be terminated.
- Nodes might therefore remain empty leading to the maintenance, as Slurm tries to clear the nodes from user jobs.
- Waiting times before and after the maintenance might be higher than usual, as nodes are emptied before or the queue of waiting jobs increases in size afterwards.
- Files on your personal or project directories will not be available during the maintenance.
Migration to Rocky Linux 9
The CLAIX-2023 copy nodes copy23-1 and copy23-2 will be reinstalled with Rocky Linux 9. During the Reinstallation, the nodes will not be available.
NFS Störung der GPFS Server
Aktuell drainen alle Knoten aufgrund einer Störung. Wir arbeiten mit dem Hersteller daran.
$HOME and $WORK filesystems are again unavailable
Due to issues with the underlying filesystem servers for $HOME and $WORK, the batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.
$HOME and $WORK unavailable on login23-g-1
Due to issues with the GPFS filesystem $HOME and $WORK are not available on login23-g-1.
The issue has been resolved.
login23-1 unreachable
To resolve issues and fault analysis of the frontend node, the dialog node login23-1 is temporarily unavailable.
Reboot of CLAIX-2023 copy nodes
Due to a pending kernel update, the CLAIX-2023 copy nodes copy23-1 and copy23-2 will be rebooted on Monday, 2025-08-25.
GPU login node unavailable due to maintenance
Due to mandatory maintenance work, the login GPU node n23-g-1 will not be available during the maintenance.
During the maintenance, the node's firmware will be updated which takes several hours to complete.
$HOME and $WORK filesystems are again unavailable
Due to issues with the underlying filesystem servers for $HOME and $WORK, the majority of batch nodes are currently unavailable, and access to $HOME and $WORK on the login nodes is not possible.
File System Issues
Due to issues with the shared filesystems, all services with respect to the HPC cluster are substantially impacted. The issues are under current investigation, and we are working on a solution.
$HOME and $WORK filesystems unavailable
Due to issues with the GPFS file system, the $HOME and $WORK directories may currently be unavailable on the login nodes. Additionally, some compute nodes are temporarily not available due to these issues. Our team is working to resolve the problem as quickly as possible.
$HOME and $WORK filesystems unavailable and cause jobs to fail.
Due to issues with the underlying filesystem servers for $HOME and $WORK; GPFS, some batch nodes are currently unavailable and login nodes might be unstable. This may have caused running jobs to fail.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.
EDIT: Systems seem to be back online and operating normally.
Emergency Shutdown of CLAIX-2023
Due to cooling issues, the CLAIX-2023 cluster has been shut down. Accessing the cluster is currently not possible.
Please note that running jobs may have been terminated due to the shutdown.
We are actively working on identifying the root cause and resolving the issue as quickly as possible.
We apologize for any inconvenience this may cause and thank you for your understanding.
Migration of login23-3 and login23-x-1 to Rocky 9
The remaining Rocky 8 login nodes login23-3 and login23-x-1 will be upgraded to Rocky 9. These nodes will be unavailable during the upgrade process. Please use the other available login nodes in the meantime. The copy nodes are unaffected and will remain Rocky 8 nodes until further notice.
Anmeldung schlägt fehl
Zurzeit ist eine Anmeldung über RWTH Single Sign-On nicht möglich.
Bestehende Sessions sind nicht davon betroffen.
Wir arbeiten an der Behebung des Problems.
--English version---
p7zip replaced with 7zip
Due to security concerns, the previously installed file archiver "p7zip" (a fork of "7zip"), was replaced by the original program.
p7zip was forked two decades ago from 7zip due to initially lacking Unix and Linux support. Unfortunately, development is lacking severly behind of the upstream code such that issues are only partially fixed if at all. Since the upstream 7zip contains also several functional and performance improvements in addition to the bug fixes, the overall performance should have improved for the users as well. The usage should not change aside of minor changes due to the development history.
$HOME and $WORK filesystems are again unavailable
Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.
$HOME and $WORK filesystems unavailable
Due to issues with the underlying filesystem servers for $HOME and $WORK, all batch nodes are currently unavailable and the login nodes are not usable. This may have caused running jobs to fail, and no new jobs can start during the downtime.
We are working to resolve the problem as quickly as possible and apologize for any inconvenience caused.
Firmware Upgrade of Frontend Nodes
The Firmware of login23-3 and login23-4 must be upgraded. During the Upgrade, the nodes will not be available. Please use the other dialog nodes in the meantime.
Altes Ticket ohne Titel
Im angegeben Zeitraum kam es teilweise zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error.
Dies betrifft alle Services, welche den Single Sign-On Login verwenden.
--- English version ---
User namespaces on all Rocky 9 systems deactivated
Due to an open security issue we have deactivated user namespaces on all Rocky 9 systems.
This feature is mainly used by containerization software and affects the way apptainer containers will behave.
Most users should not experience any interruptions. If you experience any problems, please contact us as
usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using.
Usability of Desktop Environments on CLAIX-2023 Frontend Nodes Restricted due to Security Issues
Due to high-risk security issues within some components, the affected packages had to be removed from the cluster to enable operation until the security issues can be fixed. Unfortunately, the removal breaks the desktop environments XFCE and Mate due to tight dependencies to the respective removed packages. Consequently, the file managers "Thunar" and "Caja" as well as some other utilities cannot be used at the moment when using a GUI login.
If a GUI login is mandatory, please use the IceWM environment instead. However, some of the GUI applications might not work as expected (see above).
The text-based/terminal usage of the frontend nodes is not affected by the temporal change.
Login per RWTH Single Sign-On gestört
Aktuell kommt es zu Störungen des RWTH Single Sign-On. Nach Eingabe der Zugangsdaten lädt der Bildschirm, dann erscheint ein Internal Server Error.
Dies betrifft alle Services, welche den Single Sign-On Login verwenden.
Die zuständige Fachabteilung ist bereits informiert und arbeitet an der Behebung.
--- English version ---
HPC JupyterHub down for maintenance
The HPC JupyterHub will be down for maintenance during the hours of 9:00 and 11:00.
Upgrade of Login Nodes
On Monday, June 23rd, the CLAIX-2023 Rocky 9 frontend nodes login23-1, login23-2 and login23-4 will be upgraded to Rocky Linux 9.6.
During the upgrade, these nodes will temporarily not be available. Please use the Rocky 8 frontend nodes in the meantime.
Job submission to some new Slurm projects --accounts fails
Our second Slurm controller machine has lost memory modules and is currently pending maintenance.
This machine is responsible for keeping our Slurm DB in a current state.
Without it the submission to some new Slurm projects/ accounts will fail.
Please write a ticket if the problem persists.
login23-2 currently not available
login23-2 is currently not available. Please use one of the other dialog systems.
System maintenance for the hpcwork filesystem
During the maintenance, the dialog systems will continue to be available
most of the time but without access to the hpcwork directories.
Batch jobs will not be running.
HPC password change unavailable
Due to issues with the new RegApp version, changing the HPC password may fail.
RegApp Maintenance
The RegApp software will be updated to the newest version. During this time, login to the frontends and perfmon may be unavailable. It is recommended to log in beforehand.
cp command of hpcwork files may fail on Rocky Linux 9 systems
On Rocky Linux 9 systems (especially login23-1 or login23-4) the cp command of hpcwork files may fail with the error "No data available". Current workarounds: either use "cp --reflink=never ..." to copy files or run the cp command on one of the Rocky Linux 8 nodes, e.g. copy23-1 or copy23-2.
Routing über neue XWiN-Router
In dem Zeitraum wird das Routing der bisherigen XWiN-Router (Nexus 7700) auf die neuen XWiN-Router (Catalyst 9600) umgestellt. Diese Router sind für die Netzverbindung der RWTH unerlässlich. Für diese Umstellung ist auch die Migration der DFN-Anbindung, welche redundant nach Frankfurt und Hannover geschaltet sind, sowie der RWTH-Firewall auf die neuen System erforderlich.
Innerhalb des Wartungsfensters wird es zu Ausfällen oder Teilausfällen der Außenanbindung kommen. Alle Services der RWTH (bspw. VPN, Email, RWTHonline, RWTHmoodle) werden innerhalb dieses Zeitraums nicht zur Verfügung zu stehen. Die Erreichbarkeit von Services innerhalb des RWTH-Netzes ist aufgrund eingeschränkter DNS-Funktionalität zeitweise nicht gegeben.
---english---
Partial Migration of CLAIX-2023 to Rocky Linux 9
During the maintenance, we will migrate half of the CLAIX-2023 nodes (ML & HPC) and the entire devel partition to Rocky Linux 9. Additionally, the dialog systems login23-1, login23-x-1, and login23-x-2 will be migrated.
Aktuell keine Home- und Work-Verzeichnisse auf login23-2
Auf login23-2 stehen aktuell keine Home- und Work-Verzeichnisse zur Verfuegung. Bitte weichen Sie auf eines der anderen Dialogsysteme aus.
Swapping MFA backend of RegApp
We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.
Swapping MFA backend of RegApp
We will change the MFA backend of the Regapp. Therefore, new logins will not be possible during the maintenance.
Coolage disturbed, emergency shutdown of CLAIX23
We have a problem with the external coolage. We therefore have to do an emergency shutdown of CLAIX23.
HPC JupyterHub maintenance
The HPC JupyterHub will be down for maintenance to update some internal software packages.
Access to hpcwork directories hanging
Currently the access to the hpcwork directories may hang due to problems with the file servers. The problem is being worked on.
HPCJupyterHub Profile Installation Unavailable
The installation of HPCJupyterHub Profiles is currently unavailable due to issues with our HPC container system.
A downtime of at least two weeks is expected. We apologize for the inconvenience.
Update: New Profiles can be installed now, after Apptainer changes.
System Upgrade of Login Node
The CLAIX-2023 login node login23-4 will be upgraded to Rocky Linux 9.5 for assisting the migration to Rocky Linux 9. During the version upgrade, the login node will not be available.
SSH Command Key Approval Currently Not Possible
Due to a bug in the RegApp, SSH command keys cannot be approved at the moment. We are working on a solulution.
Update 16.04 14:35:
The cause of the bug has been identified and fixed
[RegApp] hotfix deployment
the bug affecting ssh command key approval has been identified.
A hotfix will be deployed momentarily, two factor logins on the HPC will be temporarily unavailable.
Maintenance of the RegApp application
No new logins are possible during the maintenance work. Users who are already logged in will not be disturbed and the maintenance will not affect the rest of the cluster. All frontends are still available, as is the batch system with its computing nodes.
Bad Filesystem Performance
At the moment, some issues with the file systems that impact the performance are observed. File access can be severely delayed consequently. The issue is currently under investigation.
Global Maintenance of CLAIX-2023
Due to maintenance tasks on a cluster-wide scale that cannot be performed online, the whole batch service will be suspended for the maintenance.
The Regapp is also affected by the maintenance.
Jupyterhub Temporarily Unavailable Due To Kernel Update
Due to a scheduled kernel update, the jupyterhub node is temporarily unavailable. The node will be available again, as soon as the update is completed.
Kurze SSO-Störung
Heute Abend kam es gegen ca. 20:25 - 21:00 zu einem Performance-Tief des SSO-Service und wurde innerhalb dieser Zeit von der Fachabteilung geprüft. Das Problem ist behoben und schnelle Logins ohne Wartezeiten sind wieder möglich.
---English---
Login nodes unavailable due to kernel update
Due to a scheduled Kernelupdate, the login nodes, i.e., login23-*, are temporarily unavailable during the update. The nodes will be available again, as soon as the update is completed.
Some CLAIX23 Nodes Unavailable Due To Update Issues
During a kernel update issues occured which lead to delays in re-deploying the nodes to the batch service. Consequently, a large number of compute nodes is unavailable at the moment (reserved and/or down). We are currently working on handing the respective nodes and will release them as fast as possible.
HPC JupyterHub unavailable
The HPC JupyterHub is currently unavailable due to unforeseen errors in the filesystems.
A solution is being worked on.
Node unavailable due to kernel update
The kernel of the CLAIX23 copy node claix23-2 has to be updated. The node will be available again as soon as the update is finished.
RegApp Störung
Es ist zurzeit nicht möglich in der RegApp das HPC Passwort zu ändern.
Das anlegen neuer Accounts funktioniert nicht.
RegApp Wartung
In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit sind keine Logins möglich.
Update:
Die RegApp ist seit 9 Uhr wieder erreichbar, die Kommunikation mit dem 2. Faktor ist aber gestört. Entsprechend funktioniert der 2FA-Login am HPC noch nicht.
Update 10:10
Die Störung ist behoben, der 2FA-Login am HPC funktioniert wieder.
Wartung der RegApp
In der angegebenen Zeit findet eine kurze Wartung der RegApp statt. In dieser Zeit ist kein Login in der RegApp möglich.
Slurm hiccup possible
We are migrating the slurm controller to a new host. It might come to short timeouts. We try to minimize that as much as possible.
Issues regarding Availability
There may currently be login problems with various login nodes. We are working on a solution.
Infiniband issues leading to not reachable nodes
Due to to be analyzed infiniband problems many nodes including the whole GPU cluster are not reachable at the moment. We are working together with the manufacturer, to solve the problems.
Single Sign-On und MFA gestört
Zurzeit ist der Single Sign-On und die Multifaktorauthentifizierung sporadisch gestört. Wir arbeiten bereits an einer Lösung und bitten um Geduld.
---english---
Nodes drained due filesystem issues.
Dear users, Sunday the 8.12.2024 at 06:53:00 AM the home filesystems of most nodes went offline.
This might have negatively crashed some jobs and no new jobs can start during the downtime.
We are actively working on the issue.
lost filesystem $HOME $WORK connection
Due to a problem in our network, some nodes lost their connection to the $HOME and $WORK file system. This included the login23-1 and login23-2 nodes. The issue has been resolved now.
Emergency Shutdown of CLAIX-2023 Due to Failed Cooling
CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation.
The cluster will be operational again if the inducting issues can be resolved.
New file server for home and work directories
We are putting a new file server for the home and work directories into operation. For this purpose we will carry out a system maintenance in order to finally synchronise all data over the weekend.
Limited Usability of CLAIX-2023
Wegen anhaltender externer Störungen im RWTH-Netzwerk ist der Cluster nur eingeschränkt erreichbar und funktionsfähig. Die zuständige Netzwerkfachabteilung arbeitet bereits an einer Lösung der Probleme.
Power disruption / Stromausfall
Um 17:00 Uhr hat es einen kurzzeitigen Stromausfall im Raum Aachen gegeben. Die Stromversorgung besteht wieder, jedoch ist die Mehrzahl der Compute-Knoten infolgedessen ausgefallen. Es ist unklar, wann der Betrieb wieder aufgenommen werden kann.
Es wird momentan daran gearbeitet, kritische Dienste zu sichern und wiederherzustellen.
Scheduler Hiccup
Our Slurm workload manager crashed due to an unknown reason. Functionality could be restored at short hand. Further investigations are ongoing.
GPU Malfunction on GPU Login Node
Currently, a GPU of the GPU login node login23-g-1 shows an issue. The node is unavailable until the issue is resolved.
Login malfunction
It is currently not possible to log in to the login23-* frontends. There is a problem with two-factor authentication
Top500 run for GPU nodes
We are doing a new top500 run for the ML partition of CLAIX23.
The GPU nodes will not be available during that run.
Other Nodes and login23-g-1 might be also unavailable:
i23m[0027-0030],r23m[0023-0026,0095-0098,0171-0174],n23i0001
Zertifikat abgelaufen
Aufgrund von dem abgelaufenen Zertifikat für idm.rwth-aachen.de können keine IdM-Anwendungen und die Anwendungen, die über RWTH Single Sign-On angebunden sind, aufgerufen werden.
- Beim Aufrufen von IdM-Anwendungen wird eine Meldung zur unsicheren Verbindung angezeigt.
- Beim Aufrufen von Anwendungen mit dem Zugang über RWTH Single Sign-On wird eine Meldung zu fehlenden Berechtigungen angezeigt.
Wir arbeiten mit Hochdruck an der Lösung des Problems.
--- English ---
Störung der RegApp -> kein login auf dem Cluster möglich
Leider kam es in dem genannten Zeitraum zu einer Störung der RegApp, so dass man sich nicht auf den Frontends des Clusters einloggen konnte. Bereits bestehende Verbindungen wurden davon nicht beeinflusst. Das Problem ist behoben.
Altes Ticket ohne Titel
copy23-2 data transfer system will be unavailable for maintenance.
Firmware Update of InfiniBand Gateways
The firmware of the InfiniBand Gateways will be updated. The firmware update will be performed in background and should not cause any interruption of service.
Hosts der RWTH Aachen teilweise nicht aus Netzen von anderen Providern erreichbar
Aufgrund einer Störung des DNS liefern die Nameserver verschiedener Provider aktuell keine IP-Adresse für Hosts unter *.rwth-aachen.de zurück.
Als Workaround können Sie alternative DNS-Server in Ihren Verbindungseinstellungen hinterlegen, wie z.B. die Level3-Nameserver (4.2.2.2 und 4.2.2.1) oder von Comodo (8.26.56.26 und 8.20.247.20). Ggf ist es auch möglich den VPN-Server der RWTH zu erreichen, dann nutzen Sie bitte VPN.
MPI/CPU Jobs Failed to start overnight
Many nodes suffered an issue after our updates on the 19.08.2024, resulting in jobs failing on the CPU partitions.
If your job failed to start or failed on startup, please consider requeuing it if necessary. This list of jobs was identified as possibly being affected by the issue:
48399558,48468084,48470374,48470676,48473716,48473739,48473807,48473831,
48475599,48475607_0,48475607_1,48475607_2,48475607_3,48475607_4,48475607_5,
48475607_6,48475607_7,48475607_8,48475607_9,48475607_10,48475607_11,48475607_12,
48475607_13,48475607_14,48475607_15,48475607_16,48475607_17,48475607_18,48475607_19,
48476753,48482255,48485168,48486404,48488874_5,48488874_6,48488874_7,48488874_8,
48488874_9,48488874_10,48488874_11,48488875_9,48488875_10,48488875_11,48489133_1,
48489133_2,48489133_3,48489133_4,48489133_5,48489133_6,48489133_7,48489133_8,48489133_9,
48489133_10,48489154_0,48489154_1,48489154_2,48489154_3,48489154_4,48489154_5,48489154_6,48489154_7,
48489154_8,48489154_9,48489154_10,48489154_11,48489154_12,48489154_13,48489154_14,48489154_15,
48489154_16,48489154_17,48489154_18,48489154_19,48489154_20,48489154_21,48489154_22,48489154_23,
48489154_24,48489154_25,48489154_26,48489154_27,48489154_28,48489154_29,48489154_30,48489154_31,
48489154_32,48489154_33,48489154_34,48489154_35,48489154_36,48489154_37,48489154_38,48489154_39,
48489154_40,48489154_41,48489154_42,48489154_43,48489154_44,48489154_45,48489154_46,48489154_47,
48489154_100,48489154_101,48489154_102,48489154_103,48489154_104,48489154_105,48489154_106,48489154_107,
48489154_108,48489154_109,48489154_110,48489154_111,48489154_112,48489154_113,48489154_114,48489154_115,
48489154_116,48489154_117,48489154_118,48489154_119,48489154_120,48489154_121,48489154_122,48489154_123,
48489154_124,48489154_125,48489154_126,48489154_127,48489154_128,48489154_129,48489154_130,48489154_131,
48489154_132,48489154_133,48489154_134,48489154_135,48489154_136,48489154_137,48489154_138,48489154_139,
48489154_140,48489154_141,48489154_142,48489154_143,48489154_144,48489154_145,48489154_146,48489154_147,
48489154_148,48489154_149,48489154_150,48489154_151,48489154_152,48489154_153,48489154_154,48489154_155,
48489154_156,48489154_157,48489154_158,48489154_159,48489154_160,48489154_161,48489154_162,48489154_163,
48489154_164,48489154_165,48489154_166,48489154_167,48489154_168,48489154_169,48489154_170,48489154_171,
48489154_172,48489154_173,48489154_174,48489154_175,48489154_176,48489154_177,48489154_178,48489154_179,
48489154_180,48489154_181,48489154_182,48489154_183,48489154_184,48489154_185,48489154_186,48489154_187,
48489154_188,48489154_189,48489154_190,48489154_191,48489154_192,48489154_193,48489154_194,48489154_195,
48489618_1,48489618_2,48489618_3,48489618_4,48489618_5,48489618_6,48489618_7,48489618_8,48489618_9,48489618_10,
48489776,48489806_6,48489806_55,48489806_69,48489806_98,48489842,48489843,48489844,48489845,48489882_1,48489882_2,
48489882_3,48489882_4,48489882_5,48489882_6,48489882_7,48489882_8,48489882_9,48489882_10,48494481,48494490,48494752,
48494753,48494754,48494755,48494756,48494757,48494758,48494759,48494760
Maintenance
Due to updates to our compute nodes, the HPC system will be unavailable for Maintenance.
The login nodes will available at noon without interruptions, but the batch queue for jobs won't be usable during the maintenance work.
As soon as the maintenance work has been completed, batch operation will be enabled again.
These jobs should be requeued if necessary:
48271714,48271729,48271731,48463405,48463406,48463407,48466930,
48466932,48468086,48468087,48468088,48468089,48468090,48468091,
48468104,48468105,48468108,48468622,48469133,48469262,48469404,
48469708,48469734,48469740,48469754,48469929,48470011,48470017,
48470032,48470042,48470045,48474641,48474666,48475362,48489829,
48489831,48489833_2,48489838
Old HPCJupyterHub GPU profiles might run slower on the new c23g nodes.
Please migrate your notebooks to work with newer c23 GPU Profiles!
The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs.
Redeployment of these old profiles is necessary and will take some time.
MPI jobs may crash
Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.
c23i Partition is DOWN for the HPC JupyterHub
The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition.
A solution is momentarily unknown and will be investigated.
The HPC JupyterHub will not be able to use it until it is resolved.
Temporary Deactivation of User Namespaces
Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.
Quotas on HPCWORK may not work correctly
The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.
Reconfiguration of File Systems and Kernel Update
During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.
HPCJupyterHub down due to update to 5.0.0
HPCJupyterHub is down after faied update to 5.0.0. will stay until the update is complete.
HPCJupyterHub could not be updated to 5.0.0. Remains at 4.1.5.
FastX web servers on login18-x-1 and login18-x-2 stopped
The FastX web servers on login18-x-1 and login18-x-2 have been stopped, i.e. the addresses https:
login18-x-1.hpc.itc.rwth-aachen.de:3300 and https:
login18-x-2.hpc.itc.rwth-aachen.de:3300 are not available anymore. Please use login23-x-1 or login23-x-2 instead.
Maintenance
Due to maintenance work on the water cooling system, Claix23 must be empty during the specified period. As soon as the maintenance work has been completed, batch operation will be enabled again. The dialog systems are not affected by the maintenance work.
Upgrade to Rocky Linux 8.10
Due to the reached EOL of Rocky 8.9, the MPI nodes of CLAIX23 must be upgraded to Rocky 8.10. The upgrade is performed in background during production to minimize the downtime of the cluster. However, during the Upgrade, free nodes will be removed on a selection-basis and will not be available for job submission until the upgrade is completed.
Please keep in mind that during the update, the library versions installed will likely change. Thus, the performance and application behaviour may vary compared to earlier runs.
Update of Frontend Nodes
The dialog nodes (i.e. login23-1/2/3/4, login23-x-1/2) will be updated to Rocky 8.10 today within the weekly reboot. The upgrade of copy23-1/2 will follow.
Altes Ticket ohne Titel
Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue
Error on user/project management
Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue
Project management
During this period no RWTH-S, THESIS, LECTURE or WestAI projects can be granted. We apologize for the inconvenience.
RegApp Maintenance
Due to maintenance of the RegApp Identiy Provider, it is not possible to establish new connections to the cluster during the specified period. Existing connections and batch operation are not affected by the maintenance.
Deactivation of User Namespaces
Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte Usernamespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt, und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.
Performance Problems on HPCWORK
We currently register recurring performance degradations on HPCWORK directories which might be partly worsened by the on-going migration process leading on to the filesystem migration on April, 17th. The problems cannot be traced back to a single cause but are actively investigated.
HPC JupyterHub update
During the Claix HPC System Maintenance, the HPC JupyterHub will be updated to a newer version.
This will improve Claix 2023 support as well mandatory security updates.
The whole clusters needs to be updated with a new kernel.
Migration from lustre18 to lustre22
In the last weeks, we started migrating all HPCWORK data to a new filesystem. In this Maintenance we will do the final migration step. HPCWORK will not be available during this maintenance.
System Maintenance
The whole clusters needs to be updated with a new kernel such that user namespaces can be reenabled again, please compare https:
maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/8929
Simultaneously the Infiniband Stack will be updated for better performance and stability.
During this maintenance, the dialog systems and the batchsystem will not be available. The dialog systems are expected to be reopened in the early morning.
We do not believe that the maintenance will last the whole day but expect the cluster to open earlier.
Top500 - Benchmark
During the stated time Claix-2023 will not be available due to a benchmark run for the Top500 list[1]. Batch jobs which cannot finish before the start of this downtime or which are scheduled during this time period will be kept in queue and started after the cluster resumes operation.
[1] https:
www.top500.org
Longer waiting times in the ML partition
There are currently longer waiting times in the ML partition as the final steps of the acceptance process are still being carried out.
RegApp Service Update
Am 03.04.2024 wird die RegApp aktualisiert. Während des Updatefensters kann der Dienst für kurze Zeit unterbrochen sein. Aktive Sitzungen sollten nicht betroffen sein.
Problems with submitting jobs
There are currently problems when submitting jobs. We are working on fixing the problems and apologize for the inconvenience.
Deactivation of User Namespaces
Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte User Namespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.
Update:
Wir haben einen Bugfix für die betroffene Softwarekomponente installiert und User Namespaces wieder aktiviert.
Verzeichnis "hpcwork" ist leer
Zurzeit werden keine Daten auf /hpcwork angezeigt.
Die Fachabteilung ist informiert und arbeitet an der Lösung.
---english---
Scheduled Reboot of CLAIX18 Copy Nodes
Both CLAIX18 copy nodes will be rebooted on Monday, January 29th, 6.00 am (CET) due to a scheduled kernel upgrade. The systems will temporarily unavailable and cannot be used until the kernel update is finished.
Netzwerkprobleme
Aufgrund von Netzwerkromplemen kann es im angegeben Zeitraum zu Problemen bei der Nutzung des Clusters gekommen sein.
Verbindung mit dem Windows-Cluster nicht möglich
Momentan kann keine Verbindung zum Windows-Cluster hergestellt werden.
Die Kollegen sind informiert und arbeiten an der Behebung des Problems.
-- english --
jupyterhub.hpc.itc.rwth-aachen.de DNS Temporary out of Service
The jupyterhub.hpc.itc.rwth-aachen.de DNS is Temporary out of Service for 20 Minutes. Problems accessing the hpc JupyterHub might arise from this failure. Please wait until the system comes back online.