We are migrating the slurm controller to a new host. It might come to short timeouts. We try to minimize that as much as possible.
The first try was not successfull, we are on the old master again. We are analyzing the problems that occurred and try again later.
we make another attempt
There may currently be login problems with various login nodes. We are working on a solution.
The observed issues affect the batch service as well. Consequently many batch jobs may have failed.
The observed problems can be concluded from power issues. We cannot exclude that further systems may have to be controlled shut down temporarily due to a problem resolution if required. However, we hope the issues can be resolved without any additional measures.
The cluster can be accessed again. cf. ticket https://maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/9521 for further details. Several nodes, however, are still unavailable due to the consequences of the aforementioned issues. We are currently working on resolving the issues.
Due to to be analyzed infiniband problems many nodes including the whole GPU cluster are not reachable at the moment. We are working together with the manufacturer, to solve the problems.
The problem could be fixed
Zurzeit ist der Single Sign-On und die Multifaktorauthentifizierung sporadisch gestört. Wir arbeiten bereits an einer Lösung und bitten um Geduld. ---english--- At the moment, single sign-on and multi-factor authentication are sporadically disrupted. We are already working on a solution and ask for your patience.
Dear users, Sunday the 8.12.2024 at 06:53:00 AM the home filesystems of most nodes went offline. This might have negatively crashed some jobs and no new jobs can start during the downtime. We are actively working on the issue.
Most nodes are coming back online. Apologies for the troubles. We expect most nodes to be usable by 18:00.
Due to a problem in our network, some nodes lost their connection to the $HOME and $WORK file system. This included the login23-1 and login23-2 nodes. The issue has been resolved now.
CLAIX-2023 was shut-down in emergency to prevent damage to the hardware. Due to severe power issues, the cooling facilities failed and could not provide sufficient heat dissipation. The cluster will be operational again if the inducting issues can be resolved.
All MPI and GPU nodes are liquid cooled. Both CDUs (= Cooling Distribution Units) failed such that heat dissipation was not possible anymore.
Both CDUs are active again and cooling could be restored. The cluster will be booted again for damage analysis only. Until further notice, the batch service remains suspended until all issues are resolved and all power security checks are positive.
The cooling system is now fully operational again. Additionally, we have implemented further measures to enhance stability in the future. The queues were reopened last night; however, we are currently conducting a detailed investigation into some specific nodes regarding their cooling and performance. Once these investigations are complete, the affected nodes will be made available through the batch system again.
We are putting a new file server for the home and work directories into operation. For this purpose we will carry out a system maintenance in order to finally synchronise all data over the weekend.
The maintenance needs to be extended.
Due to some issues preventing a normal batch service, the maintenance had to be extended.
Due to concurrent external issues regarding the RWTH Aachen network, access and usability of the Compute Cluster is limited at the moment. The respective network department is currently working on a solution. -- Wegen anhaltender externer Störungen im RWTH-Netzwerk ist der Cluster nur eingeschränkt erreichbar und funktionsfähig. Die zuständige Netzwerkfachabteilung arbeitet bereits an einer Lösung der Probleme.
The issues could not be resolved until now and may persist trhoughout tomorrow as well.
The issues have been resolved.
At 17:00, there was a brief interruption of the power lines in the Aachen area. The power is available again, however, most of the compute nodes went down consequently. Currently, it is unclear when the service can be resumed. At the moment, critical services are under special care and are, if required, being restored. Um 17:00 Uhr hat es einen kurzzeitigen Stromausfall im Raum Aachen gegeben. Die Stromversorgung besteht wieder, jedoch ist die Mehrzahl der Compute-Knoten infolgedessen ausgefallen. Es ist unklar, wann der Betrieb wieder aufgenommen werden kann. Es wird momentan daran gearbeitet, kritische Dienste zu sichern und wiederherzustellen.
After restoring critical operational infrastructure services, the HPC service is resumed. However, a large portion of the GPU nodes are unavailable due to the impact of the incurred blackout. Until further notice, these nodes are unavailable. Nachdem die kritische Infrastruktur zum Betrieb der Systeme wiederhergestellt werden konnte, wurde der HPC-Cluster wieder bereitgestellt und freigegeben. Allerdings sind durch die Auswirkungen des Stromausfalls eine größere Zahl GPU-Knoten nicht mehr verfügbar. Wir arbeiten an der Behebung der Probleme, können allerdings noch keine Prognose geben, wann und ob die Systeme wieder verfügbar sein werden.
Der Großteil der ML Systeme (GPUs) konnten heute wieder hochgefahren und in den Batchbetrieb übergeben werden. The majority of the ML systems (GPUs) were restarted today and are back in batch operation.
Our Slurm workload manager crashed due to an unknown reason. Functionality could be restored at short hand. Further investigations are ongoing.
Currently, a GPU of the GPU login node login23-g-1 shows an issue. The node is unavailable until the issue is resolved.
The issues could be resolved.
It is currently not possible to log in to the login23-* frontends. There is a problem with two-factor authentication
We are doing a new top500 run for the ML partition of CLAIX23. The GPU nodes will not be available during that run. Other Nodes and login23-g-1 might be also unavailable: i23m[0027-0030],r23m[0023-0026,0095-0098,0171-0174],n23i0001
Aufgrund von dem abgelaufenen Zertifikat für idm.rwth-aachen.de können keine IdM-Anwendungen und die Anwendungen, die über RWTH Single Sign-On angebunden sind, aufgerufen werden. - Beim Aufrufen von IdM-Anwendungen wird eine Meldung zur unsicheren Verbindung angezeigt. - Beim Aufrufen von Anwendungen mit dem Zugang über RWTH Single Sign-On wird eine Meldung zu fehlenden Berechtigungen angezeigt. Wir arbeiten mit Hochdruck an der Lösung des Problems. --- English --- Due to the expired certificate for idm.rwth-aachen.de, no IdM applications and the applications that use RWTH Single Sign-On can be accessed. We are working on a solution. - An insecure connection message is displayed when calling up IdM applications. - When calling up applications with access via RWTH Single Sign-On, a message about missing authorisations is displayed.
Das Zertifikat wurde aktualisiert und die Anwendungen können wieder aufgerufen werden. Bitte löschen Sie den Browsercache, bevor Sie die Seiten wieder aufrufen. /// The certificate has been updated and the applications can be accessed again. Please delete the browser cache before accessing the pages again.
Leider kam es in dem genannten Zeitraum zu einer Störung der RegApp, so dass man sich nicht auf den Frontends des Clusters einloggen konnte. Bereits bestehende Verbindungen wurden davon nicht beeinflusst. Das Problem ist behoben.
copy23-2 data transfer system will be unavailable for maintenance.
The maintenance is completed.
The firmware of the InfiniBand Gateways will be updated. The firmware update will be performed in background and should not cause any interruption of service.
The updates are completed.
Aufgrund einer Störung des DNS liefern die Nameserver verschiedener Provider aktuell keine IP-Adresse für Hosts unter *.rwth-aachen.de zurück. Als Workaround können Sie alternative DNS-Server in Ihren Verbindungseinstellungen hinterlegen, wie z.B. die Level3-Nameserver (4.2.2.2 und 4.2.2.1) oder von Comodo (8.26.56.26 und 8.20.247.20). Ggf ist es auch möglich den VPN-Server der RWTH zu erreichen, dann nutzen Sie bitte VPN. // Due to DNS disruption, the name servers of various providers are currently not returning an IP address for hosts under *.rwth-aachen.de. As a workaround, you can store alternative DNS servers in your connection settings, e.g. the Level3-Nameserver (4.2.2.2 and 4.2.2.1) or Comodo (8.26.56.26 und 8.20.247.20). It may also be possible to reach the RWTH VPN server, in which case please use VPN.
Anleitungen zur Konfiguration eines alternativen DNS-Server unter Windows finden Sie über die folgenden Links: https://www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/ https://www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html Als Alternative können Sie auch VPN nutzen. Wenn Sie den VPN-Server nicht erreichen, können Sie nach der folgenden Anleitung die Host-Datei unter Windows anpassen. Dadurch kann der Server vpn.rwth-aachen.de erreicht werden. Dazu muss der folgenden Eintrag hinzugefügt werden: 134.130.5.231 vpn.rwth-aachen.de https://www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/ // Instructions for configuring an alternative DNS server under Windows can be found via the following links: https://www.ionos.de/digitalguide/server/konfiguration/windows-11-dns-aendern/ https://www.netzwelt.de/galerie/25894-dns-einstellungen-windows-10-11-aendern.html You can also use VPN as an alternative. If you cannot reach the VPN server, you can adjust the host file under Windows according to the following instructions. This will allow you to reach the server vpn.rwth-aachen.de. To do this, the following entry must be added: 134.130.5.231 vpn.rwth-aachen.de https://www.windows-faq.de/2022/10/04/windows-11-hosts-datei-bearbeiten/
Die Host der RWTH Aachen sind nun wieder auch von ausserhalb des RWTH Netzwerkes erreichbar. // The hosts of RWTH Aachen University can now be reached again from outside the RWTH network.
Auch nach der Störungsbehebung am 25.8. um 21 Uhr kann es bei einzelnen Nutzer*innen zu Problemen gekommen sein. Am 26.8. um 9 Uhr wurden alle Nacharbeiten abgeschlossen, sodass es zu keinen weiteren Problemen kommen sollte. // Individual users may have experienced problems even after the fault was rectified on 25 August at 9 pm. On 26.8. at 9 a.m. all follow-up work was completed, so there should be no further problems.
Many nodes suffered an issue after our updates on the 19.08.2024, resulting in jobs failing on the CPU partitions. If your job failed to start or failed on startup, please consider requeuing it if necessary. This list of jobs was identified as possibly being affected by the issue: 48399558,48468084,48470374,48470676,48473716,48473739,48473807,48473831, 48475599,48475607_0,48475607_1,48475607_2,48475607_3,48475607_4,48475607_5, 48475607_6,48475607_7,48475607_8,48475607_9,48475607_10,48475607_11,48475607_12, 48475607_13,48475607_14,48475607_15,48475607_16,48475607_17,48475607_18,48475607_19, 48476753,48482255,48485168,48486404,48488874_5,48488874_6,48488874_7,48488874_8, 48488874_9,48488874_10,48488874_11,48488875_9,48488875_10,48488875_11,48489133_1, 48489133_2,48489133_3,48489133_4,48489133_5,48489133_6,48489133_7,48489133_8,48489133_9, 48489133_10,48489154_0,48489154_1,48489154_2,48489154_3,48489154_4,48489154_5,48489154_6,48489154_7, 48489154_8,48489154_9,48489154_10,48489154_11,48489154_12,48489154_13,48489154_14,48489154_15, 48489154_16,48489154_17,48489154_18,48489154_19,48489154_20,48489154_21,48489154_22,48489154_23, 48489154_24,48489154_25,48489154_26,48489154_27,48489154_28,48489154_29,48489154_30,48489154_31, 48489154_32,48489154_33,48489154_34,48489154_35,48489154_36,48489154_37,48489154_38,48489154_39, 48489154_40,48489154_41,48489154_42,48489154_43,48489154_44,48489154_45,48489154_46,48489154_47, 48489154_100,48489154_101,48489154_102,48489154_103,48489154_104,48489154_105,48489154_106,48489154_107, 48489154_108,48489154_109,48489154_110,48489154_111,48489154_112,48489154_113,48489154_114,48489154_115, 48489154_116,48489154_117,48489154_118,48489154_119,48489154_120,48489154_121,48489154_122,48489154_123, 48489154_124,48489154_125,48489154_126,48489154_127,48489154_128,48489154_129,48489154_130,48489154_131, 48489154_132,48489154_133,48489154_134,48489154_135,48489154_136,48489154_137,48489154_138,48489154_139, 48489154_140,48489154_141,48489154_142,48489154_143,48489154_144,48489154_145,48489154_146,48489154_147, 48489154_148,48489154_149,48489154_150,48489154_151,48489154_152,48489154_153,48489154_154,48489154_155, 48489154_156,48489154_157,48489154_158,48489154_159,48489154_160,48489154_161,48489154_162,48489154_163, 48489154_164,48489154_165,48489154_166,48489154_167,48489154_168,48489154_169,48489154_170,48489154_171, 48489154_172,48489154_173,48489154_174,48489154_175,48489154_176,48489154_177,48489154_178,48489154_179, 48489154_180,48489154_181,48489154_182,48489154_183,48489154_184,48489154_185,48489154_186,48489154_187, 48489154_188,48489154_189,48489154_190,48489154_191,48489154_192,48489154_193,48489154_194,48489154_195, 48489618_1,48489618_2,48489618_3,48489618_4,48489618_5,48489618_6,48489618_7,48489618_8,48489618_9,48489618_10, 48489776,48489806_6,48489806_55,48489806_69,48489806_98,48489842,48489843,48489844,48489845,48489882_1,48489882_2, 48489882_3,48489882_4,48489882_5,48489882_6,48489882_7,48489882_8,48489882_9,48489882_10,48494481,48494490,48494752, 48494753,48494754,48494755,48494756,48494757,48494758,48494759,48494760
Due to updates to our compute nodes, the HPC system will be unavailable for Maintenance. The login nodes will available at noon without interruptions, but the batch queue for jobs won't be usable during the maintenance work. As soon as the maintenance work has been completed, batch operation will be enabled again. These jobs should be requeued if necessary: 48271714,48271729,48271731,48463405,48463406,48463407,48466930, 48466932,48468086,48468087,48468088,48468089,48468090,48468091, 48468104,48468105,48468108,48468622,48469133,48469262,48469404, 48469708,48469734,48469740,48469754,48469929,48470011,48470017, 48470032,48470042,48470045,48474641,48474666,48475362,48489829, 48489831,48489833_2,48489838
Please migrate your notebooks to work with newer c23 GPU Profiles! -- The migration of the GPU Profiles to Claix 2023 and the new nodes of c23g has made the old python packages use non optimal settings on the new GPUs. Redeployment of these old profiles is necessary and will take some time.
Since the cluster maintenance, random MPI job crashes are observed. We are currently investigating the issue and are working on a solution.
We have identified the issue and are currently testing workarounds with the affected users.
After successful tests with affected users, we have rolled out a workaround that automatically prevents this issue for our IntelMPI installations. We advise users to remove any custom workarounds from their job scripts to ensure compatibility with future changes.
The c23i Partition is DOWN due to unforeseen consequences of our Monitoring systems that automatically downs the only node in the partition. A solution is momentarily unknown and will be investigated. The HPC JupyterHub will not be able to use it until it is resolved.
Due to a security vulnerability in the Linux Kernel, user namespaces are temporarily deactivated. Upon the kernel update, user namespaces can be used again.
User namespaces are available again.
The quota system on HPCWORK may not work correctly. There may be an error "Disk quota exceeded" if trying to create files although the r_quota command reports that enough quota should be available. The supplier of the filesystem has been informed and is working on a solution.
File quotas for all hpcwork directories were increased to one million.
During the Maintenance, $HPCWORK will be reconfigured, such that RDMA over IB will be possible from the CLAIX23 nodes instead of HPCWORK access over ethernet. At the same time, the Kernel will be updated. After the Kernel Update, the previously deactivated User Namespaces will be re-activated, again.
The maintenance had to be extended for final filesystem tasks
Due to unforseen problems, the maintenance has to be extended to tomorrow 16.07.2024 18.00. We do not expect the manufacturer of the filesystem to take that long, but expect to open the cluster earlier again.
The maintenance could be ended successfully. Once again, sorry for the long delay.
HPCJupyterHub is down after faied update to 5.0.0. will stay until the update is complete. HPCJupyterHub could not be updated to 5.0.0. Remains at 4.1.5.
The FastX web servers on login18-x-1 and login18-x-2 have been stopped, i.e. the addresses https://login18-x-1.hpc.itc.rwth-aachen.de:3300 and https://login18-x-2.hpc.itc.rwth-aachen.de:3300 are not available anymore. Please use login23-x-1 or login23-x-2 instead.
login18-x-1 and login18-x-2 has been decommissioned.
Due to maintenance work on the water cooling system, Claix23 must be empty during the specified period. As soon as the maintenance work has been completed, batch operation will be enabled again. The dialog systems are not affected by the maintenance work.
Additionally, between 10 and 11 o'clock, there will be a maintenance of the RegApp. During this time, new logins will not be possible, existing connections will not be disturbed.
Due to the reached EOL of Rocky 8.9, the MPI nodes of CLAIX23 must be upgraded to Rocky 8.10. The upgrade is performed in background during production to minimize the downtime of the cluster. However, during the Upgrade, free nodes will be removed on a selection-basis and will not be available for job submission until the upgrade is completed. Please keep in mind that during the update, the library versions installed will likely change. Thus, the performance and application behaviour may vary compared to earlier runs.
Starting now, all new jobs will be scheduled to Rocky 8.10 nodes. The remaining nodes that still need to be updated are unvailable for job submission. These nodes will be upgraded as soon as possible after their jobs' completion.
The update of the frontend and batch nodes is completed. Remaining nodes (i.e. integrated hosting and service nodes) will be updated on the cluster maintanance scheduled for 2024-06-26.
The dialog nodes (i.e. login23-1/2/3/4, login23-x-1/2) will be updated to Rocky 8.10 today within the weekly reboot. The upgrade of copy23-1/2 will follow.
The copy frontend nodes (copy23-1, copy23-2) will be updated to Rocky Linux 8.10 during the cluster maintanance ond 2024-06-26.
The update of the remaining frontend nodes is completed.
Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue
Due to technical problems, it is not possible to create/change/delete HPC-Accounts or projects. We are working on that issue
The issue has been resolved.
During this period no RWTH-S, THESIS, LECTURE or WestAI projects can be granted. We apologize for the inconvenience.
Due to maintenance of the RegApp Identiy Provider, it is not possible to establish new connections to the cluster during the specified period. Existing connections and batch operation are not affected by the maintenance.
The RegApp is moved to a new server with a new operating system.
(German version below) Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability. --- Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte Usernamespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt, und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren.
A kernel update addressing the issue was released upstream and will be available to the compute cluster, soon. Upon the update, usernamespaces can be enabled, again.
We are planning to re-enable user namespaces on April, 29th after some final adjustments
We currently register recurring performance degradations on HPCWORK directories which might be partly worsened by the on-going migration process leading on to the filesystem migration on April, 17th. The problems cannot be traced back to a single cause but are actively investigated.
Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.
The whole clusters needs to be updated with a new kernel such that user namespaces can be reenabled again, please compare https://maintenance.itc.rwth-aachen.de/ticket/status/messages/14/show_ticket/8929 Simultaneously the Infiniband Stack will be updated for better performance and stability. During this maintenance, the dialog systems and the batchsystem will not be available. The dialog systems are expected to be reopened in the early morning. We do not believe that the maintenance will last the whole day but expect the cluster to open earlier.
Due to technical problems, we will have to postpone the maintenance to 23.04.2024 07:00.
Unfortunately, unplanned complications have arisen during maintenance, so that maintenance will have to be extended until midday tomorrow. We will endeavor to complete the work by then. We apologize for any inconvenience this may cause.
In the last weeks, we started migrating all HPCWORK data to a new filesystem. In this Maintenance we will do the final migration step. HPCWORK will not be available during this maintenance.
Due to technical problems, we will have to postpone the maintenance (and the final lustre migration step) to 23.04.2024 07:00.
During the Claix HPC System Maintenance, the HPC JupyterHub will be updated to a newer version. This will improve Claix 2023 support as well mandatory security updates. The whole clusters needs to be updated with a new kernel.
The migration was successfully completed.
During the stated time Claix-2023 will not be available due to a benchmark run for the Top500 list[1]. Batch jobs which cannot finish before the start of this downtime or which are scheduled during this time period will be kept in queue and started after the cluster resumes operation. [1] https://www.top500.org
The nodes are available now again
There are currently longer waiting times in the ML partition as the final steps of the acceptance process are still being carried out.
The waiting times should be better now
+++ German version below +++ The RegApp will be updated on 2024-04-03. During the update window, the service will be unavailable for short time intervals. Active sessions should not be affected. +++ English version above +++ Am 03.04.2024 wird die RegApp aktualisiert. Während des Updatefensters kann der Dienst für kurze Zeit unterbrochen sein. Aktive Sitzungen sollten nicht betroffen sein.
There are currently problems when submitting jobs. We are working on fixing the problems and apologize for the inconvenience.
The problem is solved now.
(German version below) Due to an open security issue we are required to disable the feature of so-called user namespaces on the cluster. This feature is mainly used by containerization software and affects the way apptainer containers will behave. The changes are effective immediately. Most users should not experience any interruptions. If you experience any problems, please contact us as usual via servicedesk@itc.rwth-aachen.de with a precise description of the features you are using. We will reactivate user namespaces as soon as we can install the necessary fixes for the aforementioned vulnerability. Update: We have installed a bugfix release for the affected software component and enabled user namespaces again. --- Aufgrund eines ausstehenden Sicherheitsproblems müssen wir sogenannte User Namespaces auf dem Cluster vorübergehend deaktivieren. Dieses Feature wird hauptsächlich von Containervirtualisierungssoftware wie Apptainer genutzt und die Abschaltung hat einen Einfluss darauf, wie diese Container intern aufgesetzt werden. Die meisten Nutzer sollten von diesen Änderungen nicht direkt betroffen sein und nahtlos weiterarbeiten können. Sollten Sie dennoch Probleme entdecken, kontaktieren Sie uns bitte via servicedesk@itc.rwth-aachen.de und schildern Sie uns, wie konkret Sie Ihre Container starten. Sobald wir einen Patch für die Sicherheitslücke einspielen können, werden wir User Namespaces wieder aktivieren. Update: Wir haben einen Bugfix für die betroffene Softwarekomponente installiert und User Namespaces wieder aktiviert.
Zurzeit werden keine Daten auf /hpcwork angezeigt. Die Fachabteilung ist informiert und arbeitet an der Lösung. ---english--- At the moment, no data are shown on /hpcwork. We are working on a solution of the problem.
Die Störung wurde behoben. // The problem has been solved.
Both CLAIX18 copy nodes will be rebooted on Monday, January 29th, 6.00 am (CET) due to a scheduled kernel upgrade. The systems will temporarily unavailable and cannot be used until the kernel update is finished.
Aufgrund von Netzwerkromplemen kann es im angegeben Zeitraum zu Problemen bei der Nutzung des Clusters gekommen sein.
For the login to login18-4.hpc.itc.rwth-aachen.de it is again mandatory to use two-factor authentication. For details see https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/475152f6390f448fa0904d02280d292d/
Momentan kann keine Verbindung zum Windows-Cluster hergestellt werden. Die Kollegen sind informiert und arbeiten an der Behebung des Problems. -- english -- At the moment it is not possible to connect to the windows cluster. We are working on a solution of the problem.
--English Version Below-- Die Störung konnte behoben werden. Eine Verbindung mit dem Windows-Cluster ist wieder möglich. --English Version-- The error has been resolved. You can connect to the Windows cluster again.
The jupyterhub.hpc.itc.rwth-aachen.de DNS is Temporary out of Service for 20 Minutes. Problems accessing the hpc JupyterHub might arise from this failure. Please wait until the system comes back online.
Der DGX-2-Knoten nd20-02 wird voraussichtlich Montag, den 27.11. und Dienstag, den 28.11. ganztägig nicht zur Verfügung stehen. Grund hierfür ist das Betriebssystemupdate auf Rocky 8. -- The DGX-2 node nd20-02 will not be available on Monday (27.11.) and Tuesday (28.11.) for the whole day. We will be updating the operating system to Rocky 8 in the specified time
The node needs to be reinstalled and cannot be used until further notice.
The update of the system was successful.
Aufgrund von Wartungsmassnahmen erfolgt das Einrichten von HPC-Accounts verzoegert. Passwort-Aenderungen sind nicht moeglich.
login18-x-2 ist defekt und steht deshalb aktuell nicht zur Verfuegung.
Das System ist wieder ok.
The complete cluster will not be available from 8am to 12am due to system maintenance. Within the maintenance, the HPC Cluster will be upgraded to Rocky 8.9.
Due to technical problems, we have to postpone the maintenance to next week monday
due to technical problems, we have to prolong the maintenance
The maintenance could be finished successfully
Currently, some users receive an error message after logging into the regapp application.. We are already working on a solution. --- Aktuell kommt es bei einigen Nutzern nach dem Login in die Regapp zu einer Fehlermeldung. Wir arbeiten bereits an einer Lösung.
Am 17.10 finden Wartungsarbeiten an der Klimaanlage der Maschinenhalle statt. Aus diesem Grund muss der Batchbetrieb im angegeben Zeitraum angehalten werden und der Cluster leer laufen. Nach den Wartungsarbeiten wird der Batchbetrieb automatisch wieder gestartet. --- Maintenance work on the air conditioning system of the machine hall will take place on 17.10. For this reason, batch operation must be stopped in the specified period and the cluster must run empty. After the maintenance work, batch operation will be restarted automatically.
The maintenance is completed. Jobs are scheduled and executed again. -- Die Wartung ist abgeschlossen. Jobs werden wieder gescheduled und ausgeführt.
Due to a network maintenance in the IT Center building SW23, the HPC Service will be temporarily suspended. During the maintenance, the cluster (including all frontend nodes) will not be available. -- Wegen Wartungsarbeiten am Netzwerk im IT-Center SW23 wird der HPC-Betrieb vorübergehend unterbrochen. Während der Wartung ist der Cluster (alle Frontendknoten einbegriffen) nicht erreichbar.
The network maintenance is completed. Until all services of the cluster are restored, the HPC service will remain suspended.
The cluster is reachable again.
Lustre18 will be temporarily shut down during the maintanance. The frontend nodes will be mandatorily rebooted. -- Lustre18 wird während der Wartung temporär gestoppt. Die Frontendknoten werden erforderlicherweise neu-gestartet.
Aktuell laesst sich auf den HPC-Dialogsystemen das Programm gnome-terminal nicht direkt starten. Wir versuchen aktuell noch herauszufinden, was das Problem ist. Bitte nutzen Sie ersatzweise ein anderes Terminal-Programm wie xterm, mate-terminal oder xfce-terminal. Evtl. ist gnome-terminal auch als Default-Terminal-Applikation in ihrer Desktop-Umgebung eingestellt. In diesem Fall passiert nichts, wenn Sie auf das Terminal-Icon druecken. Sie muessten dann ebenfalls ein anderes Terminal-Programm als Default-Applikation konfigurieren: Currently the program gnome-terminal cannot be started directly on the HPC dialog systems. We are still trying to find out what the problem is. Please use another terminal program like xterm, mate-terminal or xfce-terminal instead. Maybe gnome-terminal is also set as default terminal application in your desktop environment. In this case nothing happens when you press the terminal icon. You would have to configure another terminal program as default application as well: MATE: System - Preferences - Preferred Applications - System - Terminal Emulator XFCE: Applications - Settings - Default Applications - Utilities - Terminal Emulator
One of the DGX-2 systems (nd20-01) will be temporarily unavailable due to a scheduled maintenance. We will be updating the system to Rocky Linux 8.8. Eines der DGX-2-Systeme (nd20-01) wird aufgrund geplanter Wartungsarbeiten vorübergehend nicht verfügbar sein. Wir werden das System auf Rocky 8.8 aktualisieren Update: Due to unforeseen problems, the maintenance has to be extended until Monday. We apologize for the inconvenience. Aufgrund unvorhergesehener Probleme müssen die Wartungsarbeiten bis Montag fortgesetzt werden. Wir bitten die Unannehmlichkeiten zu entschuldigen.
The two systems copy18-1 and copy18-2 will be rebooted for maintenance reasons.
Login - Node login18-2 steht am Dienstag 26.09 von 7 Uhr bis 15 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Stabilität durchgeführt. Login - Node login18-2 will not be available on Tuesday 26.09. from 7 a.m. to 3 p.m. Work is being carried out to improve network stability.
HPC services may be disrupted currently, e.g. it may not be possible to login to our dialog nodes, to start JupyterLab notebooks or to submit batch jobs. We are working on fixing the issue.
The problems are solved.
Login - Node login18-4 steht am Mittwoch 30.08 von 6 Uhr bis 15 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Stabilität durchgeführt. Login - Node login18-4 will not be available on Wednesday 30.08. from 6 a.m. to 3 p.m. Work is being carried out to improve network stability.
Login - Nodes: login18-2, login18-g-2, login18-3 stehen am Dienstag 29.08 von 6 Uhr bis 15 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Stabilität durchgeführt. Login - Nodes: login18-2, login18-g-2, login18-3 will not be available on Tuesday 29.08. from 6 a.m. to 3 p.m. Work is being carried out to improve network stability.
Login - Nodes: login18-x-1, login18-g-1, login18-2 stehen am Montag 28.08 von 6 Uhr bis 14 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Stabilität durchgeführt. Login - Nodes: login18-x-1, login18-g-1, login18-2 will not be available on Monday 28.08. from 6 a.m. to 2 p.m. Work is being carried out to improve network stability.
Currently, we strongly recommend using IntelMPI instead of OpenMPI because OpenMPI jobs currently crash non-deterministically or remain in a "completing" state and do not complete successfully.
We have identified the root of the issue and are currently working on reverting the batch nodes to a working configuration. This might lead to slightly prolonged waiting times for new jobs. We will update this incident message as soon as all batch nodes are finished with the procedure.
The affected batch nodes are fully operational again.
The JARDS online submission system for filing applications for RWTH computing projects will be unavailable on 25.07.2023 between 7:00 and 17:00.
In der Wartung wird das aktuelle Betriebssystem Rocky Linux 8.7 auf Rocky Linux 8.8 aktualisiert. Auch die Frontends werden aktualisiert, so dass Sie nicht in der Lage sein werden, sich in den Cluster einzuloggen oder Zugriff auf Ihre Daten zu erhalten. Hierfuer gilt allerdings eine Ausnahme. Die MFA-Testmaschine login18-4 wird erreichbar bleiben, man kann sich dort jedoch nur mit einem zweiten Faktor [1] einloggen. Zeitweise wird aber auch hier $HPCWORK nicht erreichbar sein, da auch das Lustre Filesystem einer Wartung unterzogen wird. Wir gehen nicht davon aus, dass Sie Ihre Software neu kompilieren oder Ihre Jobskripte aendern muessen. Ihre Jobs sollten also nach dem Ende der Wartungsarbeiten normal anlaufen.
The Windows dialog systems (cluster-win.rz.rwth-aachen.de) will not be available due to a necessary relocation of the server hardware.
Several users are currently experiencing difficulties logging in to the login18-x-1 frontend. We are investigating the problem. For the meantime, please use login18-x-2 instead.
Due to network problems the login to login18-x-1 has been deactivated until further notice.
error fixed
Login - Nodes: login18-x-1, login18-g-1, login18-2 stehen am Montag 22.05 von 8 Uhr bis 14 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Stabilität durchgeführt. Login - Nodes: login18-x-1, login18-g-1, login18-2 will not be available on Monday 22.05. from 8 a.m. to 2 p.m. Work is being carried out to improve network stability.
Data Transfer Node copy18-2 steht von Dienstag 16.5. 9:45 Uhr bis Mittwoch 17.5. 15:00 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Redundanz durchgeführt. --- Data Transfer Node copy18-2 will not be available from Tuesday 16.5. 9:45 a.m. to Wednesday 17.5. 3:00 p.m. Work will be done to improve network redundancy.
there will be a maintenance of our ACI tenant, which results in a network interrupt of all our VMs, including LDAP, Kerberos, cvmfs etc. pp. Thus, it is not possible to login during the maintenance, we would also expect that already logged in people could face problems. Regarding the runnning jobs (if there are some) we do not know, how this will be influenced exactly, but hope, that they can run through as expected.
We will have to postpone the maintenance. A new timeslot still needs to be found, so take the changed time as preliminary timeslot.
The maintenance will take place on Monday 15.05.2023 from 13:00 to 14:00
-- english version below -- In der angegebenen Zeit findet die Umstellung des Cluster von CentOS 7 auf Rocky 8 statt. Dabei werden von Woche zu Woche weitere Systeme mit Rocky 8 neuinstalliert und im Batchbetrieb zur Verfügung gestellt. Durch diese Umstellung kann es zu höheren Wartezeiten im Batchbetrieb kommen. Mehr Informationen finden Sie auf der folgenden Seite: https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/c3735af4173543b9b14a3f645a553e8a/ --- In the given time the changeover of the cluster from CentOS 7 to Rocky 8 takes place. From week to week more systems will be reinstalled with Rocky 8 and made available in batch mode. Due to this changeover, there may be longer waiting times in batch mode. More information can be found on the following page: https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/c3735af4173543b9b14a3f645a553e8a/
Incident JARDS online submission system
Aufgrund von einem ungeplanten Neustart vieler virtuellen Maschinen war der Zugang gestört.
Data Transfer Node copy18-2 steht am Donnerstag 13.04. zwischen 10:00 Uhr und 15:00 Uhr nicht zur Verfügung. Es werden Arbeiten zur Verbesserung der Netzwerk-Redundanz durchgeführt. --- Data Transfer Node copy18-2 will not be available on Thursday 13.04. between 10:00 and 15:00. Work is being carried out to improve network redundancy.
In this mantenance we will switch the operationg system from CentOS 7 to Rocky 8 on the following dialog systems: login18-2.hpc.itc.rwth-aachen.de login18-3.hpc.itc.rwth-aachen.de login18-x-2.hpc.itc.rwth-aachen.de login18-g-2.hpc.itc.rwth-aachen.de copy18-2.hpc.itc.rwth-aachen.de More backgroud information concerning this change will be provided on the rz-cluster mailinglist.
Das System steht in dem Wartungszeitraum nicht zur Verfuegung. Bitte weichen Sie auf copy18-1 aus.
Die Dialog-Systeme copy18-2, login18-2, login18-3 und login18-g-2 stehen in dem Wartungszeitraum nicht zur Verfuegung. Bitte weichen Sie auf eines der anderen Dialog-Systeme aus.
Waehrend der Wartung koennen keine neuen Passworte fuer den HPC-Dienst gesetzt werden. Neue HPC-Accounts werden erst mit Ablauf der Wartung tatsaechlich eingerichtet.
Wartung muss leider verlaengert werden
Die Wartung ist abgeschlossen.
Der Zugriff auf die Home-Verzeichnisse des RWTH Compute Cluster von Windows-Clients aus ist derzeit aus Wartungsgruenden nicht moeglich.
Zugriff funktioniert wieder.
To improve the backupability of the HOME file system, we need to restructure the home directories. To do this, login to the frontend nodes is disabled, no Slurm jobs are executed, and you cannot check the status of your jobs.
Die Systemwartung ist beendet.
Aktuell schlägt unter Umständen das Starten einer Web-Desktop-Session auf http://login18-x-1.hpc.itc.rwth-aachen.de:3000/ fehl. Bitte melden Sie sich stattdessen auf http://login18-x-2.hpc.itc.rwth-aachen.de:3000/ an oder nutzen Sie den FastX-Desktop-Client, s. https://help.itc.rwth-aachen.de/service/rhr4fjjutttf/article/25f576374f984c888bb2a01487fef193/
JARDS online submission system for NHR projects (NHR large, NHR normal, Prep) will not be available on 09.02.23 between 8-9 o'clock.
JARDS online submission system for NHR projects (NHR large, NHR normal, Prep) will not be available on 06.02.23 between 7-9 o'clock.
Please note that the JARDS online submission system for filing applications for RWTH computing projects (rwth small, rwth thesis and rwth lecture) will not be available on 30.01.23 between 7-9 o'clock.
Please note that the JARDS online submission system for filing applications for NHR projects (NHR large, NHR normal, Prep) and RWTH computing projects (rwth small, rwth thesis and rwth lecture) will not be available on 26.01.23 between 7-8 o'clock.