Posts – Page 4 – Partnership for an Advanced Computing Environment

PACE Maintenance Period Aug 06-09 2024

[Update 07/31/24 02:23pm]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, August 6^th (08/06/2024) and is tentatively scheduled to conclude by 11:59 PM on Friday, August 9^th (08/09/2024). An extra day is needed to accommodate additional testing needed due to both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard, along with their associated RHEL9 environments) as soon as maintenance work and testing are completed. We plan to focus on the largest portion of each system first, to ensure access to data and compute capabilities are restored as soon as possible.

Also, we have CANCELED the November maintenance period for 2024 and do NOT plan to have another maintenance outage until early 2025.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected.

For Phoenix, we are migrating 427 nodes (~30% of the ~1400 total nodes on Phoenix) from RHEL7 to RHEL9 in August. The new RHEL9 nodes will not be available immediately after the Maintenance Period is completed but will come online the following week (August 12^th – 16^th). After this migration, about 50% of the Phoenix cluster will be migrated over to RHEL9, including all but 20 GPU nodes. Given this, we strongly encourage Phoenix users who have not migrated their workflows over to RHEL9 to do so as soon as possible.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[Phoenix and Hive] Continue migrating nodes to the RHEL 9 operating system
Migrate 427 nodes to RHEL9 in Phoenix
Migrate 100 nodes to RHEL9 in Hive
[Phoenix, Hive, Firebird, ICE] GPU nodes will receive new versions of the NVIDIA drivers, which *may* impact locally built tools using CUDA.
[Phoenix] H100 GPU users on Phoenix should use the RHEL9 login node to avoid module environment issues.

ITEMS NOT REQUIRING USER ACTION:

[all] Databank cooling loop work, which will require shutdown of all systems
[all] Upgrade to RHEL 9.4 from 9.3 on all RHEL9 nodes – should not impact user-installed software
[all] Research and Enterprise Hall Ethernet switch code upgrade
[all] Upgrade PACE welcome emails
[all] Upgrade Slurm scheduler nodes to RHEL9
[CEDAR] Adding SSSD and IDmap configurations to RHEL7 nodes to allow correct group access across PACE resources
[Phoenix] Updates to Lustre storage to improve stability
File consistency checks across all metadata servers, appliance firmware updates, external metadata server replacement on project storage
[Phoenix] Install additional InfiniBand interfaces to HGX servers
[Phoenix] Migrate OOD Phoenix RHEL9 apps
[Phoenix, Hive] Enable Apptainer self-service
[Phoenix, ICE] Upgrade Phoenix/Hive/ICE subnet managers to RHEL9
[Hive] Upgrade Hive storage for new disk replacement to take effect
[ICE] Updates to Lustre scratch storage to improve stability
File consistency checks and appliance firmware updates
[ICE] Retire ICE enabling rules for ECE
[ICE] Migrate ondemand-ice server to RHEL9

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

[Update 07/15/24 03:36pm]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Tuesday August 6th, 08/06/2024, and is tentatively scheduled to conclude by 11:59PM on Friday August 9th, 08/09/2024. The additional day is needed to accommodate additional testing needed due to the presence of both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard, along with their associated RHEL9 environments) as soon as maintenance work and testing is completed. We plan to focus on the largest portion of each system first, to ensure access to data and compute capabilities are restored as soon as possible.

Additionally, we have cancelled the November maintenance period for 2024, and do not plan to have a maintenance outage until early 2025.

WHAT DO YOU NEED TO DO?

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[Phoenix and Hive] Continue migrating nodes to the RHEL 9.3 operating system.

ITEMS NOT REQUIRING USER ACTION:

[all] Databank cooling loop work, which will require shutdown of all systems

[CEDAR] Adding SSSD and IDmap configurations to allow correct group access across PACE resources

[Phoenix] Updates to Lustre storage to improve stability

File consistency checks across all metadata servers, appliance firmware updates, external metadata server replacement on /storage/coda1

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,

-The PACE Team

Phoenix project storage outage

[Update 7/9/24 12:00 PM]

Phoenix project storage has been repaired, and the scheduler has resumed. All Phoenix services are now functioning.

We have updated a parameter to throttle the number of operations on the metadata servers to improve stability.

Please contact us at pace-support@oit.gatech.edu if you encounter any remaining issues.

[Original Post 7/8/24 4:40 PM]

Summary: Phoenix project storage is currently inaccessible. We have paused the Phoenix scheduler, so no new jobs will start.

Details: Phoenix Lustre project storage has experienced slowness and been intermittently unresponsive at times throughout the day today. The PACE team identified a few user jobs causing high workload on the storage system, but the load remained high on one metadata server, which eventually stopped responding. Our storage vendor recommended a failover to a different metadata server as part of a repair, but the system has been left fully unresponsive. PACE and our storage vendor continue to work on restoring full access to project storage.

Impact: The Phoenix scheduler has been paused to prevent new jobs from hanging, so no new jobs can start. Currently-running jobs may not make progress and should be cancelled if stuck. Home and scratch directories remain accessible, but an ls of the full home directory may hang due to the symbolic link to project storage.

Thank you for your patience as we work to restore Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions. You may visit https://status.gatech.edu/ for additional updates.

IDEaS storage Maintenance

WHAT’S HAPPENING?

One of the IDEaS IntelliFlash controller cards needs to be reseated. Before reseating the card, we will failover all resources to controller B, shutdown controller A, pull the whole enclosure out and reseat the card. The activity takes about 2 hours to complete.

WHEN IS IT HAPPENING?

Monday, July 8th, 2024, starting at 9 AM EDT.

WHY IS IT HAPPENING?

We are working with the vendor to resolve an issue discovered while debugging controllers and restore system back to a healthy status.

WHO IS AFFECTED?

Users of the IDEaS storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens.

WHAT DO YOU NEED TO DO?

During the maintenance, data access should be preserved, and we do not expect downtime. However, there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, jobs accessing the IDEaS storage may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

[OUTAGE] Phoenix Project Storage

[Update 06/20/2024 04:58pm]

Dear Phoenix Users,

Summary: The Phoenix cluster is back online. The scheduler is unpaused and the jobs that have been put on hold are now resumed, and the file system is ready for use.

Details: All the appliance components for Phoenix project storage were restarted, and file system consistency was confirmed. We’ll continue to monitor it and run additional consistency checks over the next few days.

Impact: If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs, so please reach out to pace-support@oit.gatech.edu if you have encountered any issues.

Thank you for your patience,

-The PACE Team

[Update 06/20/2024 01:36 pm]

Summary: The metadata servers on Phoenix, for project storage, /storage/coda1, are currently down due to degraded performance.

Details: During additional testing with the storage vendor as part of investigation of the performance issues from this morning, it was necessary to bring the storage fully offline, rather than resuming service.

Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler once we have verified that project storage is stable. Access to project storage (/storage/coda1) is currently interrupted, however, scratch storage (/storage/scratch1) is not affected. If you were running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. We will be issuing refunds for all impacted jobs as usual.

Only project storage on Phoenix is affected – storage on Hive, ICE, Buzzard and Firebird work without issues.

Thank you for your patience as we work with our storage vendor to resolve this outage. We will continue to provide updates as work continues.

Please contact us at pace-support@oit.gatech.edu with any questions.

Degraded Phoenix Project Storage Performance

Summary: The metadata servers on Phoenix, /storage/coda1, restarted by themselves, with one of them not responding, leading to degraded performance on the project storage file system.

Details: We have restarted the servers in order to restore access. Testing performance of the file system is ongoing. We will continue to monitor performance and work with the vendor to find the cause.

Impact: We have paused the scheduler for now, so you will not be able to start jobs on Phoenix. We will release the scheduler soon once we have verified that storage is stable. Access to project storage (/storage/coda1) might have been interrupted for some users. If you are running jobs on Phoenix and using project storage, please verify that your jobs have not run into any issues. Only storage on Phoenix should be affected; storage on Hive, ICE, Buzzard and Firebird work without issues.

IDEaS Storage Outage Resolved

Summary: PACE’s IDEaS storage was unreachable early this morning. Access was restored at approximately 9:00 AM.

Details: One controller on the IDEaS IntelliFlash storage became unresponsive, and the resource could not switch to the redundant controller. Rebooting both controllers restored access. PACE is working with our storage vendor to identify the cause.

Impact: IDEaS storage could not be reached during the outage from PACE and external mounts. Any jobs on Phoenix or Hive running on IDEaS storage would have failed. If you had a job on Phoenix running on IDEaS storage that failed, please email pace-support@oit.gatech.edu to request a refund.

Thank you for your patience as we resolved the issue this morning. Please contact us at pace-support@oit.gatech.edu with any questions.

Hive Storage Maintenance

WHAT’S HAPPENING?

One of the storage controllers in use for Hive requires a hard drive replacement to restore the high availability of the device. The activity takes about 2 hours to complete.

WHEN IS IT HAPPENING?

Tuesday, June 11th, 2024, starting at 10 AM EDT.

WHY IS IT HAPPENING?

The failed drive limits the high availability of the controller.

WHO IS AFFECTED?

Users of the Hive storage system will notice decreased performance since all services will be switched over to a single controller. It is possible that access will be interrupted while the switch happens.

WHAT DO YOU NEED TO DO?

During hard drive replacement for the Hive cluster, one of the controllers will be shut down, and the redundant controller will take all the traffic. Data access should be preserved, and we do not expect downtime, but there have been cases in the past where storage has become inaccessible. In case of storage unavailability during replacement becomes an issue, your job may fail or run without making progress. If you have such a job, please cancel it and resubmit it once storage can be accessed.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

For any questions, please contact PACE at pace-support@oit.gatech.edu.

Firebird scheduler outage resolved

Summary: A configuration issue with the Firebird scheduler caused failures to Firebird jobs over the weekend and this morning as storage was not accessible on compute nodes. The issue was resolved by 2:00 PM today.

Details: Changes to the Firebird scheduler configuration were made during last week’s maintenance period (May 7-9) in order to facilitate future updates to Firebird. A repair was made on Friday, after which jobs were running successfully. Over the weekend, a different issue occurred, and jobs were launched on compute nodes without the proper storage being mounted. We have fully reverted the Firebird configuration changes to their state prior to the maintenance period, and jobs should no longer face any errors.

Impact: Some jobs launched on Firebird over the last three days may have failed due to missing home and project storage on the compute nodes with messages like “no such file or directory” or an absent output file. Jobs attempted mid-day on Monday, May 13, may have been queued for an extended period while repairs were made to the scheduler configuration.

Thank you for your patience as we resolved this issue. Please contact us at pace-support@oit.gatech.edu with questions or if you continue to experience errors.

PACE Maintenance Period (May 07 – May 10, 2024)

[Update 05/09/24 04:25 PM]

Dear PACE users,

The maintenance on the Phoenix, Hive, Firebird, and OSG Buzzard clusters has been completed. The Phoenix, Hive, Firebird, and OSG Buzzard clusters are back in production and ready for research; all jobs that have been held by the scheduler have been released.

The ICE cluster is still under maintenance due to the RHEL9 migration, but we expect it to be ready tomorrow. Instructors teaching summer courses will be notified when it is ready.

The POSIX user group names on the Phoenix, Hive, Firebird, and OSG Buzzard clusters have been updated so that names will start with the “pace-” prefix. If your scripts or workflows rely on POSIX group names, they will need to be updated; otherwise, no action is required on your part. This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.

Just a reminder that the next Maintenance Period will be August 6-8, 2024.

Thank you for your patience!

-The PACE Team

[Update 05/07/24 06:00 AM]

PACE Maintenance Period starts now at 6:00 AM on Tuesday, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, 05/10/2024.

[Update 05/01/24 06:37 PM]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00 AM on Tuesday, May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59 PM on Friday, May 10th, 05/10/2024. An extra day is needed to accommodate physical work done by Databank in the Coda Data Center.PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
- This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
- If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
- This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
- NOTE: This item was originally planned for January but was delayed to avoid integration issues with IAM services, which have now been resolved.

[ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know!
- The ICE login nodes will be updated to RHEL 9.3 as well, and this WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.

[ICE] We will be retiring 8 of the RTX6000 GPU nodes on ICE to prepare for the addition of several new L40 nodes the week after MD.
[software] Sync Gaussian and VASP on RHEL7 pace-apps.
[software] Sync any remaining RHEL9 pace-apps for the OS migration.
[Phoenix, ICE] Upgrade Nvidia drivers on all HGX/DGX servers.
[Hive] The scratch deleter will not run in May and June but will resume in July.
[Phoenix] The scratch deleter will not run in May but will resume in June.
[ICE] The scratch deleter will run for Spring semester deletion during the week of May 13.

ITEMS NOT REQUIRING USER ACTION:

[datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.
[Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9.
[datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit).
[datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0.
[Hive] Consolidate all the ICE access entitlements into a single one, all-pace-ice-access.
[Hive] Upgrade Hive compute nodes to GPFS 5.1.
[Phoenix] Replace cables for the Phoenix storage server.
[Firebird] Patch Firebird storage server to 100GbE switch and reconfigure.
[Firebird, Hive] Deploy Slurm scheduler CLI+Feature bits on Firebird and Hive.
[datacenter] Configure LDAP on the MANTA NetApp HPCNA SVM.

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

-The PACE Team

[Update 04/22/24 09:53 AM]

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Tuesday May 7th, 05/07/2024, and is tentatively scheduled to conclude by 11:59PM on Friday May 10th, 05/10/2024. The additional day is needed to accommodate physical work carried out by Databank in the Coda datacenter. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, CEDAR, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

ITEMS REQUIRING USER ACTION:

[all] During the maintenance period, the PACE team will rename all POSIX user groups so that names will start with the “pace-” prefix.
- This will NOT affect numerical GIDs, but if your scripts or workflows rely on group names, they will need to be updated.
- If you don’t use POSIX user group names in your scripts or workflows, no action is required on your part.
- This is a step towards tighter integration of PACE systems with central IAM tools, which will lead to improvements across the board in the PACE user experience.
- NOTE: This item was originally planned for January, but was delayed to avoid integration issues with IAM services, which have now been resolved.

[ICE] Migrate to the RHEL 9.3 operating system – if you need access to ICE for any summer courses, please let us know!
- Note – This WILL create new ssh host-keys on ICE login nodes – so please expect a message that “WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!” when you (or your students) next access ICE after maintenance.

ITEMS NOT REQUIRING USER ACTION:

[datacenter] Databank maintenance: replace all components of cold loop water pump that had issues a couple of maintenance periods ago.
[Hive] Upgrade the underlying GPFS filesystem to version 5.1 in preparation for RHEL9
[datacenter] Repairs to one InfiniBand switch and two DDN storage controllers with degraded BBUs (Battery Backup Unit)
[datacenter] Upgrade storage controller firmware for DDN appliances to SFA 12.4.0.

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Phoenix A100 CPU:GPU Ratio Change

On Phoenix, the default number of CPUs assigned to jobs requesting an Nvidia Tensor Core A100 GPU has recently changed. Now, jobs requesting one or more A100 GPUs will be assigned 8 cores per GPU by default, rather than 32 cores per GPU. You may still request up to 32 cores per GPU if you wish by using the --ntasks-per-node flag in your SBATCH script or salloc command to specify the number of CPUs per node your job requires. Any request with a CPU:GPU ratio of at most 32 will be honored.

12 of our Phoenix A100 nodes host 2 GPUs and 64 CPUs (AMD Epyc 7513), supporting a CPU:GPU ratio up to 32, and can be allocated through both the inferno (default priority) and embers (free backfill) QOSs. We have recently added 1 more A100 node with 8 GPUs and 64 CPUs (AMD Epyc 7543), requiring this change to the default ratio. This new node is available only to jobs using the embers QOS due to the funding for its purchase.

Please visit our documentation to learn more about GPU requests and QOS or about compute resources on Phoenix and contact us with any questions about this change.

Partnership for an Advanced Computing Environment

Recent Posts

PACE Maintenance Period Aug 06-09 2024

Phoenix project storage outage

IDEaS storage Maintenance

[OUTAGE] Phoenix Project Storage

Degraded Phoenix Project Storage Performance

IDEaS Storage Outage Resolved

Hive Storage Maintenance

Firebird scheduler outage resolved

PACE Maintenance Period (May 07 – May 10, 2024)

Phoenix A100 CPU:GPU Ratio Change

Georgia Institute of Technology