Posts

PACE Maintenance Period – 01/12/26 to 01/16/26

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM Monday January 12th and is scheduled to end no later than 11:59PM on Thursday January 15th; ICE will open to Spring 2026 courses on Friday, January 16th The additional day is needed to install a second cooling pump at the data center to provide redundancy for PACE clusters. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed. PACE will release each cluster (Phoenix, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed.  

 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • None 

ITEMS NOT REQUIRING USER ACTION: 

  • [all] DataBank will install a second cooling pump into the research hall cooling loop, providing redundancy. 
  • [all] Apply maintenance updates to all compute nodes 
  • [Phoenix, ICE, Firebird] Upgrade clusters to Slurm 25.05.5 
  • [Storage] Enable Write Back on all VAST storage for performance improvements 
  • [all] Replace some PDU and IB network switches with new equipment 
  • [Storage] Apply maintenance upgrades to Lustre file system appliances 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Globus Connectors for cloud storage on PACE

We’d like to highlight Globus Connectors for cloud storage as the best way to transfer files between PACE and cloud storage services, including Dropbox, Box, and OneDrive. Globus Connectors make it easy to move files between PACE storage (on Phoenix, ICE, or CEDAR) and cloud storage through Globus’s web interface.

Please avoid large transfers to/from the cloud via rclone or other services on the login node (such as the Dropbox API), as these can cause heavy load on the campus network and impact other researchers. PACE has purchased the cloud connectors to provide a better option, for easier use and less network strain.

You can learn more about how to use Globus, including cloud connectors, on PACE in our documentation. Please contact us with questions, or to suggest other cloud storage services for which the connectors could be installed if they would enable your research.

1-Week Reminder – PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday, 10/06/2025, and is tentatively scheduled to conclude by 11:59PM on Wednesday, 10/8/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete. 

WHAT DO YOU NEED TO DO? 

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. 

WHAT IS HAPPENING? 

  • All Systems: Cooling tower maintenance and cleanup
  • All Systems: Updating to RHEL 9.6 Operating System
  • Phoenix: New GNR (Granite Rapids) login nodes coming online!
  • Phoenix and ICE: Filesystem checks for project and scratch
  • Phoenix: Updating load balancer for login nodes
  • IDEaS Storage: Updating LDAP configuration

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system. 

WHO IS AFFECTED?

All users across all PACE clusters. 

WHO SHOULD YOU CONTACT FOR QUESTIONS? 

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog

Thank you, 

The PACE Team

PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday, 10/06/2025, and is tentatively scheduled to conclude by 11:59PM on Wednesday, 10/8/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

A detailed list of updates will be provided once it is available.

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

-The PACE Team

Services Restored: PACE Login

Summary: PACE services are restored after an issue with Georgia Tech’s domain name system (DNS) servers caused interruptions for users and jobs accessing PACE resources. The DNS service has been restored and OIT is working on identifying the root cause.

Details: OIT will continue to investigate the origin of this outage.

Impact: PACE users trying to login to Phoenix, Buzzard, Hive, Firebird and ICE clusters and/or Open OnDemand instances were prevented from logging in. Additionally, users and jobs trying to check out software licenses or access CEDAR storage might have been unable to do so. If you continue to experience issues logging into your PACE account or accessing CEDAR, please contact pace-support@oit.gatech.edu.

PACE Login Down


Summary
: Users are currently unable to complete login attempts to PACE clusters via command line or OnDemand web portals, receiving an “unauthorized” or “permission denied” error.

Details: The PACE team is investigating and believe there is an issue with authentication of logins from the central GT access management but do not yet have details. 

Impact: Attempts to access PACE resources may fail at this time.

Thank you for your patience as we continue investigating. Please visit https://status.gatech.edu for updates.

Disruption of the Phoenix scheduler service on August 14

The Phoenix cluster had a service interruption on Thursday, August 14, around 0:01 AM. The SLURM scheduler was in process of restarting, which is our regular procedure for the purpose of clearing the jobs stuck in the CG (“completing”) state. Unfortunately, during restart the scheduler lost the connection to the network drive that hosts the job state, and all jobs that were running on Phoenix at that time were terminated.  

We sincerely apologize for this disruption of service. We are working on modifying our configuration to prevent this from happening in future. We are in contact with the developers of the scheduler software, and are developing an alternative, more stable way to maintain the scheduler’s job state. For Thursday night, we disabled the midnight automatic restarts. For Friday and this weekend, we will offset the scheduler restarts of the HA pair; the two nodes running the scheduler service will restart at 12:15 and 1:15 AM, lessening the load on each node. While there is a risk of losing the connection to the filesystem, we estimate this risk as low. Next week we will look at more robust options. The cost of the jobs that were terminated at midnight on August 14 will not count towards the August usage.  

Phoenix Login Outages

Summary: An issue with DNS caused researchers to receive error messages when attempting to ssh to Phoenix or to open a shell in Phoenix OnDemand beginning late Thursday evening. A workaround has been activated to restore access, but researchers may still encounter intermittent issues.

Details: The load balancer receiving ssh requests to the Phoenix login node began routing to incorrect servers late Thursday evening. The PACE team deployed a workaround at approximately 10:15 AM on Friday that is still populating in DNS servers.

Impact: Researchers may receive “man-in-the-middle” warnings and be presented with ssh fingerprints that do not match those published by PACE for verification. Overriding the warning might lead to further errors as an incorrect server was reached. Researchers using the cluster shell access in Phoenix OnDemand may receive a connection closed error.

It is possible to get around this outage by ssh to a specific Phoenix login node (-1 through -6). There is no specific workaround for the OnDemand shell, though it is possible to request an Interactive Desktop job and use the terminal within it.

Thank you for your patience as we identified the cause and are working to resolve the issue. Please email pace-support@oit.gatech.edu with any questions or concerns. You may visit status.gatech.edu for ongoing updates.

Phoenix project storage outage, impacting login

Summary: An outage of the metadata servers on Phoenix project storage (Lustre) is preventing access to that storage and may also prevent login by ssh, access to Phoenix OnDemand, and some Globus access on Phoenix. The PACE team is working to repair the system.

Details: During the afternoon of Saturday, July 19, one of the metadata servers for Phoenix Lustre project storage stopped responding. The failover to the other metadata server was not successful. The PACE team has not yet been able restore access and has engaged our storage vendor.

Impact: Files on the Phoenix Lustre project storage system are not accessible, and researchers may not be able to log in to Phoenix by ssh nor via the OnDemand web interface. Globus on Phoenix may time out, but researchers can type another path into the Path box to bypass the home directory and enter a subdirectory directly (e.g., typing ~/scratch will allow access to the scratch storage). Research groups that have already migrated to VAST project storage may not be impacted. VAST project, scratch, and CEDAR storage may still be reachable this way.

Thank you for your patience as we work to restore access to Phoenix project storage. Please contact us at pace-support@oit.gatech.edu with any questions.

[UPDATE Mon 21 Jul, 11:00]

Phoenix project storage outage is over

The outage on the Lustre project storage is over; the scheduler has been released and is accepting jobs. The access through the head nodes, Globus, and Open OnDemand is restored.  

The diagnostic check of the metadata volumes, performed over the weekend, completed successfully. As a precaution, we are running a thorough check to data volumes to verify there are no other issues. In an unlikely event of data loss, it will be restored from the backups. Scratch, home, VAST and CEDAR storage systems were not affected by the outage. The cost of the jobs that were terminated due to the outage will be refunded.  

We are continuing to work with the storage vendors to prevent project storage outages. The ongoing migration of project storage from Lustre to VAST systems will reduce the impact when one of the shared file systems has issues.  

Degraded performance on Phoenix storage

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.