rlombardi6 – Partnership for an Advanced Computing Environment

1-Week Reminder – PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

PACE’s next Maintenance Period starts at 6:00AM on Monday, 10/06/2025, and is tentatively scheduled to conclude by 11:59PM on Wednesday, 10/8/2025. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work is complete.

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

All Systems: Cooling tower maintenance and cleanup
All Systems: Updating to RHEL 9.6 Operating System
Phoenix: New GNR (Granite Rapids) login nodes coming online!
Phoenix and ICE: Filesystem checks for project and scratch
Phoenix: Updating load balancer for login nodes
IDEaS Storage: Updating LDAP configuration

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

The PACE Team

PACE Maintenance Period (Oct 6 – Oct 8, 2025)

WHEN IS IT HAPPENING?

WHAT DO YOU NEED TO DO?

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime.

WHAT IS HAPPENING?

A detailed list of updates will be provided once it is available.

WHY IS IT HAPPENING?

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.

WHO IS AFFECTED?

All users across all PACE clusters.

WHO SHOULD YOU CONTACT FOR QUESTIONS?

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns. You may read this message on our blog.

Thank you,

-The PACE Team

Services Restored: PACE Login

Summary: PACE services are restored after an issue with Georgia Tech’s domain name system (DNS) servers caused interruptions for users and jobs accessing PACE resources. The DNS service has been restored and OIT is working on identifying the root cause.

Details: OIT will continue to investigate the origin of this outage.

Impact: PACE users trying to login to Phoenix, Buzzard, Hive, Firebird and ICE clusters and/or Open OnDemand instances were prevented from logging in. Additionally, users and jobs trying to check out software licenses or access CEDAR storage might have been unable to do so. If you continue to experience issues logging into your PACE account or accessing CEDAR, please contact pace-support@oit.gatech.edu.

PACE Login Down

Summary: Users are currently unable to complete login attempts to PACE clusters via command line or OnDemand web portals, receiving an “unauthorized” or “permission denied” error.

Details: The PACE team is investigating and believe there is an issue with authentication of logins from the central GT access management but do not yet have details.

Impact: Attempts to access PACE resources may fail at this time.

Thank you for your patience as we continue investigating. Please visit https://status.gatech.edu for updates.

Degraded performance on Phoenix storage

Dear Phoenix users,

Summary: The project storage system on Phoenix (/storage/coda1) is slower than normal, due to heavy use and hard drive failures. The rebuild process to spare hard drives is ongoing; until it is complete, some users might experience slower file access on the project storage.

Details: Two hard drives that support the /storage/coda1 project storage failed on 1-July at 3:30am and 9:20am forcing a rebuild of the data to spare drives. This rebuild usually takes 24-30 hours to complete. We are closely monitoring the rebuilding process, which we expect to complete on July 2 around noon. In addition, we are temporarily moving file services from one metadata server to another and back to rebalance the load across all available systems.

Impact: Access to files is slower than usual during the drive rebuild and metadata server migration process. There is no data loss for any users. For the affected users, the degradation of performance can be observed on the login as well as compute nodes. The file system will continue to be operational while the rebuilds are running in the background. Access to the Phoenix cluster is not impacted, and the scheduler is processing jobs at a normal rate.

We thank you for your patience as we are working to solve the problem.

[Update] [storage] Phoenix Project storage degraded performance

[Updated March 31, 2025 at 414pm]

Dear Phoenix researchers,

As the Phoenix project storage system has stabilized, we have restored login access via ssh and resumed starting jobs.

The cost for the jobs running during the performance degradation will not count towards the March usage.

The Phoenix OnDemand portal can again be used to access project and scratch space. Any user still receiving a “Proxy Error” should contact pace-support@oit.gatech.edu for an individual reset of their OnDemand session.

Globus file transfers have resumed. We have determined that transfers to/from home, scratch, and CEDAR storage were inadvertently paused, and we apologize for any confusion. Any paused transfer should have automatically resumed.

The PACE team continues to monitor the storage system for any further issues. We are working with the vendor to identify the root cause and prevent future performance degradation.

Please contact us at pace-support@oit.gatech.edu with any questions. We appreciate your patience during this unexpected outage.

Best,

The PACE Team

[Updated March 31, 2025 at 12:41pm]

Dear Phoenix Users,

To limit the impact of the current Phoenix project filesystem issues, we have implemented the following changes to expedite troubleshooting and limit impact to currently running jobs:

New Logins to Phoenix Login Nodes are Paused

We have prevented new login attempts to the Phoenix login nodes. Users that are currently logged in will be able to stay logged onto the system.

Phoenix Jobs Prevented from Starting

Jobs that are in the queue but that have not yet started have been paused to prevent them from starting. These submitted jobs will remain in the queue.

Jobs that are currently running may experience decreased performance if using project storage. We are doing our best to prioritize the successful completion of these jobs.

Open OnDemand (OOD)

Users of Phoenix OOD can log in and interact with only their home directory. Project and scratch space are not available.

Some users of Open OnDemand may be unable to reach this service and are experiencing “Proxy Error” messages. We are investigating the root cause of this issue.

Globus File Transfer Paused for Project Space

File transfers to/from project storage on Globus have been paused. Other Globus transfers (Box, DropBox, and OneDrive cloud connectors; scratch; home; and CEDAR) will continue.

The PACE team is working to diagnose the current issues with support from our filesystem vendor. We will continue to share updates as we have them and apologize for this unexpected service outage.

Best,

The PACE Team

Partnership for an Advanced Computing Environment

Author: rlombardi6

1-Week Reminder – PACE Maintenance Period (Oct 6 – Oct 8, 2025)

PACE Maintenance Period (Oct 6 – Oct 8, 2025)

Services Restored: PACE Login

PACE Login Down

Degraded performance on Phoenix storage

[Update] [storage] Phoenix Project storage degraded performance

Georgia Institute of Technology