Posts

PACE Spending Deadlines for FY25

As you plan your spending on the PACE Phoenix and Firebird clusters for the end of Georgia Tech’s FY25 on June 30, 2025, we would like to alert you to several deadlines:

  1. Due to the time it takes to process purchase requests, we would like to get all prepaid compute and lump-sum storage purchases exceeding $5,000 by April 30, 2025. Please contact us if you know there will be any purchases exceeding that amount so that we may  help you with planning.
    1. Purchases under $5,000 can continue without restrictions.
  2. All spending after May 31, 2025, will be held for processing in July, in FY26. This includes postpaid compute jobs run in June, monthly storage payments for June, and new prepaid purchases requested after May 31, 2025.
    1. State funds (DE worktags) expiring on June 30, 2025, may not be used for June spending.
    2. Grant funds (GR worktags) expiring June 30, 2025, may be used for postpaid compute and monthly storage in June.
  3. Existing refresh (CODA20), FY20, and prepaid compute are not impacted, nor is existing prepaid storage.
  4. For worktags that are not expiring, your normal monthly limits on postpaid compute (if selected) will apply in May and June. Monthly storage will continue to be billed as normal.

Find out more about paid compute and storage services available on PACE on our website. If you have any questions or would like to meet with us to discuss the best options for your specific needs, please email us.

[Maintenance] Maintenance window EXTENDED – May 5th-9th

As part of the work needed to mitigate cooling issues in the Coda datacenter, there will be a full replacement of the cooling system water pump in the research hall of the datacenter. While we previously hoped to handle the maintenance from May 6-9th, we are now planning to start one day earlier on May 5th at 6am ET due to the volume of physical work that must be carried out in the datacenter.

Due to this being the final day for instructors to submit grades, we will ensure that the ICE system remains available to instructors. Reservations on all clusters have been set to prevent jobs from running into the maintenance window as usual.

We will follow up with a full list of the planned activities during this maintenance window in our two-week reminder.

Cooling Failure in Coda Datacenter

[Update 4/3/25 9:55 AM]
Our vendors are working to restore cooling capabilities to the datacenter by fully replacing the cooling system controller and expect to have the work completed by 7:00pm ET.  
 
We hope to return all systems to service by tomorrow (Friday) evening, provided that all repairs to the cooling system are complete and after testing for stability after the shutdown.  Clusters will be released tomorrow as testing is completed for each system.  
 
We will provide updates on progress via status.gatech.edu and share announcements via specific mailing lists as clusters become available or the situation changes significantly.

[Update 4/2/25 5:50 PM]

Due to continued high temperatures, all Phoenix and Firebird compute nodes have been turned off, and all running jobs were cancelled. Impacted jobs will be refunded at the end of April.

[Original Post 4/2/25 5:20 PM]

Summary: The controller for the cooling system in the Coda Datacenter has failed. Many PACE nodes have been turned off given the significantly reduced cooling capacity in the datacenter. No jobs can start on research clusters.

Details: The controller for the system providing cooling to nodes in the Coda Research Hall has failed. To avoid damage, PACE has urgently shut down many compute nodes to reduce heat.

Impact: No new jobs can start on PACE’s research clusters (Phoenix, Hive, Buzzard, and Firebird). All Hive and Buzzard compute nodes have been turned off, and running jobs were cancelled. There is not yet an impact to ICE, but we may need to shut down ICE nodes as well as we monitor temperatures.

Please visit https://status.gatech.edu for ongoing updates as the situation evolves. Please contact pace-support@oit.gatech.edu with any questions.

[Update] [storage] Phoenix Project storage degraded performance

[Updated March 31, 2025 at 414pm]

Dear Phoenix researchers,

As the Phoenix project storage system has stabilized, we have restored login access via ssh and resumed starting jobs.

The cost for the jobs running during the performance degradation will not count towards the March usage.

The Phoenix OnDemand portal can again be used to access project and scratch space. Any user still receiving a “Proxy Error” should contact pace-support@oit.gatech.edu for an individual reset of their OnDemand session.

Globus file transfers have resumed. We have determined that transfers to/from home, scratch, and CEDAR storage were inadvertently paused, and we apologize for any confusion. Any paused transfer should have automatically resumed.

The PACE team continues to monitor the storage system for any further issues. We are working with the vendor to identify the root cause and prevent future performance degradation.

Please contact us at pace-support@oit.gatech.edu with any questions. We appreciate your patience during this unexpected outage.

Best,

The PACE Team

[Updated March 31, 2025 at 12:41pm]

Dear Phoenix Users,

To limit the impact of the current Phoenix project filesystem issues, we have implemented the following changes to expedite troubleshooting and limit impact to currently running jobs:

New Logins to Phoenix Login Nodes are Paused

We have prevented new login attempts to the Phoenix login nodes. Users that are currently logged in will be able to stay logged onto the system.

Phoenix Jobs Prevented from Starting

Jobs that are in the queue but that have not yet started have been paused to prevent them from starting. These submitted jobs will remain in the queue.

Jobs that are currently running may experience decreased performance if using project storage. We are doing our best to prioritize the successful completion of these jobs.

Open OnDemand (OOD)

Users of Phoenix OOD can log in and interact with only their home directory. Project and scratch space are not available.

Some users of Open OnDemand may be unable to reach this service and are experiencing “Proxy Error” messages. We are investigating the root cause of this issue.

Globus File Transfer Paused for Project Space

File transfers to/from project storage on Globus have been paused. Other Globus transfers (Box, DropBox, and OneDrive cloud connectors; scratch; home; and CEDAR) will continue.

The PACE team is working to diagnose the current issues with support from our filesystem vendor. We will continue to share updates as we have them and apologize for this unexpected service outage.

Best,

The PACE Team

[storage] Phoenix Project storage degraded performance

We are currently experiencing degraded performance on Phoenix Project storage. We are investigating with the vendor and will provides updates as we learn more. 

Summary: Performance of Phoenix project storage is currently degraded.

Details: Two of our MDS (MetaData Servers) rebooted early Monday morning, March 31, and load averages are unusually high on one of them.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until we are able to mitigate the issue. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.”

Phoenix storage performance degraded

[Update 3/21/25 12:30 PM]

Following the completion of the rebuild and copyback processes on the impacted redundant storage pool, Phoenix project storage performance has returned to normal. Please contact pace-support@oit.gatech.edu if you encounter any further issues.

[Original post 3/19/25 5:00 PM]

Summary: Performance of Phoenix project storage is currently degraded.

Details: Multiple redundant disks failed yesterday and today, and storage is slowed while the redundant pool rebuilds.

Impact: Researchers may experience significant slowness in read & write performance on Phoenix project storage until the process is complete. Conda environments located in project storage may be very slow to load (even if the python script to run is located elsewhere) or fail to activate, while attempts to view project storage files via the OnDemand web portal may time out.

Please visit https://status.gatech.edu for updates and contact pace-support@oit.gatech.edu with any questions.

[Advance Notice] Planned Spring-Break (March 17-21st) Downtime

[Update 3/6/25]

Summary: All PACE compute nodes will be unavailable from 4:00 PM on Friday, March 14, through Tuesday, March 18, to repair a water leak in the Coda Datacenter Research Hall cooling system. Access to login nodes and data will remain available.  

Details: Due to a water leak discovered last month, a seal will be replaced at the start of Spring Break in the cooling system. A full replacement of the pump is planned for the May maintenance period, which will be extended one day and is now planned for May 6-9, 2025, once all parts are available. Non-compute nodes in the Enterprise Hall will not be impacted in the Spring Break repair.  

Impact: During the outage, it will not be possible to run any compute jobs on any PACE cluster (Phoenix, Hive, ICE, Firebird, Buzzard). Login nodes and storage systems will remain available. A reservation has been placed on all schedulers to prevent any jobs from starting if their walltime request extends past 4:00 PM on March 14; these jobs will be held until maintenance is complete.  

Thank you for your patience as we work to restore full functionality of the cooling system. You can read this message on our blog.  

Best, 

-The PACE Team 

[Original Post 2/19/25]

Summary: A water leak has occurred in the CODA Datacenter Research Hall cooling system due to the failure of a pump seal – as a result, we are planning a two or three-day outage (pending confirmation from the Databank and mechanical contractor) during the week of March 17th-21st, which we hope will have less impact due to Spring Break. Access to login nodes and data will remain available, as these live in a different part of the datacenter. No compute services (Phoenix, Hive, ICE, Firebird, or Buzzard compute nodes) will be available. We will follow up once the exact days of the outage are finalized. 

Details: A pump seal in the CODA research hall cooling system failed on Feb 16th. The leak is not currently impacting operation of any PACE resources. Databank is working on a full pump replacement (“flange-to-flange”) plan to address the issue. Databank is actively sourcing the pump and associated parts and coordinating with a new mechanical contractor. We currently target the pump replacement to occur during Spring Break (March 17 – 21). However, this target date could change based on supply chain constraints. The mechanical work is estimated to take one to two days (depending on if additional damage or issues are identified during the pump replacement). Upon completion of the work, the PACE team will need one business day to conduct all necessary testing on the ~2,000 systems and release the five clusters currently hosted in the Research Hall (Phoenix, Hive, ICE/AI Makerspace, Firebird, and OSG Buzzard).  

Being able to perform the work during Spring Break represents a best-case scenario. Databank is actively monitoring the leak and the overall health of the cooling system. Should the situation deteriorate quickly or a catastrophic failure occur, Databank will coordinate emergency repair work to replace the pump seal itself using available on-site spare parts. Under this scenario, a complete pump replacement would be coordinated during the planned PACE Maintenance period in May.  
 
We are striving to keep the shutdown as short as possible. A reservation has been placed on the cluster to prevent any jobs being cancelled by the shutdown – which will cause some jobs to hold until the outage is over. 

Thank you for your patience as we work to recover from this situation

Best, 

-The PACE Team 

[Notice] Phoenix Scheduler Account Issue

Following the Slurm upgrade during the January 2025 maintenance window, the monthly usage reset did not execute as scheduled on February 1. Consequently, reported balances were lower than expected, as the reported usage still included January utilization. Having identified the issue, the PACE team manually reset usage across all accounts at 12:00 PM on February 4. Additionally, the vendor has been notified of the bug to provide a patch before the next monthly cycle.  

Any jobs that ran from the beginning of February until now will not count towards February usage, and any overages of pre-set limits will be refunded.  

We sincerely apologize for the inconvenience.  

Thank you and have a great day! 

PACE team 

[Complete] PACE Maintenance Period – January 13th-16th 2025

WHEN IS IT HAPPENING?  

PACE’s next Maintenance Period starts at 6:00AM on Monday January 13th, 01/13/2025, and is tentatively scheduled to conclude by 11:59PM on Thursday January 16th, 01/16/2025. The additional day is needed to accommodate additional testing needed due to the presence of both RHEL7 and RHEL9 versions of our systems as we migrate to the new Operating System. PACE will release each cluster (Phoenix, Hive, Firebird, ICE, and Buzzard) as soon as maintenance work and testing are completed. We will prioritize ICE to support Spring courses as soon as possible, and for the others, plan to focus on the largest portion of each system first (for Phoenix and Firebird where both OSs are present), to restore access to data and compute capabilities. 

 
WHAT DO YOU NEED TO DO?   

As usual, jobs with resource requests that would be running during the Maintenance Period will be held until after the maintenance by the scheduler. During this Maintenance Period, access to all the PACE-managed computational and storage resources will be unavailable. This includes Phoenix, Hive, Firebird, ICE, and Buzzard. Please plan accordingly for the projected downtime. CEDAR storage will not be affected. 

WHAT IS HAPPENING?   

ITEMS REQUIRING USER ACTION: 

  • [Phoenix] Continue migrating nodes to the RHEL 9 operating system, which will complete post-MD – after this, Phoenix will be 75% on the RHEL9 OS. 
  • [Hive] COMPLETE migrating nodes to the RHEL 9 operating system. 
  • [Phoenix and Hive] Default login behavior will change so that login-phoenix and login-hive will point to RHEL 9 login nodes rather than RHEL 7 nodes, which WILL trigger SSH warnings. For more information on SSH at PACE, see our documentation
      

ITEMS NOT REQUIRING USER ACTION: 

  • [Phoenix, Hive, Firebird, ICE] Upgrade Slurm to 24.11.10 
  • [all] DataBank will perform cooling tower cleaning requiring all machines in the research hall to be powered off 
  • [all] Upgrade border firewall hardware  
  • [Phoenix,ICE] Upgrade IB (InfiniBand) switch firmware 
  • [Phoenix,Hive] Move Globus endpoints to new network to improve performance 
  • [ICE] Enable self-service container builds 
  • [Phoenix] Upgrade all storage servers to latest version to support performance improvements, covering scratch and project (coda1) storage 
  • [Firebird] Upgrades to underlying storage servers to improve functionality 

WHY IS IT HAPPENING?  

Regular maintenance periods are necessary to reduce unplanned downtime and maintain a secure and stable system.  

WHO IS AFFECTED?  

All users across all PACE clusters.  

WHO SHOULD YOU CONTACT FOR QUESTIONS?   

Please contact PACE at pace-support@oit.gatech.edu with questions or concerns.

Thank you,  

-The PACE Team 

[Resolved] Phoenix login nodes outage on Dec 5, 2024

On the morning of December 5, 2024, the RHEL9 login nodes of the Phoenix cluster became unresponsive. The problems started at 4:37 AM, when one login node (out of two) had a memory problem; at 6:27 AM, it crashed. The other login node crashed at 9:37 AM, rendering the RHEL9 environment on Phoenix inaccessible. Both login nodes were restarted at 11:30 AM, which resolved the issue. The jobs that crashed between 4:37 and 11:30 AM have been refunded.