Disruption of the Phoenix scheduler service on August 14

The Phoenix cluster had a service interruption on Thursday, August 14, around 0:01 AM. The SLURM scheduler was in process of restarting, which is our regular procedure for the purpose of clearing the jobs stuck in the CG (“completing”) state. Unfortunately, during restart the scheduler lost the connection to the network drive that hosts the job state, and all jobs that were running on Phoenix at that time were terminated.  

We sincerely apologize for this disruption of service. We are working on modifying our configuration to prevent this from happening in future. We are in contact with the developers of the scheduler software, and are developing an alternative, more stable way to maintain the scheduler’s job state. For Thursday night, we disabled the midnight automatic restarts. For Friday and this weekend, we will offset the scheduler restarts of the HA pair; the two nodes running the scheduler service will restart at 12:15 and 1:15 AM, lessening the load on each node. While there is a risk of losing the connection to the filesystem, we estimate this risk as low. Next week we will look at more robust options. The cost of the jobs that were terminated at midnight on August 14 will not count towards the August usage.