reminder – electrical work in the data center

Just a quick reminder that Facilities will be doing some electrical work in the data center, unrelated to PACE, tomorrow.  We’re not expecting any issues, but there is a remote possibility that this work could interrupt electrical power to various PACE servers, storage and network equipment.

Upcoming Quarterly Maintenance on 4/17

The first quarter of the year had passed already, and it’s time for the quarterly maintenance once again!

Our team will offline all the clusters for regular maintenance and improvements on 04/17, for the entire day. We have a scheduler reservation in place to hold jobs that would not complete until the maintenance day, so hopefully no jobs will need to be killed. The jobs with such long wallclock times will still be queued, but they will not be released until the maintenance is over.

Please direct your concerns/questions to PACE support at pace-support@oit.gatech.edu.

Thanks!

FYI – upcoming datacenter electrical work

In addition to our previously scheduled maintenance day activities next tuesday, the datacenter folks are scheduling another round of electrical work during the morning of Saturday 4/21.  Like the last time, this should not affect any PACE managed equipment, but just in case….

Regarding the job scheduler problems over the weekend

We experienced a major problem with one of our file servers over the weekend, which caused some of your jobs to fail. We would like to apologize for this inconvenience and provide you with more details on the issue.

In a nutshell, the management blade of our file server we use for scratch space (iw-scratch) crashed for a reason that we are still investigating. This system has a failover mechanism, which allows another blade to take over for continuation of operations. Therefore, you were still able to see your files and could use the software stack that is on this fileserver.

Our node that runs the moab server (job scheduler), on the other hand, mounts this fileserver using another mechanism that uses a static IP. After the new blade took over the operations, our Moab node continued to try mounting the iw-scratch using the IP of the failed blade, needless to say, unsuccessfully.

As a result, some jobs failed with messages similar to “file not found”. This problem also rendered the moab server unresponsive, until we rebooted it Saturday night. Even after the reboot, some problems persisted until we fixed the server this morning. We will keep you updated as we find more about the nature of the problem. We are also in contact with the vendor company to prevent this from happening again.

Thank you once again for your understanding and patience. Please contact us at pace-support@oit.gatech.edu for any questions and concerns.

FYI – upcoming electrical work

We’ve received word that GT-Facilities will be done some electrical work in the data center, unrelated to PACE, on Saturday, February 18 from 6:00am until noon.  They do not expect this to disrupt any PACE managed equipment, but just in case….

Upcoming quarterly maintenance – 1/18/2012

This is a reminder that all PACE-managed clusters will be shutdown on January 18 (Wednesday next week) for regular maintenance.

All currently running jobs will complete before the shutdown.  Any jobs submitted to the scheduler between now and maintenance day will either complete before the shutdown or wait until after maintenance to start.

Major items on the list this time around are:
– Improving scratch filesystem performance
– Increasing scratch filesystem size
– Complete the migration of our server infrastructure to VMs
– Further redundancy improvements to core of the HPC network
– Adjustments to VM hypervisors which hopefully improve login node performance
– Install a new binary for the scheduler to remove a limit to the number of queues (this shouldn’t change anything else, but mentioning it just in case..)
– integration of some new clusters into the Infiniband fabric (again, doing this on maintenance day just in case something bad happens)

For updates about maintenance, please check the PACE blog at http://blog.pace.gatech.edu/

If you have questions or concerns, please send a note to pace-support@oit.gatech.edu.

Possible network outage on 12/13

The network team will perform a maintenance next Tuesday (12/13) at 7:30am. This is not expected to affect any systems or running jobs, but there is still a ~20% chance that a network outage can happen, and last for about an hour. The team will be on site, prepared to intervene immediately should that happens.
Please note that a network outage will affect running jobs, so you might like to wait until the maintenance is over to submit large and/or critical jobs. As always, please contact us if you have any concerns or questions.

Updated: Network troubles, redux (FIXED)

We’ve got the switch back.  The outage looks to have caused our virtual machine farm to reboot, so connections to head nodes will have been dropped.

This also affected the network path between compute nodes and the file servers.  With a little luck, the NFS traffic should resume, but you may want to check on any running jobs to make sure.

Word from the network team is that they were following published instructions from the switch vendor to integrate the two switches when the failure occurred.  We’ll be looking into pretty intensely, as this these switches are seeing a lot of deployments in other OIT functions.