Posts

Maintenance Day (October 16, 2012) – complete

We have completed our maintenance activities.  Head nodes are online again and queued up jobs are being released.

Our filesystem correction activities on the scratch found eight “objects” on the v7 volume to be damaged and were automatically removed.  Unfortunately, the process provides no indication which file or directory was problematic.

As always, please followup to pace-support@oit.gatech.edu with any problems you may see, ideally with the pace-support.sh script discussed here: http://pace.gatech.edu/support.

campus network maintenance

The Network team will be performing some scheduled maintenance this Saturday morning.  This may impact connectivity between your workstations/laptops/home, but should not affect  jobs running within PACE.  However, if your job requires access to network services outside of the PACE cluster (e.g. a remote license server), this maintenance may affect your jobs.

For further information please see the maintenance announcement on status.oit.gatech.edu.

Check the status of queue(s) using “pace-check-queue”

Dear PACE Users,

We have a new tool to announce. If you would like to check the status of any PACE queue, you can now run:

pace-check-queue <queuename>

substituting the queuename with  the name of the queue you would like to check. This tool has a column, which tells you whether a node is accepting jobs or not, including a human readable explanation. This tool provides, at one glance, the following information:

* Which nodes are included in the queue

* Which nodes accept jobs and which don’t (and if they don’t, why)

* How may cores and how much memory each node has, and what percent of them are being used

* Overall usage (CPU/Memory) levels for the entire queue.

(This information is refreshed every half an hour)

We had recently announced a new tool, pace-stat, to check the status of your queues. These tools complement each other, so feel free to use both. Please report any down/problem nodes that you see in the list to pace-support@oit.gatech.edu.

Hope these new tools will provide you with a better HPC environment. Happy computing!

PS: These tools are continuously being developed, therefore your feedback and suggestions for improvements are always welcome!

upcoming maintenance day, 10/16 – working on the scratch storage

It’s that time again.  We’ve been working with our scratch storage vendor (Panasas) quite a lot lately, and think we finally have some good news.  Addressing the scratch space will be a major thrust of this quarterly maintenance, and we are cautiously optimistic that we will see improvements.  We will also be applying some VMware tuning to our RHEL5 virtual machines that should increase responsiveness of those head nodes & servers.  Completing upgrades to RHEL6 for a few clusters and a few other minor items round out our activities for the day.

Scratch storage

We have been testing new firmware on our loaner Panasas storage.  Despite our best efforts, we have been unable to replicate our current set of problems after upgrading our loaner equipment to this firmware.  This is good news!  However, simply upgrading is insufficient to fully resolve our issues.  So on maintenance day, we will be performing a number of tasks related to the Panasas.  After the firmware update, we need to perform some basic file integrity checks – the equivalent of a UNIX fsck – on a copule of volumes.  This process requires those volumes to be offline for the duration.  After this, we need to perform reads of every file on the scratch that was created before the firmware upgrade.  Based on our calculations, this will take weeks.  Fortunately, this process can happen in the background, and with the filesystems online and otherwise operating normally.  The net result is that the full impact of our maintenance day improvements to the scratch will not likely be realized for a couple of weeks.  If there are files (particularly large ones) that you no longer need and can delete, this process will go faster.  We will also be upgrading the Panasas client software on all compute nodes to (hopefully) address performance issues.

Finally, we will also be instituting a 20TB per user hard quota in addition to the 10TB per user soft quota currently in place.  Users that exceed the soft quota will receive warning emails, but writes will succeed.  Writes will fail for users that attempt to exceed the hard quota.

VMware tuning

With some assistance from the Architecture and Infrstructure directorate in OIT, we will be making a number of adjustments to our VMware world.  The most significant of which is adjusting the filesystem alignment of our RHEL5 virtual machines.  Users of RHEL5 head nodes are likely to see the most improvement.  We’ll also be installing the VMware tools packages and applying various tuning parameters enabled by this package.

RHEL6 upgrades

The remaining RHEL5 portions of the clusters below will be upgraded to RHEL6.  After maintenance day, RHEL5 will be unavailable to these clusters.

  • Uranus
  • BioCluster
  • Cygnus

Misc items

  • Configuration updates to redundant network switches serving some project storage
  • Capacity expansion of the ECE file server
  • Serial number updates to a small number of compute nodes lacking serial numbers in the BIOS
  • Interoperability testing of Mellanox Infiniband switches
  • Finish project directory migration of two remaining Optimus users

Cygnus FS pc5 online…mostly.

We have been able to bring /nv/pc5 back online, but at a cost to redundancy. One of the network interfaces/cables/switches is not behaving, but when we tried disconnecting various combinations of cables, we found one that caused the filesystem to be immediately available to all nodes.

Considering how close maintenance day is (10/16/12), spending time isolating the cable/switch/interface problem now only means more time for this filesystem to be offline as equipment gets retested. Waiting until maintenance day will cause the least disruption for Cygnus pc5 users who have their last run of jobs and take some time pressure off of us to make sure we have resolved the issue in its entirety before bringing all resources back online.

Despite the loss of redundancy, functionality is NOT affected. Only in the case of an additional switch or cable failure between now on October 16 will functionality be impacted.

Cygnus File System pc5 offline

It appears that we have an issue with the server housing the /nv/pc5 filesystem, which contains a subset of the Cygnus cluster users. We’re trying to isolate the source of the problem, but we have yet to actually find a pattern to why it is available on some nodes and not on others.

Joe Cluster Status

Around 8, 8:30pm on September 28, 2012, a power event took down the TSRB data center, knocking a significant fraction of the Joe cluster offline.

With assistance from Operations, we are now bringing these nodes online after determining that several of the management switches for these nodes did not recover from the event gracefully. As these switches control our ability to manage the nodes, we had to wait until the switches were available to bring nodes online, now at about 4pm on September 29, 2012.

Jobs that were running on these nodes (iw-a2-* and iw-a3-*) at the time of the outage may have terminated abnormally. Jobs scheduled but not running should be fine.

UPDATE @ 4:40pm, 2012-09-29: All nodes are online.

New and Updated Software: GCC, Maxima, OpenCV, Boost, ncbi_blast

Software Installation and Updates

We have had several requests for new or updated software since the last post on August 14.
Here are the details about the updates.
All of this software is installed on RHEL6 clusters (including force-6, uranus-6, ece, math, apurimac, joe-6, etc.)

GCC 4.7.2

The GNU Compiler Collection (GCC) includes compilers for many languages (C, C++, Fortran, Java, and Go).
This latest version of GCC supports advanced optimizations for the latest compute nodes in PACE.

Here is how to use it:

$ module load gcc/4.7.2
$ gcc <source.c>
$ gfortran <source.f>
$ g++ <source.cpp>

Versions of GCC already installed on RHEL6 cluster are gcc/4.4.5, gcc/4.6.2, and gcc/4.7.0

Maxima 5.28.0

Maxima is a system for the manipulation of symbolic and numerical expressions, including differentiation, integration, Taylor series, Laplace transforms, ordinary differential equations, systems of linear equations, polynomials, and sets, lists, vectors, matrices, and tensors. Maxima yields high precision numeric results by using exact fractions, arbitrary precision integers, and variable precision floating point numbers. Maxima can plot functions and data in two and three dimensions.

Here is how to use it:

$ module load clisp/2.49.0 maxima/5.28.0
$ maxima
#If you have X-Forwarding turned on, "xmaxima" will display a GUI with a tutorial
$ xmaxima

OpenCV 2.4.2

OpenCV (Open Source Computer Vision) is a library of programming functions for real time computer vision.

OpenCV is released under a BSD license, it is free for both academic and commercial use. It has C++, C, Python and soon Java interfaces running on Windows, Linux, Android and Mac. The library has more than 2500 optimized algorithms.

This installation of OpenCV has been installed with support for Python and NumPy. It has been installed without support for Intel TBB, Intel IPP, or CUDA.

Here is how to use it:

$ module load gcc/4.4.5 opencv/2.4.2
$ g++ <source.cpp> $(pkg-config --libs opencv)

Boost

Boost provides free peer-reviewed portable C++ source libraries.
Boost libraries are intended to be widely useful, and usable across a broad spectrum of applications.

Here is how to use it:

$ module load boost/1.51.0
$ g++ <source.cpp>

NCBI BLAST

Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.

Here is how to use it:

$ module load gcc/4.4.5 ncbi_blast/2.2.27
$ blastn
$ blastp
$ blastx
...

Registration open for OpenACC GPU Programming Workshop

Extreme Science and Engineering Discovery Environment
http://xsede.org/

Registration open for October 2012
OpenACC GPU Programming Workshop

One hundred registrants will be accepted for the OpenACC GPU Programming Workshop, to be held October 16 and 17, 2012. The workshop includes hand-on access to Keeneland, the newest XSEDE resource, which is managed by the Georgia Institute of Technology (Georgia Tech) and the National Institute for Computational Sciences, an XSEDE partner institution.

Based on demand, the workshop is scheduled to be held at ten different sites around the country. Anyone interested in participating is asked to follow the link below and then register by clicking on the preferred site. Only the first 100 registrants will be accepted.

The workshop is offered by the Pittsburgh Supercomputing Center, the National Institute for Computational Sciences, and Georgia Tech.

Questions? Contact Tom Maiden at tmaiden@psc.edu.

Register and read more about the workshop at:
http://www.psc.edu/index.php/training/openacc-gpu-programming

[XSEDE is supported by the National Science Foundation; https://www.xsede.org, info@xsede.org.]