OIT HPC Services
NC State Office of Information Technology (OIT) offers a number of intermediate level HPC services to support research and instruction. These services are available to all NC State faculty.
To access HPC services an NC State
requests an HPC project. Once the project is established,
normally within one business day of the request, the faculty
add individual accounts for students or collaborators.
The faculty member who requested the project is responsible
for all resource use by that project.
Distributed Memory Computing
Distributed memory computing services are provided by two Linux clusters henry2 and sam. Access to distributed memory computing services is via ssh to login nodes attached to the Linux cluster from which jobs can be submitted to the resource management and queuing system.
Resource intensive interactive access (eg application graphical user interfaces) is also available using an HPC image from the Virtual Computing Laboratory (VCL).
Shared Memory Computing
Shared memory computing services are provided by Opteron based nodes integrated with the henry2 cluster. These nodes provide up to 16 shared memory processor cores and up to 128GB of memory accessable through a dedicated queue.
Three types of storage service are available for users of the distributed memory and shared memory compute services:
- home directory storage is shared between both distributed memory and shared memory services and provides up to about a gigabyte of backed up storage;
- independent scratch storage services, including parallel file systems, are available for distributed memory and shared memory jobs providing up to a few terabytes of volatile storage with no backups;
- a shared mass storage service provides several terabytes of backed up storage for important files.
A suite of applications is maintained for use on HPC compute services. These applications include Fortran and C/C++ compilers, code development tools, and libraries with which users can build their own custom applications.
Consulting and Collaboration
HPC computational science staff are available to assist with effective and efficient utilization of HPC compute services. This assistance may take the form of help with short, specific questions or issues using HPC services (consulting) or more general sustained assistance to a project or activity using HPC services (collaboration).
Disaster Recovery and Business Continunity Plan
HPC hardware resources are primarily distributed between two data centers. As described in more detail below, there is some balancing of resources between the two data centers. However, HPC users currently play a vital role in ensuring that their activities are resilient and can continue with acceptable level of disruption in the case of any event affecting HPC services.
Hardware is easily replaced. Software, data, and staff expertise are the components that are essential for continuity of work after an event.
Distributed Memory Compute Resources
HPC distributed memory compute resources share a common hardware architecture with the Virtual Computing Laboratory (VCL) that is based on IBM BladeCenter technology. VCL/HPC BladeCenter hardware is distributed between NC State data center 1 (DC1), NC State data center 2 (DC2), and a data center at MCNC.
In the event of one of the data centers being unavailable, the BladeCenter resources in the available data center would provide a reduced level of service - requiring that services and applications using BladeCenter services be prioritized.
Shared Memory Compute Resource
HPC shared memory compute resource is integrated with the henry2 cluster in DC2. Applications that run on the shared memory nodes can also run on the other BladeCenter nodes - with reduced memory and processor capability.
Therefore in the event that DC2 (or just the shared memory nodes) were unavailable work from the shared memory nodes would be shifted to other BladeCenter nodes. This work would have to be prioritized along with distributed memory and VCL workloads to determine which jobs would receive priority.
- Home directory storage for henry2 is physically located on disks in DC2 and backed up to a tape library at MCNC. If DC2 were lost contents of /home could be restored on storage at MCNC.
- Scratch storage is physically located with the resource it is supporting - sam scratch storage at MCNC and henry2 scratch storage in DC2. Due to the nature of scratch storage it is not backed up and it should not contain any critical data that could not be regenerated.
- Mass storage is distribued between DC1 and DC2 - ~12TB is located in DC1 and ~34TB in DC2. Storage in DC2 is hierarchically managed with older files migrating to tape library located in DC2. Currently backup of DC2 mass storage (/ncsu/volume1) and DC1 mass storage (/ncsu/volume2) goes to tape library in DC2.
Bottom line on HPC storage is that there are scenarios (total loss of DC2) that place the /ncsu/volume1 file system data (both primary and backup copies) at risk. HPC group is seeking affordable ways to improve the resilience of centrally managed HPC data. However, for now users must take responsibility for maintaining a copy of critical source code and data in a location outside the /ncsu/volume1 file system.
HPC application software used on henry2 is stored on disks physically located in DC2 and backed up to tape library at MCNC. Applicaion software used on sam is stored on disks physically located at MCNC and are currently not backed up. These applications were delivered on media that is stored in HLB or were downloaded via the network. Therefore, in the event that the primary and backup copies of the applications were lost, they could be reloaded from physical media or downloaded again.
License keys for the applications also reside in disks in DC2 and are backed up to the tape libary at MCNC. If these copies of the license keys were to be lost, the keys were delivered via email which is stored on Google email server(s) and archived on Postini and could be restored from there.
HPC Staff Resources
There are three full time HPC staff members plus a director with some ongoing operational responsibility. There is only limited overlap between various staff members responsibilities. Absence of a single HPC staff member results in reduced level of service across a broad range of activities as another staff member has to prioritize activities of two full time positions, one including unfamiliar responsibilities. Absence of two (or more) HPC staff members would leave substantial areas (or all) with no knowledgeable person to fulfill those responsibilities.
HPC Business Continuity
Data essential for HPC business continuity is maintained in a MySQL database operated by OIT Systems. This database is accessed using web interfaces that are implemented on web servers operated by both OIT Systems and HPC.
Internal HPC operational documentation is maintained on a web server operated by HPC. This documentation is backed up to HPC storage physically located in DC1 with backup in DC2.
Last modified: April 21 2012 10:29:01.