Overview
NC State Office of Information Technology (OIT) offers a
number of intermediate level HPC services to support
research and instruction. These services are available
to all NC State faculty.
To access HPC services an NC State
faculty member
requests an HPC project. Once the project is established,
normally within one business day of the request, the faculty
member can
add individual accounts for students or collaborators.
The faculty member who requested the project is responsible
for all resource use by that project.
Distributed Memory Computing
Distributed memory computing services are provided by
a Linux cluster
henry2. Access to distributed
memory computing services is via ssh to login nodes attached to the
Linux cluster login nodes from which jobs can be submitted to the
resource management and queuing system.
Resource intensive
interactive access
(eg application graphical user interfaces) is also
available using an HPC image from the
Virtual Computing
Laboratory (VCL).
Shared Memory Computing
Shared memory computing services are provided by an
IBM POWER5 system running AIX. Access to shared memory
computing services is via ssh to a login node that is
part of the POWER5 system. From the login node compute
intensive shared memory jobs are submitted to the
resource management and queuing system.
Storage
Three types of storage service are available for
users of the distributed memory and shared memory compute
services: 1) home directory storage is shared between
both distributed memory and shared memory services and
provides up to about a gigabyte of backed up storage;
2) independent scratch storage services, including parallel
file systems, are available for distributed memory and
shared memory jobs providing up to a few terabytes of
volatile storage with no backups; and 3) a shared mass storage
service provides several terabytes of backed up storage
for important files.
Applications
A suite of applications is maintained for
use on HPC compute services. These applications include
Fortran and C/C++ compilers, code development tools,
and libraries with which users can build their own custom
applications.
Consulting and Collaboration
HPC computational science staff are available to assist
with effective and efficient utilization of HPC compute
services. This assistance may take the form of help with
short, specific questions or issues using HPC services (consulting)
or more general sustained assistance to a project or
activity using HPC services (collaboration).
Disaster Recovery and Business Continunity Plan
Overview
HPC hardware resources are distributed between two
data centers on NC State campus. As described in
more detail below, there is some balancing of
resources between the two data centers. However,
HPC users currently play a vital role in ensuring
that their activities are resilient and can
continue with minimal disruption in the case of
any event affecting HPC services.
Hardware is easily replaced. Software, data, and
staff expertise are the components that are essential
for continuity of work after an event.
Distributed Memory Compute Resources
HPC distributed memory compute resources share a
common hardware architecture with the Virtual
Computing Laboratory (VCL) that is based on IBM
BladeCenter technology. VCL/HPC BladeCenter
hardware is distributed between NC State data
center 1 (DC1) and data center 2 (DC2) with
more resources in DC2 than in DC1.
In the event of one of the data centers being
unavailable, the BladeCenter resources in the
available data center would provide a reduced
level of service - requiring that applications
using BladeCenter services be prioritized.
Shared Memory Compute Resource
HPC shared memory compute resource is the
IBM POWER5 system which is located in DC1.
Software currently running on the POWER5
system can be run on the BladeCenter hardware,
albeit at substantially reduced performance.
Therefore in the event that DC1 (or just the
POWER5) were unavailable work from the POWER5
would be shifted to the BladeCenter. This work
would have to be prioritized along with
distributed memory work to determine which
jobs would receive priority.
HPC Storage
- Home directory storage is physically
located on disks in DC2 and backed up to a
tape library in DC2. An event affecting DC2
would affect both the primary and backup copy
of HPC home directory storage.
- Scratch storage is physically located
with the resource it is supporting - POWER5 scratch
storage in DC1 and BladeCenter cluster scratch
storage in DC2. Due to the nature of scratch storage
it is not backed up and it should not contain any
critical data that could not be regenerated.
- Mass storage is distribued between
DC1 and DC2 utilizing the university storage
management system (SMS) hardware for disk storage and
the HPC tape library for tape storage.
HPC uses 16 TB of SMS space, 8 TB is physically located
in DC1 and 8 TB in DC2. All HPC SMS space is backed up
to the HPC tape library in DC2. Therefore, no data would
be at risk from an event affecting DC1, but an event
affecting DC2 could affect all of the data physically
residing on DC2 disks backed up to DC2 tape library.
Bottom line on HPC storage is that there are
scenarios that place the centrally managed data
(both primary and backup copies) at risk. HPC group
is seeking affordable ways to improve the resilience
of centrally managed HPC data. However, for now
users must take responsibility for maintaining a
copy of critical source code and data in a location
outside the centrally managed HPC storage.
Applications
HPC application software used on both the POWER5 system
and BladeCenter cluster are stored on disks physically
located in DC2 and backed up to tape library in DC2.
These applications were delivered on media that is
stored in HLB or were downloaded via the network.
Therefore, in the event that the primary and backup copies
of the applications were lost, they could be reloaded
from physical media or downloaded again.
License keys for the applications also reside in disks
in DC2 and are backed up to the tape libary in DC2.
If these copies of the license keys were to be lost,
the keys were delivered via email which is stored on
the campus email server(s) and could be restored from there.
HPC Staff Resources
There are three full time HPC staff members and there is
only limited overlap between their responsibilities. Absence of a
single HPC staff member would result in reduced level of service
across a broad range of activities as another staff member
would have to prioritize activities of two full time positions,
one including unfamiliar responsibilities. Absence of two
(or more) HPC staff members would leave substantial areas
(or all) with no knowledgeable person to fulfill those
responsibilities.
HPC Business Continuity
Data essential for HPC business continuity is maintained
in a MySQL database operated by OIT Systems. This database
is accessed using web interfaces that are implemented on
web servers operated by both OIT Systems and HPC.
Internal HPC operational documentation is maintained on a
web server operated by HPC. This documentation is backed up
to HPC SMS storage physically located in DC1 with backup
in DC2.