Skip title Accessibility statement: we seek to make the HPC web pages accessible to all users. If you encounter accessibility issues with HPC web pages please send a description of the problem by email to eric_sills@ncsu.edu - thank you. NC State
Office of Information Technology
High Performance Computing
Skip menu side bar
Home
About
 
OpNews
 
Help/Accounts
 
Partners
 
User Projects
Services
 
Hardware
 
Software
 
Grid
 
Monitor
HowTo/FAQ
 
Docs & Pubs
 
Courses
 
Other Resources
 Services ...

    Overview

    NC State Office of Information Technology (OIT) offers a number of intermediate level HPC services to support research and instruction. These services are available to all NC State faculty.

      To access HPC services an NC State faculty member requests an HPC project. Once the project is established, normally within one business day of the request, the faculty member can add individual accounts for students or collaborators. The faculty member who requested the project is responsible for all resource use by that project.

    Distributed Memory Computing

    Distributed memory computing services are provided by two Linux clusters henry2 and sam. Access to distributed memory computing services is via ssh to login nodes attached to the Linux cluster from which jobs can be submitted to the resource management and queuing system.

    Resource intensive interactive access (eg application graphical user interfaces) is also available using an HPC image from the Virtual Computing Laboratory (VCL).

    Shared Memory Computing

    Shared memory computing services are provided by Opteron based nodes integrated with the henry2 cluster. These nodes provide up to 16 shared memory processor cores and up to 128GB of memory accessable through a dedicated queue.

    Storage

    Three types of storage service are available for users of the distributed memory and shared memory compute services: 1) home directory storage is shared between both distributed memory and shared memory services and provides up to about a gigabyte of backed up storage; 2) independent scratch storage services, including parallel file systems, are available for distributed memory and shared memory jobs providing up to a few terabytes of volatile storage with no backups; and 3) a shared mass storage service provides several terabytes of backed up storage for important files.

    Applications

    A suite of applications is maintained for use on HPC compute services. These applications include Fortran and C/C++ compilers, code development tools, and libraries with which users can build their own custom applications.

    Consulting and Collaboration

    HPC computational science staff are available to assist with effective and efficient utilization of HPC compute services. This assistance may take the form of help with short, specific questions or issues using HPC services (consulting) or more general sustained assistance to a project or activity using HPC services (collaboration).




    Disaster Recovery and Business Continunity Plan

    Overview

    HPC hardware resources are primarily distributed between two data centers. As described in more detail below, there is some balancing of resources between the two data centers. However, HPC users currently play a vital role in ensuring that their activities are resilient and can continue with minimal disruption in the case of any event affecting HPC services.

    Hardware is easily replaced. Software, data, and staff expertise are the components that are essential for continuity of work after an event.

    Distributed Memory Compute Resources

    HPC distributed memory compute resources share a common hardware architecture with the Virtual Computing Laboratory (VCL) that is based on IBM BladeCenter technology. VCL/HPC BladeCenter hardware is distributed between NC State data center 1 (DC1), NC State data center 2 (DC2), and a data center at MCNC. In the event of one of the data centers being unavailable, the BladeCenter resources in the available data center would provide a reduced level of service - requiring that applications using BladeCenter services be prioritized.

    Shared Memory Compute Resource

    HPC shared memory compute resource is integrated with the henry2 cluster in DC2. Applications that run on the shared memory nodes can also run on the other BladeCenter nodes - with reduced memory and processor capability.

    Therefore in the event that DC2 (or just the shared memory nodes) were unavailable work from the shared memory nodes would be shifted to other BladeCenter nodes. This work would have to be prioritized along with distributed memory and VCL workloads to determine which jobs would receive priority.

    HPC Storage

    1. Home directory storage for henry2 is physically located on disks in DC2 and backed up to a tape library in DC2. An event affecting DC2 would affect both the primary and backup copy of HPC home directory storage. [Plans are underway to relocate backup of henry2 /home to a tape library at MCNC.]
    2. Scratch storage is physically located with the resource it is supporting - sam scratch storage at MCNC and henry2 scratch storage in DC2. Due to the nature of scratch storage it is not backed up and it should not contain any critical data that could not be regenerated.
    3. Mass storage is distribued between DC1 and DC2 utilizing the university storage management system (SMS) hardware for disk storage and the HPC tape library for tape storage.

      HPC uses 16 TB of SMS space, 8 TB is physically located in DC1 and 8 TB in DC2. All HPC SMS space is backed up to the HPC tape library in DC2. Therefore, no data would be at risk from an event affecting DC1, but an event affecting DC2 could affect all of the data physically residing on DC2 disks backed up to DC2 tape library.

    Bottom line on HPC storage is that there are scenarios that place the centrally managed data (both primary and backup copies) at risk. HPC group is seeking affordable ways to improve the resilience of centrally managed HPC data. However, for now users must take responsibility for maintaining a copy of critical source code and data in a location outside the centrally managed HPC storage.

    Applications

    HPC application software used on henry2 is stored on disks physically located in DC2 and backed up to tape library in DC2. Applicaion software used on sam is stored on disks physically located at MCNC and are currently not backed up. These applications were delivered on media that is stored in HLB or were downloaded via the network. Therefore, in the event that the primary and backup copies of the applications were lost, they could be reloaded from physical media or downloaded again.

    License keys for the applications also reside in disks in DC2 and are backed up to the tape libary in DC2. If these copies of the license keys were to be lost, the keys were delivered via email which is stored on the campus email server(s) and could be restored from there.

    HPC Staff Resources

    There are three full time HPC staff members and there is only limited overlap between their responsibilities. Absence of a single HPC staff member would result in reduced level of service across a broad range of activities as another staff member would have to prioritize activities of two full time positions, one including unfamiliar responsibilities. Absence of two (or more) HPC staff members would leave substantial areas (or all) with no knowledgeable person to fulfill those responsibilities.

    HPC Business Continuity

    Data essential for HPC business continuity is maintained in a MySQL database operated by OIT Systems. This database is accessed using web interfaces that are implemented on web servers operated by both OIT Systems and HPC.

    Internal HPC operational documentation is maintained on a web server operated by HPC. This documentation is backed up to HPC SMS storage physically located in DC1 with backup in DC2.

Last modified: July 23 2009 10:34:58.
Office of Information Technology | NC State University | Raleigh, NC 27695 | Accessibility Statement | Policy Disclaimer | Contact Us