Skip title Accessibility statement: we seek to make the HPC web pages accessible to all users. If you encounter accessibility issues with HPC web pages please send a description of the problem by email to eric_sills@ncsu.edu - thank you.

High Performance and Grid Computing
   
Skip menu side bar
Home
About

OpNews

Help/Accounts

Staff

Partners

User Projects


Services

Hardware

Software

Grid

Monitor


HowTo/FAQ

Docs & Pubs

Courses

Other Resources


 Getting Started with the IBM BladeCenter Linux Cluster at NC State ...



  • Henry2 System Configuration

    There are more than 175 2.8GHz-3.2GHz dual Xeon compute nodes in the henry2 cluster. Each node has two Xeon processors, four GigaBytes of memory, and a 36 or 40 GigaByte disk. There are an additional nodes available for code development and debugging. The BladeCenter compute nodes are managed by the LSF resource manager and are not for access except through LSF (accounts directly accessing compute nodes are subject to immediate termination).

    Logins for the cluster are handled by a set of login nodes which can be accessed as login.hpc.ncsu.edu using ssh.
    Additional information on the university Linux cluster configuration is available in http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf

  • Logging onto the cluster

    SSH access is supported to the login nodes (login.hpc.ncsu.edu). Logins are authenticated using Unity user names and passwords. NC State windows users can obtain ssh clients from ITECS remote access page. Also, Windows X11 server for Windows is available from the same ITECS site for users with Unity IDs.

    Login nodes should not be used for interactive jobs that take any significant fraction of system resources. The usual way to run CPU intensive codes is to submit them as batch jobs to LSF, which schedules them for execution on computational nodes. Example LSF job submission files can be found in Intel Compilers.

    Nevertheless, it is sometimes necessary to use interactive GUI based serial pre and post processors for data resident in the HPC environment. Interactive computing in the HPC environment should be performed by requesting a VCL HPC service. To request a VCL node with the HPC environment, go to the web page http://vcl.ncs u.edu.

    Click on "Make a VCL Reservation"

    From the list of environments, select "HPC(Redhat Linux)"

    When a node is available, you will receive a message detailing how to log in. You can have exclusive use of the node for four hours (actually can be extended a few hours if the system is not busy). If you have an HPC account, but have problems getting an HPC VCL node, send e-mail to gary_howell@ncsu.edu.

  • File Systems

    AFS files are not available from the cluster.

    Users have a home directory that is shared by all the cluster nodes. Also, the /usr/local file system is shared by all nodes. Each node currently has its own /scratch file system that is available to all users. Two shared scratch file systems /share and /share3 are also available to all users on each node. An HPC Storage Partner Program provides faculty the option of purchasing additional storage to directly connect to NC State HPC resources. Additionally, from the login nodes the HPC mass storage system, /ncsu/volume1 and /ncsu/volume2, is available for storage in excess of what can be accomodated in /home and these file systems are also available from other NC State HPC login nodes (e.g. from the POWER5 shared memory system login node).

    User files in /home, /ncsu/volume1, and /ncsu/volume2 are backed up daily. A single backup version is maintained for each file. User files in all other file systems are not backed up.

    Important files should never be placed on storage that is not backed up unless another copy of the file exists in another location.

    HPC projects are allocated 100GB of storage in one of the hpc mass storage systems (volume1 or volume2). Additional backed up space in these file systems can be purchased or leased.

    Additional information about storage on HPC resources is available from http://hpc.ncsu.edu/Documents/GettingStartedstorage.php

  • Compiling

    There are three compiler flavors available on the cluster: 1) the standard GNU compilers supplied with Linux, 2) the Intel compilers, and 3) the Portland Group compilers.

    The default GNU compilers are okay for compiling utility programs but in most cases are not appropriate for computationally intensive applications. Overall the best performance has been observed using the Intel compilers. The Portland Group compilers tend to be somewhat less syntacticly strict and also provide somewhat better debugging capabilities.

    Additional information about use of each of these compilers is available from the following links. Generally objects and libraries built with different compiler flavors should not be mixed as unexpected behavior may result.

    Programs with memory requirements of more than ~1GB should review the following information.
    A note on compiling executables with large (> ~1 GB) memory requirements

    Also, programs with memory requirements of more than ~3GB are not supported on the 32-bit Xeon architecture used on most of the cluster nodes. A number of 64-bit Xeon EM64T nodes are available - along with a 64-bit login node (login64.hpc.ncsu.edu). These nodes can support codes with larger memory requirements, however, the physical memory installed on the nodes is only four gigabytes.

  • Running Jobs

    The Blade Center is designed to run computationally intensive jobs on compute nodes. Running jobs on the head node is possible, but if several users run computationally intensive jobs on the head node at one time, then the node can stall and require rebooting. Users who stall the head node by using it for computation will be put in stocks on the village greeen and be required to perform community service.

    So please be polite and limit your use of the head node to editing and compiling, and transferring files. Running more than one file transfer program (scp, sftp, cp) from the head node at a time is also not desirable.

    To run computationally intensive jobs on the blade center we use the compute nodes. Access to the compute nodes is managed by LSF. All tasks for the compute nodes should be submitted to LSF.

    The following steps are used to submit jobs to LSF:

    • Create a script file containing the commands to be executed for your job:
      #BSUB -o standard_output
      #BSUB -e standard_error
      
      cp input /share/myuserid/input
      cd /share/myuserid
      ./job.exe < input
      cp output /home/myuserid
      
      
    • Use the bsub command to submit the script to the batch system. In the following example two hours of run time are requested:
      bsub -W 2:00 < script.csh
      
    • The bjobs command can be used to monitor the progress of a job
    • The -e and -o options specify the files for standard error and standard output respectively. If these are not specified the standard output and standard error will be sent by email to the account submitting the job.
    • The bpeek command can be used to view standard output and standard error for a running job.
    • The bkill command can be used to remove a job from LSF (regardless of current job status).

    For parallel jobs it is necessary for LSF to interface with the mpirun command to pass host information. To simplify this process an interface script mpiexec has been provided in the LSF bin directory. The following batch script will run a parallel job, note that the number of tasks will match the number of processors requested from LSF. The path set when bsub is invoked must include the appropriate mpirun command.

    #! /bin/csh
    #BSUB -o standard_output
    #BSUB -e standard_error
    
    mpiexec ./parjob.exe
    

    To submit a parallel job use the -n option to the bsub command to specify the number of processors to be used.

    There are a number of queues currently configured. In general the best queue will be selected automatically without the user specifing a queue to the bsub command. In some cases LSF may override user queue choices and assign jobs to a more appropriate queue.

    There is a queue that will schedule jobs on any of the blades and accepts jobs using up to 64 processors. The serial job queue will schedule jobs only on selected blades. The single_chassis queue will schedule jobs only on blades that are located within the same chassis. Each chassis holds 14 blades so jobs accepted by the single_chassis queue are limited to a maximum of 28 processors.

    A note on LSF job scheduling

    LSF writes some intermediate files in the user's home directory while the job is running. If the disk quota has been exceeded, then the batch job will fail, often without any meaningful error message.


Copyright © 2003-2007 by NC State University and others, All Rights Reserved.
HPC & Grid (Version 1.4 / Site access count: 717386) - Site/Content Notice

Site contact: Eric Sills, E-mail: eric_sills at ncsu dot edu , Tel: 919-513-0324, Fax: 919-513-1893, HPC and Grid Operations, Information Technology Division, Box 7109, North Carolina State University, Raleigh, NC27695-7914, USA