-
There are more than 175 2.8GHz-3.2GHz dual Xeon compute nodes
in the henry2 cluster. Each node has two Xeon processors,
four GigaBytes of memory, and a 36 or 40 GigaByte disk.
There are an additional
nodes available for code development and debugging. The BladeCenter
compute nodes are managed by the LSF resource manager and are
not for access except through LSF (accounts directly accessing
compute nodes are subject to immediate termination).
Logins for the cluster are handled by a set of login nodes which
can be accessed as login.hpc.ncsu.edu using ssh.
Additional information on the university Linux cluster
configuration is available in
http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf
-
SSH access is supported to the login
nodes (login.hpc.ncsu.edu).
Logins are authenticated using Unity user names and
passwords.
NC State windows users can obtain ssh clients from
ITECS remote access page. Also, Windows X11 server
for Windows is available from the same ITECS site for
users with Unity IDs.
Login nodes should not be used for interactive jobs that take any
significant fraction of system resources. The usual way to run
CPU intensive codes is to submit them as batch jobs to LSF, which
schedules them for execution on computational nodes. Example LSF job
submission files can be found in Intel
Compilers.
Nevertheless, it is sometimes necessary to use interactive GUI based serial
pre and post processors for data resident in the HPC environment.
Interactive computing in the HPC environment should be performed by
requesting a VCL HPC service. To request a VCL node with the
HPC environment, go to the web page http://vcl.ncs
u.edu.
Click on "Make a VCL Reservation"
From the list of environments, select "HPC(Redhat Linux)"
When a node is available, you will receive a message detailing
how to log in. You can have exclusive use of the node for four
hours (actually can be extended a few hours if the system is
not busy). If you have an HPC account, but have problems getting an
HPC VCL node, send e-mail to gary_howell@ncsu.edu.
-
AFS files are not available from the
cluster.
Users have a home directory that is
shared by all the cluster nodes. Also, the
/usr/local file system is shared by
all nodes. Each node currently has its own
/scratch file system that is
available to all users. Two shared scratch file
systems /share and /share3
are also available to all users on each node.
An HPC Storage Partner Program provides faculty
the option of purchasing additional storage to
directly connect to NC State HPC resources.
Additionally, from the login nodes
the HPC mass storage system,
/ncsu/volume1 and /ncsu/volume2,
is available for storage
in excess of what can be accomodated in /home
and these file systems are also available from other
NC State HPC login nodes (e.g. from the POWER5 shared
memory system login node).
User files in /home, /ncsu/volume1, and /ncsu/volume2
are backed up daily. A single backup version is
maintained for each file. User files in all other
file systems are not backed up.
Important files should never be placed on storage
that is not backed up unless another copy of the
file exists in another location.
HPC projects are allocated 100GB of storage
in one of the hpc mass storage systems (volume1 or
volume2). Additional backed up space in these
file systems can be purchased or leased.
Additional information about storage on HPC
resources is available from
http://hpc.ncsu.edu/Documents/GettingStartedstorage.php
-
There are three compiler flavors available
on the cluster: 1) the standard GNU compilers
supplied with Linux, 2) the Intel compilers,
and 3) the Portland Group compilers.
The default GNU compilers are okay for compiling
utility programs but in most cases are not appropriate
for computationally intensive applications.
Overall the best performance has been observed
using the Intel compilers. The Portland Group
compilers tend to be somewhat less syntacticly
strict and also provide somewhat better
debugging capabilities.
Additional information about use of each of these
compilers is available from the following links.
Generally objects and libraries built with different
compiler flavors should not be mixed as unexpected
behavior may result.
Programs with memory requirements of more than ~1GB
should review the following information.
A note on compiling
executables with large (> ~1 GB) memory requirements
Also, programs
with memory requirements of more than ~3GB are not
supported on the 32-bit Xeon architecture used on most
of the cluster nodes. A number of 64-bit Xeon EM64T
nodes are available - along with a 64-bit login node
(login64.hpc.ncsu.edu). These nodes can support
codes with larger memory requirements, however, the
physical memory installed on the nodes is only four
gigabytes.
-
The Blade Center is designed to run computationally intensive
jobs on compute nodes. Running jobs on the head node is
possible, but if several users run computationally intensive
jobs on the head node at one time, then the node can stall
and require rebooting. Users who stall the head node by using
it for computation will be put in stocks on the village
greeen and be required to perform community service.
So please be polite and limit your use of the head node to editing
and compiling, and transferring files. Running more than
one file transfer program (scp, sftp, cp) from the head node
at a time is also not desirable.
To run computationally intensive jobs on the blade center we
use the compute nodes.
Access to the compute nodes is managed by LSF.
All tasks for the compute nodes should be
submitted to LSF.
The following steps are used to submit jobs
to LSF:
For parallel jobs it is necessary for LSF to interface with the
mpirun command to pass host information. To simplify this process
an interface script mpiexec has been provided in the
LSF bin directory. The following batch script will
run a parallel job, note that the number of tasks will match the
number of processors requested from LSF. The path set when bsub
is invoked must include the appropriate mpirun command.
#! /bin/csh
#BSUB -o standard_output
#BSUB -e standard_error
mpiexec ./parjob.exe
To submit a parallel job use the -n option to the
bsub command to specify the number of processors
to be used.
There are a number of queues currently configured. In
general the best queue will be selected automatically
without the user specifing a queue to the bsub command.
In some cases LSF may override user queue choices and
assign jobs to a more appropriate queue.
There is a queue that will schedule jobs on any of the
blades and accepts jobs using up to 64 processors.
The serial job queue will schedule jobs only on
selected blades. The
single_chassis queue will schedule jobs only on blades
that are located within the same chassis. Each chassis
holds 14 blades so jobs accepted by the single_chassis
queue are limited to a maximum of 28 processors.
A note on LSF job scheduling
LSF writes some intermediate files in the user's home
directory while the job is running. If the disk quota
has been exceeded, then the batch job will fail, often
without any meaningful error message.