-
There are 981 dual Xeon compute nodes
in the henry2 cluster. Each node has two Xeon processors
(mix of single-, dual-, quad-, and six-core),
2 to 3 GigaBytes of memory per core, and a 36 - 146 GigaByte disk.
The nodes all have 64-bit processors. Generally, either 32-bit or 64-bit
x86 executables will run correctly.
64-bit executables are
required in order to access more than about 3GB of memory for program
data.
The BladeCenter
compute nodes are managed by the LSF resource manager and are
not for access except through LSF (accounts directly accessing
compute nodes are subject to immediate termination).
Logins for the cluster are handled by a set of login nodes which
can be accessed as login.hpc.ncsu.edu using ssh.
Additional information on the initial henry2
configuration (c. 2004) is available in
http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf.
Some additional
informaion about the cluster architecture is available at
http://hpc.ncsu.edu/Hardware/henry2_architecture.php.
-
SSH access is supported to the login
nodes (login.hpc.ncsu.edu.
Logins are authenticated using Unity user names and
passwords.
Microsoft Windows users can obtain ssh clients from
ITECS remote access page. Also, Windows X11 server
for Microsoft Windows is available from the same ITECS site.
Login nodes should not be used for interactive jobs that take any
significant amount of system resources. The usual way to run
CPU intensive codes is to submit them as batch jobs to LSF, which
schedules them for execution on computational nodes. Example LSF job
submission files can be found in Intel
Compilers.
It is sometimes necessary to use interactive GUI based serial
pre and post processors for data resident in the HPC environment.
Interactive computing in the HPC environment should be performed by
requesting a Virtual Computing Lab (VCL) HPC environment.
To use the VCL HPC environment go to the web page
http://vcl.ncsu.edu
and click on "Make a VCL Reservation". If you have not
already authenticated with your Unity ID and password you
will be prompted to do so.
From the list of environments, select "HPC (64-bit RedHat Linux)".
When the environment is ready VCL will provide information regarding
how to log in. VCL provides a dedicated environment, so heavy
interactive use will not interfer with other users.
If you have an HPC account, but have problems accessing the VCL
HPC environment, send e-mail to oit_hpc@help.ncsu.edu.
-
AFS files are not available from the
cluster (but are available on the VCL
HPC environments described above).
Users have a home directory that is
shared by all the cluster nodes. Also, the
/usr/local file system is shared by
all nodes. Home file system is backed up
daily, with one copy of each file retained.
Three NFS mounted shared scratch file
systems /share, share2,
and /share3 are also available to all users.
These file systems are not backed up and files may be
deleted from the file systems automatically
at any time, use of these file systems is at
the users own risk. There is a 1TB group quota
on each of these file systems.
A parallel file system /gpfs_share is
also available. Directories on /gpfs_share
can be requested. There is a 1TB group quota
imposed on /gpfs_share. /gpfs_share
file system is not backed up and files are subject
to being deleted at any time. Use is at the users own
risk.
Finally, from the login nodes
the HPC mass storage file systems,
/ncsu/volume1 and /ncsu/volume2,
are available for storage
in excess of what can be accomodated in /home.
Since these file system are not available from the
compute nodes they cannot be used for running jobs.
User files in /home, /ncsu/volume1, and /ncsu/volume2
are backed up daily. A single backup version is
maintained for each file. User files in all other
file systems are not backed up.
Important files should never be placed on storage
that is not backed up unless another copy of the
file exists in another location.
HPC projects are allocated 1TB of storage
in one of the HPC mass storage systems (volume1 or
volume2). Additional backed up space in these
file systems can be purchased or leased.
Also a Storage Parter Program provides option
for faculty partners to purchase additional
storage and have it network attached to the henry2
cluster either using NFS or GPFS.
Additional information about storage on HPC
resources is available from
http://hpc.ncsu.edu/Documents/GettingStartedstorage.php
-
Many software packages have already been compiled to run on the blade center.
If you click on Software in the left toolbar or on http://www.ncsu.edu/itd/hpc/Software/Software.php , you'll see a list of
software. In many cases, there are "HowTos" which explain how to
get access and submit example jobs. Suggestions on documentation updates and
on additional software are encouraged.
-
There are three compiler flavors available
on the cluster: 1) the standard GNU compilers
supplied with Linux, 2) the Intel compilers,
and 3) the Portland Group compilers.
The default GNU compilers are okay for compiling
utility programs but in most cases are not appropriate
for computationally intensive applications.
Overall the best performance has been observed
using the Intel compilers. However, the Intel
compilers support very few extensions of the
Fortran standard - so codes written using
non-standard Fortran may fail to compile without
modifications.
The Portland Group
compilers tend to be somewhat less syntacticly
strict than the Intel compilers while still
generating more efficient code than the Gnu
compilers.
Additional information about use of each of these
compilers is available from the following links.
Generally objects and libraries built with different
compiler flavors should not be mixed as unexpected
behavior may result.
Programs with memory requirements of more than ~1GB
should review the following information.
A note on compiling
executables with large (> ~1 GB) memory requirements
-
The Blade Center is designed to run computationally intensive
jobs on compute nodes. Running resource intensive jobs on
the login nodes, while technically possible, is not permitted.
Please limit your use of the login nodes to editing
and compiling, and transferring files. Running more than
one concurrent file transfer program (scp, sftp, cp) from login nodes
is also not desirable.
-
To run computationally intensive jobs on the blade center
use the compute nodes.
Access to the compute nodes is managed by LSF.
All tasks for the compute nodes should be
submitted to LSF.
The following steps are used to submit jobs
to LSF:
-
After an operating system update in May 2010, many codes compiled
with MPICH-1 libraries exhibited "net_send" errors. Since the MPICH-1
libraries are no longer maintained by the mpich developers, MPICH was
updated to MPICH-2.
Compiling and running with the MPICH-2 parallel libraries uses the following
syntax.
For MPICH-2, environmental variables are set with "add pgi_hydra",
"add intel_hydra for pgi and intel compilers, respectively.
Alternatively, "source /usr/local/apps/env/pgi_hydra.csh,
source /usr/local/apps/env/intel_hydra.csh,
For MPICH-2 compiled codes, a job submission script bfoo looks like
#! /bin/csh
#BSUB -o standard_output.%J
#BSUB -e standard_error.%J
#BSUB -n 4
#BSUB -W 15
#BSUB -R span[ptile=2]
setenv MPICH_NO_LOCAL 1
mpiexec_hydra ./parjob.exe
The span[ptile=2] requests than job tasks be distributed
two per node. This specification is optional and can range from
1 to 12. Specifying specific number of tasks per node may result in
longer time waiting in the queue for the available resources to
match the request.
The setenv MPICH_NO_LOCAL 1 specifies that
all MPI messages will be passed through sockets, not using
shared memory available on a node.
If setenv MPICH_NO_LOCAL 1 is omitted, the
span[ptile must remain. Some possible alternative lines would be
unsetenv MPICH_NO_LOCAL
#BSUB -R span[ptile=4]
which would allocate 4 MPI processes on each node, or
unsetenv MPICH_NO_LOCAL
#BSUB -R span[ptile=8]
which would allocate 8 MPI processes each on quad core (8 core total) nodes.
"span[ptile=8]" restricts the choice of nodes on which
LSF can schedule jobs to empty quad core nodes.
Quad core nodes are usually not empty, so asking for 8 cores
on a node can entail a long wait before running.
If the number of MPI processes on each node (specified by the -R span[ptile= )
is not specified, then the line "setenv MPICH_NO_LOCAL 1" is necessary.
But even with "setenv MPICH_NO_LOCAL 1", a ptile setting often helps
job execution performance. (Absent a ptile setting, many processes may land on a few nodes. Runtime bottlenecks can occur as many processes communicate through a few sockets.)
-
Henry2 nodes are a mix of single-, dual-, quad-, and six-core processors.
Total processor cores per node range from 2 to 16. All the
processor cores on a node share access to the all of the
memory on the node. Individual nodes can be used to run
programs written using a shared memory programming model
- such as OpenMP.
To submit a shared memory job to use multiple cores on a single node use
the bsub options -n 1 -x. These request one task
to specify exclusive use of a node. An example submission
file might be
#! /bin/csh
#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 16
#BSUB -R span[hosts=1]
#BSUB -W 15
#BSUB -q shared_memory
setenv OMP_NUM_THREADS 16
./exec
If the above file is shmemjob, it could be submitted by the command
bsub < shmemjob
and will run on a node with 16 cores.
Shared memory jobs can also be run on other nodes,
but with access to fewer total processor cores.
A script such as the following would use nodes
with two quad-core processors to access 8 total
processor cores.
#! /bin/csh
#BSUB -o out.%J
#BSUB -e err.%J
#BSUB -n 8
#BSUB -R span[hosts=1]
#BSUB -W 15
setenv OMP_NUM_THREADS 8
./exec
The number of job slots requested, -n 8 in this example,
needs to match the number of threads the parallel job will use
(OMP_NUM_THREADS). The resource request must specify
span[hosts=1] to ensure that LSF assigns all the requested
job slots on the same node - so they will have access to the
same physical memory.
See the individual compilers for the flags needed to compile codes to enable
OpenMP shared memory parallelism.
Short course lecture notes on Openmp from the fall of 2009 give some instructions
for converting a Fortran or C code to use OpenMP parallelism.
-
Normally, when running a hybrid parallel job, you want place 1 MPI process on each node and, under that MPI process, you want to use all the cores available on that node. The following script is a simple sample script that can be used to run a hybrid parallel job "hybrid-job".
#!/bin/csh
#BSUB -o standard_output.%J
#BSUB -e standard_error.%J
#BSUB -n 16
#BSUB -x
#BSUB -R "qc span[ptile=1]"
#BSUB -W 15
source /usr/local/apps/env/intel_mpich2_hydra-101.csh
setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`; mpiexec_hydra ./hybrid-job
If the script is named hybrid-job.csh, then it can be submitted by the command
bsub < hybrid-job.csh
The following specifications in the above script are necessary for running a hybrid parallel job:
- The specification of -x requests exclusive use of each node.
- The specification of span[ptile=1] requests that 1 MPI process be placed on each node. Thus, there are 16 nodes and each node gets 1 MPI process.
- The specification of qc means that you are requesting quad-core nodes. This enables you (most probably) to get nodes with same type of cores on each node and with same number of cores on each node. (If the nodes have different types of cores or different numbers of cores, then some nodes may be under-utilized.) You may change qc to dc to request dual-core nodes.
- The source step is necessary for setting up appropriate Hydra MPICH2 related environment variables. Depending on your situation, you may need to source a different file such as /usr/local/apps/env/pgi_mpich2_hydra-105.csh
- The command
setenv OMP_NUM_THREADS `grep processor /proc/cpuinfo | wc -l`
sets the environment variable OMP_NUM_THREADS to the number of cores on each node regardless of how many cores are there on the node. This ensures that all the cores on each node are utilized.
-
A number of LSF queues are configured on the henry2
cluster. Often the best queue will be selected
without the user specifing a queue to the bsub command.
In some cases LSF may override user queue choices and
assign jobs to a more appropriate queue.
Jobs requesting 4 or fewer processors and 15 minutes or less time
are assigned to the
debug queue and run with minimal wait times. Once
a user is satisfied a job is running well, more time will typically
be requested.
Queues available to all users support jobs running on up to 128 processors
for one day or jobs running for up to a week on up to 16 processors.
Jobs that need up to two hours and up to 28 processors are run in a
queue that has access to nearly all cluster nodes [generally the
queues open to all users only have access to nodes that were purchased
with central funding]. Jobs that require 28 or fewer processors (but more
than 2 hours) are placed in the single chassis queue. Jobs in this
queue are scheduled on nodes located within the same physical blade
chassis - resulting in better message passing bandwidth and lower latency
for messages.
Partners, those who have purchased nodes to add to the henry2 cluster,
may add the bsub option -q partnerqueueame to place their job
in the partner queue. Partner queues are dedicated for use of the
partner and their project and have priority access to the
quantity of processors the partner has added to the cluster.
A note on LSF job scheduling provides
some additional details regarding how LSF is configured on henry2
cluster.
LSF writes some intermediate files in the user's home
directory as jobs are starting and running. If the
user's disk quota
has been exceeded, then the batch job will fail, often
without any meaningful error messages or output. The
quota command will display usage of /home
file system.