|  
|
  |
|
Getting Started with the IBM Blade Center Linux Cluster at NC State ...
Login nodes are mainly for compiling code, copying and editing files, and submitting jobs to LSF to run as batch.
The login nodes should not be used for interactive jobs that take any
significant fraction of system resources. The usual way to run
CPU intensive codes is to submit them as batch jobs to LSF, which
schedules them for execution on computational nodes. Example LSF job
submission files can be found in Intel
Compilers.
Nevertheless, it is often necessary to use interactive GUI based serial
pre and post processors for data resident in the HPC environment.
Interactive computing in the HPC environment should be performed by
requesting a VCL HPC service. To request a VCL node with the
HPC environment, go to the web page http://vcl.ncsu.edu.
Click on "Make a VCL Reservation"
From the list of environments, select "HPC(Redhat Linux)"
When a node is available, you will receive a message detailing
how to log in. You can have exclusive use of the node for four
hours (actually can be extended a few hours if the system is
not busy). If you have problems getting an HPC VCL node, e-mail
gary_howell@ncsu.edu to make sure to add you to the the list of eligible
users.
Henry2 System Configuration
There are currently 131 2.8 GHz or 3.0 GHz 2-way nodes.
Each node has two Xeon processors, four GB of memory,
and a 40 GB disk. There are an additional two nodes available
for code development and debugging. The blade center nodes are
managed by the LSF queuing system and are not for access
except through LSF. Logins for the cluster are handled by the
login nodes (login.hpc.ncsu.edu).
Additional information on the university Linux cluster
configuration is available in
http://hpc.ncsu.edu/Documents/hpc_cluster_config.pdf
Logging onto the cluster
SSH access is supported to the login
nodes (login.hpc.ncsu.edu).
Logins are enabled using Unity IDs and
passwords.
Free SSH clients are available from various sources.
Links to some commonly used versions are included here:
-
Windows
- Unix, Linux
File Systems
AFS files are not available from the
cluster. Users have a home directory that is
shared by all the cluster nodes. Also, the
/usr/local file system is shared by
all nodes. Each node currently has its own
/scratch file system that is
available to all users. Two shared scratch file
systems /share and /share3
are also available on each node.
Additionally, from the login nodes
the HPC storage system,
/ncsu/volume1 and /ncsu/volume2,
is available for storage
in excess of what can be accomodated in /home
and these file systems are also available from the
IBM POWER5 system.
User files in /home, /ncsu/volume1, and /ncsu/volume2
are backed up daily. A single backup version is
maintained for each file. User files in all other
file systems are not backed up.
Important files should never be placed on storage
that is not backed up unless another copy of the
file exists in another location.
HPC projects are allocated 100GB of storage
in one of the hpc storage systems (volume1 or
volume2). Additional backed up space in these
file systems can be purchased or leased.
Additional information about storage on HPC
resources is available from
http://hpc.ncsu.edu/Documents/GettingStartedstorage.php"
Compiling
There are three compiler flavors available
on the cluster: 1) the standard gnu compilers
supplied with linux, 2) the Intel compilers,
and 3) the Portland Group compilers.
The default gnu compilers are good for compiling
utility programs, but are not as appropriate
for computationally intensive applications.
Overall the best performance has been observed
using the Intel compilers. Moreover, good debuggers
and profilers are available with the Intel compilers.
See
A note on compiling
executables with large (> ~1 GB) memory requirements
See
Serial Compilers on the
Blade Center
for some information on how to compile serials codes on the blade center.
This can be useful if you want code to run on some other Linux box.
Long serial jobs to run on the Blade Center should be submitted to
the LSF queue. (Running computationally intensive jobs
on the head node can lock it up, causing reboots, inconveniencing
other users who lose their current work . . . so such jobs
are killed as found ).
- GNU Compilers
The gnu compilers are available in the default
path and are invoked with the cc and
f77 commands for the C/C++ and Fortran77
compilers respectively. For parallel codes
the MPICH library compiled with the gnu compilers
is available in /usr/local/gnu/mpich-rhel3/mpich-1.2.6-3.2.3/lib.
To set environmental variables to use the gnu compilers, type
add gnu
The following commands compiled and linked a simple parallel program
mpif77 -c ring.f
g77 -o rring ring.o -L/usr/local/gnu/mpich-rhel3/mpich-1.2.6-3.2.3/32/lib /
-lfmpich -lmpich -L/usr/local/gnu/gcc-lib/i386-redhat-linux/3.2.3 -g2c
If the file named brring contains
#! /bin/csh
#BSUB -W 10
#BSUB -n 4
#BSUB -o /share/foouser/ring.out.%J
#BSUB -e /share/foouser/ring.err.%J
#BSUB -J ring
mpiexec ./rring
Then executing the command (from the same window from which
/usr/local/gnu/mpich-rhel3/gnu-rhel3.csh was sourced)
bsub < brring
(having changed foouser to your own user name) submits the code for execution.
The -W 10 line sets a job
limit of ten minutes, -n 4 asks for 4 processors. Since only
a few processors and a short time are asked for, the job will
be submitted to the debug queue, and hence return quickly.
The stardard output goes to the file /share/foouser.ring.out.xxxxxx
where the xxxxxx is the LSF job ID. Similarly /share/foouser.ring.err
is (due to the -e flag) the standard error.
After job submission, a user can track the job progress by
entering
bhist
or
bhist -l
and kill the job by entering
bkill xxxxxx
where xxxxx is the LSF job ID returned by bhist. If the job
has started running, standard output and error can be accessed
by
bpeek
Parallel programmers are strongly
encouraged to use the Intel or Portland Group
compilers to generate more efficient code. Those constructing
code for others to use should consider that code compiled
with the Intel compilers is likely to be portable to
other platforms.
- Intel Compilers
To use the Intel compilers it is necessary
to properly configure some environment
variables and paths. This is easily accomplished
by sourcing /usr/local/apps/env/intel.csh.
Once one of these files have been sourced, the Intel
compilers with links to the mpich libraries may be invoked
with the mpif77, mpif90, mpicc and
mpiCC commands for the Fortran77/90 and C/C++
compilers respectively.
As a convenience an alias - add - has been created
for csh/tcsh users to set up the environment for various
software packages. To use the Intel compilers the command
add intel
will set the necessary environment variables.
Parallel programs compiled with the Intel compilers
should be linked with the MPICH libraries located
under /usr/local/intel/mpich.
The following command line would compile a Fortran MPI
code with a high level of optimization:
mpif90 -o rring -O3 -tpp7 -xW -static ring2.f
At this time (May 2005), Intel compiled codes require
the -static (specifying use of static .a libraries ) flag for
successful execution.
Similar scripts are available for C (mpicc), C++ (mpiCC),
and Fortran77 (mpif77).
If the file named brring contains
#! /bin/csh
#BSUB -W 10
#BSUB -n 4
#BSUB -o /share/foouser/ring.out.%J
#BSUB -e /share/foouser/ring.err.%J
#BSUB -J ring
mpiexec ./rring
Then executing the command (from the same window from which
/usr/local/gnu/mpich-rhel3/gnu-rhel3.csh was sourced)
bsub < brring
(having changed foouser to your own user name)
submits the code for execution. The discussion above
under the gnu compilers shows what some of the flags mean.
- Portland Group Compilers
To use the Portland Group compilers it is necessary
to properly configure some environment variables
and paths.
A shortcut is available for csh/tcsh users
by using an alias which has been created -
add.
add pgi
Will configure the environment to use the
Portland Group compilers. The same job submission
script as in the gnu and Intel examples also works
for the Portland group compiled code.
It is not recommended to use the Intel
and Portland Group compilers during the same
login session.
Once these have been set the Portland Group compilers
may be invoked with the pgcc, pgCC,
pgf77, pgf90, and pghpf
commands for the C, C++, Fortran77, Fortran90, and
High Performance Fortran compilers respectively.
Parallel programs compiled with the Portland Group
compilers should be linked with the MPICH libaries.
Having added the pgi envirnomment by
add pgi
the following command line line would compile an MPI
Fortran 90 code with a high level of optimization:
mpif90 -o rring -fastsse ring2.f
-
Running Jobs
The login nodes are shared by all users. Running computationally
intensive jobs on the login nodes can cause them to stall and need
to be rebooted. Moral: don't hog the login nodes. If you do need
to extensively use GUI based applications -- for example to set up
your batch jobs or analyze data resulting from runs of batch jobs,
then one good way is to use the VCL facility, selecting an HPC
image.
Users should also refrain from running more than
one sftp or scp session at a time.
Running computationally intensive jobs on the blade center (anything
other than a compilation that requires more than a minute or so
to run) is accomplished
by using LSF to submit batch tasks to the compute nodes.
All tasks for the compute nodes should be
submitted to LSF.
The following steps are used to submit jobs
to LSF:
For parallel jobs it is necessary for LSF to interface with the
mpirun command to pass host information. To simplify this process
an interface script mpiexec has been provided in the
LSF bin directory. The following batch script will
run a parallel job, note that the number of tasks will match the
number of processors requested from LSF. The path set when bsub
is invoked must include the appropriate mpirun command.
#! /bin/csh
#BSUB -o standard_output
#BSUB -e standard_error
mpiexec ./parjob.exe
To submit a parallel job use the -n option to the
bsub command to specify the number of processors
to be used.
There are a number of queues currently configured. In
general the best queue will be selected automatically
without the user specifing a queue to the bsub command.
In some cases LSF may override user queue choices and
assign jobs to a more appropriate queue.
There is a queue that will schedule jobs on any of the
blades and accepts jobs using up to 64 processors.
The serial job queue will schedule jobs only on
selected blades. The
single_chassis queue will schedule jobs only on blades
that are located within the same chassis. Each chassis
holds 14 blades so jobs accepted by the single_chassis
queue are limited to a maximum of 28 processors.
A note on LSF job scheduling
LSF writes some intermediate files in the user's home
directory while the job is running. If the disk quota
has been exceeded, then the batch job will fail, often
without any meaningful error message.
|
|
Last modified: August 01 2006 15:40:18.
Copyright © 2003-2007 by
NC State University and
others, All Rights Reserved.
HPC & Grid (Version
1.4
/
Site access count: 754060)
- Site/Content Notice
Site contact: Eric Sills, E-mail:
eric_sills at ncsu dot edu , Tel: 919-513-0324, Fax: 919-513-1893,
HPC and Grid Operations, Information Technology Division,
Box 7109, North Carolina State University, Raleigh,
NC27695-7914, USA
|
|