Question. Desktop and even laptop computers are very fast. Why are there specially designated high performance computers? Distributed processor machines typically cost less per processor than shared memory machines. Essentially they are built from standard computers networked together. For processor A to access data resident in the memory of processor B, both processors must interact in passing a message. The standard library for the interaction is MPI protocols inserted in C or Fortran codes. Converting a code to be parallel is somewhat harder in the distributed memory MPI environment than in the shared memory OpenMP environment. Fewer parallel distributed memory commercial codes exist. In defense of distributed memory computation, the largest fastest machines in the world use the distributed memory model. Codes written for the distributed memory model also work in the shared memory model. Over the last decade, many public domain distributed memory codes have been developed. In many academic specialties, teams working to solve a given kind of problem produce open source codes. Since most academic environments support distributed memory computation in terms of Beowulfs or other clusters, these codes can be run "everywhere".
If you want to parallelize your code with the least amount of effort, you
can convert it to a shared memory code using the OpenMP paradigm. Such codes
tend to work well for up to 4 or 8 processors. For an OpenMP
tutorial see OpenMP Tutorial .
Since the p690 is crowded, that's about all the processors you can get.
And we do have some plans to have continuing support for such OpenMP codes.
The Blade Architecture There are 4 Gbytes of memory per node (for a total of 392 Gbytes of memory. A message passing code that took all the nodes could thus solve a very large problem!) A "blade" is a 2 CPU node. Fourteen blades fit into a "chassis", with the chassis of 28 CPUs occupying a 4U slot in a rack. Within the 28 CPU chassis, blades are connected with a GiGE switch. Five chassis and a few RAID boxes for hard drive storage can be packed into a single rack, with the chassis connected to each other by another GiGE switch. An MPI "ping" between two processes on a chassis requires 60 to 100 microseconds. Hard drives in the RAID boxes and blades can be "hot" swapped. We have standby diesel power to provide uninterrupted power and cooling. Blade Centers are a popular product for IBM. Many of the top 500 computers in the world have the same GiGE interconnected blade chassis that we use. They can be thought of as a flavor of "Beowulf" but with the advantage of having bugs worked out, and with cluster management software provided by IBM We purchase the equipment with a three year on-site hardware warranty. Thus the Blade Center is a highly reliable resource. The Blade Center provides a Linux software environment for running and porting software. We also provide software support, aiding users in porting and using standard software packages, in providing training, and as able. The following are some typical ways the Blade Center is used. Use 1. Run lots of serial jobs at one time.
Suppose you have an Abaqus or Ansys job to run. Even for a large
shared memory machine, the optimal number of processors for an Abaqus
code is usually two. If you want to solve
a large job, the Blade Center CPUs are likely better as having more memory
than your own PC. Also having a global file system means you can
potentially run several instances at once. If you want more than
4 Gbytes of memory, you'll need a shared memory machine.
For example, for Ansys, if you have a problem too large for 4 Gbytes
of memory you can run it on the p690 instead. It will take longer waiting
to run, and won't run impressively fast. The queueing system
on the Blade Center is designed so that parallel rather than
serial jobs take most of the computing time.
Use 2. Run parallel jobs Often people run already developed
parallel codes. The obstacles are porting the code and learning to set
up data decks. The queueing system
is adjusted to make sure that everyone gets a share.
Jobs that request less time do not wait as long to run. The "sweet spot"
of the machine is jobs that fit within a chassis of 28 CPUs (latencies
are longer and network bandwidth lower between chassis).
Use 3. Convert code from serial to parallel or aid in
extending and developing
parallel libraries For these purposes, you'll need to learn to use the
MPI library,
An LLNL MPI tutorial,
and NCSU HPC MPI short course. To aid debugging, the queueing system gives priority
to jobs requesting small amounts of time and a few processors.
1) As with any resource available to a large number of users, there is competition for the resource. Large jobs may spend a while in the queue. 2) To use more than 2 processors, codes must use message passing. These programs can be hard to write (but maybe someone else has already written the code and you can learn to use it or revise it only slightly). 3) Jobs too large to fit on one chassis of 28 CPUs suffer from higher communication overhead. 4) The processors are 32 bit, so the address space is "only" 4 GBytes. (We're getting in a few 64 bit Opterons to see if they are worthwhile. They will have 9 Gbytes RAM per 2 CPU computational node) 1) Many users also apply for time at other sites with larger machines. Large publically available clusters are at the Pittsburgh Supercomputing Center and the San Diego Supercomputing Center. The DOE facilities at ORNL (Oak Ridge) are available for some projects. DOD sponsored projects can get time at the MSRCs (Major Shared Resource Centers) ERDC, ARL, NAVO and Aberdeen. 2) There are some system wide resources for the North Carolina state universities, see the web page UNC Supercomputing 3) There may be other resources on campus you can use. e.g., the PAMS cluster, the statistics cluster. Some Sun boxes for Abaqus or Matlab, the Biogrid. The PAMS cluster has about as many CPUs as the ITD blade center, but is somewhat less uniform and does not have as much RAM on each node. There are a number of faculty who have their own clusters to which you can conceivably get access. 4) Grid computing. There is a North Carolina Biogrid project, for which NC State is a node. Also there is an on-campus grid of Macs. 5) 1-1-1 computing is planned as way of providing virtual lab seats. Log in and a box will be assigned to you, pre-imaged with the operating system and software you requested. The assigned box is a fast Xeon with lots (for a PC) of RAM. A pilot program is scheduled for the fall of 2004. Of the 196 CPUs in the blade center, the first 160 were donated by IBM or purchased with NC State funds. In the last several months additional blades have been purchased by individual faculty research projects. Typically, faculty "partners" have had prior experience in purchasing their own clusters and are relieved to get the benefits of ownership without the ongoing adminstrative burden. Currently, the cost per blade (2 CPUs with 4 GBytes of RAM and a three year hardware warranty) is about $5K. By purchasing blades, projects get priority access. Partners get professional system maintenance and also get use of the global file system and access to a larger system than the number of CPUs they purchased. When partners buy blades, other users gain in that these resources are made available in those times when they are not being using by the purchaser. For more information on partnering, see the web page Faculty Partners There are details below on submitting parallel jobs via an LSF bsub job submission script. For specific information on running on the p690, see the web page //www.ncsu.edu/itd/hpc/Documents/GettingStartedp690_content.html For information on running on the BladeCenter, see the web page //www.ncsu.edu/itd/hpc/Documents/GettingStartedbc_content.html We have occasional success in persuading software companies to modify license agreements and also give us the academic price. For information on commercial software currently available, see the web page //www.ncsu.edu/itd/hpc/Software/software-status.html Among general purpose software, we have several commercial compilers. These include the Intel and Portland group compilers on the BladeCenter. On the p690, we have the IBM compilers. Both machines have the gnu compilers. We have two good parallel debuggers on the BladeCenter. These are the Portland Group pgdbg debugger and Totalview. On the p690, we have dbx and pdbx (parallel version), both of which are good, but do not have GUI interfaces. The gdb command line debuggers are available on the BladeCenter. For information on debuggers, see the web pages //www.ncsu.edu/itd/hpc/Documents/debug2.html and A Totalview tutorial. The intel compilers produce the fastest code. They link well to the Intel math library. To add the Intel environment , type >add intel at the prompt. (Here and in the following the > represents the prompt, so you would only type "add intel". ) ifc is the Fortran compiler, icc is the C compiler. For a list of suitable compiler options, see the web page //www.ncsu.edu/itd/hpc/Documents/GettingStartedbc_content.html The Portland group compilers are more forgiving in linking. They tend to be somewhat slower in execution time. To add the Portland group environment , type >add pgi at the prompt. If in doubt what environment you're in, log out and get a fresh xterm and try again. The Fortran and C compilers are pgcc, pgf90, pgf77. The gnu environment is most likely to work with open source codes. A disadvantage is that the gnu Fortran 90 compiler isn't yet complete. The gnu environment is the default . In any of the environments, you can use the mpicc and mpif77 commands to use the correct compilers for that environment and handle linking to the MPI parallel libraries (a different MPI library for each environment). The default mpiexec command (embedded within a bsub command) calls the appropriate mpirun command, thereby executing an MPI program. >bsub < bfoo The " > " is meant to represent the command line prompt and should not be typed. The " < " is a redirect symbol and is necessary so that the bsub command interprets the script bfoo (listed next) as input.
It will run for at most 5 minutes, (from the line -W 5) which is the allowed wall-clock time. If it has not successfully completed in five minutes, it will die with error message "2". It's a good idea to ask for a bit more time than your job requires. But not too much more, as shorter jobs are scheduled to run first. The -n 4 line specifies that four processors are needed. The -i line specifies /home/gwhowell/blastries/LU.in as an input file. You could also open a file for input in the usual way by opening a file from C or Fortran. The -o line specifies an output file. The %J gives an output file with the current LSF job number. The only file systems available for writing from the computational nodes are /home, and /share (or /share2). The /share directories are purged after a couple of weeks, so sending output files to /share will not permanently clutter up you directories. But be sure to save your data. The -o specified file will not appear until the job exits from LSF. The above file, since it asks for only 4 processors and five minutes, will run in the debug queue with high priority. So if you're making quick changes to code and want to see if they work, try asking for few processors and a small amount of time. One way to "chain" jobs so that the next job can use the output of the last is by using the -w option. So for example, you can have two hour jobs that will be more likely to run than if you specify twelve hour jobs. Two hour jobs are eligible to run on partner nodes not currently being used by partners. >man bsub will give you a man page for bsub submission scripts. One you've submitted a job you may want to see if it's started, or still running. >bhist >bjobs tell you the status of your current jobs. If you want to see how many jobs are lined up, try >bjobs -u all >bhist -l will give you more detail about your job. To get a look at what's come out in the file specified by -o, try >bpeek To kill a parallel job, use the LSF job ID number bnum, (a 4 or 5 digit number returned by bhist or bjobs ) and type >bkill bnum Typing >bhosts will give you current status of all nodes. Or for a more informative display www.ncsu.edu/itd/hpc/Monitor/LSF_status.php ` For many academic specialties, there are open source codes. Financially and legally, these are easy to install. The informal mutual support groups are often better than the formal support attached to academic licenses, with specific instructions and trouble shooting instances posted on the internet. Examples: magpar Parallel Finite Element Micromagnetics Package PWscf Plane-Wave Self-Consistent Field A set of program for electronic structure calculations. Trilinos Open Source Software from Sandia There are some general purpose building block libraries that should be considered if you want codes to run at reasonable speeds. On most modern computers, and particularly on Intel Xeons, most code runs at a small fraction of peak speed. In theory, two floating point operations can occur per clock cycle, so that two Xeon processors running at 3 GHz have a peak speed of 12 Gflops/sec. In practice, the two CPUs share a data bus from RAM that delivers at most 500 million double precision numbers per second. Tuned BLAS (Basic Linear Algebra Subroutines) libraries perform matrix operations such as matrix matrix multiplies at near the theoretical peak speed. But this is only true of tuned libraries. If you download the standard Fortran or C code off the internet and compile it, you see much poorer performance. The LAPACK dense linear algebra package uses the BLAS library, enabling solution of linear systems or linear least squares problems at very near peak performance. LAPACK and BLAS libraries exist for each of the GNU, Portland Group, and INTEL software environments on the Blade Center Clusters, and as part of the ESSL libary on the p690. As an example of how much difference BLAS can make, a code modelling alloy crystallization had its primary time spent in solving a system of linear equations. It required ten hours to solve a 7K by 7K matrix equations. Linking to tuned BLAS and LAPACK libraries reduced the time to less than a minute. For information on how to link to the BLAS libraries, see the web page //www.ncsu.edu/itd/hpc/Documents/linuxblas.html In order to address more than 1 Gbyte of memory on the BladeCenter, users should use the -static flag for the Intel and gnu compilers and the -Bstatic flag for the Portland group compilers. Also they should avoid linking to .so (shared) libraries. For an explanation, see the web page //www.ncsu.edu/itd/hpc/Documents/LinuxMemoryMap_content.html Similarly, the default allocated memory on the p690 is small. See the web page //www.ncsu.edu/itd/hpc/Documents/AIXMemoryModels_content.html for information on how to allocate large amounts of memory. For parallel solution of matrix dense matrix operations, on the p690 the essl library also works. On the BladeCenter, the SCALAPACK library SCALAPACK User's Guide can be used. So far, we've ported SCALAPACK to work for the Portland Group and gnu environments. A complication in a parallel solution of Ax=b is how to distribute the matrix so that entries of the matrix A lie on the correct processor. Some SCALAPACK utilities can accomplish the desired redistribution of matrix entries. Many scientific computations rely on solution of linear equations, but typically systems of equations are not dense. Parallel solution of a sparse system of linear equations can be accomplished by the SuperLU package SuperLU. SuperLU accomplishes Gaussian elimination on a sparse system of equations. For the sparse system, only nonzero elements of a matrix are stored. ( SuperLU is available in the Intel environment on the BladeCenter ). For yet larger systems of sparse matrices, iterative systems are needed. One open source package that's absorbed many man years of government funded work is PETSc (Portable Extensible Toolkit for Scientific Computation) PETSc . PETSc has been installed in the gnu environment on the BladeCenter. Some other software packages from the ACTS DOE sponsored collection may also be useful to general users, see ACTS . These include (in addition to packages mentioned above) SUNDIALS (Suite of Nonlinear and Differential/Algebraic equation solvers) and TAO (Toolkit for Advanced Optimization), also installed in the gnu software environment. For further information, please e-mail Gary Howell, gary_howell@ncsu.edu, 919-513-7672 or Eric Sills, eric_sills@ncsu.edu |