>ssh -l myname login64.hpc.ncsu.edu
Copy /share3/gwhowell/pachec.tar to your home directory by
>cp /share3/gwhowell/pachec.tar . >tar xvf pachec.tar
>cd pachec/ppmpi_f/chap03 >ls
For a not very optimized version using the Portland group compilers ..
>add pgi64_hydra >mpif90 pmonte.f -o pmonte
>pgf90 -c pmonte.f >pgf90 -o pmonte pmonte.o /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libfmpich.a /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libmpich.a /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libmpl.a
Finally, having produced the file "compilepmonte" make it executable by
>chmod +x compilepmonte
More complicated programs are usually compiled with "make" (see the 2nd lab).
There are several thousand computational CPUs on the blade center, most packaged wth 4 or 8 cores for each blade. Most blades have 2 GBytes of RAM per core, with all the cores on a blade having access to the blade's RAM. Communication between blades and to the shared file system is by GBit ethernet.
Users run jobs on the computational nodes by submitting bsub scripts
which will put their jobs in an LSF (Load Sharing
Facility) queue. Here's a sample bsub script which
runs the executable pmonte.
#! /bin/csh #BSUB -W 5 #BSUB -n 4 #BSUB -R em64t source /usr/local/apps/env/pgi64_hydra.csh setenv MPICH_NO_LOCAL 1 mpiexec_hydra ./pmonte #BSUB -o /share/gwhowell/chap03/pmonte.out.%J #BSUB -e /share/gwhowell/chap03/pmonte.err.%J #BSUB -J pmonte
bsub < bstry
Some explanation of the bstry script:
The -W 5 asks for 5 minutes. -n 4 asks for 4 processors. The -o names the standard out, and the -e the standard err. The %J appends the LSF job ID integer to the file names so that you don't get them appended to the same files on successive runs. The -R em64t specifies that a blade with 64 bit pointers will be used (needed with the pgi64_hydra compilers and mpich library).
Of course, you need to change the gwhowell to your own user name and make sure that the directory you propose to write to exists and that you have write privileges in it.
Writing these files to /share is convenient in that /share has no space constraint. However, files on /share are not backed up. In fact, they are purged, i.e., files on /share older than a month or so (or two weeks or whatever it takes to keep some space on the disk) are deleted.
You might look in the .err and .out files .. There is only one line in the .out file which is related to the job submission. Of course, if the submission had failed there would be more. You might try changing some things to make the submission fail. For instance, I got some pretty mysterious errors by omitting the line "real*8 rand" in the pmonte.f file.
For example, give a bad path to the executable.
Log out and when you log back in, don't type "add pgi64_hydra". What happens if you compile with pgi, then use "add intel64_hydra"?
As an exercise, now try compiling and running some other program from chap03. Say ring.f
Is it actually true, that the number of processers has to be even?
Actually, the program does not (usually) hang for an even number of processors. The sends return (provided adequate buffer space is available). So are then ready to receive. The even number of processor code is "unsafe" in that it depends on adequate buffer space.
In particular, it makes sense to
One way to time codes is by prepending time to the call to the executable, e.g.
time ./foo.exeThis gives results of "user" time. "User" time should be taken with a grain of salt for parallel computation. It can for example, be the total time spent in various processes launched. Really we are more interested in "wall clock time". One way to get "wall clock time" is from the reported start and end times reported by LSF. The example programs show the use of the "wall clock timer" MPI_Wtime(), which can be placed at the start and end of an MPI program. See Timers for a bit more.
will list files in reverse order of time, so the newest files will appear last. So if you've just run a profiled code you can easily find the log file. For example, running a file compiled with -pg and with gnu or pgi compilers (pgf90, pgcc, gcc, g77, g++) will produce an output file gmon.out. To see the contents of gmon.out, compiled by running an executable foo.exe, try typing
Each line of the file will correspond to a subroutine, and will tell you how often the subroutine was called and the total time spent in that subroutine. Generally, if we want to speed a code, knowing where the code spends most of its time shows us where to concentrate. Or if a subroutine is called "oodles" of times maybe it should be inlined. One tutorial for gprof is The GNU gprof.
The gprof profile samples times in codes compiled with the -pg flag. Time spent in other parts of the code is not reported.
Here's an example run. It was compiled with the pgf77 compiler using the -Mprof=func flag. What do we see? After the executable ran, the file "pgprof.out" was produced. It follows.
PROF NODALL 0 a.out 1093292124 0
h blade1-13 23023 0 1
t 1 7
r zgebrd110 1 238 1 49.3378 23.3085
r zgebd3 1 702 1 7.52057e-05 6.07718e-05
r zlabr2 1 1342 199 26.0292 1.0356
r zgeupm 1 2064 1393 22.0503 0.918796
r zgemver 1 2342 207 2.94329 2.94329
r zgemvt2 1 3113 1393 21.1315 21.1315
r zrivbrd 1 1 1 109.268 59.9301
The driver is zrivbrd (contained in the fortran file zrivbrd2.f). According to the logfile, it takes 59 seconds. Actually, looking at the code, the driver calls some library routines from LAPACK, which has been linked to but not compiled with the -p flag. Since the LAPACK codes were not compiled with the profiler, the times spent in the library routines are attributed to the driver.
Another time consuming routine is zgebrd110, which required 23.3 seconds
most of which was actually spent in running the BLAS routine dgemm
(matrix matrix multiply). The next longest time 21.1 secs, which calls
the BLAS matrix vector multiply dgemv. zgemvt2 was
called 1393 times, (called only from zgeupm also called 1393 times).
The routine zgemver was called 2342 times and required 2.94 seconds
(it calls BLAS dgemv and also dger which are rank one updates).
For a prettier display of the results, I typed
which gave a GUI which gave some better explanations. For instance the 238 is the line of the file zge062704_1.f on which the routine zgebrd110 starts. It also give another informative column of times, which is how long a code spends in a routine and its subroutines. For example zrivbrd and its subroutines required 198.268 seconds.
These results (blaming all the time on the BLAS calls) seems to indicate that we should make sure we have a good BLAS library. (The results of that experiment will appear below. For a user manual for the PGPROF profiler, see PGPROF
One way of course, is by "brute force". Brute force is plausible with some Unix commands. The one indispensable "track it down" command is grep. So for example to search all the .f files in a directory for the ones that contain the characters ZLABR2(,
>grep ZLABR2( *.f
Or to search a bunch of .a archive files to find which one has the>GNU elusive function "foobar"
>nm -r *.a | grep foobar
where the | pipes the output of "nm -r *.a" which is a massive amount of symbols to grep. Grep throws away all the lines of the output except for those that contain "foobar". The -r flag made sure that the name of the .a file is included on each line of the nm generated symbol table.
Then by tracking down all the subroutines, you'll eventually construct a tree showing which ones call which. Or you could have just used the -pg compile option. Most of the traditional computer vendors such as IBM, Digital, HP, have their own utitilies which will construct a call table for you. For the open source environments (gcc, g77, etc.), provided you have compiled with the -pg flag. There is also well-known public domain program GNU gprof (by Jay Fenlason), which works with C, Fortran, and Pascal. I hope to give you a longer demo, but "info gprof" would get you going. See also the web page GNU gprof or Class notes from Rice University.
Call tables are also constructed by programs such as lint for C and ftnchek for Fortran 77. These are public domain, but some supported licensed programs can be purchased. Lint and ftnchek also give a good deal of info about possible programming problems such as mis-typed argument lists, non-portable language constructs, etc.
One problem with Fortran 90 is that the standard open source tools such as lint, ftnchek and gprof don't yet work with it. So not only do you have to buy a Fortran 90 compiler, but then you have to purchase these as development tools. Fortunately, we have the pgi flavors here. Also on the p690, we have the IBM tools.
One way to get portability is to use the standard C clock function.
Then from Fortran, use a wrapper to call it. Here's the wrapper.
#include < time.h>
/* printf("Here we are:\n clock=%f12.8\n",(float)clock()); */
In some instances, you may need to put an extra underscore after the ftime_. If you wanted to call this from C, you would take away the underscore. Compile it with
>cc -c timd.c
and then just include timd.o in the list of object files to be linked into your executable.
The Fortran or C code has calls to ftime() as follows:
Having declared pretim and entim as real*4
pretim = ftime()
Code segment to be timed
entim = ftime() - pretim
Then entim is the elapsed CPU time for the "Code segment to be timed". In C the declation would be as "float" and semicolons would be required at the ends of lines.
The clock function is portable in that as part of the C standard, it exists everywhere C does. A disadvantage with the C clock function or Fortran90 cpu_time is that the resolution is often pretty low. Frequently the smallest nonzero time is 1.e-2 or 1.e-3 seconds. So to time a bit of code you have to get it to repeat many times. Then the data stays in cache so the code runs artificially fast. Occasionally, the compiler figures out a loop is repeated with the same data and figures out it is unnecessary. Then times can get really short.
Another standard C timing function appropriate for wall clock time is clock_gettime, which returns a struct of which the second component typically has a higher resolution than clock(). For wall clock time I often prefer just to use the MPI_Wtime() function. If you have an MPI library available, just link to it, then between MPI_INIT() and MPI_Finalize() calls you can even though your code is really serial, use the MPI_Wtime() wall clock function. It usually returns an answer with a resolution in microseconds.
Finally, if you've isolated a section of code which can run as a stand-alone program ./fooexec, you can simply type
to get screen output detailing how long the code took. Under csh or tcsh shell the time command can give a good deal of other information about the code's run-time performance, e.g., how much memory it used. (See the man page).
A good on-line reference for timers and other performance tools is LLNL performance_tools tutorial
Getting rid of the profiling and using the -O4 compiler option made little difference in the times.
Swapping the PGI supplied BLAS for the ATLAS BLAS dropped the LAPACK time down to 25 seconds, i.e., 430 Mflops of complex, equivalently 1320 Mflops for double. Theory: using the Goto BLAS with Intel compiler and flags will push the rate above 500 Mflops. For directions on linking to ATLAS, Intel, and Goto BLAS libraries on the Blade Center, see BLAS libraries
But the alternative bidiagonalization version was reduced in time only from 48 seconds to 34 seconds. A first problem was that the block size had been set to 3 for purposes of debugging. Returning the block size to the usual 16, the time went to about one half second more than the current LAPACK routine. Looking carefully at the profiler data, it turned out that the average zgemver calls took longer than the average call to zgemvt. I then realized that most of the zgemver calls (one per call to zlabr2) were actually zgemvt calls with a "zero" update. Fixing these calls so that they do not call a matrix update routine avoided one write of the matrix per call to zlabr2. And reduced the time of execution to .7 seconds less than the current LAPACK call. Some remaining functions that take more than one second are the call to zlabr2 and the call to zgeupm. remaining investigation is whether the conjugation (not expressed as a BLAS call) could be responsible. Perhaps some of the vector operations for which conjugation is done an element at a time need some optimization?
These timers are typically developed as part of the chip design process, and may or may not be available to the public. For example, the Digital Alpha chip had a very nice counter, but alas the sys admin never wanted to leave it on. This is because it would have some "drag" effect on the system.
Dongarra, et. al, have proposed a portable counter PAPI. It runs on most processors, and is public domain. Performance Application Programming Interface
Why? It's interesting to compare and see what bits of code can sustain the most flops per instruction.
We've seen how to time and profile. In an example, we got a big speedup (factor of two) by changing to a different BLAS library. More generally, if we found a part of the code that took a significant amount of time and did not have a corresponding optimized library call, we could try some of the techniques from the next lecture to optimize it ourselves. For example, 10% or so of the time in the example profiled code was not in BLAS routines, so I may have to try to optimize that code by hand.