|
Serial Profiling and Timing
Profiling and Timing
A main reason to go to parallel processing is to make codes run faster.
Serial speed-ups are a first step.
The following is a laundry list of ways to improve performance in serial.
Profile and time to see which parts of the code are using time,
Try compiler optimization,
How To Use
To search
out to see if someone has done good optimized libraries, e.g. BLAS.
BLAS libraries
Make sure we're looping in the right direction (row or column
to make sure data called into cache is reused),
Try blocking for data locality,
Make sure that I/O is efficient,
Know what features of Fortran 90, C and C++, etc. are likely to slow code.
The material in this web page focuses on how to time codes so that we know
how long they take. Knowing which parts of codes take time is the first
step to improving performance.
A good on-line reference for timers and other performance tools is
LLNL performance_tools tutorial
The Unix Time Command
A straightforward timer is the unix shell command.
If you've isolated a section of code which can run as a
stand-alone program ./fooexec, then in the bash, csh, or tcsh shell, you
can simply type
>time ./fooexec
to get screen output detailing how long the code took. For example, in
a bash shell, I got the following
>time ./fooexec
real 0m0.121s
user 0m0.000s
sys 0m0.000s
The real time here is wall clock time. The user and system time correspond
to CPU time. In this particular instance, the code did very little, so that
.121 seconds was essentially just the system overhead for calling the
small executable from the harddrive into execution.
Generally, if the wall clock time is consistently much longer than the
total of the sys and user time, then the time for fetching data to and
from hard drives may be taking a good deal of time. In parallel codes,
the code may be spending a good deal of time waiting on communication
between processors.
For this code on a second run,
the code was already resident in RAM
>time ./fooexec
real 0m0.007s
user 0m0.000s
sys 0m0.010s
This timing is unusual in that user and sys (CPU) time should total
less than the real or wall clock time. Under csh or
tcsh shell the time command can give a good deal of other information
about the code's run-time performance, e.g., how much memory it used.
(See the man page).
In Line Timers
Wrapping the gettimeofday function to get a Fortran wall clock timer.
Using the MPI_Wtime wall clock timer
Wrapping the C clock function to get a
Fortran CPU timer.
Of course, for purposes of comparing codes or how long it takes
for a subroutine to run, we can just call a timer. C and Fortran 90
have portable CPU timers. Fortran 77 does not.
One way to get portability is to use the standard C clock function.
Then from Fortran, use a wrapper to call it. Here's the wrapper.
This timer usually works, but the gettimeofday function used above is
typically of somewhat higher resolution, measuring "wall clock time"
as opposed to "CPU time".
#include < time.h>
float ftime_()
/* printf("Here we are:\n clock=%f12.8\n",(float)clock()); */
{return (float)clock()/CLOCKS_PER_SEC;}
In some instances, you may need to put an extra underscore after the
ftime_ If you wanted to call this from C, you would take away
the underscore. Similarly, to link to fortran codes compiled with
the xlf compiler on the p690, remove the underscore.
Compile it with
>cc -c timd.c
and then just include timd.o in the list of object files to be linked
into your executable.
The wrapped function is called as ftime.
The Fortran or C code has calls to ftime() as follows:
Having declared pretim and entim as real*4
pretim = ftime()
...
Code segment to be timed
...
entim = ftime() - pretim
Then entim is the elapsed CPU time for the "Code segment to be timed". In C
the declation would be as "float" and semicolons would be required at the
ends of lines.
The clock function is portable in that as part of the C standard, it exists everywhere C does. A disadvantage with
the C clock function or Fortran90 cpu_time is that the resolution is often pretty low.
Frequently the smallest nonzero time is 1.e-2 or 1.e-3 seconds. So to time a bit
of code you have to get it to repeat many times. Then the data stays in
cache so the code runs artificially fast. Occasionally, the compiler
figures out a loop is repeated with the same data and figures out it
is unnecessary. Then times can get really short.
Wrapping the C wall clock timer gettimeofday to
get a Fortran wall clock timer
Another common C timing function appropriate for wall clock time is gettimeofday, which returns a struct
of which the second component typically has a higher resolution than
clock(). The following wrapper on gettimeofday allows a Fortran
code to call a subroutine wtime.
#include < sys/time.h >
#include < stdlib.h >
/*
This timer returns wall clock time.
The first entry is current time in seconds.
The second entry is microseconds.
*/
void wtime_(double tim[2], int *ierr2 )
{
struct timeval time1;
long tim1, tim2;
int ierr;
double elap;
ierr = gettimeofday(&time1, NULL) ;
*ierr2 = ierr;
if (ierr != 0 ) printf("bad return of gettimeofday, ierr = %d \n",ierr);
tim[0] = time1.tv_sec;
tim[1] = time1.tv_usec;
/* printf("tim1 = %f \n",tim[0]);
printf("tim2 = %f \n", tim[1]); */
}
The following sample Fortran routine calls wtime.
program testtimers
c
c This program tests the calling of C timers
c from fortran.
c
real*8 sum, tim(2), tim2(2), elapsed
integer i, ierr, ierr2
i = 1
sum = 0.d0
call wtime(tim, ierr)
print*,' tim1 =', tim(1), tim(2) , ierr
do while (sum .lt. 19)
sum = sum + 1./i
i = i+1
end do
call wtime(tim2, ierr2)
tim(1) = tim2(1) - tim(1)
tim(2) = tim2(2) - tim(2)
c
c It's important to do the differences separately as tim(1) and tim(2)
c are O(10^9) and are almost equal to one another. So converting
c to a microsecond number before subtracting results in a coarse resolution.
c
elapsed = 1.e6*tim(1) + tim(2)
print*,' elapsed time is ', 1.e6*tim(1) + tim(2) ,' microseconds'
elapsed = elapsed/1.e6
print*,' elapsed time is ', elapsed,' seconds'
stop
end
If the C code is timw.c and the fortran code is test.f, the compile and
link phases could be performed by the following command lines.
pgf77 -c test.f
gcc -c timw.c
pgf77 -o test test.o timw.o
For the pgf77 compiler, we could have used an intrinsic timer, but
that timer would not be portable for example to g77 compiled code.
This timer (gettimeofday) typically has a higher resolution than the
clock function. For example, in Linux, Solair and AIX, the resolution
is microseconds. Again to compile the code with the xlf compiler on
the p690, delete the _ after the wtime.
Using the MPI_Wtime wall clock timer
For wall clock time
I often prefer just to use the MPI_Wtime() function. If you have
an MPI library available, just link to it, then between MPI_INIT()
and MPI_Finalize() calls you can even though your code is really
serial, use the MPI_Wtime() wall clock function. It usually returns
an answer with a resolution in microseconds. The following code
was used above and is modified here to instead use the MPI_Wtime
call.
program testtimers
c
c This program tests the calling of C timers
c from fortran.
c
include 'mpif.h'
real*8 sum, startim, entim, elapsed
integer i, ierr
call MPI_Init(ierr)
i = 1
sum = 0.d0
startim = MPI_Wtime()
do while (sum .lt. 19)
sum = sum + 1./i
i = i+1
end do
entim = MPI_Wtime() - startim
print*,' elapsed time is ', entim, ' seconds'
stop
end
Usually, the resolution of MPI_Wtime() is in microseconds.
Profilers
Running a code under a profiler produces a table that details how long
the code spent in various subroutines. Typically, the profiler is
turned on by compiling the code with the -p flag. Then when the code
is executed, a log file is produced. This can have various names, by
>ls -lrt
will list files in reverse order of time, so the newest files will appear
last. So if you've just run a profiled code you can easily find the
log file. Each line of the file will correspond to a subroutine, and will
tell you how often the subroutine was called and the total time spent in
that subroutine. Generally, if we want to
speed a code, knowing where the code spends most of its time shows
us where to concentrate. Or if a subroutine is called "oodles" of times
maybe it should be inlined.
Example with the PGI Fortran compiler and profiler
Here's an example run.
It was compiled with the pgf77 compiler using the -p flag.
What do we see? After the executable ran, the file "pgprof.out"
was produced. It follows.
PROF NODALL 0 a.out 1093292124 0
h blade1-13 23023 0 1
t 1 7
p 0
f zge062704_1.f
r zgebrd110 1 238 1 49.3378 23.3085
r zgebd3 1 702 1 7.52057e-05 6.07718e-05
r zlabr2 1 1342 199 26.0292 1.0356
r zgeupm 1 2064 1393 22.0503 0.918796
r zgemver 1 2342 207 2.94329 2.94329
r zgemvt2 1 3113 1393 21.1315 21.1315
f zrivbrd2.f
r zrivbrd 1 1 1 109.268 59.9301
z
The driver is zrivbrd (contained in the fortran file zrivbrd2.f). According to the logfile, it takes 59 seconds. This is wallclock time, i.e.,
elapsed time for the entire code exectution. Wallclock time
may include time for I/O and also time that the CPU has spent running
other user's code. Thus it is important to profile where possible
on computational nodes. There we are guarantted that no other
user is using the processor. In contrast, on the head node,
or in using the p690, there may be a good deal of contention for
user resources, resulting in inconsistent runs.
In this instance (from looking at the code), the
driver calls some library routines from LAPACK, which has been linked
to but not compiled with the -p flag. Since the LAPACK
codes were not compiled with the profiler, the times spent in the
library routines are attributed to the driver.
Another time consuming routine is zgebrd110, which required 23.3 seconds
most of which was actually spent in running the BLAS routine dgemm
(matrix matrix multiply). The next longest time 21.1 secs, which calls
the BLAS matrix vector multiply dgemv. zgemvt2 was
called 1393 times, (called only from zgeupm also called 1393 times).
The routine zgemver was called 2342 times and required 2.94 seconds
(it calls BLAS dgemv and also dger which are rank one updates).
For a prettier display of the results, I typed
>pgprof pgprof.out
which gave a GUI which gave some better explanations. For instance
the 238 is the line of the file zge062704_1.f on which the routine
zgebrd110 starts. It also give another informative column of times,
which is how long a code spends in a routine and its subroutines.
For example zrivbrd and its subroutines required 198.268 seconds.
These results (blaming all the time on the BLAS calls) seems to
indicate that we should make sure we have a good BLAS library. (The
results of that experiment will appear below. For a tutorial
on the PGPROF profiler, see
PGPROF
The pgprof profiler also has parallel capabilities.
gprof, Lint, ftnchek, etc.
I happen to know the code we're examining pretty well, as I wrote it.
But for someone coming in cold, the PGI log file would be a bit
confusing. It doesn't tell us for example, which code calls which.
Generally, when looking to modify a code you find yourself constructing
a call table.
One way of course, is by "brute force". Brute force is plausible with
some Unix commands.
The one indispensable "track it down" command is grep. So for example
to search all the .f files in a directory for the ones that contain
the characters ZLABR2(,
>grep ZLABR2( *.f
Or to search a bunch of .a archive files to find which one has the>GNU
elusive function "foobar"
>nm -A *.a | grep foobar
where the | pipes the output of "nm -r *.a" which is a massive amount
of symbols to grep. Grep throws away all the lines of the output
except for those that contain "foobar". The -r flag made sure that
the name of the .a file is included on each line of the nm generated
symbol table.
Then by tracking down all the subroutines, you'll eventually construct
a tree showing which ones call which. Or you could have just used the
-pg compile option.
Most of the traditional computer vendors such as IBM, Digital, HP,
have their own utitilies which will construct a call table for you.
For the open source environments (gcc, g77, etc.), provided you
have compiled with the -pg flag.
There is also well-known public domain program GNU gprof (by Jay Fenlason),
which works with
C, Fortran, and Pascal. I hope to give you a longer demo, but
"info gprof" would get you going. See also the web page
GNU gprof or
Class notes from Rice University.
Call tables are also constructed by programs such as lint for C and
ftnchek for Fortran 77. These are public domain, but some supported
licensed programs can be purchased. Lint and ftnchek also give a
good deal of info about possible programming problems such as
mis-typed argument lists, non-portable language constructs, etc.
One problem with Fortran 90 is that the standard open source tools
such as lint, ftnchek and gprof don't yet work with it. So not
only do you have to buy a Fortran 90 compiler, but then you have
to purchase these as development tools. Fortunately, we have the pgi
flavors here. Also on the p690, we have the IBM tools.
Speed-ups by BLAS, Compiler Optimization, Intel Compiler
Initially, we had an LAPACK time (according to clock) of 52 seconds to
bidiagonalize a 1600 by 1600 matrix, corresponding to about 206 Mflops.
Where the trial routine requires only 48 seconds. 206 Mflops is not
as horrible as it sounds as these are double precsion complex flops,
where an add corresponds to two double precsion adds, and a multiply
to 4 double precision multiplies. Assuming an equal number of adds
and mults, there are 3 double precsion flops to a complex flop, so
we have a speed of 620 Mflops. Still the peak speed of the processor
is around 6 Gflops.
Getting rid of the profiling and using the -O4 compiler option made
little difference in the times.
Swapping the PGI supplied BLAS for the ATLAS BLAS dropped the
LAPACK time down to 25 seconds, i.e., 430 Mflops of complex,
equivalently 1320 Mflops for double. Theory: using the Goto
BLAS with Intel compiler and flags will push the rate above 500 Mflops.
For directions on linking to ATLAS, Intel, and Goto BLAS libraries on
the Blade Center, see
BLAS libraries
But the alternative bidiagonalization version was reduced in time only from 48 seconds to
34 seconds. A first problem was that the block size had been set to 3
for purposes of debugging. Returning the block size to the usual 16,
the time went to about one half second more than the current LAPACK routine.
Looking carefully at the profiler data, it turned out that the average
zgemver calls took longer than the average call to zgemvt. I then
realized that most of the zgemver calls (one per call to zlabr2) were
actually zgemvt calls with a "zero" update. Fixing these calls so
that they do not call a matrix update routine avoided one write of the
matrix per call to zlabr2. And reduced the time of execution to .7 seconds
less than the current LAPACK call. Some remaining functions that take
more than one second are the call to zlabr2 and the call to zgeupm.
remaining investigation is whether the conjugation (not expressed as
a BLAS call) could be responsible. Perhaps some of the vector operations
for which conjugation is done an element at a time need some optimization?
Hardware Counters--Papi
In quantum mechanics or anthropology, observations influence the observed. So using profilers or timers entails using
"interrupts" distorting program execution.
A less intrusive approach to observing the
code in the wild is to use a hard ware counter. It samples:
Every so many steps -- say 1000 clock cycles -- it polls, asking
what command is being issued on this clock cycle? Then after program
execution is complete, you can look and see what proportion of the time
was spent doing what.
These timers are typically developed as part of the chip design
process, and may or may not be available to the public. For example,
the Digital Alpha chip had a very nice counter, but alas the
sys admin never wanted to leave it on. This is because it would
have some "drag" effect on the system.
Dongarra, et. al, have proposed a portable counter PAPI. It runs
on most processors, and is public domain.
Performance Application Programming Interface
Why? It's interesting to compare and see what bits of code
can sustain the most flops per instruction.
Looking at the Assembler Code
Most compilers have an option to let you look at the assembler
code. Look in the compiler man page. Mostly commonly,
it's a -S (as opposed to -c) option. If you've determined that
some bit of code could be a problem area then you can look
at the assembler code and try various coding options to try
to decrease the number of load and store instructions, etc.
Personally, I don't actually write the assembler, just rewrite
the Fortran or C. It's also interesting to see what the
compiler optimizer has done to your code.
Conclusions
We've seen how to time and profile. In an example, we got a big speedup
(factor of two) by changing to a different BLAS library. More generally,
if we found a part of the code that took a significant amount of time
and did not have a corresponding optimized library call, we could try
some of the techniques from the next lecture to optimize it ourselves.
For example, 10% or so of the time in the example profiled code was
not in BLAS routines, so I may have to try to optimize that code by
hand.
|