Profiling is useful in giving some indication of which subroutines and functions are using most of the time, and hence are worth making more efficient. All the profiling tools discussed here work well for serial codes.
Adding a -pg flag on compilation and linking of codes, causes a file called gmon.out to produced at run time. If the executable is involked by
./foo
then
gprof ./foo |& tee fooprof
will put profile information in the file fooprof. Profiling gives information on how long the executable spent in each subroutine or function. It's based on sampling with a relatively course grained timer.
For pgcc and pgf90 compiled codes, compiling with the -Mprof=lines or -Mprof=funcs flag produces at runtime a file called pgprof.out, which can be examined by
pgprof pgprof.out
The intel compilers also have a profiling mechanism (see the man page for ifort or icc).
Some of the above profilers probably have been extended to work well in the multi-core case, but if so, I haven't yet figured out how to use them there. Having figured out what parts of a code can use some parallelization, I typically insert a wall clock timer to see how much speedup I can get.
Of course, the easiest way to time codes is by using
time ./foo
In the csh or tcsh Unix shells, the "time" mechanism has some interesting options (try "man time"). "time" does not give details as to how a code is spending time.
The timers in timw3.c are fine grained wall clock timers, useful in getting a closer look at code segments and more accurate assessments of how long in spent in functions and subroutines. They are based on the timeofday() Posix function, so should be useable on most machines. The C and Fortran codes below give examples of how these timers can be used. For correct usage, be sure to look at the variable declarations in the codes.
Wall clock timers are convenient in measuring parallel speedup, but note that the microsecond precision is a bit spurious. Results over small time intervals vary depending on which background processes have grabbed the processor.
Some traditional timers in Fortran give cputime. These often appear to be based on the clock() function in the Posix standard. For multi-core machines these can be confusing as they sum the time spent by all cores. Also resolution is fairly low, typically in hundredths of a second on Linux boxes and milliseconds in Windows.
The "date" timer specified by the Fortran standard gives times in seconds, so is of yet lower resolution. MPI and OpenMP also provide timers, which seem usually to be high quality wall clock timers based on Posix cputime().
Both the pgi and intel compilers provide flags for automatic parallelization of loops. Even on the simple sax-new.f and sax.c codes, I observed no parallel speedup. Also I have not found a way to get Fortran or C code with OpenMP directives inserted by the pgi or intel compilers, so customization is not easy. However, the Intel compilers will generate "reports" on parallelizing loops. These may be useful.
I have in the past found some automatic parallelization tools that insert OpenMP directives. These can be useful as a starting point. Typically the programmer goes through and removes pragmas for loops that are too short to be usefully parallelized, and also observes which loops were not automatically parallelized.
To use a debugger remove all optimization flags. For an OpenMP code, you need to leave that flag (-fopenmp for gcc4 or gfortran, -mp=numa for pgcc or pgf90, -openmp for icc or ifort). The debugger also requires a -g flag.
For gcc4 and gfortran compiled codes, ddd is a graphic front-end for the gdb debugger. So far, it works well for me in serial and with some fiddling, also for MPI codes. But not for OpenMP.
For pgcc and pgf90 compiled codes, the pgdbg debugger works well in serial and can also be made to work with MPI codes. In an initial attempt, I was not able to usefully step through a parallel OpenMP region (private variables were not recognized).
However, the totalview debugger did work, and is worth a try not only for ifort and icc compiled codes, but also for others.
To use the totalview debugger, first type
add tv
then
totalview ./fooexec < input &
(presuming that input has the data that ./fooexec would require from standard input).
It's probably best to run totalview from a VCL replica of an HPC login node (rather than from a login node) , as it runs rather slowly (go to vcl.ncsu.edu and ask for an HPC image). Generally, the HPC VCL image allows you to run computationally intensive jobs interactively, so long as only one blade is required.
Hitting a "next" button will start you stepping through the program. Typically you set a break point, then use "continue" to jump to that point. The debugger allows you to step through the code and examine variables after execution of a given line.
Inside a parallel region, you can duplicate the main GUI. Then in the bottom screen, select which thread you want to step through.
Often we can get surprizingly high speedups in serial codes, even before parallelizing. Folk wisdom in the HPC world is that most codes are written to run only at 1% peak efficiency (typically codes spend their time randomly accessing RAM, then waiting a hundred clock cycles for the next fetch from RAM). Then they are rewritten to work in parallel. Fortunately, in the shared memory case the penalty for a "cache miss" is still on the order of a hundred clock cycles (where in the distributed memory case, the penalty for accessing data on some other processor ranges from a million clock cycles to an indefinitely stalled program).
Some common and easy methods to get serial speedups are
Make sure you're using the best compiler flags for the compilers you're using.
Vendor compilers are often faster than public domain compilers.
Rather than write your own matrix routines or fast fourier transforms or sorts or .. use a library someone else already wrote and debugged. For matrix operations, tuned BLAS (Basic Linear Algebra Subroutines) are available for each supported compiler. We can also find threaded versions of the BLAS (which use more than one core).
For serial optimization, it's often helpful to realize that C stores 2-D arrays by rows, Fortran by columns. Pulling just one number from RAM to cache can take a hundred or so clock cycles. But actually when a nunber is brought from RAM to cache, a cache line of neighboring data is also brought. So when the next element in a row is used, it's usually already there (in C but not in Fortran). When the next element in a column is used, it's usually already been fetched (in Fortran but not C).
Loop unrolling is often useful. Increment loops several say k at at ime, putting the k iterations into a single pass of a loop. "Loop unrolling" avoids the overhead of incrementing the loop variable. Often the compiler will do this for you, but usually only for the innermost loop. Don't forget to write clean-up steps for the part of the loop not performed in the combined iteration.
If a variable is repeatedly updated, before being stored, it may make sense to save it as a temporary variable, then finally write it to its permanent array value. Then the code doesn't repeatedly write to RAM (writes to RAM are even more expensive than reads from RAM).
If a variable is changed by several threads, then code will run very slowly (and is likely wrong).
But the computer can also be confused if two different processors change data from the same cache line (false sharing). If two threads access data from the same cache line, the computer gives it special and slow treatment to assure "cache coherency".
Generally, we want our for or do loops to be long, and prefer static scheduling.
Thought as to what barriers can be safely removed can be helpful (especially if the barrier is frequently repeated).
From the blade center, find a directory from which you wish to work. Do
cp /home/gwhowell/progs/openmp/tblas.tar . tar xvf tblas.tar
The following codes are motivated by a need to calculate matrix matrix multiplications in cases where the matrices are long and flat or tall and skinny. Such computations arise in block Krylov methods for sparse matrix computations. (See Block Householder BLAS-3 Reduction of Sparse Matrices ) It's not clear that available BLAS libraries are well-tuned for the "tall skinny" case.
The codes dskn11.f and dskn11.c are straightforward codings of matrix matrix multiplications (with a bit of wrapping to do timings for various sizes.). For a couple of compilers, try compiling and running the code. For example,
gcc4 -fopenmp -c dskn11.c gcc -c timw3.c gcc4 -fopenmp -o dskn11 dskn11.o timw3.o gfortran -fopenmp -c dskn11.c gcc -c timw3.c gfortran -fopenmp -o dskn11 dskn11.o timw3.o
Then run it by
./dgskn11 < input
where "input" is a file perhaps consisting of
8 8 8 0
Next try with some compiler flags. How much difference do they make?
Here's a code with one loop unrolled (by 2) and OpenMP. Try compiling dskn12.f (you could also try a rewrite of dskn11.f or dskn11.c ) Did this help ? This Fortran version requires the input file to be modified to have a large number first.
program dskn
integer lda, ldk
parameter (lda=10000,ldk=30)
real*8 a(lda,ldk), b(ldk,ldk), c(lda,ldk)
integer i,j,k, m,n,l
real*8 alpha,beta
real*8 g11, g12, g13, g14
real*8 g21, g22, g23, g24
real*8 g31, g32, g33, g34
real*8 g41, g42, g43, g44
print*, 'input m <= ', lda
read*, m
print*, 'input n <= ', ldk
read*, n
print*, 'input l <= ', ldk
read*, l
do i = 1, m
do j = 1, n
a(i,j) = 1.d0
c(i,j) = 0.d0
end do
end do
do i = 1, n
do j = 1, l
b(i,j) = 1.d0
end do
end do
alpha = 1.d0
beta = 1.d0
do i = 1, m-1, 2
do j = 1, l
g11 = 0.d0
g21 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
print*,'c(',i,',',j,')=',c(i,j)
print*,'c(',i+1,',',j,')=',c(i+1,j)
end do
end do
do i = (m/2)*2+1, m
do j = 1, l
g11 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
print*,'c(',i,',',j,')=',c(i,j)
end do
end do
stop
end
The loop is unrolled by a factor of 4. dskn41.f. You can skip to dskn44.f or dskn44.c if you like.
program dskn
integer lda, ldk
parameter (lda=10000,ldk=30)
real*8 a(lda,ldk), b(ldk,ldk), c(lda,ldk)
integer i,j,k, m,n,l
real*8 alpha,beta
real*8 g11, g12, g13, g14
real*8 g21, g22, g23, g24
real*8 g31, g32, g33, g34
real*8 g41, g42, g43, g44
print*, 'input m <= ', lda
read*, m
print*, 'input n <= ', ldk
read*, n
print*, 'input l <= ', ldk
read*, l
do i = 1, m
do j = 1, n
a(i,j) = 1.d0
c(i,j) = 0.d0
end do
end do
do i = 1, n
do j = 1, l
b(i,j) = 1.d0
end do
end do
alpha = 1.d0
beta = 1.d0
do i = 1, m-3, 4
do j = 1, l
g11 = 0.d0
g21 = 0.d0
g31 = 0.d0
g41 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
g31 = g31 + a(i+2,k) * b(k,j)
g41 = g41 + a(i+3,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
c(i+2,j) = alpha*g31 + beta*c(i+2,j)
c(i+3,j) = alpha*g41 + beta*c(i+3,j)
print*,'c(',i,',',j,')=',c(i,j)
print*,'c(',i+1,',',j,')=',c(i+1,j)
print*,'c(',i+2,',',j,')=',c(i+2,j)
print*,'c(',i+3,',',j,')=',c(i+3,j)
end do
end do
do i = (m/4)*4+1, m
do j = 1, l
g11 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
print*,'c(',i,',',j,')=',c(i,j)
end do
end do
stop
end
This one is dskn42.f (one loop unrolled by 4 and one by 2)
program dskn
! this program attempts to speed computation of
! C <-- alpha * A * X + beta * C
! where A,C are tall and skinny , X relatively small
integer lda, ldk
parameter (lda=300010,ldk=10)
real*8 a(lda,ldk), b(ldk,ldk), c(lda,ldk)
integer i,j,k, m,n,l
real*8 alpha,beta
! variables from here will be local in a subroutine.
! output control variable for debugging ..
integer idum
! scratch variables
real*8 g11, g12, g13, g14
real*8 g21, g22, g23, g24
real*8 g31, g32, g33, g34
real*8 g41, g42, g43, g44
! timing variables
real*8 Mflop1, tim(2)
real*4 elaps1
integer ierr
external wdiff
print*, 'input m <= ', lda
read*, m
print*, 'input n <= ', ldk
read*, n
print*, 'input l <= ', ldk
read*, l
print*,' input idum > to print debug'
read*,idum
! initialization
do i = 1, m
do j = 1, n
a(i,j) = 1.d0
c(i,j) = 0.d0
end do
end do
do i = 1, n
do j = 1, l
b(i,j) = 1.d0
end do
end do
alpha = 1.d0
beta = 1.d0
! matrix matrix multiply
call wtime(tim, ierr)
!$OMP PARALLEL DO schedule(static,10000)
!$OMP& private (i,j,k,g11,g21,g31,g41,g12,g22,g32,g42)
do i = 1, m-3, 4
do j = 1, l-1, 2
g11 = 0.d0
g21 = 0.d0
g31 = 0.d0
g41 = 0.d0
g12 = 0.d0
g22 = 0.d0
g32 = 0.d0
g42 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
g31 = g31 + a(i+2,k) * b(k,j)
g41 = g41 + a(i+3,k) * b(k,j)
g12 = g12 + a(i,k) * b(k,j+1)
g22 = g22 + a(i+1,k) * b(k,j+1)
g32 = g32 + a(i+2,k) * b(k,j+1)
g42 = g42 + a(i+3,k) * b(k,j+1)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
c(i+2,j) = alpha*g31 + beta*c(i+2,j)
c(i+3,j) = alpha*g41 + beta*c(i+3,j)
c(i,j+1) = alpha*g12 + beta*c(i,j+1)
c(i+1,j+1) = alpha*g21 + beta*c(i+1,j+1)
c(i+2,j+1) = alpha*g31 + beta*c(i+2,j+2)
c(i+3,j+1) = alpha*g41 + beta*c(i+3,j+2)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! print*,'c(',i+1,',',j,')=',c(i+1,j)
! print*,'c(',i+2,',',j,')=',c(i+2,j)
! print*,'c(',i+3,',',j,')=',c(i+3,j)
! print*,'c(',i,',',j+1,')=',c(i,j+1)
! print*,'c(',i+1,',',j+1,')=',c(i+1,j+1)
! print*,'c(',i+2,',',j+1,')=',c(i+2,j+1)
! print*,'c(',i+3,',',j+1,')=',c(i+3,j+1)
! endif
end do
do j = (l/2)*2+1, l
g11 = 0.d0
g21 = 0.d0
g31 = 0.d0
g41 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
g31 = g31 + a(i+2,k) * b(k,j)
g41 = g41 + a(i+3,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
c(i+2,j) = alpha*g31 + beta*c(i+2,j)
c(i+3,j) = alpha*g41 + beta*c(i+3,j)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! print*,'c(',i+1,',',j,')=',c(i+1,j)
! print*,'c(',i+2,',',j,')=',c(i+2,j)
! print*,'c(',i+3,',',j,')=',c(i+3,j)
! endif
end do
end do
do i = (m/4)*4+1, m
do j = 1, l
g11 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! endif
end do
end do
elaps1 = wdiff(tim,ierr)
Mflop1 = (((2.0*m)*l)*n)/elaps1/1.d6
print*,'m =',m, 'n =',n, 'l =', n
print*,'The Mflop rate was', Mflop1
stop
end
This one is a 4 by 4 unrolling (both in Fortran and C as dskn44.c and dskn44.f). It's a 3 deep loop. Should we also unroll the innermost loop? For a startling change in speed, try removing a variable from the private list.
program dskn
! this program attempts to speed computation of
! C <-- alpha * A * X + beta * C
! where A,C are tall and skinny , X relatively small
integer lda, ldk
parameter (lda=400010,ldk=10)
real*8 a(lda,ldk), b(ldk,ldk), c(lda,ldk)
integer i,j,k, m,n,l
real*8 alpha,beta
! variables from here will be local in a subroutine.
! output control variable for debugging ..
integer idum
! variable for smoothing of timeing
integer ii
! scratch variables
real*8 g11, g12, g13, g14
real*8 g21, g22, g23, g24
real*8 g31, g32, g33, g34
real*8 g41, g42, g43, g44
! timing variables
real*8 Mflop1, tim(2)
real*4 elaps1
integer ierr
integer numth, OMP_GET_MAX_THREADS
external wdiff
! omp variables
external OMP_GET_MAX_THREADS
numth = OMP_GET_MAX_THREADS()
print*,' '
print*,'OMP_NUM_THREADS=',numth
! print*, 'input m <= ', lda
read*, m
! print*, 'input n <= ', ldk
read*, n
! print*, 'input l <= ', ldk
read*, l
! print*,' input idum > to print debug'
read*,idum
! initialization
do i = 1, m
do j = 1, n
a(i,j) = 1.d0
c(i,j) = 0.d0
end do
end do
do i = 1, n
do j = 1, l
b(i,j) = 1.d0
end do
end do
alpha = 1.d0
beta = 1.d0
! matrix matrix multiply
call wtime(tim, ierr)
!$OMP PARALLEL
do ii = 1,10
!$omp do
!$OMP& private (i,j,k,g11,g21,g31,g41,g12,g22,g32,g42)
!$OMP& private (g13,g23,g33,g43,g14,g24,g34,g44)
do i = 1, m-3, 4
do j = 1, l-3, 4
g11 = 0.d0
g21 = 0.d0
g31 = 0.d0
g41 = 0.d0
g12 = 0.d0
g22 = 0.d0
g32 = 0.d0
g42 = 0.d0
g13 = 0.d0
g23 = 0.d0
g33 = 0.d0
g43 = 0.d0
g14 = 0.d0
g24 = 0.d0
g34 = 0.d0
g44 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
g31 = g31 + a(i+2,k) * b(k,j)
g41 = g41 + a(i+3,k) * b(k,j)
g12 = g12 + a(i,k) * b(k,j+1)
g22 = g22 + a(i+1,k) * b(k,j+1)
g32 = g32 + a(i+2,k) * b(k,j+1)
g42 = g42 + a(i+3,k) * b(k,j+1)
g13 = g13 + a(i,k) * b(k,j+2)
g23 = g23 + a(i+1,k) * b(k,j+2)
g33 = g33 + a(i+2,k) * b(k,j+2)
g43 = g43 + a(i+3,k) * b(k,j+2)
g14 = g14 + a(i,k) * b(k,j+3)
g24 = g24 + a(i+1,k) * b(k,j+3)
g34 = g34 + a(i+2,k) * b(k,j+3)
g44 = g44 + a(i+3,k) * b(k,j+3)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
c(i+2,j) = alpha*g31 + beta*c(i+2,j)
c(i+3,j) = alpha*g41 + beta*c(i+3,j)
c(i,j+1) = alpha*g12 + beta*c(i,j+1)
c(i+1,j+1) = alpha*g22 + beta*c(i+1,j+1)
c(i+2,j+1) = alpha*g32 + beta*c(i+2,j+1)
c(i+3,j+1) = alpha*g42 + beta*c(i+3,j+1)
c(i,j+2) = alpha*g13 + beta*c(i,j+2)
c(i+1,j+2) = alpha*g23 + beta*c(i+1,j+2)
c(i+2,j+2) = alpha*g33 + beta*c(i+2,j+2)
c(i+3,j+2) = alpha*g43 + beta*c(i+3,j+2)
c(i,j+3) = alpha*g14 + beta*c(i,j+3)
c(i+1,j+3) = alpha*g24 + beta*c(i+1,j+3)
c(i+2,j+3) = alpha*g34 + beta*c(i+2,j+3)
c(i+3,j+3) = alpha*g44 + beta*c(i+3,j+3)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! print*,'c(',i+1,',',j,')=',c(i+1,j)
! print*,'c(',i+2,',',j,')=',c(i+2,j)
! print*,'c(',i+3,',',j,')=',c(i+3,j)
! print*,'c(',i,',',j+1,')=',c(i,j+1)
! print*,'c(',i+1,',',j+1,')=',c(i+1,j+1)
! print*,'c(',i+2,',',j+1,')=',c(i+2,j+1)
! print*,'c(',i+3,',',j+1,')=',c(i+3,j+1)
! endif
end do
do j = (l/4)*4+1, l
g11 = 0.d0
g21 = 0.d0
g31 = 0.d0
g41 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
g21 = g21 + a(i+1,k) * b(k,j)
g31 = g31 + a(i+2,k) * b(k,j)
g41 = g41 + a(i+3,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
c(i+1,j) = alpha*g21 + beta*c(i+1,j)
c(i+2,j) = alpha*g31 + beta*c(i+2,j)
c(i+3,j) = alpha*g41 + beta*c(i+3,j)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! print*,'c(',i+1,',',j,')=',c(i+1,j)
! print*,'c(',i+2,',',j,')=',c(i+2,j)
! print*,'c(',i+3,',',j,')=',c(i+3,j)
! endif
end do
end do
!$omp end do
! logical error here , what is it ?
do i = (m/4)*4+1, m
do j = 1, l
g11 = 0.d0
do k = 1, n
g11 = g11 + a(i,k) * b(k,j)
end do
c(i,j) = alpha*g11 + beta*c(i,j)
! if (idum.gt.0) then
! print*,'c(',i,',',j,')=',c(i,j)
! endif
end do
end do
end do
!$omp end parallel
elaps1 = wdiff(tim,ierr)
Mflop1 = (((2.0*m)*l)*n)/elaps1/1.d5
print*,'m =',m, 'n =',n, 'l =', n
print*,'The Mflop rate was', Mflop1
stop
end
or in C
#include
#include
#include
#include
int main()
/* translated from dskn44.f
program dskn
! this program attempts to speed computation of
! C <-- alpha * A * X + beta * C
! where A,C are tall and skinny , X relatively small */
{
int i, j, k, m, n, l, ii, jj;
double a[300010][12],b[12][12],c[300010][12];
double alpha, beta;
/* ! variables from here will be local in a function .
! output control variable for debugging .. */
/* integer idum */
int idum;
/*! scratch variables */
double g11, g12, g13, g14, g21, g22, g23, g24;
double g31, g32, g33, g34, g41, g42, g43, g44;
/* timing variables */
double Mflop1, tim[2], elaps1;
double endtim;
/* OpenMP variables */
int ierr, numth;
numth = omp_get_max_threads();
printf("\n");
printf("omp_num_threads = %d\n",numth);
scanf("%d",&n);
scanf("%d",&l);
scanf("%d",&idum);
for (m = 4000; m < 300000; m *= 2) {
/*! initialization */
for (i=0; i< m; i++) {
for (j=0; j< n; j++) {
a[i][j] = 1.0;
c[i][j] = 0.0;
}
}
for (i=0; i
One the matrices are too large for cache, the Fortran and C versions
appear to give about the same speed.
You can see quite a bit of experimentation could be used to find
the best loop unrolling parameters.
The experimentation can be automated. This
is the approach used to construct the ATLAS BLAS (Basic Linear
Algebra Subroutines). These are due mainly to Clint Whaley. You can
look at the code if you like (open source). His "kernel" routines
are in C. ATLAS source code can be loaded on any machine and in an hour
or so you have a BLAS tuned for that machine and running at better
than 80% of peak machine efficiency.
For Intel processors and compilers, the libmkl libraries are often
a more efficient than ATLAS. Formerly, intel contracted this work to
underemployed Russian nuclear scientists.
The Goto BLAS are usually the fastest (Goto is at the
University of Texas), with the libraries freely available, but not
the source code. Goto is owed a debt of gratitude for his fine
work in handtuning to get very good efficiencies.
AMD distributes acml libraries (both threaded and non-threaded) for
use on its chips.
The code callem.f benchmarks this same problem by using OpenMP
and acml libraries. This is in the directory you downloaded last week,
along with scripts for compiling it such as makegfort, makeintel,
makeintelacml, makepgi.
The actual BLAS calls distributed using OpenMP to distribute
the work among threads are in the subroutines dgskn.f, dglng.f and dglnsk.f
called by callem.f. The following figures show some results from calling acml libraries via
the dgskn.f, dglng.f and dglnsk.f subroutines. These matrices
had the larger dimension 300K along with the specified smaller dimension. Unless
other specified, they were compiled with the pgif90 compiler, vs. 7.2-2. All runs
were on a blade with two 8 core Opterons sharing memory.
Smallest Dimension vs. Gflop Rate
Gfortran smallest dimension vs. Gflop Rate
Gfortran Gflops vs. Number of Threads
pgf90 Gflops vs. Number of threads -- width 40
pgf90 Gflops vs. Number of Threads -- width 24
How do these computation speeds compare to those obtained by the loop-unrolled codes dgsk44.c and dgsk44.f ?