- What and why the BLAS
As we've seen, a simple matrix vector multiplication can run at
widely varying speeds. The initial class codes ran on a 3.06 GHz
Pentium at about 200 Mflops. Eventually, adjusting loop unrollings
and finding the correct compiler flags to use the MXX registers, we
got "in-cache" matrix vector multiplies to run at about 3800 Mflops.
Of course, if we went to a new computer and/or compiler,
we'd have to spend another few days.
Alternatively, we could just call the BLAS DGEMV operator. This can be a bit
of a pain, due to keeping all the calling arguments straight, but if
we do learn to do it, then we don't have to rewrite to go to a new computer
(just figure out what library to link to).
There are many other operators we can portably call with BLAS and
by linking to a BLAS operator we get code that runs much faster than a
first attempt. As an example, one user required ten hours for solving
a matrix equation. Using LAPACK and BLAS on a 2 CPU node, the solve
required less than four minutes. (Perhaps instead of BLAS, we should
call these the BLAST). For a complete BLAS description,
see Netlib.
The BLAS originated when the first Cray computers could not run at
advertised Mflop rates. These were vector machines. Jack Dongarra
and some other people still working developed the BLAS libraries to help
speed computations. The first BLAS
were vector-vector operations, so-called BLAS-1 operations, such as
inner products _DOTs and adding a multiple of one vector to another.
_AXPYs.
The next generation of BLAS were matrix vector operations, for example
matrix vector multiplies such as _GEMVs.
For either BLAS-1 or BLAS-2 operations the number of floating point
operations is proportional to the amount of data. When the data
is too large for cache, these operations
are therefore limited by the speed of the data bus from DRAM.
Thus on cache-based architectures, neither BLAS-1 nor BLAS-2 operations
can (for data too large for cache) work at the advertised MFlop rate.
BLAS-3 operations such as matrix matrix multiplies _GEMMs can perform
enough flops per data fetch and store to run at peak clock speed.
Operations such as matrix solves of Ax=b by factoring A =LU
or solving least squares computations, such as computing
x to minimize || b - Ax || by decomposing A = QR are almost entirely
performed by _GEMMs and hence can run at peak speed.
In the following sections, some of the more popular BLAS operators are listed.
- BLAS-1, vector-vector operations
The BLAS-1 operators are vector-vector. For all BLAS
(and LAPACK) prefixed letters indicates
arithmetic type, including S single, D double, C complex, Z double
precision complex. The BLAS-1 include
_SWAP x <--> y S, D
_SCAL x <-- alpha * x S, D, C, Z, CS, ZD
_COPY x <-- y S, D, C, Z
_AXPY y <-- alpha * x + y S, D, C, Z
_DOT dot <-- S, D, DS
_DOTU dot <-- x^T y C, Z
_DOTC dot <-- x^H y C, Z
_NRM2 nrm2 <-- || x ||_2 S, D, SC, DZ
_ASUM nrm1 <-- || x ||_1 S, D, SC, DZ (returns sum of
abs of imaginary
and real vector entries)
I_AMAX returns i of first
maximal element of (if x_i real part, y_i imaginary part)
|x_i| + |y_i| S, D, C, Z
The BLAS-1 operators, such as _COPY, _AXPY, _SWAP, and _SCAL, require
a load and a store of each vector entry. In each case the number of
flops is the same as the number of reads and the same as the number of
writes. For the _DOT, _ASUM, I_AMAX, the number of reads is
the same as the number of flops and only one store is required. For
the _DNRM2, only one write is needed and the number of flops is twice
the number of reads.
While BLAS-1 operations were efficient on vector machines, they
are less efficient on cache-based architectures where the bus bandwidth
between DRAM and CPU tends to be the bottleneck.
- BLAS-2, matrix-vector
For BLAS-2 operations, the same set of one letter prefixes is used.
The next two letters indicate the type of matrix. The following is
the LAPACK naming convention (see
LAPACK .
BD bidiagonal
DI diagonal
BG general band
GE general (unsymmetric, not always square)
GG pair of gneral matrices
GT general tridiagonal
HB (complex) Hermitian band
HE (complex) Hermitian
HG upper Hessenberg and a triangular pair of matrices
HP (complex) Hermitian, packed storage
HS upper Hessenberg
OP (real) orthogonal, packed storage
PB symmetric or Hermitian positive definite band
PO symmetric or Hermitian positive definite
PP symmetric or Hermitian positive definite, packed storage
PT symmetric or Hermitian positive definite tridiagonal
SB (real) symmetric band
SP symmetric, packed storage
ST (real) symmetric tridiagonal
SY symmetric
TB triangular band
TG pair of triangular matrices
TP triangular, packed storage
TR triangular
TZ trapezoidal
UN (complex) unitary
UP (complex) unitary, packed storage
All of these refer to dense or banded matrices. The newer BLAS
also include some sparse options.
The ___MV operations are matrix vector multiplies. These include GE,
GB, HE, HB, HP, SY, SB, SP, TR, TB, TP, TR, TBS, and TPS matrices.
The options with TBS, TPS entail multiplying by the inverse matrix.
The GE options have possible operations including multiplying by the
matrix transpose.
_GEMV y <-- alpha*A*x + beta*y, y <-- alpha*A^T*x + beta*y
_GBMV y <-- alpha*A*x + beta*y, y <-- alpha*A^T*x + beta*y
_HEMV y <-- alpha*A*x + beta*y
_HBMV y <-- alpha*A*x + beta*y
_HPMV y <-- alpha*A*x + beta*y
_SYMV y <-- alpha*A*x + beta*y
_SBMV y <-- alpha*A*x + beta*y
_SPMV y <-- alpha*A*x + beta*y
_TRMV y <-- A*x , x <-- A^T*x
_TBMV y <-- A*x , x <-- A^T*x
_TPMV y <-- A*x , x <-- A^T*x
_TPSV y <-- inv(A)*x , x <-- inv(A^T) * x
_TRSV y <-- inv(A)*x , x <-- inv(A^T) * x
_TBSV y <-- inv(A)*x , x <-- inv(A^T) * x
The matrix vector multiplications involve few writes. Also there
are at least two flops per matrix element read in (four flops for
the symmetric cases). If the main bottleneck is the data cache,
then the matrix vector multiplications can be expected to be also
twice as fast as _AXPY operations.
The ___R operations are rank-one updates and the ___R2 operations are
rank-two updates. They include
_GER, _GERU, _GERC, _HER, _HPR, _HER2, _HPR2, _SYR, _SPR, _SYR2, _SPR2
operators. Recall that a rank-one update is of the form
A <-- alpha * x y^T + A
The rank-two operators preserve symmetry (or the Hermitian property)
A <-- alpha * x * y^T + alpha * y * x^T + A
Since updates require writing all of A (as well as reading it), they
are significantly slower than matrix vector multiplications.
- BLAS-3, matrix-matrix operations
The main BLAS-3 operations are matrix vector multiplications.
These are ___MM operations, for matrix types GE, SY, HE, SY, etc.
For example (leaving out some functionality)
_GEMM C <-- alpha op(A) op(B) + beta C
_SYMM C <-- alpha AB + beta C
_HEMM C <-- alpha AB + beta C
_SYRK C <-- alpha A A^T + beta C
_HERK C <-- alpha A A^H + beta C
_SYRK2 C <-- alpha A B^T + alpha B A^T + beta C
_TRMM B < -- alpha op(A) B,
_TRSM B < -- alpha op(inv(A)) B
If A, B, and C are n by n then performing a _GEMM requires
3 n^2 reads and n^2 writes where the matrix matrix multiplication
requires 2 n^3 flops. By performing operations on subblocks of A,
we can manage to perform O(k^3) flops with O(k^2) reads and writes, thus
O(k) flops can be managed per read/write. Thus _GEMM operations can
typically run almost at the peak advertised speed. On the BladeCenter,
_GEMM operations can be about ten times as fast as _GEMV operations.
- Reference BLAS
The above account gives an introduction to BLAS capabilities.
Of course, to actually call them, we need to get long lists of
arguments straightened out for each call. For this purpose, I
keep a directory of Fortran .f files which are the source code
for the reference BLAS. The source files are found in
BLAS tar file . I hunt out the routine I want and then
copy its first line (SUBROUTINE or FUNCTION) to the routine
I want to call it from.
Then I can check carefully to see if each argument is of the
correct type. Moreover, the operations are commented in
a standard way, so that it's easy to read and see what each
argument does and what the function (subroutine) returns.
For example, then, it's relatively easy to convert a matlab
code that manipulates matrices to call BLAS routines.
Some tricks.
To reference a row of a matrix of leading dimension lda, reference
the vector with "stride" lda. For a diagonal, use dimension lda+1 .
To reference a block of a matrix A(k:k+m;l:l+n), specify
A as A(k,l) with m+1 rows and n+1 columns.
- Optimized BLAS Libraries
The reference BLAS return correct answers, but do not run
particularly fast. To get fast computations, we turn to
libraries which have been optimized for the particular computer
we're using. Ten years ago, e.g., when I spent a few summers
at Oak Ridge National Labs, vendors would typically charge
for these libraries.
More recently, Clint Whaley (who worked for Jack Dongarra, who
was the main guy behind the BLAS) developed the ATLAS BLAS.
ATLAS BLAS.
These are written in C. In compiling the code, timing experiments
are used to set loop-unrolling parameters and
block sizes. The ATLAS
BLAS typically attain speeds very nearly as fast as vendor-supplied
BLAS. There are various BLAS packages installed on the
Blade Center. In rank-ordering of speed
- GoTo BLAS
- BLAS supplied with Intel compiler
- ATLAS BLAS
(the top 3 are relatively close on a single processor,
- PGI supplied BLAS
- reference BLAS.
Both the GoTo BLAS and the ATLAS BLAS can (depending on operating
system) often be downloaded from the web and just copied into
the desired location.
If you encounter a link time error, the quickest solution is often
to search on the internet for the missing file + the BLAS package
in question + the compiler you're using.
- Converting Code to BLAS
Often it can be a pain to exactly match all the calling arguments
for BLAS calls. My own programming style is to start with a matlab
script such as the following fragment. Matlab is interpreted so
you can try your command on screen and then when you know it
works add it to a .m file, e.g., foo.m . Eventually you've converted
the matlab script into a program. The following code
fragment is embedded inside
a for loop of a larger program which reduces a matrix to bidiagonal
form.
vold = v(2:n-i+1);
zold = z(2:n-i+1);
if(i==2), % special case for the first time through the loop.
% could get rid of the special case by renaming variables.
a(i-1,i:n) = a(i-1,i:n) - u(1)*z - w(1)*v; % zeros the i-1st row.
else,
a(i-1,i:n) = a(i-1,i:n) - u(1)*z - wnew(1)*v; % zeros the i-1st row.
end;
vmat(i-1,i:n) = v;
uold = u(2:m-i+2);
if(i==2), % special case becuase the first time through use w.
wold = w(2:m-i+1);
else
wold = wnew(2:m-i+2);
end;
vtemp = a(i,i:n) - umat(i,1:i-1) * zmat(1:i-1,i:n) ...
- wmat(i,1:i-1) * vmat(1:i-1,i:n); % updates the ith row.
% the ith row has never been updated till this step.
a(i,i+1:n) = vtemp(2:n-i+1);
% The new u should depend on the update for the first few columns.
% So need the column i update done outside the loop.
a(i-1:m,i-1) = a(i-1:m,i-1) - (u'*a(i-1:m,i-1))*u; % zeros col i-1
The code is matrix oriented. Converting it to Fortran (or C)
would typically involve writing do (for) loops. But instead we
can just make it into a sequence of BLAS calls. The following shows
the BLAS call corresponding to each line
vold = v(2:n-i+1); % DCOPY
zold = z(2:n-i+1); % DCOPY
if(i==2), % special case for the first time through the loop.
% could get rid of the special case by renaming variables.
% If (I.EQ. 2) THEN
a(i-1,i:n) = a(i-1,i:n) - u(1)*z - w(1)*v; % zeros the i-1st row.
% 2 DAXPY calls
else,
% ELSE
a(i-1,i:n) = a(i-1,i:n) - u(1)*z - wnew(1)*v; % zeros the i-1st row.
% 2 DAXPY calls
end;
% ENDIF
vmat(i-1,i:n) = v; % DCOPY
uold = u(2:m-i+2); % DCOPY
if(i==2), % special case becuase the first time through use w.
% IF (I.EQ.2) THEN
wold = w(2:m-i+1); %DCOPY
else
% ELSE
wold = wnew(2:m-i+2); % DCOPY
end;
% ENDIF
vtemp = a(i,i:n) - umat(i,1:i-1) * zmat(1:i-1,i:n) ...
- wmat(i,1:i-1) * vmat(1:i-1,i:n); % updates the ith row.
% the ith row has never been updated till this step.
% 2 DGEMV calls
a(i,i+1:n) = vtemp(2:n-i+1); % DCOPY
% The new u should depend on the update for the first few columns.
% So need the column i update done outside the loop.
a(i-1:m,i-1) = a(i-1:m,i-1) - (u'*a(i-1:m,i-1))*u; % zeros col i-1
% alpha = u'*a(i-1:m,i-1) is a DDOT
% then a(i-1:m,i-1) = a(i-1:m,i-1) - alpha*a(i-1:m,i-1) is a DAXPY
So now we need the daxpy.f file to see how the conversion goes.
subroutine daxpy(n,da,dx,incx,dy,incy)
c
c constant times a vector plus a vector.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
c
double precision dx(*),dy(*),da
integer i,incx,incy,ix,iy,m,mp1,n
c
if(n.le.0)return
if (da .eq. 0.0d0) return
if(incx.eq.1.and.incy.eq.1)go to 20
c
c code for unequal increments or equal increments
c not equal to 1
c
ix = 1
iy = 1
if(incx.lt.0)ix = (-n+1)*incx + 1
if(incy.lt.0)iy = (-n+1)*incy + 1
do 10 i = 1,n
dy(iy) = dy(iy) + da*dx(ix)
ix = ix + incx
iy = iy + incy
10 continue
return
c
c code for both increments equal to 1
c
c
c clean-up loop
c
20 m = mod(n,4)
if( m .eq. 0 ) go to 40
do 30 i = 1,m
dy(i) = dy(i) + da*dx(i)
30 continue
if( n .lt. 4 ) return
40 mp1 = m + 1
do 50 i = mp1,n,4
dy(i) = dy(i) + da*dx(i)
dy(i + 1) = dy(i + 1) + da*dx(i + 1)
dy(i + 2) = dy(i + 2) + da*dx(i + 2)
dy(i + 3) = dy(i + 3) + da*dx(i + 3)
50 continue
return
end
Hm not as well commented as the later BLAS.
subroutine daxpy(n,da,dx,incx,dy,incy)
but this is dy <-- dy + da* dx , dx, dy vectors, of length n, da a scalar.
incx is the stride of da, incy is the stride of incy.
a(i-1,i:n) = a(i-1,i:n) - u(1)*z - w(1)*v; % zeros the i-1st row.
becomes where we want a row of A and A has leading dimension lda.
call daxpy(n-i+1 ,-u(1), z, 1, a(i-1,i), lda)
call daxpy(n-i+1, -w(1), v, 1, a(i-1,i), lda)
and then we have the idea how to do all the daxpys. Next
we can tackle the DCOPYs and then the DGEMVs.
subroutine dcopy(n,dx,incx,dy,incy)
c
c copies a vector, x, to a vector, y.
c uses unrolled loops for increments equal to one.
c jack dongarra, linpack, 3/11/78.
c modified 12/3/93, array(1) declarations changed to array(*)
DGEMV is BLAS-2, more of a team effort, better documented.
SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
$ BETA, Y, INCY )
* .. Scalar Arguments ..
DOUBLE PRECISION ALPHA, BETA
INTEGER INCX, INCY, LDA, M, N
CHARACTER*1 TRANS
* .. Array Arguments ..
DOUBLE PRECISION A( LDA, * ), X( * ), Y( * )
* ..
*
* Purpose
* =======
*
* DGEMV performs one of the matrix-vector operations
*
* y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y,
*
* where alpha and beta are scalars, x and y are vectors and A is an
* m by n matrix.
*
* Parameters
* ==========
*
* TRANS - CHARACTER*1.
* On entry, TRANS specifies the operation to be performed as
* follows:
*
* TRANS = 'N' or 'n' y := alpha*A*x + beta*y.
*
* TRANS = 'T' or 't' y := alpha*A'*x + beta*y.
*
* TRANS = 'C' or 'c' y := alpha*A'*x + beta*y.
*
* Unchanged on exit.
*
* M - INTEGER.
* On entry, M specifies the number of rows of the matrix A.
* M must be at least zero.
* Unchanged on exit.
*
* N - INTEGER.
* On entry, N specifies the number of columns of the matrix A.
* N must be at least zero.
* Unchanged on exit.
*
* ALPHA - DOUBLE PRECISION.
* On entry, ALPHA specifies the scalar alpha.
* Unchanged on exit.
*
* A - DOUBLE PRECISION array of DIMENSION ( LDA, n ).
* Before entry, the leading m by n part of the array A must
* contain the matrix of coefficients.
* Unchanged on exit.
*
* LDA - INTEGER.
* On entry, LDA specifies the first dimension of A as declared
* in the calling (sub) program. LDA must be at least
* max( 1, m ).
* Unchanged on exit.
*
* X - DOUBLE PRECISION array of DIMENSION at least
* ( 1 + ( n - 1 )*abs( INCX ) ) when TRANS = 'N' or 'n'
* and at least
* ( 1 + ( m - 1 )*abs( INCX ) ) otherwise.
* Before entry, the incremented array X must contain the
* vector x.
* Unchanged on exit.
*
* INCX - INTEGER.
* On entry, INCX specifies the increment for the elements of
* X. INCX must not be zero.
* Unchanged on exit.
*
* BETA - DOUBLE PRECISION.
* On entry, BETA specifies the scalar beta. When BETA is
* supplied as zero then Y need not be set on input.
* Unchanged on exit.
*
* Y - DOUBLE PRECISION array of DIMENSION at least
* ( 1 + ( m - 1 )*abs( INCY ) ) when TRANS = 'N' or 'n'
* and at least
* ( 1 + ( n - 1 )*abs( INCY ) ) otherwise.
* Before entry with BETA non-zero, the incremented array Y
* must contain the vector y. On exit, Y is overwritten by the
* updated vector y.
*
* INCY - INTEGER.
* On entry, INCY specifies the increment for the elements of
* Y. INCY must not be zero.
* Unchanged on exit.
*
*
* Level 2 Blas routine.
*
* -- Written on 22-October-1986.
* Jack Dongarra, Argonne National Lab.
* Jeremy Du Croz, Nag Central Office.
* Sven Hammarling, Nag Central Office.
* Richard Hanson, Sandia National Labs.
Personally, I leave
the Matlab code as comments and later can compare the Matlab
code to the Fortran code line by line with a debugger. Usually,
I get almost all the BLAS-1 calls right but maybe not the BLAS-2
calls.
On the BladeCenter, the BLAS are pre-installed,
the following sections give some specifics of how to link to the
BLAS, depending on which compiler environement you have chosen.
- The LAPACK library
The LAPACK library combines the earlier EISPACK (eigenvalues) and
LINPACK (matrix solve) libraries. It is based on BLAS operators,
allowing efficient and portable calls to high quality matrix solve
and eigenvalue routines. Matrices are assume to be dense or
banded. There is a handbook showing how to use the package.
LAPACK User's Guide, 3rd Edition, Anderson, Bai, Bischof, Blackford,
Demmel, Dongarra, Du Croz, Greenbaum, Hammarling, McKenney, Sorensen,
1999, SIAM Press. There is an online version. LAPACK User's Guide.
- A Sample make file for the intel ifc compiler.
Using LAPACK and BLAS insures fast computations, whatever machine
you choose to run on. You do have to figure out how to link to
the libraries. On the BladeCenter, choice of LAPACK library depends
on which compiler you've used. First the intel ifc compiler.
Add the Intel environment by the command line
>add intel
The following make file would link to the
BLAS library.
- LIBDIR = /usr/local/intel/mkl60/lib/32
- LIBS = -L${LIBDIR} -lmkl_ia32 -lmkl_lapack -lguide -lpthread
- FC = /usr/local/intel/compiler70/ia32/bin/ifc
- FFLAGS = -static
- foo:
- $(FC) $(FFLAGS) -c foo.f
- $(FC) $(FFLAGS) -o foo foo.o ${LIBS}
The -static flag
allows matrices using up to 231 - 1 bytes (one byte less than 2 Gigabytes).
Using the -static flag takes the memory space that would otherwise
be used for .so shared libraries.
If the above makefile is foomake, you could execute it by the command line
(making sure the $(FC) lines start with tabs).
>make -f foomake
The web page Intel support has more information on linking to the Intel math kernel library.
- Sample make files for the Portland Group pgf77 compiler.
Similarly, to link the BLAS and LAPACK libraries when using the PGI Fortran compiler, first add the pgi environment by
>add pgi
(should log out and back in if you've been using the Intel compiler environment). Then the following make file will work.
- LIBDIR = /usr/local/pgi/lunux86/5.0/lib
- LIBS = -L${LIBDIR1} -llapack -lblas
- FC = /usr/local/pgi/linux86/5.0/bin/pgf77
- foo:
- $(FC) -c foo.f
- $(FC) -o foo foo.o ${LIBS}
To allow larger matrices, use the -Bstatic flag. For faster execution link to the
Atlas BLAS library /usr/local/intel/blacs/Linux_P4SSE2/lib/libatlas.a. The following
slightly more elaborate make file uses the -Bstatic flag and Atlas BLAS. As with
the Intel compiler, the -Bstatic flag can not be used with an executable that
links to shared .so libraries.
- FFLAGS = -c -Bstatic -fastsse
- LFLAGS =
- LIBDIR = /usr/local/pgi/linux86/5.0/lib
- LIBDIR2 = /usr/local/intel/blacs/Linux_P4SSE2/lib
- LIBDIR3 = /usr/lib/gcc-lib/i386-redhat-linux/2.96
- LIBS = -L${LIBDIR2} -llapack -lcblas -lf77blas -latlas -lg2c
- foo:
- $(FC) -c foo.f
- $(FC) -o foo foo.o ${LIBS}
The -Bstatic flag allows use of larger arrays. The ATLAS BLAS is faster
than the PGI supplied BLAS.
- A Sample make file for the gnu g77 compiler.
The gnu environment is the default.
The following make file would link to the
ATLAS BLAS library.
- FC = /usr/bin/g77
- CC = /usr/bin/gcc
- LIBDIR = /usr/local/gnu/lib/Linux_P4SSE2/lib
- LIBS = -L${LIBDIR} -llapack -lcblas -lf77blas -latlas -lg2c
- FFLAGS = -static
- foo:
- $(FC) $(FFLAGS) -c foo.f
- $(FC) $(FFLAGS) -o foo foo.o ${LIBS}
The -static flag
allows matrices larger than the stack size. For example, if the
-static flag is not used, then dimensioning 3 1K by 1K double
precision matrices requiring a total of 24 MBytes of storage would
result in a run time (segmentation fault) message.
Using the -static flag takes the memory space that would otherwise
be used for .so shared libraries, so is not compatible with using
shared libraries.
If the above makefile is foomake, you could execute it by the command line
(making sure the $(FC) lines start with tabs).
>make -f foomake
- Who should use the BLAS and LAPACK libraries.
The BLAS (Basic Linear Algebra Subroutines) and LAPACK (Linear Algebra
Package) are basic building blocks for many codes. The BLAS perform
such basic operations as innner products, matrix-vector and matrix-matrix
products. The LAPACK routines use the BLAS routines to perform
dense matrix operations such as LU decomposition to solve linear equation,
QR decomposition to solve least square problems, and also singular value
and eigen problems.
Advantages of using the LAPACK and BLAS libraries are in having portable
fast code. Fortran (C is also possible with a bit more fiddling) codes
calling LAPACK and BLAS can be ported easily to a variety of archectectures.
The code is high quality, giving not only good performance, but also
handling exceptional cases and avoiding numeric under and overflow.
For problems too large to fit in cache, LAPACK codes often run in one third
or less of the time required by the predecessor packages EISPACK and LINPACK.
(For small matrices, size a few hundred or less, EISPACK and LINPACK
may sometimes be faster, having fewer levels of subroutine).
For solving a 12K by 12K linear system on a single processor, a user found that the Numeric
Recipes solver timed out after ten hours. The LAPACK solver dgesv required
fifteen minutes with the PGI compiler and PGI supplied BLAS, and about ten minutes
with the Intel compiler and BLAS.
The efficiency of the implementation depends mainly on the quality of the
underlying BLAS library. Good BLAS implementations allow matrix matrix
multiplications and such operations as LU and QR decomposition to run
at near the peak CPU clock rates.
Speeds obtained by downloading the standard BLAS source code and compiling
it are slower than for a tuned library. Several tuned BLAS and
LAPACK libraries are available on the blade cluster. Instructions on linking
to the Intel, PGI, and Atlas versions of the library are included above.
- Who should not use the BLAS and LAPACK libraries?
Many scientific computations solve large sparse linear systems. "Sparse" means
that most elements of the matrix are zero so that only the nonzero elements are
stored, enabling larger matrices to fit in the available memory. Two of the
best known publically available sparse solver packages are SuperLU (using sparse Gaussian
elimination) and PETSc (incorporating iterative solvers). For advice on choosing
an appropriate package, or other questions, you can e-mail Gary Howell, gary_howell@ncsu.edu.
|