The following is an update of Lecture 6 posted a few weeks ago. An additional code is added that enables reading of of 'RUA' Harwell-Boeing format matrices.
For example Davis's UF Sparse Matrix Collection. The Harwell Boeing format is essentially the sparse column format detailed below. For a complete description as well as "bones" of Fortran subroutines for reading the format, see Matrix Market, which also has a posted collection of matrices. A tar file with the code discussed below can be copied from /tmp/timit2.tar.
We've seen that we can get pretty good efficiencies for dense matrix vector multipliers. For dense matrices that fit "in-cache", we can get better than 60% of the advertised peak speed. For matrices that won't fit in cache, we're at something more like 13% of peak CPU speed, but we're limited mainly by the bus speed (running at a good chunk of bus speed).
Much of the masses of data used in scientific computation are stored in terms of matrices, but most typically, the matrices are sparse. Sparse matrices are mostly zero, so it's more efficient to store only the nonzero elements. How? We'll mainly follow Saad, "Iterative Methods for Sparse Linear Systems, 2nd Edition", Yousef Saad, SIAM, 2003.
Let's take an example matrix and show the most common representations. Consider the matrix
4.0 1.0 0.0 0.0 2.5 0.0 4.0 1.0 0.0 0.0 A = 0.0 1.0 4.0 0.0 1.0 0.0 0.0 1.0 4.0 0.0 2.5 0.0 0.0 0.5 4.0
which can be represented in Coordinate Format as one floating point vector
AA = 4.0 1.0 2.5 4.0 1.0 1.0 4.0 1.0 1.0 4.0 2.5 0.5 4.0
JR = 1 1 1 2 2 3 3 3 4 4 5 5 5
JC = 1 2 5 2 3 2 3 5 3 4 1 4 5
The coordinate format is straightforward. A disadvantage is that it requires
3 full length vectors, each of length
If we restrict ourselves to listing elements a row at a time then we can
compress the
AA = 4.0 1.0 2.5 4.0 1.0 1.0 4.0 1.0 1.0 4.0 2.5 0.5 4.0
JA = 1 4 6 9 11 14
IC = 1 2 5 2 3 2 3 5 3 4 1 4 5
This is called Compressed Sparse Row storage and is probably the most popular
sparse storage format.
Another format is Compressed Sparse Column, analogous, perhaps more popular.
The following snippet of code is Fortran 77 for a matrix vector multiply in a CSR format.
do i = 1, n
k1 = ja(i)
k2 = ja(i+1) - 1
y(i) = 0
do j = k1, k2
y(i) = y(i) + aa(j) * x(ic(j))
end do
end do
Another important operation is solving Lx = y, where L is a lower triangular matrix. Suppose the lower triangular matrix L has ones on the diagonal. Then we can solve Lx = y with the following lines of Fortran 77 (untried, by the way, so take it with a grain of salt) .
x(1) = y(1)
do i = 2, n
x(i) = y(i)
do j = ial(i), ial(i+1)-1
x(i) = x(i) - al(j) * x(jal(j))
end do
end do
Here's a snippet for matrix vector multiplication where $A$ is stored in
compressed column storage.
do j=1,n
k1 = ic(j)
k2 = ic(j+1) - 1
do i = k1, k2
b(ir(i)) = b(ir(i)) + a(i)*x(j)
end do
end do
sum = 0.d0
do j=1,n
sum = sum + abs(b(j))
end do
For dense matrices, the common way of solving Ax=b is by Gaussian elimination or equivalently, by decomposing A = LU. For sparse matrices, Gaussian elimination is also frequently used. A common problem is that the row operations produce fill, increasing the necessary storage.
An alternative method of solution for Ax=b is by iterative methods. Common iterative methods involve repeated matrix vector multiplications.
For dense matrix vector multiplication, the Mflops rate could be higher than 3.5 Gflops (when the matrix fits in L2 cache ) but falls off to around 700 Mflops when the matrix is too large to fit into cache. The 3.5 Gflops rate is higher than half the advertised peak rate of 6.1 Glops. The 700 Mflops corresponds to reading 350 million doubles/sec from RAM which is a rate near the bus speed.
For sparse matrix vector multiplication, Mflops rates are signficantly
lower. For example, consider the following data collected on my
Pentium IV lap top (using the g77 compiler). In this case, entries
were randomly distributed, 10 nonzero entries per column, for a variety
of matrix sizes. The second column of Mflop rates is from a 3.06 GHz
Xeon compiled with
>ifc -03 -tpp7 -xW -c spcol.f
Matrix vector multiplies were repeatedly performed, often enough to require several seconds of computational time, computing
Ax = b
setting x to b normalized and repeating. This is the power method. For instance Google employs the power method to find the eigenvector corresponding to the largest eigenvalue of a transition matrix, allowing ordering of search results.
The largest possible dense matrix on this machine's RAM would have been less than 10K by 10K (800 MBytes).
The matrix A and the integer vector ir are read in sequentially for each matrix vector multiplication, The Mflop rate stays relatively constant at about 25% of the rate for out of cache dense matrix computations -- till matrix sizes get up to about 10K. Here the 100K nonzero entries of A require 800 Kbytes of storage and the 100K nonzero entries of ir require 400 Kbytes of storage. Reading them in sequentially effectively flushes the cache so that the 80 Kbytes of storage for b must be partly re-read from cache (it has been refreshed just before the start of the iteration).
As the matrix size grows to 1M, most entries of b are individually fetched from RAM, so that the Mflops rate declines to about 17 Mflops/ sec, i.e., a flop occurs on about 1 of 200 clock cycles.
Can you think of algorithm improvements that would allow a better hit rate for elements of b?
Square nonzeros/col Mflops Kbytes for A Kbytes for b matrix and ir size 100 10 158.6 476 12 .8 200 10 156.1 435 24 1.6 400 10 152.5 408 48 3.2 1000 10 154.9 364 120 8 2000 10 152.6 328 240 16 4000 10 150.2 298 480 32 10000 10 120.2 1200 80 20000 10 100.7 281 2400 160 40000 10 81.5 120 4800 320 100000 10 33.2 42.1 12000 800 200000 10 23.1 30.4 24000 1600 400000 10 18.6 24.2 48000 3200 1000000 10 17.8 21.5 120000 8000
The above experiments are very regular, but artificial. It is a bit difficult to think of applications for which entries are so randomly distributed. As a more realistic experiment, a matrix twotone_rua_gz was downloaded. It has 120K rows and columns and 1.2M nonzero entries. The Mflop rate for matvec multiplies (henry2 with the same compiler options as above) was 182 Mflops (as compared to about 40 Mflops for the 100K square matrix with 1M nonzero entries from the table above).
The following is the code to read the Harwell-Boeing format matrix.
subroutine readhb1(lunit,lda,ldn,nrow,ncol,nnzero,
+ colptr,rowind,values,ierror)
c
c LUNIT is the unit number of the file which was opened
c in the calling program.
c lda is the maximal number of nonzeros
c ldn is the maximal number of columns
c nrow is the number of rows
c ncol is the number of cols
c nnzero is the number of nonzeros
c colptr(i) is the index of the first element of column i
c colptr(n+1) is nnzero
c rowind (i) is the row of the ith entry of values
c values (i) is the ith nonzero value (going down column starting
c at left.
c ierror returns error codes
c
C ================================================================
C ... SAMPLE CODE FOR READING A SPARSE MATRIX IN STANDARD FORMAT
C ================================================================
c
integer lda,ldn,ierror
CHARACTER TITLE*72 , KEY*8 , MXTYPE*3 ,
1 PTRFMT*16, INDFMT*16, VALFMT*20, RHSFMT*20
INTEGER TOTCRD, PTRCRD, INDCRD, VALCRD, RHSCRD,
1 NROW , NCOL , NNZERO, NELTVL
INTEGER COLPTR (*), ROWIND (*)
REAL VALUES (*)
C ------------------------
C ... READ IN HEADER BLOCK
C ------------------------
READ ( LUNIT, 1000 ) TITLE , KEY ,
+ TOTCRD, PTRCRD, INDCRD, VALCRD, RHSCRD,
+ MXTYPE, NROW , NCOL , NNZERO, NELTVL,
+ PTRFMT, INDFMT, VALFMT, RHSFMT
1000 FORMAT ( A72, A8 / 5I14 / A3, 11X, 4I14 / 2A16, 2A20 )
C -------------------------
C ... READ MATRIX STRUCTURE
C -------------------------
if (mxtype.ne.'RUA') then
print*,' The matrix is not of a form usable by spcol.f'
ierror = -1
return
endif
READ ( LUNIT, PTRFMT ) ( COLPTR (I), I = 1, NCOL+1 )
READ ( LUNIT, INDFMT ) ( ROWIND (I), I = 1, NNZERO )
IF ( VALCRD .GT. 0 ) THEN
c
C ----------------------
C ... READ MATRIX VALUES
C ----------------------
c
READ ( LUNIT, VALFMT ) ( VALUES (I), I = 1, NNZERO )
ELSE
print*,' valcrd was less than zero or undefined'
print*,' so am not reading matrix values'
ierror = -2
return
ENDIF
c successful return
ierror = 0
return
end
Comments on code? The role of ierror is fairly standard,
imitating the return code in C. As is often the case, a main
difficulty in calling the code is in keeping track of the list
of arguments.
From BLAS on the NCSU machines you can see how to link to BLAS packages. Link to 3 different BLAS packages and perform repeated matrix vector multiplications across a range of matrix sizes from 100 to 2000. Compare the results to your own best matrix vector multipliation. Due next Thursday.
The following is source code for the Fortran reference dgemv. If you want help to find the C version or binding, let me know.
SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
$ BETA, Y, INCY )
* .. Scalar Arguments ..
DOUBLE PRECISION ALPHA, BETA
INTEGER INCX, INCY, LDA, M, N
CHARACTER*1 TRANS
* .. Array Arguments ..
DOUBLE PRECISION A( LDA, * ), X( * ), Y( * )
* ..
*
* Purpose
* =======
*
* DGEMV performs one of the matrix-vector operations
*
* y := alpha*A*x + beta*y, or y := alpha*A'*x + beta*y,
*
* where alpha and beta are scalars, x and y are vectors and A is an
* m by n matrix.
*
* Parameters
* ==========
*
* TRANS - CHARACTER*1.
* On entry, TRANS specifies the operation to be performed as
* follows:
*
* TRANS = 'N' or 'n' y := alpha*A*x + beta*y.
*
* TRANS = 'T' or 't' y := alpha*A'*x + beta*y.
*
* TRANS = 'C' or 'c' y := alpha*A'*x + beta*y.
*
* Unchanged on exit.
*
* M - INTEGER.
* On entry, M specifies the number of rows of the matrix A.
* M must be at least zero.
* Unchanged on exit.
*
* N - INTEGER.
* On entry, N specifies the number of columns of the matrix A.
* N must be at least zero.
* Unchanged on exit.
*
* ALPHA - DOUBLE PRECISION.
* On entry, ALPHA specifies the scalar alpha.
* Unchanged on exit.
*
* A - DOUBLE PRECISION array of DIMENSION ( LDA, n ).
* Before entry, the leading m by n part of the array A must
* contain the matrix of coefficients.
* Unchanged on exit.
*
* LDA - INTEGER.
* On entry, LDA specifies the first dimension of A as declared
* in the calling (sub) program. LDA must be at least
* max( 1, m ).
* Unchanged on exit.
*
* X - DOUBLE PRECISION array of DIMENSION at least
* ( 1 + ( n - 1 )*abs( INCX ) ) when TRANS = 'N' or 'n'
* and at least
* ( 1 + ( m - 1 )*abs( INCX ) ) otherwise.
* Before entry, the incremented array X must contain the
* vector x.
* Unchanged on exit.
*
* INCX - INTEGER.
* On entry, INCX specifies the increment for the elements of
* X. INCX must not be zero.
* Unchanged on exit.
*
* BETA - DOUBLE PRECISION.
* On entry, BETA specifies the scalar beta. When BETA is
* supplied as zero then Y need not be set on input.
* Unchanged on exit.
*
* Y - DOUBLE PRECISION array of DIMENSION at least
* ( 1 + ( m - 1 )*abs( INCY ) ) when TRANS = 'N' or 'n'
* and at least
* ( 1 + ( n - 1 )*abs( INCY ) ) otherwise.
* Before entry with BETA non-zero, the incremented array Y
* must contain the vector y. On exit, Y is overwritten by the
* updated vector y.
*
* INCY - INTEGER.
* On entry, INCY specifies the increment for the elements of
* Y. INCY must not be zero.
* Unchanged on exit.
*
*
* Level 2 Blas routine.
*
* -- Written on 22-October-1986.
* Jack Dongarra, Argonne National Lab.
* Jeremy Du Croz, Nag Central Office.
* Sven Hammarling, Nag Central Office.
* Richard Hanson, Sandia National Labs.
*
*
* .. Parameters ..
DOUBLE PRECISION ONE , ZERO
PARAMETER ( ONE = 1.0D+0, ZERO = 0.0D+0 )
* .. Local Scalars ..
DOUBLE PRECISION TEMP
INTEGER I, INFO, IX, IY, J, JX, JY, KX, KY, LENX, LENY
* .. External Functions ..
* LOGICAL LSAME
* EXTERNAL LSAME
* .. External Subroutines ..
* EXTERNAL XERBLA
* .. Intrinsic Functions ..
INTRINSIC MAX
* ..
* .. Executable Statements ..
*
* Test the input parameters.
*
INFO = 0
IF (1.ne.1) then
c disabling tests to avoid linking to lsame -- gwh
cx ( .NOT.LSAME( TRANS, 'N' ).AND.
c $ .NOT.LSAME( TRANS, 'T' ).AND.
c $ .NOT.LSAME( TRANS, 'C' ) )THEN
INFO = 1
ELSE IF( M.LT.0 )THEN
INFO = 2
ELSE IF( N.LT.0 )THEN
INFO = 3
ELSE IF( LDA.LT.MAX( 1, M ) )THEN
INFO = 6
ELSE IF( INCX.EQ.0 )THEN
INFO = 8
ELSE IF( INCY.EQ.0 )THEN
INFO = 11
END IF
c tests disable to avoid linking to xerbla -- GWH
c IF( INFO.NE.0 )THEN
c CALL XERBLA( 'DGEMV ', INFO )
c RETURN
c END IF
*
* Quick return if possible.
*
IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.
$ ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) )
$ RETURN
*
* Set LENX and LENY, the lengths of the vectors x and y, and set
* up the start points in X and Y.
*
c IF( LSAME( TRANS, 'N' ) )THEN
IF( TRANS.EQ. 'N' ) THEN
LENX = N
LENY = M
ELSE
LENX = M
LENY = N
END IF
IF( INCX.GT.0 )THEN
KX = 1
ELSE
KX = 1 - ( LENX - 1 )*INCX
END IF
IF( INCY.GT.0 )THEN
KY = 1
ELSE
KY = 1 - ( LENY - 1 )*INCY
END IF
*
* Start the operations. In this version the elements of A are
* accessed sequentially with one pass through A.
*
* First form y := beta*y.
*
IF( BETA.NE.ONE )THEN
IF( INCY.EQ.1 )THEN
IF( BETA.EQ.ZERO )THEN
DO 10, I = 1, LENY
Y( I ) = ZERO
10 CONTINUE
ELSE
DO 20, I = 1, LENY
Y( I ) = BETA*Y( I )
20 CONTINUE
END IF
ELSE
IY = KY
IF( BETA.EQ.ZERO )THEN
DO 30, I = 1, LENY
Y( IY ) = ZERO
IY = IY + INCY
30 CONTINUE
ELSE
DO 40, I = 1, LENY
Y( IY ) = BETA*Y( IY )
IY = IY + INCY
40 CONTINUE
END IF
END IF
END IF
IF( ALPHA.EQ.ZERO )
$ RETURN
IF( TRANS.EQ. 'N' ) THEN
* IF( LSAME( TRANS, 'N' ) )THEN
*
* Form y := alpha*A*x + y.
*
JX = KX
IF( INCY.EQ.1 )THEN
DO 60, J = 1, N
IF( X( JX ).NE.ZERO )THEN
TEMP = ALPHA*X( JX )
DO 50, I = 1, M
Y( I ) = Y( I ) + TEMP*A( I, J )
50 CONTINUE
END IF
JX = JX + INCX
60 CONTINUE
ELSE
DO 80, J = 1, N
IF( X( JX ).NE.ZERO )THEN
TEMP = ALPHA*X( JX )
IY = KY
DO 70, I = 1, M
Y( IY ) = Y( IY ) + TEMP*A( I, J )
IY = IY + INCY
70 CONTINUE
END IF
JX = JX + INCX
80 CONTINUE
END IF
ELSE
*
* Form y := alpha*A'*x + y.
*
JY = KY
IF( INCX.EQ.1 )THEN
DO 100, J = 1, N
TEMP = ZERO
DO 90, I = 1, M
TEMP = TEMP + A( I, J )*X( I )
90 CONTINUE
Y( JY ) = Y( JY ) + ALPHA*TEMP
JY = JY + INCY
100 CONTINUE
ELSE
DO 120, J = 1, N
TEMP = ZERO
IX = KX
DO 110, I = 1, M
TEMP = TEMP + A( I, J )*X( IX )
IX = IX + INCX
110 CONTINUE
Y( JY ) = Y( JY ) + ALPHA*TEMP
JY = JY + INCY
120 CONTINUE
END IF
END IF
*
RETURN
*
* End of DGEMV .
*
END