Lecture 6- CSC/MA 783 --Gary Howell
Sparse Matrix Vector Multiplication


  • Why Sparse Matrices

    The following is an update of Lecture 6 posted a few weeks ago. An additional code is added that enables reading of of 'RUA' Harwell-Boeing format matrices.

    For example Davis's UF Sparse Matrix Collection. The Harwell Boeing format is essentially the sparse column format detailed below. For a complete description as well as "bones" of Fortran subroutines for reading the format, see Matrix Market, which also has a posted collection of matrices. A tar file with the code discussed below can be copied from /tmp/timit2.tar.

    We've seen that we can get pretty good efficiencies for dense matrix vector multipliers. For dense matrices that fit "in-cache", we can get better than 60% of the advertised peak speed. For matrices that won't fit in cache, we're at something more like 13% of peak CPU speed, but we're limited mainly by the bus speed (running at a good chunk of bus speed).

    Much of the masses of data used in scientific computation are stored in terms of matrices, but most typically, the matrices are sparse. Sparse matrices are mostly zero, so it's more efficient to store only the nonzero elements. How? We'll mainly follow Saad, "Iterative Methods for Sparse Linear Systems, 2nd Edition", Yousef Saad, SIAM, 2003.

  • Storage Schemes

    Let's take an example matrix and show the most common representations. Consider the matrix

    				4.0  1.0  0.0  0.0  2.5 
    			 	0.0  4.0  1.0  0.0  0.0
    		A	=	0.0  1.0  4.0  0.0  1.0
    				0.0  0.0  1.0  4.0  0.0
    				2.5  0.0  0.0  0.5  4.0 	
    

    which can be represented in Coordinate Format as one floating point vector AA which contains the nonzero entries, and integer vectors JR which has the corresponding rows while JC which has the corresponding columns.

        AA  =  4.0  1.0  2.5  4.0  1.0  1.0  4.0  1.0  1.0  4.0  2.5  0.5  4.0
        JR  =  1    1    1    2    2    3    3    3    4    4    5    5    5
        JC  =  1    2    5    2    3    2    3    5    3    4    1    4    5 
    
    

    The coordinate format is straightforward. A disadvantage is that it requires 3 full length vectors, each of length Nz where in this example we have Nz = 13. An advantage is that elements can be listed in any order (just so all 3 vectors are permuted in the same way).

    If we restrict ourselves to listing elements a row at a time then we can compress the JR vector as follows.

        AA  =  4.0  1.0  2.5  4.0  1.0  1.0  4.0  1.0  1.0  4.0  2.5  0.5  4.0
        JA  =  1    4    6    9    11   14
        IC  =  1    2    5    2    3    2    3    5    3    4    1    4    5 
    
    

    This is called Compressed Sparse Row storage and is probably the most popular sparse storage format. AA is the same as in coordinate format and IC(i) has the column of the ith entry of AA. The ith entry of JA identifies the first entry of the ith row. If A has n rows, then the n+1st and last entry of JA is Nz + 1 .

    Another format is Compressed Sparse Column, analogous, perhaps more popular.

  • Basic Matrix Operations

    The following snippet of code is Fortran 77 for a matrix vector multiply in a CSR format.

          do i = 1, n
             k1 = ja(i)
             k2 = ja(i+1) - 1
             y(i) = 0
             do j = k1, k2
                y(i) = y(i) + aa(j) * x(ic(j))
             end do
          end do 
    

    Another important operation is solving Lx = y, where L is a lower triangular matrix. Suppose the lower triangular matrix L has ones on the diagonal. Then we can solve Lx = y with the following lines of Fortran 77 (untried, by the way, so take it with a grain of salt) .

          x(1) = y(1)
          do i = 2, n
             x(i) = y(i) 
             do j = ial(i), ial(i+1)-1
                x(i) = x(i) - al(j) * x(jal(j))
             end do
          end do
    
    Here's a snippet for matrix vector multiplication where $A$ is stored in compressed column storage.
             do j=1,n
                k1 = ic(j)
                k2 = ic(j+1) - 1
                do i = k1, k2
                   b(ir(i)) = b(ir(i)) + a(i)*x(j)
                end do
             end do
             sum = 0.d0
             do j=1,n
                sum = sum + abs(b(j))
             end do
    

    For dense matrices, the common way of solving Ax=b is by Gaussian elimination or equivalently, by decomposing A = LU. For sparse matrices, Gaussian elimination is also frequently used. A common problem is that the row operations produce fill, increasing the necessary storage.

    An alternative method of solution for Ax=b is by iterative methods. Common iterative methods involve repeated matrix vector multiplications.

  • Efficiency

    For dense matrix vector multiplication, the Mflops rate could be higher than 3.5 Gflops (when the matrix fits in L2 cache ) but falls off to around 700 Mflops when the matrix is too large to fit into cache. The 3.5 Gflops rate is higher than half the advertised peak rate of 6.1 Glops. The 700 Mflops corresponds to reading 350 million doubles/sec from RAM which is a rate near the bus speed.

    For sparse matrix vector multiplication, Mflops rates are signficantly lower. For example, consider the following data collected on my Pentium IV lap top (using the g77 compiler). In this case, entries were randomly distributed, 10 nonzero entries per column, for a variety of matrix sizes. The second column of Mflop rates is from a 3.06 GHz Xeon compiled with

    >ifc -03 -tpp7 -xW -c spcol.f 
    

    Matrix vector multiplies were repeatedly performed, often enough to require several seconds of computational time, computing

    Ax = b

    setting x to b normalized and repeating. This is the power method. For instance Google employs the power method to find the eigenvector corresponding to the largest eigenvalue of a transition matrix, allowing ordering of search results.

    The largest possible dense matrix on this machine's RAM would have been less than 10K by 10K (800 MBytes).

    The matrix A and the integer vector ir are read in sequentially for each matrix vector multiplication, The Mflop rate stays relatively constant at about 25% of the rate for out of cache dense matrix computations -- till matrix sizes get up to about 10K. Here the 100K nonzero entries of A require 800 Kbytes of storage and the 100K nonzero entries of ir require 400 Kbytes of storage. Reading them in sequentially effectively flushes the cache so that the 80 Kbytes of storage for b must be partly re-read from cache (it has been refreshed just before the start of the iteration).

    As the matrix size grows to 1M, most entries of b are individually fetched from RAM, so that the Mflops rate declines to about 17 Mflops/ sec, i.e., a flop occurs on about 1 of 200 clock cycles.

    Can you think of algorithm improvements that would allow a better hit rate for elements of b?

     
    Square       nonzeros/col      Mflops     Kbytes for A     Kbytes for b
    matrix                                     and ir
    size
    
    100             10          158.6    476      12            .8
    
    200             10          156.1    435      24           1.6
    
    400             10          152.5    408      48           3.2
    
    1000            10          154.9    364     120           8
    
    2000            10          152.6    328     240          16
    
    4000            10          150.2    298     480          32
    
    10000           10          120.2           1200          80
    
    20000           10          100.7    281    2400         160
    
    40000           10           81.5    120    4800         320
    
    100000          10           33.2    42.1  12000         800
    
    200000          10           23.1    30.4  24000        1600
    
    400000          10           18.6    24.2  48000        3200
    
    1000000         10           17.8    21.5 120000        8000
    

    The above experiments are very regular, but artificial. It is a bit difficult to think of applications for which entries are so randomly distributed. As a more realistic experiment, a matrix twotone_rua_gz was downloaded. It has 120K rows and columns and 1.2M nonzero entries. The Mflop rate for matvec multiplies (henry2 with the same compiler options as above) was 182 Mflops (as compared to about 40 Mflops for the 100K square matrix with 1M nonzero entries from the table above).

    The following is the code to read the Harwell-Boeing format matrix.

    
    
          subroutine readhb1(lunit,lda,ldn,nrow,ncol,nnzero,
         +                      colptr,rowind,values,ierror) 
    c
    c     LUNIT is the unit number of the file which was opened
    c       in the calling program.  
    c     lda is the maximal number of nonzeros
    c     ldn is the maximal number of columns 
    c     nrow is the number of rows
    c     ncol is the number of cols
    c     nnzero is the number of nonzeros
    c     colptr(i) is the index of the first element of column i
    c     colptr(n+1) is nnzero 
    c     rowind (i) is the row of the ith entry of values
    c     values (i) is the ith nonzero value (going down column starting
    c       at left.   
    c     ierror returns error codes 
    c
    C     ================================================================
    C     ... SAMPLE CODE FOR READING A SPARSE MATRIX IN STANDARD FORMAT
    C     ================================================================
    c
          integer lda,ldn,ierror
    
          CHARACTER      TITLE*72 , KEY*8    , MXTYPE*3 ,
         1               PTRFMT*16, INDFMT*16, VALFMT*20, RHSFMT*20
    
          INTEGER        TOTCRD, PTRCRD, INDCRD, VALCRD, RHSCRD,
         1               NROW  , NCOL  , NNZERO, NELTVL
    
          INTEGER        COLPTR (*), ROWIND (*)
    
          REAL           VALUES (*)
    
    C    ------------------------
    C     ... READ IN HEADER BLOCK
    C     ------------------------
    
          READ ( LUNIT, 1000 ) TITLE , KEY   ,
         +                     TOTCRD, PTRCRD, INDCRD, VALCRD, RHSCRD,
         +                     MXTYPE, NROW  , NCOL  , NNZERO, NELTVL,
         +                     PTRFMT, INDFMT, VALFMT, RHSFMT
     1000 FORMAT ( A72, A8 / 5I14 / A3, 11X, 4I14 / 2A16, 2A20 )
    
    C     -------------------------
    C     ... READ MATRIX STRUCTURE
    C     -------------------------
          if (mxtype.ne.'RUA') then
             print*,' The matrix is not of a form usable by spcol.f'
             ierror = -1
             return 
          endif 
    
          READ ( LUNIT, PTRFMT ) ( COLPTR (I), I = 1, NCOL+1 )
    
          READ ( LUNIT, INDFMT ) ( ROWIND (I), I = 1, NNZERO )
    
          IF  ( VALCRD .GT. 0 )  THEN
    c
    C         ----------------------
    C         ... READ MATRIX VALUES
    C         ----------------------
    c
              READ ( LUNIT, VALFMT ) ( VALUES (I), I = 1, NNZERO )
          ELSE
             print*,' valcrd was less than zero or undefined'
             print*,' so am not reading matrix values' 
             ierror = -2
             return 
          ENDIF
    c     successful return
          ierror = 0 
          return 
          end 
    
    Comments on code? The role of ierror is fairly standard, imitating the return code in C. As is often the case, a main difficulty in calling the code is in keeping track of the list of arguments.

  • Homework

    From BLAS on the NCSU machines you can see how to link to BLAS packages. Link to 3 different BLAS packages and perform repeated matrix vector multiplications across a range of matrix sizes from 100 to 2000. Compare the results to your own best matrix vector multipliation. Due next Thursday.

    The following is source code for the Fortran reference dgemv. If you want help to find the C version or binding, let me know.

    
    
          SUBROUTINE DGEMV ( TRANS, M, N, ALPHA, A, LDA, X, INCX,
         $                   BETA, Y, INCY )
    *     .. Scalar Arguments ..
          DOUBLE PRECISION   ALPHA, BETA
          INTEGER            INCX, INCY, LDA, M, N
          CHARACTER*1        TRANS
    *     .. Array Arguments ..
          DOUBLE PRECISION   A( LDA, * ), X( * ), Y( * )
    *     ..
    *
    *  Purpose
    *  =======
    *
    *  DGEMV  performs one of the matrix-vector operations
    *
    *     y := alpha*A*x + beta*y,   or   y := alpha*A'*x + beta*y,
    *
    *  where alpha and beta are scalars, x and y are vectors and A is an
    *  m by n matrix.
    *
    *  Parameters
    *  ==========
    *
    *  TRANS  - CHARACTER*1.
    *           On entry, TRANS specifies the operation to be performed as
    *           follows:
    *
    *              TRANS = 'N' or 'n'   y := alpha*A*x + beta*y.
    *
    *              TRANS = 'T' or 't'   y := alpha*A'*x + beta*y.
    *
    *              TRANS = 'C' or 'c'   y := alpha*A'*x + beta*y.
    *
    *           Unchanged on exit.
    *
    *  M      - INTEGER.
    *           On entry, M specifies the number of rows of the matrix A.
    *           M must be at least zero.
    *           Unchanged on exit.
    *
    *  N      - INTEGER.
    *           On entry, N specifies the number of columns of the matrix A.
    *           N must be at least zero.
    *           Unchanged on exit.
    *
    *  ALPHA  - DOUBLE PRECISION.
    *           On entry, ALPHA specifies the scalar alpha.
    *           Unchanged on exit.
    *
    *  A      - DOUBLE PRECISION array of DIMENSION ( LDA, n ).
    *           Before entry, the leading m by n part of the array A must
    *           contain the matrix of coefficients.
    *           Unchanged on exit.
    *
    *  LDA    - INTEGER.
    *           On entry, LDA specifies the first dimension of A as declared
    *           in the calling (sub) program. LDA must be at least
    *           max( 1, m ).
    *           Unchanged on exit.
    *
    *  X      - DOUBLE PRECISION array of DIMENSION at least
    *           ( 1 + ( n - 1 )*abs( INCX ) ) when TRANS = 'N' or 'n'
    *           and at least
    *           ( 1 + ( m - 1 )*abs( INCX ) ) otherwise.
    *           Before entry, the incremented array X must contain the
    *           vector x.
    *           Unchanged on exit.
    *
    *  INCX   - INTEGER.
    *           On entry, INCX specifies the increment for the elements of
    *           X. INCX must not be zero.
    *           Unchanged on exit.
    *
    *  BETA   - DOUBLE PRECISION.
    *           On entry, BETA specifies the scalar beta. When BETA is
    *           supplied as zero then Y need not be set on input.
    *           Unchanged on exit.
    *
    *  Y      - DOUBLE PRECISION array of DIMENSION at least
    *           ( 1 + ( m - 1 )*abs( INCY ) ) when TRANS = 'N' or 'n'
    *           and at least
    *           ( 1 + ( n - 1 )*abs( INCY ) ) otherwise.
    *           Before entry with BETA non-zero, the incremented array Y
    *           must contain the vector y. On exit, Y is overwritten by the
    *           updated vector y.
    *
    *  INCY   - INTEGER.
    *           On entry, INCY specifies the increment for the elements of
    *           Y. INCY must not be zero.
    *           Unchanged on exit.
    *
    *
    *  Level 2 Blas routine.
    *
    *  -- Written on 22-October-1986.
    *     Jack Dongarra, Argonne National Lab.
    *     Jeremy Du Croz, Nag Central Office.
    *     Sven Hammarling, Nag Central Office.
    *     Richard Hanson, Sandia National Labs.
    *
    *
    *     .. Parameters ..
          DOUBLE PRECISION   ONE         , ZERO
          PARAMETER        ( ONE = 1.0D+0, ZERO = 0.0D+0 )
    *     .. Local Scalars ..
          DOUBLE PRECISION   TEMP
          INTEGER            I, INFO, IX, IY, J, JX, JY, KX, KY, LENX, LENY
    *     .. External Functions ..
    *     LOGICAL            LSAME
    *     EXTERNAL           LSAME
    *     .. External Subroutines ..
    *     EXTERNAL           XERBLA
    *     .. Intrinsic Functions ..
          INTRINSIC          MAX
    *     ..
    *     .. Executable Statements ..
    *
    *     Test the input parameters.
    *
          INFO = 0
          IF    (1.ne.1) then 
    c disabling tests to avoid linking to lsame -- gwh 
    cx ( .NOT.LSAME( TRANS, 'N' ).AND.
    c    $         .NOT.LSAME( TRANS, 'T' ).AND.
    c    $         .NOT.LSAME( TRANS, 'C' )      )THEN
             INFO = 1
          ELSE IF( M.LT.0 )THEN
             INFO = 2
          ELSE IF( N.LT.0 )THEN
             INFO = 3
          ELSE IF( LDA.LT.MAX( 1, M ) )THEN
             INFO = 6
          ELSE IF( INCX.EQ.0 )THEN
             INFO = 8
          ELSE IF( INCY.EQ.0 )THEN
             INFO = 11
          END IF
    c tests disable to avoid linking to xerbla -- GWH 
    c     IF( INFO.NE.0 )THEN
    c        CALL XERBLA( 'DGEMV ', INFO )
    c        RETURN
    c     END IF
    *
    *     Quick return if possible.
    *
          IF( ( M.EQ.0 ).OR.( N.EQ.0 ).OR.
         $    ( ( ALPHA.EQ.ZERO ).AND.( BETA.EQ.ONE ) ) )
         $   RETURN
    *
    *     Set  LENX  and  LENY, the lengths of the vectors x and y, and set
    *     up the start points in  X  and  Y.
    *
    c     IF( LSAME( TRANS, 'N' ) )THEN
          IF(  TRANS.EQ. 'N' ) THEN
             LENX = N
             LENY = M
          ELSE
             LENX = M
             LENY = N
          END IF
          IF( INCX.GT.0 )THEN
             KX = 1
          ELSE
             KX = 1 - ( LENX - 1 )*INCX
          END IF
          IF( INCY.GT.0 )THEN
             KY = 1
          ELSE
             KY = 1 - ( LENY - 1 )*INCY
          END IF
    *
    *     Start the operations. In this version the elements of A are
    *     accessed sequentially with one pass through A.
    *
    *     First form  y := beta*y.
    *
          IF( BETA.NE.ONE )THEN
             IF( INCY.EQ.1 )THEN
                IF( BETA.EQ.ZERO )THEN
                   DO 10, I = 1, LENY
                      Y( I ) = ZERO
       10          CONTINUE
                ELSE
                   DO 20, I = 1, LENY
                      Y( I ) = BETA*Y( I )
       20          CONTINUE
                END IF
             ELSE
                IY = KY
                IF( BETA.EQ.ZERO )THEN
                   DO 30, I = 1, LENY
                      Y( IY ) = ZERO
                      IY      = IY   + INCY
       30          CONTINUE
                ELSE
                   DO 40, I = 1, LENY
                      Y( IY ) = BETA*Y( IY )
                      IY      = IY           + INCY
       40          CONTINUE
                END IF
             END IF
          END IF
          IF( ALPHA.EQ.ZERO )
         $   RETURN
          IF(  TRANS.EQ.  'N' ) THEN
    *     IF( LSAME( TRANS, 'N' ) )THEN
    *
    *        Form  y := alpha*A*x + y.
    *
             JX = KX
             IF( INCY.EQ.1 )THEN
                DO 60, J = 1, N
                   IF( X( JX ).NE.ZERO )THEN
                      TEMP = ALPHA*X( JX )
                      DO 50, I = 1, M
                         Y( I ) = Y( I ) + TEMP*A( I, J )
       50             CONTINUE
                   END IF
                   JX = JX + INCX
       60       CONTINUE
             ELSE
                DO 80, J = 1, N
                   IF( X( JX ).NE.ZERO )THEN
                      TEMP = ALPHA*X( JX )
                      IY   = KY
                      DO 70, I = 1, M
                         Y( IY ) = Y( IY ) + TEMP*A( I, J )
                         IY      = IY      + INCY
       70             CONTINUE
                   END IF
                   JX = JX + INCX
       80       CONTINUE
             END IF
          ELSE
    *
    *        Form  y := alpha*A'*x + y.
    *
             JY = KY
             IF( INCX.EQ.1 )THEN
                DO 100, J = 1, N
                   TEMP = ZERO
                   DO 90, I = 1, M
                      TEMP = TEMP + A( I, J )*X( I )
       90          CONTINUE
                   Y( JY ) = Y( JY ) + ALPHA*TEMP
                   JY      = JY      + INCY
      100       CONTINUE
             ELSE
                DO 120, J = 1, N
                   TEMP = ZERO
                   IX   = KX
                   DO 110, I = 1, M
                      TEMP = TEMP + A( I, J )*X( IX )
                      IX   = IX   + INCX
      110          CONTINUE
                   Y( JY ) = Y( JY ) + ALPHA*TEMP
                   JY      = JY      + INCY
      120       CONTINUE
             END IF
          END IF
    *
          RETURN
    *
    *     End of DGEMV .
    *
          END