Lecture 5- CSC/MA 783 --Gary Howell
Experiments with the Xeon Cache


  • Memory Access Times

    Moving on from "in-cache". Last time we discussed general principles of cache architectures with examples from the IBM Power architecture. This time I'm repeating a bit, but trying to explicitly try out ideas for the Xeon architecture we have on the henry2 Blades.

    Table 6.1 shows clock cycles for copying (loading and storing) a double word (64 bits= 1 double precision) on an IBM Power chip (the p690 we have here is a slightly later generation). The first column is for the preferred stride of 1 (2 to the 0 power). For 16 = 2^4 double words, 1.8 clock cycles for a load and store declining for 2^14 bytes to .688 bytes. But then for 2^15 bytes which is the cache size the time kicks back up to 1.829 clocks per load-store. So long as the total size of the message is larger than the stride and so long as the whole message fits in half of cache, then the copy times are reasonable (above the double line).

    Here's a small program to determine copy rates. I compiled and ran this on the Blade Center.

    
          program copy
    c
    c  This program times copies of various sizes and strides
    c    The copies are repeated enough times to be timed with
    c    the C clock timer, which times (in Linux) with resolution
    c    .01 seconds.   Caveat.  clock times CPU time, where
    c    it may be that wall-clock time is more appropriate.  
    c
          real*8 a(4000000), b(4000000)
          integer*8 i,j,k,n, its , jstart 
          real*4 ftime, pretim, entim
          external ftime
    
          print*, 'input message size'
          read*, n
          print*, 'input stride' 
          read*, k 
    
          do i=1,4000000 
             a(i) = i
          end do
          its = 10000000/n 
          pretim = ftime()
          i = 1 
          do j = 1, its 
    c  do its copies, so that the total number of copies is about 1.d7  
             jstart = j*i*k  - 3000000*((i*j*k)/3000000)
    c  copy n elements of a (with stride k) 
             do i=jstart,n*k+jstart,k
                b(i) = a(i) 
             end do 
          end do 
          entim = ftime()-pretim
          print*,'elapsed CPU time =', entim
          print*, ' clocks/(copied double) =', entim*3.D9/1.D7
          stop 
          end 
    
    

    The following (3rd column) data is compiled with the g77 compiler, running on blade8-4.

                   stride    	size     (clock ticks)/(double copied)
    		   1		   1		1152	  855
    		   1 	 	   2		 576    459
    		   1		   4		 315    240
    		   1		   8		 186    153
    		   1		  16		 123    111
    		   1		  32		  84     72
    		   1		  64		  63     57
    		   1 		 128		  51     45
    		   1		 256		  42     39
    		   1		 512		  39     36
    		   1		1024		  36     36
    		   2		1024		  57     57
    		   4		1024		 108    105  
    		   8		1024		 198    201
    		  16		1024		 342    345
    		  32		1024		 345    348
    		  64		1024		 345    351
    		 128		1024		 345    360
    		 256		1024		 345    387
    		 512		1024		 345    399
    		1024		1024		 345    414
    

    The 4th column is data from a run on blade1-4 compiled with

    >ifc -c tpp7 -xW -O3

    Commentary. It appears that all data is out of cache, so that the timings here are for reading a chuck of double precision numbers and then writing them. Copying 1 double precision number is pretty expensive (about 1000 clock cycles). Copying 1K double precision numbers stored consecutively requires about 36 clock cycles per number copied. But if we copy 1024 numbers stored at a stride 16 or larger, there are 345 clocks per copy. The data indicates that the cache line is 16 or fewer double precision numbers.

    For data of sizes larger than half the cache size, we can see how long a load and store from memory take. Essentially, every time the stride doubles so does the copy time. This goes till we hit the cache line size of 32 double words. Then it takes 53 clocks per double word copy.

    So for out of cache copies with big strides we can take about 80 times as long for in cache copies with stride one. Or about 30 times as long for the out of cache stride one.

    Example -- two vectors x and y too long to fit in cache. First compute
    by
    for i=1,n
    sum = sum + x(i) * y(i)
    end
    and then
    x = c*x + y by
    for i=1,n
    y(i) = c*x(i) + y(i)
    end
    Since the vectors are too long to fit in cache they both have to be read in twice, so it would be much better if they were both referenced in stride 1 (30 times better than if they were both of stride 32). Goedecker and Hoisie call stride 1 (data stored adjacent in memory) "spatial locality" and call data currently resident in-cache "temporal locality".

    As an example of how to make a code execute slowly consider performing the above two operations (dot product and _axpy) on two rows of a Fortran matrix.

    On the Xeons of the Blade Center, a stride 1 copy too large for cache goes at about 40 million doubles per second. Striding the other way on a matrix (along successive rows) goes at about 4 million doubles per second (provided the row is short enough to fit in cache).

  • Performance for Unit and Large Stride Access

    In Table 1, we see that repeated copies of vectors of stride 1 execute distinctly more slowly when the copy bigger than half the cache size. Factor of 3. On many machines the factor is even more than 3. The factor depends on the bandwidth of the memory bus relative to the in-cache performance. On the Blades, 2 Xeon CPUs share a common data bus. But the fall-off in performance for large copies does not occur for vector machines.

    Vector machines are designed to pipeline data directly to a register, without a cache. For the Cray C90 for example, bandwidth approaches a high asymptote as the length of a vector increases.

    For the Xeon, I observed that "in-cache" copies of vectors of length 1K or so took 23 clocks/double. As vectors grew to long to be in cache, copies took 30 clocks/double. As explained below I think the difference was not dramatic because in this experiment the reads were in or out of cache, but all the writes were out of cache.

    Bandwidth of access of stride 1 data (spatially local data) is increasing more or less as processor speed. The rate of access for nonlocal (large stride) data is increasing less fast. Latency as number of clock cycles to access the first bit of data is increasing.

  • Reading Data Into Cache The fall-off in rate of copying for the Xeon was not nearly as dramatic as for the chip illustrated in Goedecker and Hoisie. There is a dramatic difference between reading data in or out of cache. To see this consider flop rates for multiplying square matrices by a vector.

    matrix size  Mflops   Doubles Read/Clock Cycle  
      80          3479  		1.76
     100          3333              1.84
     120          3809              1.61
     200          1568              3.90
     300           639              9.58
     400           684              8.95
     500           714              8.57
     600           755              8.11
     700           792              7.73
     800           792              7.73
     900           807              7.58
    1000           798              7.67
    1100           826              7.72
    1200           798              7.59
    1300           806              7.67
    1700           830              7.59 
    1800           830              7.37
    1900           836              7.32
    2000           842              7.27
    

    The limiting factor in a matrix vector multiply is typically not the number of floating point operations (adds and multiplies) that a processor can perform in a second, but rather how fast elements of a matrix can be fed into the CPU. Each read in a matrix vector multiply corresponds to 2 flops (an add and a multiply). There is about a factor of 4 fall-off in flop rate as the square matrix size gets up to 300. The size of the L2 cache is 512 Kbytes = 2^19 bytes = 2^16 double precision numbers = square matrix of size 2^8 = 256 by 256. In the above table the clocks per read is calculated as 2*3.06e9/(Gflop Rate).

    When matrices take up more than about half of the 512 Kbyte cache, they start to be flushed between successive matrix vector multiplies, so the number of clock cycles for a read becomes longer. In (L2) cache reads take about 1.7 clock cycles while out-of-cache reads take about 8 clock cycles. 8 clock cycles corresponds to about 375 million double precision numbers per second. The advertised bus rate is ??

    If we express the time for a streamed copy as

    clocks to copy = (clocks to read) + (clocks to write)

    then for the out of cache copy we have

    30 = 8 + (clocks to write)

    clocks to write = 22.

    For the in cache copy we have

    23 = 2 + (clocks to write)

    clocks to write = 21

    So I think we're writing out of cache to RAM in both the "in cache" and "out of cache" experiments. And each write takes 21 or 22 clock cycles.

  • Times for In Cache and Out of Cache Writes So far, the simple program I wrote appears to measure out of cache writes. In cache writes?
  • Loop-Reordering

    Consider

         do j = 1, n
            do i = 1, n
               A(i,j) = B(i,j)
            end do
         end do 
    
    vs. the equivalent 
    
         do i = 1, n
            do j = 1, n
               B(i,j) = A(i,j)
            end do
         end do 
    

    If this is Fortran, the array A will be stored by columns. If A is in cache, then if the inner loop goes along the row (on i) then the copy will may be 3 times as fast as if the loop is reversed. If A is out of cache or too large to fit in cache, then the inner loop along the row may be 80 times faster than the other.

  • Loop-Fusion

    Consider the consective matrix vector multiplies y <-- x'* A followed by w <-- w + A * y

         do j = 1, n
            y(j) = 0.d0 
            do i = 1, m
               y(j) = y(j) + A(i,j) * x(i)
            end do 
         end do
         do j = 1, n 
            do i = 1, m
               w(i) = w(i) + A(i,j) * y(j)
            end do
         end do
    
    These loops are in the correct Fortran order, but if A is too large to be in cache, then A is read into cache twice. We could instead fuse the two loops (this is also a BLAS operator GEMVT which hopefully we'll see in some new editions of the BLAS).
         do j = 1, n
            y(j) = 0.d0 
            do i = 1, m
               y(j) = y(j) + A(i,j) * x(i)
            end do
            do i = 1, m  
               w(i) = w(i) + A(i,j) * y(j)
            end do
         end do
    
    If a column of A fits in cache, then it is only read in once.

  • Data Locality in Conceptual Design

    Suppose we have for particle i, 3 position coordinates, x, y, and z and 3 velocities u, v, and w. We could do this with 6 vectors. Or we could make a 3 D array

    > real*8 posvel(6,n)

    so that column i has the x, y, z coordinates and velocities u, v, and w. If these properties of i are usually all accessed at one time, then accessing a column will fetch them all as one cache line. About six times as fast as accessing them all separately.

  • Cache Thrashing

    Caches are not fully associative. Typically there might be 4 locations where a given element from memory can be stored in cache. If all these are full one will need to be flushed and replaced. The eligible locations are separted by powers of two. If the leading dimension of an array is a power of two, then accessing rows of the matrix may cause "cache-thrashing", so even if we reaccess an element we find that's is already been inexplicably flushed from cache.