Lecture 4- CSC/MA 783 --Gary Howell
Optimizing In-Cache Floating Point Operations -- Xeon Specifics
Generally, How Slow is Out of Cache?


  • Dual Processor Xeons

    Last time, we went through Chapter 5 of Goedecker and Hoisie, trying out their techniques on henry2. A sample code designed to run in-cache at full speed ran at a bit less half the clock speed. The highest flop rate we could get was around 1300 Mflops/second. Since the advertised speed of these processors is 6 Gflops per processor, we're peaked at a bit higher than 20% of peak speed.

    But using some architecture specific information, it turned out to be possible to write Fortran code that could do matrix vector multiplicaton at 3800 Mflops/sec.

    The Xeon Cluster is a good reference to the Xeon (Pentium 4) architecture.

    Summarizing and commenting: Two processors share a system bus to the graphics controller and RAM memory so we expect a bottleneck for data not in processor cache.

    Since the machine is 32-bit , there is a max of 2^32 = 4 Gbytes of linear memory. Henry2 blades share 4 Gbytes of memory between 2 CPUs. (Most of competing CPUs are 64 bit allowing a larger memory size. At time of writing--August 2004-- 64-bit Xeons are coming out but not all the open source software pieces are there yet).

    The Xeon has a single floating point processor. In theory, one FP operation can be performed in every clock tick, for a maximal speed of 3 Gflops using the floating point processor.

    There are 8 80-bit floating point registers . To use the FP processor, the data to be added or mulitplied has to be in the floating point register on that clock cycle. Recall that when we examined the assembler language code for our "ideal" code we found lots of loads. This appears to because we had register spill so that only about half of the instructions were floating point adds and multiplies (the rest being loads back into the floating point registers). The floating point calculations using this floating point register are IEEE standard.

    8 32-bit general purpose registers And lots of other special purpose registers: a 6 16-bit Segments registers, a 16 bit control register, 16-bit status register, 16-bit tag register, 11-bit opcode register 48 bit FPU instruction point register, 48-bit FPU data point register, 8 64-bit MMX registers, a 32 bit MXCSR Register, and lastly ...

    8 128-bit XMM Registers The MMX provide SIMD (single instruction multiple data) operations on 64-bit packed byte, word or doubleword integer data. The XXM registers provide for 128-bit single or double precision floating point, word, double or quadword integer date. For floating data the XMM registers can have 2 packed double-precision floats or 4 single-precision values.

    To use the XMM registers, SSE2 instructions are used. Loads, shifts, shuffles, arithmetic, logicals. The INTEL compilers provide workable SSE2 vectorizing .

  • Vectorizing

    SIMD (Single Instruction Multiple Device) means that when a single instruction is issued it can be performed in all registers is the same clock. Or in a specified subset. So if we can get all 8 registers with the desired data, e.g., 2 floating point numbers, then we can multiply each pair in one clock cycle. Of course it might take a few clock cycles before we're ready to do another such operation such as adding pairs of numbers.

    The XXM registers need loops that the Intel compiler can vectorize.

    The compiler needs

  • no data dependence in inside the loop.
  • needs arithmetic the SSE2 can do, e.g., adds or multiplies
  • "countable" (not while) loops and "enough" iterations
  • Only inner loops are vectorized.
  • Using the SSE2 vectorizations gives up IEEE standard arithmetic. So rounding error will not be as well controlled, and your answers will be a bit different. (An old joke at Motorola when a bug was found in Pentium division. "What's the correct pronunciation of IEEE when flying on a plane designed with a Pentium?")

  • Vectorizing with the Intel Compiler

    We have to turn on the machine specific optimization. Compiling with

    >ifc -c -O3 -tpp7 -xW calgemv.f

    increased the speed for 104 by 104 matrices from 1330 Mflops to 1900 Mflops.

    Since the compiler unrolls the inner loop itself, we can play with the depth of unrolling by using the flag -unroll k to get unrolling of length k (as opposed to the default which is 4?). Then to get the 3800 Mflops, I worked with the depth of the outer loop (the compiler only unrolls the depth of the inner loop. The fact that the vector registers hold a total of 16 double precision numbers turned out to be relevant. I won't give you my version since this is homework for Thursday.

    Here's a snippet of the assembler code which corresponds to the SIMD part of the loop. The right hand column attempts to correspond to the Fortran line number. The corresponds adds in multiplies for the floating point processor were fmul, fmulp, fadd, faddp. That code also had many more loads.

     
     
    ..B1.27:                        # Preds ..B1.27 ..B1.26
            movapd    ..1.mainLOCAL.a.52(%eax,%edi,8), %xmm5        #33.0
            mulpd     %xmm0, %xmm5                                  #33.0
            addpd     ..1.mainLOCAL.b.48(,%edi,8), %xmm5            #33.0
            movapd    ..1.mainLOCAL.a.52+16(%eax,%edi,8), %xmm4     #33.0
            movapd    ..1.mainLOCAL.a.52+32(%eax,%edi,8), %xmm3     #33.0
            movapd    ..1.mainLOCAL.a.52+48(%eax,%edi,8), %xmm2     #33.0
            movapd    %xmm5, ..1.mainLOCAL.b.48(,%edi,8)            #33.0
            mulpd     %xmm0, %xmm4                                  #33.0
            mulpd     %xmm0, %xmm3                                  #33.0
            mulpd     %xmm0, %xmm2                                  #33.0
            addpd     ..1.mainLOCAL.b.48+16(,%edi,8), %xmm4         #33.0
            addpd     ..1.mainLOCAL.b.48+32(,%edi,8), %xmm3         #33.0
            addpd     ..1.mainLOCAL.b.48+48(,%edi,8), %xmm2         #33.0
            movapd    %xmm4, ..1.mainLOCAL.b.48+16(,%edi,8)         #33.0
            movapd    %xmm3, ..1.mainLOCAL.b.48+32(,%edi,8)         #33.0
            movapd    %xmm2, ..1.mainLOCAL.b.48+48(,%edi,8)         #33.0
            addl      $8, %edi                                      #32.0
            cmpl      %edx, %edi                                    #34.0
            jl        ..B1.27       # Prob 99%                      #34.0
                                    # LOE eax edx ecx ebp esi edi xmm0 xmm1
    

    Morals: It makes sense to look on the web and see if you can get some hints on optimizing for a particular architecture. Also the compiler man pages may help you in choosing the right compiler options.

    Another usual moral at this point is to compare the speed to doing a matrix vector multiply by calling a dgemv from a good BLAS library. However, in this case the Intel supplied BLAS library runs more slowly at 2700 Mflops (compared to 3800 Mflops for the hand-tuned blocks. While we do have more faith in the reliability of the BLAS library, it is sometimes the case that for small problems that will reside in cache we can do a bit better. As an example, people working with images sometimes find they need can do their own improved tuning for small matrices which require a large fraction of their computational time.

  • Memory Access Times

    Moving on from "in-cache". One difference among various computers is the size of the cache. Having large fast caches is helpful in getting a variety of codes to run faster. But big caches are expensive as are fast data buses to keep the the cache supplied from RAM. For example, for in-cache codes the Xeon Blades can be just as fast as more expensive Itaniums. But the large cache Itaniums can cost 5 times as much per CPU. How important is "in-cache" vs. "out-of-cache" in terms of data accesses?

    Table 6.1 shows clock cycles for copying (loading and storing) a double word (64 bits= 1 double precision) on an IBM Power chip (the p690 we have here is a slightly later generation). The first column is for the preferred stride of 1 (2 to the 0 power). For 16 = 2^4 double words, 1.8 clock cycles for a load and store declining for 2^14 bytes to .688 bytes. But then for 2^15 bytes which is the cache size the time kicks back up to 1.829 clocks per load-store. So long as the total size of the message is larger than the stride and so long as the whole message fits in half of cache, then the copy times are reasonable (above the double line).

    For data of sizes larger than half the cache size, we can see how long a load and store from memory take. Essentially, every time the stride doubles so does the copy time. This goes till we hit the cache line size of 32 double words. Then it takes 53 clocks per double word copy.

    So for out of cache copies with big strides we can take about 80 times as long for in cache copies with stride one. Or about 30 times as long for the out of cache stride one.

    Example -- two vectors x and y too long to fit in cache. First compute
    by
    for i=1,n
    sum = sum + x(i) * y(i)
    end
    and then
    x = c*x + y by
    for i=1,n
    y(i) = c*x(i) + y(i)
    end
    Since the vectors are too long to fit in cache they both have to be read in twice, so it would be much better if they were both referenced in stride 1 (30 times better than if they were both of stride 32). Goedecker and Hoisie call stride 1 (data stored adjacent in memory) "spatial locality" and call data currently resident in-cache "temporal locality".

    As an example of how to make a code execute slowly consider performing the above two operations (dot product and _axpy) on two rows of a Fortran matrix.

    On the Xeons of the Blade Center, a stride 1 copy too large for cache goes at about 40 million doubles per second. Striding the other way on a matrix (along successive rows) goes at about 4 million doubles per second (provided the row is short enough to fit in cache).

  • Performance for Unit and Large Stride Access

    In Table 1, we see that repeated copies of vectors of stride 1 execute distinctly more slowly when the copy bigger than half the cache size. Factor of 3. On many machines the factor is even more than 3. The factor depends on the bandwidth of the memory bus relative to the in-cache performance. On the Blades, 2 Xeon CPUs share a common data bus. But the fall-off in performance for large copies does not occur for vectors machines.

    Vector machines are designed to pipeline data directly to a register, without a cache. For the Cray C90 for example, bandwidth approaches a high asymptote as the length of a vector increases.

    The Xeon has the behavior of the cache-based architecture. This is why we see that repeated matrix vector multiplies execute at a lower peak rate. If we have to read data that is not currently resident in cache but is stored in adjacent locations, we expect it to be copied at about the rate of the copies too large for cache.

    Bandwidth of access of stride 1 data (spatially local data) is increasing more or less as processor speed. The rate of access for nonlocal (large stride) data is increasing less fast. Latency as number of clock cycles to access the first bit of data is increasing.

  • Loop-Reordering

    Consider

         do j = 1, n
            do i = 1, n
               A(i,j) = B(i,j)
            end do
         end do 
    
    vs. the equivalent 
    
         do i = 1, n
            do j = 1, n
               B(i,j) = A(i,j)
            end do
         end do 
    

    If this is Fortran, the array A will be stored by columns. If A is in cache, then if the inner loop goes along the row (on i) then the copy will may be 3 times as fast as if the loop is reversed. If A is out of cache or too large to fit in cache, then the inner loop along the row may be 80 times faster than the other.

  • Loop-Fusion

    Consider the consective matrix vector multiplies y <-- x'* A followed by w <-- w + A * y

         do j = 1, n
            y(j) = 0.d0 
            do i = 1, m
               y(j) = y(j) + A(i,j) * x(i)
            end do 
         end do
         do j = 1, n 
            do i = 1, m
               w(i) = w(i) + A(i,j) * y(j)
            end do
         end do
    
    These loops are in the correct Fortran order, but if A is too large to be in cache, then A is read into cache twice. We could instead fuse the two loops (this is also a BLAS operator GEMVT which hopefully we'll see in some new editions of the BLAS).
         do j = 1, n
            y(j) = 0.d0 
            do i = 1, m
               y(j) = y(j) + A(i,j) * x(i)
            end do
            do i = 1, m  
               w(i) = w(i) + A(i,j) * y(j)
            end do
         end do
    
    If a column of A fits in cache, then it is only read in once.

  • Data Locality in Conceptual Design

    Suppose we have for particle i, 3 position coordinates, x, y, and z and 3 velocities u, v, and w. We could do this with 6 vectors. Or we could make a 3 D array

    > real*8 posvel(6,n)

    so that column i has the x, y, z coordinates and velocities u, v, and w. If these properties of i are usually all accessed at one time, then accessing a column will fetch them all as one cache line. About six times as fast as accessing them all separately.

  • Cache Thrashing

    Caches are not fully associative. Typically there might be 4 locations where a given element from memory can be stored in cache. If all these are full one will need to be flushed and replaced. The eligible locations are separted by powers of two. If the leading dimension of an array is a power of two, then accessing rows of the matrix may cause "cache-thrashing", so even if we reaccess an element we find that's is already been inexplicably flushed from cache.