Last time, we went through Chapter 5 of Goedecker and Hoisie, trying out their techniques on henry2. A sample code designed to run in-cache at full speed ran at a bit less half the clock speed. The highest flop rate we could get was around 1300 Mflops/second. Since the advertised speed of these processors is 6 Gflops per processor, we're peaked at a bit higher than 20% of peak speed.
But using some architecture specific information, it turned out to be possible to write Fortran code that could do matrix vector multiplicaton at 3800 Mflops/sec.
The Xeon Cluster is a good reference to the Xeon (Pentium 4) architecture.
Summarizing and commenting: Two processors share a system bus to the graphics controller and RAM memory so we expect a bottleneck for data not in processor cache.
Since the machine is 32-bit , there is a max of 2^32 = 4 Gbytes of linear memory. Henry2 blades share 4 Gbytes of memory between 2 CPUs. (Most of competing CPUs are 64 bit allowing a larger memory size. At time of writing--August 2004-- 64-bit Xeons are coming out but not all the open source software pieces are there yet).
The Xeon has a single floating point processor. In theory, one FP operation can be performed in every clock tick, for a maximal speed of 3 Gflops using the floating point processor.
There are 8 80-bit floating point registers . To use the FP processor, the data to be added or mulitplied has to be in the floating point register on that clock cycle. Recall that when we examined the assembler language code for our "ideal" code we found lots of loads. This appears to because we had register spill so that only about half of the instructions were floating point adds and multiplies (the rest being loads back into the floating point registers). The floating point calculations using this floating point register are IEEE standard.
8 32-bit general purpose registers And lots of other special purpose registers: a 6 16-bit Segments registers, a 16 bit control register, 16-bit status register, 16-bit tag register, 11-bit opcode register 48 bit FPU instruction point register, 48-bit FPU data point register, 8 64-bit MMX registers, a 32 bit MXCSR Register, and lastly ...
8 128-bit XMM Registers The MMX provide SIMD (single instruction multiple data) operations on 64-bit packed byte, word or doubleword integer data. The XXM registers provide for 128-bit single or double precision floating point, word, double or quadword integer date. For floating data the XMM registers can have 2 packed double-precision floats or 4 single-precision values.
To use the XMM registers, SSE2 instructions are used. Loads, shifts,
shuffles, arithmetic, logicals. The INTEL compilers provide
workable SSE2 vectorizing .
SIMD (Single Instruction Multiple Device) means that when a single instruction is issued it can be performed in all registers is the same clock. Or in a specified subset. So if we can get all 8 registers with the desired data, e.g., 2 floating point numbers, then we can multiply each pair in one clock cycle. Of course it might take a few clock cycles before we're ready to do another such operation such as adding pairs of numbers.
The XXM registers need loops that the Intel compiler can vectorize.
The compiler needs
We have to turn on the machine specific optimization. Compiling with
>ifc -c -O3 -tpp7 -xW calgemv.f
increased the speed for 104 by 104 matrices from 1330 Mflops to 1900 Mflops.
Since the compiler unrolls the inner loop itself, we can play with the depth of unrolling by using the flag -unroll k to get unrolling of length k (as opposed to the default which is 4?). Then to get the 3800 Mflops, I worked with the depth of the outer loop (the compiler only unrolls the depth of the inner loop. The fact that the vector registers hold a total of 16 double precision numbers turned out to be relevant. I won't give you my version since this is homework for Thursday.
Here's a snippet of the assembler code which corresponds to the SIMD part of the loop. The right hand column attempts to correspond to the Fortran line number. The corresponds adds in multiplies for the floating point processor were fmul, fmulp, fadd, faddp. That code also had many more loads.
..B1.27: # Preds ..B1.27 ..B1.26
movapd ..1.mainLOCAL.a.52(%eax,%edi,8), %xmm5 #33.0
mulpd %xmm0, %xmm5 #33.0
addpd ..1.mainLOCAL.b.48(,%edi,8), %xmm5 #33.0
movapd ..1.mainLOCAL.a.52+16(%eax,%edi,8), %xmm4 #33.0
movapd ..1.mainLOCAL.a.52+32(%eax,%edi,8), %xmm3 #33.0
movapd ..1.mainLOCAL.a.52+48(%eax,%edi,8), %xmm2 #33.0
movapd %xmm5, ..1.mainLOCAL.b.48(,%edi,8) #33.0
mulpd %xmm0, %xmm4 #33.0
mulpd %xmm0, %xmm3 #33.0
mulpd %xmm0, %xmm2 #33.0
addpd ..1.mainLOCAL.b.48+16(,%edi,8), %xmm4 #33.0
addpd ..1.mainLOCAL.b.48+32(,%edi,8), %xmm3 #33.0
addpd ..1.mainLOCAL.b.48+48(,%edi,8), %xmm2 #33.0
movapd %xmm4, ..1.mainLOCAL.b.48+16(,%edi,8) #33.0
movapd %xmm3, ..1.mainLOCAL.b.48+32(,%edi,8) #33.0
movapd %xmm2, ..1.mainLOCAL.b.48+48(,%edi,8) #33.0
addl $8, %edi #32.0
cmpl %edx, %edi #34.0
jl ..B1.27 # Prob 99% #34.0
# LOE eax edx ecx ebp esi edi xmm0 xmm1
Morals: It makes sense to look on the web and see if you can get some hints on optimizing for a particular architecture. Also the compiler man pages may help you in choosing the right compiler options.
Another usual moral at this point is to compare the speed to doing a matrix vector multiply by calling a dgemv from a good BLAS library. However, in this case the Intel supplied BLAS library runs more slowly at 2700 Mflops (compared to 3800 Mflops for the hand-tuned blocks. While we do have more faith in the reliability of the BLAS library, it is sometimes the case that for small problems that will reside in cache we can do a bit better. As an example, people working with images sometimes find they need can do their own improved tuning for small matrices which require a large fraction of their computational time.
Moving on from "in-cache". One difference among various computers is the size of the cache. Having large fast caches is helpful in getting a variety of codes to run faster. But big caches are expensive as are fast data buses to keep the the cache supplied from RAM. For example, for in-cache codes the Xeon Blades can be just as fast as more expensive Itaniums. But the large cache Itaniums can cost 5 times as much per CPU. How important is "in-cache" vs. "out-of-cache" in terms of data accesses?
Table 6.1 shows clock cycles for copying (loading and storing) a double word (64 bits= 1 double precision) on an IBM Power chip (the p690 we have here is a slightly later generation). The first column is for the preferred stride of 1 (2 to the 0 power). For 16 = 2^4 double words, 1.8 clock cycles for a load and store declining for 2^14 bytes to .688 bytes. But then for 2^15 bytes which is the cache size the time kicks back up to 1.829 clocks per load-store. So long as the total size of the message is larger than the stride and so long as the whole message fits in half of cache, then the copy times are reasonable (above the double line).
For data of sizes larger than half the cache size, we can see how long a load and store from memory take. Essentially, every time the stride doubles so does the copy time. This goes till we hit the cache line size of 32 double words. Then it takes 53 clocks per double word copy.
So for out of cache copies with big strides we can take about 80 times as long for in cache copies with stride one. Or about 30 times as long for the out of cache stride one.
Example -- two vectors x and y too long to fit in cache.
First compute
As an example of how to make a code execute slowly consider performing
the above two operations (dot product and _axpy) on two rows of a
Fortran matrix.
On the Xeons of the Blade Center, a stride 1 copy too large for cache goes at
about 40 million doubles per second. Striding the other way on a matrix
(along successive rows) goes at about 4 million doubles per second (provided
the row is short enough to fit in cache).
In Table 1, we see that repeated copies of vectors of stride 1 execute
distinctly more slowly
when the copy bigger than half the cache size. Factor of 3.
On many machines the factor is even more than 3. The factor depends
on the bandwidth of the memory bus relative to the in-cache performance.
On the Blades, 2 Xeon CPUs share a common data bus. But the fall-off in
performance for large copies does not occur for vectors machines.
Vector machines are designed to pipeline data directly
to a register, without a cache. For the Cray C90 for example, bandwidth
approaches a high asymptote as the length of a vector increases.
The Xeon has the behavior of the cache-based architecture. This is why
we see that repeated matrix vector multiplies execute at a lower peak rate.
If we have to read data that is not currently resident in cache but is
stored in adjacent locations, we expect it to be copied at about the rate
of the copies too large for cache.
Bandwidth of access of stride 1 data (spatially local data) is
increasing more or less as processor speed. The rate of access for
nonlocal (large stride) data is increasing less fast. Latency as
number of clock cycles to access the first bit of data is increasing.
Consider
If this is Fortran, the array A will be stored by columns. If A is
in cache, then if the inner loop goes along the row (on i) then the
copy will may be 3 times as fast as if the loop is reversed. If A is
out of cache or too large to fit in cache, then the inner loop along
the row may be 80 times faster than the other.
Consider the consective matrix vector multiplies y <-- x'* A followed
by w <-- w + A * y
Suppose we have for particle i, 3 position coordinates, x, y, and z and
3 velocities u, v, and w. We could do this with 6 vectors. Or we could
make a 3 D array
> real*8 posvel(6,n)
so that column i has the x, y, z coordinates and velocities u, v, and w.
If these properties of i are usually all accessed at one time, then
accessing a column will fetch them all as one cache line. About six
times as fast as accessing them all separately.
Caches are not fully associative. Typically there might be 4 locations
where a given element from memory can be stored in cache. If all these are
full one will need to be flushed and replaced. The eligible locations are
separted by powers of two. If the leading dimension of an array is a power
of two, then accessing rows of the matrix may cause "cache-thrashing", so
even if we reaccess an element we find that's is already been inexplicably
flushed from cache.
for i=1,n
sum = sum + x(i) * y(i)
end
and then
x = c*x + y by
for i=1,n
y(i) = c*x(i) + y(i)
end
Since the vectors are too long to fit in cache they both have to be read
in twice, so it would be much better if they were both referenced in stride
1 (30 times better than if they were both of stride 32). Goedecker and Hoisie call
stride 1 (data stored adjacent in memory) "spatial locality" and call data
currently resident in-cache "temporal locality".
do j = 1, n
do i = 1, n
A(i,j) = B(i,j)
end do
end do
vs. the equivalent
do i = 1, n
do j = 1, n
B(i,j) = A(i,j)
end do
end do
do j = 1, n
y(j) = 0.d0
do i = 1, m
y(j) = y(j) + A(i,j) * x(i)
end do
end do
do j = 1, n
do i = 1, m
w(i) = w(i) + A(i,j) * y(j)
end do
end do
These loops are in the correct Fortran order, but if A is too large
to be in cache, then A is read into cache twice. We could instead
fuse the two loops (this is also a BLAS operator GEMVT which hopefully
we'll see in some new editions of the BLAS).
do j = 1, n
y(j) = 0.d0
do i = 1, m
y(j) = y(j) + A(i,j) * x(i)
end do
do i = 1, m
w(i) = w(i) + A(i,j) * y(j)
end do
end do
If a column of A fits in cache, then it is only read in once.