Moving on from "in-cache". Last time we discussed general principles of cache architectures with examples from the IBM Power architecture. This time I'm repeating a bit, but trying to explicitly try out ideas for the Xeon architecture we have on the henry2 Blades.
Table 6.1 shows clock cycles for copying (loading and storing) a double word (64 bits= 1 double precision) on an IBM Power chip (the p690 we have here is a slightly later generation). The first column is for the preferred stride of 1 (2 to the 0 power). For 16 = 2^4 double words, 1.8 clock cycles for a load and store declining for 2^14 bytes to .688 bytes. But then for 2^15 bytes which is the cache size the time kicks back up to 1.829 clocks per load-store. So long as the total size of the message is larger than the stride and so long as the whole message fits in half of cache, then the copy times are reasonable (above the double line).
Here's a small program to determine copy rates. I compiled and ran this on the Blade Center.
program copy
c
c This program times copies of various sizes and strides
c The copies are repeated enough times to be timed with
c the C clock timer, which times (in Linux) with resolution
c .01 seconds. Caveat. clock times CPU time, where
c it may be that wall-clock time is more appropriate.
c
real*8 a(4000000), b(4000000)
integer*8 i,j,k,n, its , jstart
real*4 ftime, pretim, entim
external ftime
print*, 'input message size'
read*, n
print*, 'input stride'
read*, k
do i=1,4000000
a(i) = i
end do
its = 10000000/n
pretim = ftime()
i = 1
do j = 1, its
c do its copies, so that the total number of copies is about 1.d7
jstart = j*i*k - 3000000*((i*j*k)/3000000)
c copy n elements of a (with stride k)
do i=jstart,n*k+jstart,k
b(i) = a(i)
end do
end do
entim = ftime()-pretim
print*,'elapsed CPU time =', entim
print*, ' clocks/(copied double) =', entim*3.D9/1.D7
stop
end
The following (3rd column) data is compiled with the g77 compiler, running on blade8-4.
stride size (clock ticks)/(double copied)
1 1 1152 855
1 2 576 459
1 4 315 240
1 8 186 153
1 16 123 111
1 32 84 72
1 64 63 57
1 128 51 45
1 256 42 39
1 512 39 36
1 1024 36 36
2 1024 57 57
4 1024 108 105
8 1024 198 201
16 1024 342 345
32 1024 345 348
64 1024 345 351
128 1024 345 360
256 1024 345 387
512 1024 345 399
1024 1024 345 414
The 4th column is data from a run on blade1-4 compiled with
>ifc -c tpp7 -xW -O3
Commentary. It appears that all data is out of cache, so that the timings here are for reading a chuck of double precision numbers and then writing them. Copying 1 double precision number is pretty expensive (about 1000 clock cycles). Copying 1K double precision numbers stored consecutively requires about 36 clock cycles per number copied. But if we copy 1024 numbers stored at a stride 16 or larger, there are 345 clocks per copy. The data indicates that the cache line is 16 or fewer double precision numbers.
For data of sizes larger than half the cache size, we can see how long a load and store from memory take. Essentially, every time the stride doubles so does the copy time. This goes till we hit the cache line size of 32 double words. Then it takes 53 clocks per double word copy.
So for out of cache copies with big strides we can take about 80 times as long for in cache copies with stride one. Or about 30 times as long for the out of cache stride one.
Example -- two vectors x and y too long to fit in cache.
First compute
As an example of how to make a code execute slowly consider performing
the above two operations (dot product and _axpy) on two rows of a
Fortran matrix.
On the Xeons of the Blade Center, a stride 1 copy too large for cache goes at
about 40 million doubles per second. Striding the other way on a matrix
(along successive rows) goes at about 4 million doubles per second (provided
the row is short enough to fit in cache).
In Table 1, we see that repeated copies of vectors of stride 1 execute
distinctly more slowly
when the copy bigger than half the cache size. Factor of 3.
On many machines the factor is even more than 3. The factor depends
on the bandwidth of the memory bus relative to the in-cache performance.
On the Blades, 2 Xeon CPUs share a common data bus. But the fall-off in
performance for large copies does not occur for vector machines.
Vector machines are designed to pipeline data directly
to a register, without a cache. For the Cray C90 for example, bandwidth
approaches a high asymptote as the length of a vector increases.
For the Xeon, I observed that "in-cache" copies of vectors of length
1K or so took 23 clocks/double. As vectors grew to long to be
in cache, copies took 30 clocks/double. As explained below
I think the difference was not dramatic because in this experiment
the reads were in or out of cache, but all the writes were out
of cache.
Bandwidth of access of stride 1 data (spatially local data) is
increasing more or less as processor speed. The rate of access for
nonlocal (large stride) data is increasing less fast. Latency as
number of clock cycles to access the first bit of data is increasing.
The limiting factor in a matrix vector multiply is typically
not the number of floating point operations (adds and multiplies)
that a processor can perform in a second, but rather how fast
elements of a matrix can be fed into the CPU.
Each read in a matrix vector multiply corresponds to 2 flops
(an add and a multiply). There is about a factor of 4 fall-off
in flop rate as the square matrix size gets up to 300. The size
of the L2 cache is 512 Kbytes = 2^19 bytes = 2^16 double precision
numbers = square matrix of size 2^8 = 256 by 256.
In the above table the clocks per read is calculated as
2*3.06e9/(Gflop Rate).
When matrices take up more than about half of the 512 Kbyte
cache, they start to be flushed between successive matrix
vector multiplies, so the number of clock cycles for a read
becomes longer. In (L2) cache reads take about 1.7 clock
cycles while out-of-cache reads take about 8 clock cycles.
8 clock cycles corresponds to about 375 million double
precision numbers per second. The advertised bus rate is ??
If we express the time for a streamed copy as
clocks to copy = (clocks to read) + (clocks to write)
then for the out of cache copy we have
30 = 8 + (clocks to write)
clocks to write = 22.
For the in cache copy we have
23 = 2 + (clocks to write)
clocks to write = 21
So I think we're writing out of cache to RAM in both the "in cache"
and "out of cache" experiments. And each write takes 21 or 22
clock cycles.
Consider
If this is Fortran, the array A will be stored by columns. If A is
in cache, then if the inner loop goes along the row (on i) then the
copy will may be 3 times as fast as if the loop is reversed. If A is
out of cache or too large to fit in cache, then the inner loop along
the row may be 80 times faster than the other.
Consider the consective matrix vector multiplies y <-- x'* A followed
by w <-- w + A * y
Suppose we have for particle i, 3 position coordinates, x, y, and z and
3 velocities u, v, and w. We could do this with 6 vectors. Or we could
make a 3 D array
> real*8 posvel(6,n)
so that column i has the x, y, z coordinates and velocities u, v, and w.
If these properties of i are usually all accessed at one time, then
accessing a column will fetch them all as one cache line. About six
times as fast as accessing them all separately.
Caches are not fully associative. Typically there might be 4 locations
where a given element from memory can be stored in cache. If all these are
full one will need to be flushed and replaced. The eligible locations are
separted by powers of two. If the leading dimension of an array is a power
of two, then accessing rows of the matrix may cause "cache-thrashing", so
even if we reaccess an element we find that's is already been inexplicably
flushed from cache.
for i=1,n
sum = sum + x(i) * y(i)
end
and then
x = c*x + y by
for i=1,n
y(i) = c*x(i) + y(i)
end
Since the vectors are too long to fit in cache they both have to be read
in twice, so it would be much better if they were both referenced in stride
1 (30 times better than if they were both of stride 32). Goedecker and Hoisie call
stride 1 (data stored adjacent in memory) "spatial locality" and call data
currently resident in-cache "temporal locality".
matrix size Mflops Doubles Read/Clock Cycle
80 3479 1.76
100 3333 1.84
120 3809 1.61
200 1568 3.90
300 639 9.58
400 684 8.95
500 714 8.57
600 755 8.11
700 792 7.73
800 792 7.73
900 807 7.58
1000 798 7.67
1100 826 7.72
1200 798 7.59
1300 806 7.67
1700 830 7.59
1800 830 7.37
1900 836 7.32
2000 842 7.27
do j = 1, n
do i = 1, n
A(i,j) = B(i,j)
end do
end do
vs. the equivalent
do i = 1, n
do j = 1, n
B(i,j) = A(i,j)
end do
end do
do j = 1, n
y(j) = 0.d0
do i = 1, m
y(j) = y(j) + A(i,j) * x(i)
end do
end do
do j = 1, n
do i = 1, m
w(i) = w(i) + A(i,j) * y(j)
end do
end do
These loops are in the correct Fortran order, but if A is too large
to be in cache, then A is read into cache twice. We could instead
fuse the two loops (this is also a BLAS operator GEMVT which hopefully
we'll see in some new editions of the BLAS).
do j = 1, n
y(j) = 0.d0
do i = 1, m
y(j) = y(j) + A(i,j) * x(i)
end do
do i = 1, m
w(i) = w(i) + A(i,j) * y(j)
end do
end do
If a column of A fits in cache, then it is only read in once.