Two similar lectures are
Dongarra lecture at UT Knoxville
Yelick lecture at UC Berkeley
So if the assembler program was
>load a
>load b
>add a and b
>store result
then to achieve a peak computation rate, we want several of these sequences to be occurring simultaneously, so that one add can occur on each clock cycle.
The Sandia article points out that too long a pipeline to accomplish a
simple add means that the pipeline spends a lot of time being broken.
Pipelining (instruction parallelism) was a main 1990s concept. More recently, Intel touts thread level parallelism. Here is one pipeline breaks, another thread can be ready to go. The Sandia paper ended up choosing the Opteron on price performance. Xeons appear to be preferable with code that frequently breaks the pipelines -- due to ability to perform out of order execution.
Registers -- Data needs to be in a register to be operated on, e.g., added, multiplied. The usual number of registers has been 32. The Xeon has 8 registers (inherited from the 8086 days)
Caches -- associative cache -- most modern caches are associative, i.e., data from a given global memory address can be written to more than one location in cache. Not only must data be in cache, but also instructions. The TLB (translation look aside buffer) keeps track of the relation between logical addresses and physical addresses.
When data is not in cache, it has to be fetched from a lower level cache, or from RAM -- or perhaps even the hard drive.
The L1 cache is relatively small -- 8K-64Kbytes It requires only one or two clock cycles to load data from the L1 cache to registers.
The L2 cache is somewhat larger -- 256K to 8 Mbytes. (sometimes there's yet another level). It might take ten clocks to get data from L2 to register. 512KBytes on the BladeCenter.
RAM can be several Gbytes (4 Gbytes on the Blade Center, 128 Gbytes on the p690) . It may take a hundred clock cycles to access the first data in RAM. Data is brought from RAM to cache in "cache lines" typically several words. The databus on the Xeons can bring in 400 million 64 bit chunks per second. This is a frequent bottleneck. Newer generations of Xeons will raise the rate to 800 million 64 bit chunks per second.
Writes from cache back to memory take somewhat longer than reads to cache. This seems to be independent of whether a cache is "write-through" or "write-back".
Comparison of Xeon to classic workstation architecture. For 20 years, the classic workstations chips had a great advantage over the x86 architectures. When my 286 was doing 8Kflops per second, my DEC workstation could do 8MFlops. But now the clock speeds and LINPACK benchmarks of the Xeon and Opteron have about caught up. The workstation class chips, especially the Intel Itanium with a large cache are still about as fast, (actually the current Intel Itanium is perhaps twice as fast). But the Itanium costs ten times as much. The Xeon is still lacking the 64 bit address space, but the Opteron already has it, and a 64 bit Xeon is currently almost in production.
The work station chips have had some refinements for high performance architectures.
Vector -- designed to do many floating point operations quickly
Shared Memory -- cache coherence problem. Solved but threatens scalability to many processors. Also because it's a specialty item, it makes for a high cost per processor. We can buy 64 CPUs of a BladeCenter for the cost of one year's maintenance on the 32 CPU p690.
Both vector and shared memory processors are "specialty" . Vector processors were almost dead, but the Japanese put out the "Earth Simulator" giving Cray the opportunity to sell a new batch to government labs.
Adolfy and Hoisie concentrate on workstation processors-- See Tables 2.1 and 2.2
Sparc Ultra-- no longer competitive for scientific computation.
MIPS -- on last generation (SGI will no longer use)
Power3 and Power4 -- these are what we have in the p690. New generations
are still coming out.
Alpha EV6 and EV7 (these were the ones I worked on and were
arguably the fastest till last year). On its last generation.
HP PA-RISC -- on its last generation. SGI, HP (including Compaq-Digital)
are putting eggs in the Itanium basket.
A more recent successor would be the Itanium. 64 bit Intel "dream chip" Expensive. Probably the fastest now.
Characterized by large cache memories -- pipelined instructions. 64 bit.
The Top 500 Supercomputers of a 2 or 3 years ago were dominantly this clusters of workstation chips. They were "clustered". Connected by "high-speed" networks.
Two flavors were "shared memory" , "distributed memory". Shared memory codes are easier to write. OpenMP is a standard. It is often easy to parallelize codes to run with OpenMP. Drawbacks: The resulting code works only on shared memory machines Scales only to a few processors. Many valuable commercial packages only work well for shared memory, e.g., Gaussian, Abaqus.
Since shared memory machines are very expensive per processor you're not likely to get more than a few processors.
If you do want the code to scale to many processors, then you have to control where the data sits, i.e., explicitly pass messages, typically by MPI calls.
The advantage of expressing the code in MPI is then it will run on distributed memory machines, which are much cheaper.
In the last year, 64 bit Opterons and Xeons have been moving up rapidly. These don't benchmark as well as the Itanium, but are much cheaper. See the Sandia paper as to why they chose Opterons. The Xeons have higher clock speeds. Handle multiple threads better. The Opterons have faster and independent data buses for each processor.
Here we have a 32 CPU shared memory IBM p690.
A 200 CPU (soon to grow to 250 CPU) Xeon (IBM BladeCenter).
The p690 has the IBM AIX (IBM Unix) software environment.
The BladeCenter runs under Linux (the open source Unix) with gnu, Portland Group and Intel compilers. The Totalview debugger can be used. Also the Portland group debugger and profiler.
Interconnect. A parallel distributed memory computer requires a fast interconnect so that processors can communicate rapidly. The interconnect can be a significant fraction of the cost. The BladeCenter, in common with a number of recent machines, uses GiGE, characterized by a high bandwidth between processors, but latency longer than for specialized fabrics like Quadrics switches or Myrinet. Infiniband may be a future contender.
The 512CPU SC ALPHA machine I worked on for HP cost around $10M ? and was a bargain compared to IBM SPs? The 250 CPU processor BladeCenter here cost less than $700K.
Summary Using standard off-the-shelf parts to contruct a cluster has virtues in price/performance. But it's also important to consider ease of management. The IBM BladeCenter has the virtue of being well-engineered and easy to manage, as well as relatively inexpensive per processor. It stays busy and software developed for it will run on almost any parallel machine.
But some very useful and popular software is designed for shared memory machines. So even though they're much more expensive to purchase and maintain, we have a continuing need . . .
To solve huge problems, we go to parallel computation. Two reasons are to increase available RAM and to speed computation. Speeding computation is a well-understood problem in serial computation, so it makes sense to start with serial optimization. Sometimes great speedups are possible just in serial.
Selection of best algorithm.
For example, Strassen's matrix matrix multiplication algorithm
requires O(n^2.78) operations compared to O(n^3) so for large
enough matrices it's worthwhile.
Use efficient libraries.
A student told me it took ten hours
for each time step of her code (involving solution of Ax=b)
I redid it with LAPACK with various BLAS libraries and got it down to
6 minutes with 2 CPUs on the same machine. (I think the fastest
BLAS managed to do most of the flops with Strassen's?)
There are 3 BLAS levels.
BLAS-1 -- vector vector, e.g.
y <-- ax + y
(_axpy, saxpy, daxpy, caxpy, zaxpy)
When they first built the Cray vector machines ..
Have to fetch two vectors x and y and then store y, but stores take
a long time
2n fetches of data, 2n flops. n stores
The n stores dominate.
BLAS-2 -- matrix vector e.g.
y <-- Ax + y
(_gemv)
If A n by n, n^2 data to be fetched, 2n^2 flops, n stores (since A
was not changed don't need to re-store A) .
The reading of A predominates. e.g., for the Ax computation
quoted above of 700 Mflops, there are 2 flops per data fetched
and the data is fetched over a bus that will carry 400 Million
doubles per second.
BLAS-3 -- matrix matrix , e.g.,
C <-- AB + C
(_gemm)
Requires 2 n^3 flops, 3 n^2 reads, n^2 writes. If you call
an optimized BLAS routine, it keeps the reads and writes down
to the advertised order so there are many flops per read and write,
and the floating point operations predominate the computation.
So the algorithm actually runs at near clock speed.
Optimal data layout.
Fortran stores data in columns,
C by rows.
In either case, data is brought into cache in cache lines of 4 or 8
double precision numbers. If you're going to be operating on
the entire matrix it makes sense to make 4 by 4 blocks and opearate
on them. This means the neighboring elements of the matrix will
always be in the smallest (fastest) cache.
But if you want to operate on a row or column at a time, operate a column at a time for Fortran, a row at a time for C. Else it could be really slow.
So for example, if you have data in Fortran, 3 data values per point and will typically be accessing the data for a given point at a given time.
Dimension r(3,n)
is likely to be much more efficient than
dimension r(n,3)
Why? Because the compiler often rearranges computations.
How and why, the following are some of the common ways that you may also want to use.
Common expression elimination
>s1 = a + b + c
>s2 = a + b - c
becomes
>t = a + b
>s1 = t + c
>s2 = t - c
Reducing the number of adds from 4 to 3. Suppose you instead had
>s1 = a + c + b
>s2 = a + b - c
First note this would not be the same because floating point arithmetic
is not associative. So the compiler might have a good excuse for not
performing the same optimization. So maybe we should just put
(a + b) in parentheses to help the compiler out.
Similarly if r*s exists in each expression in a loop (loop-invariant code motion)
Similarly for a reference such as x(i). Note the compiler can't assume that evaluating a function with the same input will always give the same result. Example rand()
One way to speed up code is to "inline" . Some compilers have an "inlining" option. Saves 50 clock cycles?
Strength reduction
Integer addition cheaper than integer multiply. So i + i preferable to 2*i. But floating point mults and adds cost the same.
x^3 costs many times more as a power (evaluating an exponential function) as opposed to x*x*x . So if you can convert a "pow" to a multiplication that could give a factor of twenty.
Sometimes there are "fast" transcendental libraries that give up a digit in return for a factor of two speedup. Sometimes this is a compiler option.
Constant Value Propagation and Evaluation
>two = 2.d0
>x = 3.d0 * two * y
The compiler can convert to
>x = 6.d0 * y
Induction Variable Simplification Instead of computing a 2-D array address by integer multiplication, the compiler could get to the next row address by integer addition.
Register Allocation and Instruction Scheduling
Optimally allocating registers is a very tough problem, but sometimes the compiler can do some easy tricks to help. For example, if one calculation depends on the next, then the next calculation is delayed a few clock cycles for the "pipeline".
Example
>x=1.+2.*y
>s = s + x
z = 2. + 2.*r
t = t + z
would have to wait between commands 1 and 2 and between 3 and 4
>x = 1. + 2.* y
>z = 2. * 2.*r
>s = s + x
t = t + z
Of course, you don't usually want to devote this much attention to a line of code. Especially if you think the compiler is going to do it anyway.
One way to tell if the compiler "did it" is to look at the assembler code for a section. Usually compilers have a -s or -S option to see the compiler produced assembler code.
To see if the code is holding you up, you can use a "profiler". The profiler tells you how much time the code spent in each routine. So you can concentrate on the routines that ate the time. Next time, timers and profilers.