Problems grow too large to be done on a single processor. Either they take too much RAM (mind you we may have to work to find the right options to use all available RAM). Or problems just take too long on a single processor. Or both.
The two predominant forms of parallel computations are shared vs. distributed memory. In a distributed memory program, each processor (or possibly a node of processors) has its own chunk of RAM. To access RAM on another processor, we must send "messages" which typically involves writing to a buffer, and using some kind of communication network, e.g., GiGE ethernet.
On a shared memory machine, memory is accessible to all processors. For example, 2 CPUs on a blade share 4 Gbytes of memory. The 32 CPUs on the p690 share 128 GBytes of memory.
An advantage to shared memory is that it's typically easier to convert a code to shared memory than it is to convert to a message passing code. The conversions to a shared memory code can often be gradual.
One approach to parallel computation would be to desing a new computer language. In fact, several such languages have been developed, most though are vendor specific. One standard parallel language is High Performance Fortran. Available on the p690.
Another portable way to get parallel performance on a shared memory platform is by using "threads". Threads are light weight processes, designed to swap in and out of the CPU quickly. Pthreads is a standard low-level library to do thread based programming from C or C++.
For scientific computations, the most common means to obtain shared memory parallelism is by using the OpenMP library. OpenMP obtains parallellism by inserting "pragmas" in Fortran or C code. If OpenMP is not available then the pragmas are ignored and the code can still run in parallel. I'm mainly following the book "Parallel Programming in OpenMP", by Chandra, Dagum, Kohr, Maydan, McDonald, and Menon, Morgan Kaufmann, 2001. For online tutorials see LLNL OpenMP Tutorial and Mozdznyski
Directives depend on runtime library routines and require setting of environmental variables. Typically, the program is compiled with -omp flags. Environmental variables are set and their may be a preprocessor # include.
In Fortran 77, insert pragma lines beginning with
!$omp c$omp *$ompwith either 0's or blanks in the 6th column. Some other character in the 6th column would make the line a continuation of the previous line.
In free form Fortran, any line starting with
!$ompis a pragma. Continuation is indicated by an & at the end of the line.
In C or C++, a pragma starts with
#pragma omp
Unless the compiler is instructed to use the pragmas, it ignores them and compiles the code as a serial code.
Other statements may be calls to parallel libraries which might not be available in a serial version, which would cause problems linking. Statements that should be compiled only in the serial version can also be preceded by the first column
!$ c$ *$or in free format, any line starting with !$ (in any column so long as preceding characters are white space)
A Fortran 77 example
iam = 0
!$ iam = omp_get_thread_num() ! only compiled if OpenMP enabled
c A continuation (continued only when OpenMP enabled)
y = x
!$ + + offset
Of course, one has to be careful that the serial version will stay make sense if the OpenMP directives are not present. There are 3 language extensions embodied in OpenMP: parallel control structures, data environment, and synchronization.
Parallel Control Structures alter the flow of control. The model is "fork/join". Two different threads can execute concurrently. Or a parallel "do" can divide iterations, which each processor performing a different subset of the iterations. Examples?
Communication and Data Environment An OpenMP program starts with a single thread of execution. It has access to global variables and to automatic (stack) variables within subroutines as well as to dynamically (heap) variables. The global context remains throughout execution.
Each thread has its own execution context and private stack. It can use its private stack to call subroutines.
The
Synchronization OpenMP threads communicate through reads and
writes to shared variables. Two forms of scynchronizations are
mutual excusion and barriers. When multiple threads can modify
the same variable (think bank account balance as an example) a mutual
exclusion
Consider the code
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
A parallel OpenMP version is
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
Because there are no dependencies between loop iterations, the only
change required is to insert a pragma directing that the loop be parallelized.
Runtime Execution Model
(serial) Master thread executes serial portion of code
(serial) Master thread enters the saxpy subroutine
Master thread encounters "parallel do". Creates slave threads
(parallel) master and slave threads divvy up iterations and
each of them do some
(implicit barrier) Wait for everyone to finish
(serial) slave threads gone -- master resumes execution
A problem I see here is that dividing this loop up among processors may actually slow its execution. Question (for later) how to control how many ways the loop is subdivided.
Communication and data scoping and the saxpy
Each iteration of the loop read the scalar values a. It read x(i)
and read, changed and wrote y(i). y(i) is not a problem because
each value of i belongs to a different processor, so these do not overlap.
"i" is a problem. Each process needs its own version of the loop variable
i. The loop index "i" is
-------------------------------------------------------
Serial execution -- a,x,y,n,i are global shared
-------------------------------------------------------
Parallel execution
-- a,x,y,n are global shared
-- i | i | i | i is private to each thread
-------------------------------------------------------
Once the
Synchronization in the Simple Loop Example
"y" is modified, but by multiple threads, but since each modifies its
own segment, that requies no synchronization. However, for a
Perspectives
Loop-level parallelization is straightforward and can be quickly accomplished. Usually we only get a speed-up of two or three from this sort of parallelism. How much is this compared to what we can get from using a given processor more efficiently?
Question: we have two CPUs per Xeon blade. Can we program them with OpenMP? OpenMP is listed as a compiler option for the pgi compilers, but ??