When memory is shared between processors, we can co-ordinate processors by having them store data to RAM. On distributed memory processors, we similarly coordinate by storing to a global file system, but that would be orders of magnitude slower. The most common standards for shared memory machines are OpenMP and the Posix standard Pthreads. We won't cover Pthreads here. See for example, "Pthreads Programming", Nichols, Buttlar, and Farrell. O'Reilly & Associates, 1996.
Shared memory codes can be run either on 2 processors of a Xeon Blade or on the IBM p690. On the Blade Center, a primary limitation is that the bandwidth to memory must be shared by both CPUs. But if you aren't keeping the memory bus busy, then OpenMP is an easy way to increase performance by using two processor. On the p690, OpenMP can allow a single program to use a good deal of memory. It is not too long a wait to get 4 or 8 processors and OpenMP will let us use them.
On the BladeCenter, the PGI and Intel compilers both support use of OpenMP directives. To insure that we get 2 processes per node, include in the bsub script a line #BSUB -R span[ptile=2] . The following is an example of a simple bsub script.
#!/bin/csh #BSUB -R span[ptile=2] #BSUB -W 10 #BSUB -n 2 #BSUB -o out.%J ./foo
If this script is a file bit, it would be run by
>bsub < bit
For the PGI compilers (which are pgf77, pgf90, pgcc, no C++ compiler), the OpenMP compile flag is -mp . To get the path set to use the pgi compilers, type "add pgi". For the Intel compilers ifc and icc, the flag is -openmp (having added the path by "add intel") . In theory, one should set the environmental variable OMP_NUM_THREADS to be 2, but I think this is the default.
On the p690, with the xlf (or xlf90 compiler) the OpenMP library is specified by the -qsmp=omp flag. Here you should be careful to set the OMP_NUM_THREADS environmental variable as desired. In tcsh, for example
>setenv OMP_NUM_THREADS 4
For the p690, the bsub file could be almost the same
#!/bin/csh #BSUB -W 10 #BSUB -n 4 #BSUB -o out.%J ./foo
but here we are asking for 4 processors and did not need the #BSUB -R line. is unneccesary.
Many scientific computation codes have loops which involve many computations and which require most of the time in the code. If the computations in a loop are independent of one another and if the loop size is specified in advance, then it can be easy to have each of several processors read part of the data, work on it, and write it back to the common memory.
The Fortran syntax for a parallel do directive is
!$omp parallel do [clause [,] [clause ...]]
do index = first, last, stride
computations ...
end do
[!$omp end parallel do]
The variables first, last, stride are fixed so that the compiler can decide how to divide up the computation. The C syntax is
#pragma omp parallel for [clause [,] [clause ...]]
for (index = first; test_expr; increment_expr){
computations ...
}
For C, the test_expr can involve <,<=, etc. The increment_expr needs to be the equivalent of stride, i.e., increment by the same amount for each iteration of the loop. gotos (fortran) or breaks (C) which would take code execution out of the loop (or to a different iteration of the parallel loop) are not allowed.
The square brackets [ ] indicate optional expressions.
Clauses give some additional control over the loop. Most commonly
private or shared (or reduction) scoping clauses define how variables are to be treated. Recall that a shared variable has only one value and is common to all threads.
Clauses give some additional control over the loop. Most commonly
private or shared (or reduction) scoping clauses define how variables are to be treated. Recall that a shared variable has only one value and is common to all threads and private variables have different values in each thread. The Fortran default is for the loop variable to be private. In the C default, the loop variable is shared.
While scoping clauses are the most common, other clauses include schedule, if, ordered, and copyin.
schedule clauses control how the iterations are split across threads. if clauses can cause the loop to be parallel or serial depending on a user-controlled if statement (there needs to be enough work to cover the start-up cost for the parallel loop). The ordered clause says that some iterations must be done first, limiting the choice of parallelism. The copyin initializes so-called threadprivate variables. There can be lots of scoping and copyin clauses as there can be lots of variables. But at most one each of the schedule, if, order clauses.
Consider the code
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
A parallel OpenMP version is
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
Because there are no dependencies between loop iterations, the only
change was to insert a pragma directing that the loop be parallelized.
Recall that when the loop is done, all the slave threads disappear, leaving
only the master thread (fork/join paradigm). There is an implicit barrier
at the end of the parallelized loop. If we wanted we could put the
!$omp end parallel doat the end.
subroutine mvect(A,lda,m,n,x,y)
c
c This subroutine computes y' = x'*A + y'
c where A is a matrix of m rows and n columns, x is a vector of
c m elements, y is a vector of n elements.
c
c Arguments
c
integer lda,m,n
double precision A(lda,*), x(*), y(*)
c
c local variables
c
integer i, j
c
!$omp parallel do
do j=1,n
do i=1,m
y(j) = y(j) + A(i,j)*x(i)
end do
end do
!$omp end parallel do
return
end
This loop works fine because the inner loops can share all variables but y and j. j is by default private and if each thread gets some values of j then they can also update the corresponding values y(j). By Fortran default, the inner loop index i is also private. But in a C inner loop, the inner loop index would be private.
The following example is less straightforward because here every iteration of j involves a change to all the values of y. So each processor has to have its own value for each element of y (they can't divide them up). After doing its part of the computation the vectors y from each processor have to be added up. This is the reduction operator.
subroutine mvect(A,lda,m,n,x,y)
c
c This subroutine computes y = A*x
c where A is a matrix of m rows and n columns, x is a vector of
c n elements, y is a vector of m elements.
c
c Arguments
c
integer lda,m,n
double precision A(lda,*), x(*), y(*)
c
c local variables
c
integer i, j
c
do i=1,n
y(i) = 0.d9
end do
!$omp parallel do private(i, j)
!$omp+ reduce(+:y)
do j=1,n
do i=1,m
y(i) = y(i) + A(i,j)*x(j)
end do
end do
!$omp end parallel do
return
end
So I tried this out on the Blade Center, using the same code that ran at 3500 Mflops for small matrices and 700 for large. (It's j loop, was do j=1,n,8 and it was compiled with loop unrolling set to 2). It was slowed down to 1100 Mflops for small matrices and the same 700 Mflops for large matrices.
Why?
Hypothesis. The 2 CPUs on a blade share a data bus. We were keeping the data bus full with one processor, hence the second CPU doesn't really help. Conceivably we could do better for small matrices, but the overhead of splitting up the problem for 2n^2 = 2*100^2 = 2.e4 computations is not worth it.
program spcol
c
c This program does sparse matrix multiplies if A
c is stored in sparse column format
c
integer nmax, nrowmax
parameter (nmax = 10000000, nrowmax = 1000000 )
real*8 a(nmax), x(nrowmax), b(nrowmax)
real*8 frac , sum
integer ir(nmax) , ic(nrowmax)
integer i, j, k, ncol, k1, k2, rem, nz, its , iseed
integer is, itemp
real*8 temp,mflops
character*1 TRANSA,TRANSB
real*4 ftime, pretim, entim
real*8 randw
external ftime, randw
print *, 'input n'
read*,n
print*,' input fraction of nonzeros'
read*,frac
iseed = n
sum = randw(iseed)
c
c This is a way to initialize a sparse matrix
c in O(nz) as opposed to O(n^2) provided
c that 0 < frac << 1
c The number of entries per column is specified,
c but the nonzero columns are random.
c
pretim = ftime()
k = 1
do j=1,n
ic(j) = k
i = 1
100 do while ( i.lt.(int(frac*n)+1))
a(k) = 1.D0/frac/n
ir(k) = randw(iseed)*n + 1
c
c If ir(is) is a row index already used,
c then find a new row index
c
do is = ic(j), k-1
if ( ir(is) .eq. ir(k) ) then
go to 100
c Since we're looping through the rows anyway,
c do a bubble sort ( of course, other sorts
c would be more efficient if there are many nonzero
c rows. )
else if (ir(is) .gt. ir(k) ) then
itemp = ir(is)
ir(is) = ir(k)
ir(k) = itemp
endif
end do
k = k+1
i = i + 1
end do
end do
ic(n+1) = k
nz = k
do j=1, n
x(j) = 1.d0
b(j) = 0.d0
end do
entim = ftime() - pretim
print*,' initialization time was ', entim
c do i=1,nz
c print*,'ir(',i,')=',ir(i)
c end do
its = 100000000/nz
pretim = ftime()
do k =1,its
do j=1,n
k1 = ic(j)
k2 = ic(j+1) - 1
do i = k1, k2
b(ir(i)) = b(ir(i)) + a(i)*x(j)
end do
end do
sum = 0.d0
do j=1,n
sum = sum + abs(b(j))
end do
sum = 1.d0/sum
do j=1,n
x(j) = sum* b(j)
b(j) = 0.d0
end do
end do
entim = ftime() - pretim
print*, entim, its
mflops = 2.e0*its*(nz/entim)/1.e6
print *,'sum = ', sum/n
print *,' mflops =', mflops
print *,'nz =',nz
stop
end
c
c RANNOR generates normally distributed (N(0,1)) random numbers.
c
double precision function rannor(ix)
implicit double precision (a-h,o-z)
data twopi /6.28318530717959d0/
r=dsqrt(-2.d0*dlog(randw(ix)))
theta=twopi*randw(ix)
rannor = r*dcos(theta)
return
end
c
c RANDW is a linear congruential U(0,1) random number generator.
c
c For details see ``A more portable random number generator''
c by Linus Schrage in ACM TOMS, Vol. 5, No.2, (1979), pp. 132--138.
c
double precision function randw(ix)
c Portable random number generator using the recursion
c ix = ix * a mod p
c
c Some compilers, e.g. the hp3000, require the following
c declaration to be integer*4
c
integer a,p,ix,b15,b16,xhi,xalo,leftlo,fhi,k
c
c 7**5, 2**15, 2**16, 2**31 -1
c
data a/16807/, b15/32768/, b16/65536/, p/2147483647/
c
c Get 15 hi order bits of ix.
c
xhi = ix / b16
c
c Get 16 lo bits of ix anf form lo product.
c
xalo=(ix - xhi * b16)*a
c
c Get 15 hi order bits of lo product.
c
leftlo = xalo / b16
c
c Form the 31 highest bits of full product.
c
fhi = xhi * a + leftlo
c
c Get overflow past 31st bit of full product.
c
k = fhi / b15
c
c Assemble all the parts and presubtract p.
c The parantheses are essential.
c
ix = (((xalo-leftlo*b16)-p) + (fhi-k*b15)*b16) + k
c
c Add p back in if necessary.
c
if (ix .lt. 0) ix = ix + p
c
c Multiply by 1/(2**31-1)
c
randw = float(ix)*4.656612875d-10
return
end
The assignment is to insert OpenMP directives and chart matrix size vs. Mflop rate, comparing the serial and parallel versions. Hold the number of nonzero entries per column constant.
Write an explanation of the results in terms of the common data bus for the two processors and what data resides in cache.
Data sharing is controlled by setting the "scope" or variables. "scope" is private, shared or reduction. Scoping clauses include
We can explictly declare variables to be shared or private. which Here's an example of how to declare variables private or shared
!$omp parallel do private(i, j)
!$omp+ shared (x, y, A)
do j=1,n
do i=1,m
y(j) = y(j) + A(i,j)*x(i)
end do
end do
!$omp end parallel do
The discussion of how variables are classified as private or shared could be very detailed and in practice you'll probably want to go to an online reference such as LLNL OpenMP Tutorial and Mozdznyski . Let's look at an example from Chandra, et al, and see how much more detail we want.
Fortran example
subroutine caller(a, n)
integer n, a(n), i, j, m
m = 3
!$omp parallel do
do i = 1, n
do j = 1, 5
call callee(a(i), m, j)
end do
end do
end
subroutine callee(x, y, z)
common /com/ c
integer x, y, z, ii, cnt
save cnt
cnt = cnt + 1
do ii = 1, z
x = x + c
end do
return
end
Fortran
variable scope Safe? why this scope?
----------------------------------------
A shared yes declared outside // do
n shared yes declared outside // do
i private yes // loop index
j private yes Fortran sequential loop index
m shared yes declared outside // do
x shared yes This is actually shared A
y shared yes This is actually shared m
z private yes This is actually private j
c shared yes in a common block
ii private yes local stack-allocated variable of subroutine
cnt shared no save attribute
C example
void caller(int a[], int n )
{ int i, j, m = 3;
#pragma omp parallel for
for (i = 0; i< n; i++ ) {
int k = m;
for (j = 1: j<= 5; j++)
callee(&a[i], &k, j);
}
}
extern int c;
void callee(int *x, int *y, int z)
{ int ii;
static int cnt;
cnt++;
for (ii = 0; ii < z; i++)
*x = *y + c;
}
C
variable scope Safe? why this scope?
----------------------------------------
a shared yes declared outside // do
n shared yes declared outside // do
i private yes // loop index
j sharedi yes sequential loop index, not in Fortran
m shared yes declared outside // do
k private yes automatic variable declered inside //
x private yes value parameter
*x shared yes This is actually shared A
y private yes value parameter
*y private yes This is actually private k
z private yes value parameter
c shared yes declared as extern
ii private yes local stack-allocated variable of subroutine
cnt shared no declared as static
Changing Defaults
Of course, we can explicitly declare as many variables as we like, but that might get painful, or perhaps we actually want to be declare everything (as in Fortran we sometimes use "implicit none").
To change the default, we add a clause to the parallel do. In fortran we can use any of default(shared), default(private), default(none) while in C the choice is default(shared) or default(none).
default(private) means that scoped variables are private by default. Above A and n would be private. But would not impact variables inside subroutine callee. One use would be to start to convert to a distributed memory code where variables really are not shared, so can make sure logic is right before taking the leap. Or if there are lots of temporary variables used then they would all have to be explicitly listed, but the default(private) let's us avoid that.
default(shared) doesn't actually change anything.
default(none) can help catch scoping errors by forcing us to think about the scoping of all of these many variables we are scoping by hand.
What do we mean by a data dependency? Here's an example.
a(1) = 1
a(2) = 1
!$omp parallel do
do i = 2,n ! produce some terms of the Fibonacci sequence
a(i+1) = a(i) + a(i-1)
end do
This code is innately serial in that each entry new value of a(i) depends on the previous 2. So a parallel do structure around this do loop would either be (correctly) ignored or the compiler would produce erroneous code. With the intel compiler I got
>ifc -openmp fib.f -o fib
ifc: Command line warning: openmp requires C style preprocessing; fpp level is rest to 2
program FIB
fib.f(5) : (col. 0) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
So generally, we can't expect that the compiler will discover dependencies for us.
So we'd better look at each variable within a do loop and see about its dependencies. If the variable is only read and never written, then its okay. Else if the variable is written only once and not reread on other iterations, then we're okay.
If a variable is written and also read, we have a dependence. Array variables can be especially tricky. Here's a few examples
do i = 2, n, 2
a(i) = a(i) + a(i-1)
end do
This would be okay. a(i) is only written for i even, so is only written
once. a(i-1) is odd and never written.
do i=1,n/2
a(i) = a(i) + a(i + n/2)
end do
What do you think?
do i=1, n/2 + 1
a(i) = a(i) + a(i + n/2)
end do
Is this the same?
do i=1,n
a(idx(i)) = a(idx(i)) + b(idx(i))
end do
What do you think?
Here's some examples with subroutines (also from Chandra, et al).
subroutine add(c,a,b)
common /count/ cnt
real*8 c, a, b
integer cnt
c = a+b
cnt = cnt + 1
end
suboutine smooth(a, n, 1)
integer n, i
read*8 a(n)
a(i) = ( a(i) + a(i-1) + a(i+1) ) / 3.0
end
subroutine add_count(a, n, i)
common /count/ cnt
integer n, i, cnt
real*8 a(n)
a(i) = a(i) + cnt
end
do i = 1,n
call add(c(i), a(i), b(i) )
end do
Okay?
do i=2,n-1
call smooth (a, n, i)
end do
Okay?
do i = 1,n
call add_count(a, n, i)
end do
Next time, what dependencies can we fix and how?