When memory is shared between processors, we can co-ordinate processors by having them store data to RAM. Processors with shared RAM can coordinate on a time scale on the order of a microsecond, which allows a finer level parallelism than is efficient for distributed memory processing.
Shared memory codes can be run either on the processors of a Xeon Blade (the number of processors per blade varies from 2 to 8) or on the 8 core Opterons, which due to higher memory bandwidth and more RAM, should perform better.
On the BladeCenter, the PGI, Intel and gnu compilers support use of OpenMP directives. Typically a user will request the same number of threads as cores on a blade and request exclusive use of a blade. The sample script
#!/bin/csh #BSUB -n 1 -x #BSUB -W 10 #BSUB -R dc #BSUB -o out.%J setenv OMP_NUM_THREADS 4 ./foowill run the executable foo on blade with two dual cores, i.e. a total of 4 cores, so that setting the number of threads (the environmental variable OMP_NUM_THREADS) to 4 makes sense. If this script is a file named bit, it would be run by
bsub < bit
For an 8 core job, the sript
#!/bin/csh #BSUB -n 1 -x #BSUB -W 10 #BSUB -R qc #BSUB -o out.%J setenv OMP_NUM_THREADS 8 ./foo
will request exclusive use of a dual quad core blade.
For the PGI compilers (which are pgf77, pgf90, pgcc, no C++ compiler), the OpenMP compile flag is -mp . To get the path set to use the pgi compilers, type "add pgi". For the Intel compilers ifc and icc, the flag is -openmp (having added the path by "add intel") , while for the gnu compilers gcc, g++, and gfortran, the flag is -fopenmp (For good compiler flags, see First lecture .
Many scientific computation codes spend most of their time in computationally intensive loops. If the computations in a loop are independent of one another and if the loop size is specified in advance, then it can be easy to have each of several processors read part of the data, work on it, and write it back to the common memory.
The Fortran syntax for a parallel do directive is
!$omp parallel do [clause [,] [clause ...]]
do index = first, last, stride
computations ...
end do
[!$omp end parallel do]
The variables first, last, stride can not be adjusted during the loop so that the compiler can decide how to divide up the computation. The C syntax is
#pragma omp parallel for [clause [,] [clause ...]]
for (index = first; test_expr; increment_expr){
computations ...
}
For C, the test_expr can involve <,<=, etc. The increment_expr needs to be the equivalent of stride, i.e., increment by the same amount for each iteration of the loop. gotos (fortran) or breaks (C) which would take code execution out of the loop (or to a different iteration of the parallel loop) are not allowed.
The square brackets [ ] indicate optional expressions.
Clauses give some additional control over the loop. Most commonly
private or shared (or reduction) scoping clauses define how variables are to be treated. Recall that a shared variable has only one value and is common to all threads and private variables have different values in each thread. The Fortran default is for the loop variable to be private. In the C default, the loop variable is shared.
While scoping clauses are the most common, other clauses include schedule, if, ordered, and copyin.
schedule clauses control how the iterations are split across threads. if clauses can cause the loop to be parallel or serial depending on a user-controlled if statement (there needs to be enough work to cover the start-up cost for the parallel loop). The ordered clause says that some iterations must be done first, limiting the choice of parallelism. The copyin initializes so-called threadprivate variables. There can be lots of scoping and copyin clauses as there can be lots of variables. But at most one each of the schedule, if, order clauses.
The initial value of each private variable is not defined. copyin and threadprivate constructions allow a way to re-use private variables in successive parallel do loops. A restriction is that the successive loops have to use the same numbers of threads.
Following is quite a bit more on parallel do, but let's skip to the other OpenMP Parallel Features .
Consider the code
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
A parallel OpenMP version is
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
or in C
int psaxpy(int *n, float *sa, float *sx, float *sy) {
#include
/* Local variables */
int i;
/* constant times a vector plus a vector. */
#pragma omp parallel for \
private(i)
for (i = 0; i < *n; i++) {
sy[i] += *sa * sx[i];
}
return 0;
}
Because there are no dependencies between loop iterations, the only
Fortran change
was to insert a pragma directing that the loop be parallelized.
Recall that when the loop is done, all the slave threads disappear, leaving
only the master thread (fork/join paradigm). There is an implicit barrier
at the end of the parallelized loop. If we wanted we could put the
!$omp end parallel doat the end of the Fortran version. The main possible error is not examining the loop carefully enough to make sure that each iteration is independent. So it may be good practice to be more explicit than necessary, e.g.,
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
!$omp& shared(y,x,a) private(i)
do i = 1, n
y(i) = a * x(i) + y(i)
end do
!$omp end parallel do
return
end
or in C
int psaxpy(int *n, float *sa, float *sx, float *sy) {
#include
/* Local variables */
int i;
/* constant times a vector plus a vector. */
#pragma omp parallel for \
shared(sy,sx,*sa) private(i)
for (i = 0; i < *n; i++) {
sy[i] += *sa * sx[i];
}
return 0;
}
I had a team of two novice progammers inserting OpenMP directives. For each loop, they wrote a comment on the necessary private and shared variables (or why the loop could not be independent). Their parallelized code returned the same answers as the serial code.
I, on the other hand, sometimes get overconfident, don't comment, make mistakes, then have a hard time tracking down the bug.
subroutine mvect(A,lda,m,n,x,y)
c
c This subroutine computes y' = x'*A + y'
c where A is a matrix of m rows and n columns, x is a vector of
c m elements, y is a vector of n elements.
c
c Arguments
c
integer lda,m,n
double precision A(lda,*), x(*), y(*)
c
c local variables
c
integer i, j
c
!$omp parallel do
!$omp& shared(A,x,y) private(i,j)
do j=1,n
do i=1,m
y(j) = y(j) + A(i,j)*x(i)
end do
end do
!$omp end parallel do
return
end
or in C
int mvect(float A[100][100], int m, int n, float x[100], y[100]) {
#include
/* Local variables */
int i;
int j;
/* This subroutine computes y' = x'*A + y'
where A is a matrix of m rows and n columns, x is a vector of
m elements, y is a vector of n elements. */
#pragma omp parallel for \
shared(A,x,y) private(i,j)
for (j = 0; i < m; i++) {
for (i = 0; j < n; j++ {
y[j] += A[i][j] * x[i];
}
}
return 0;
}
This loop works fine because the inner loops can share all variables but y and j. j is by default private and if each thread gets some values of j then they can also update the corresponding values y(j). By Fortran default, the inner loop index i is also private. But in a C inner loop, the inner loop index would be private.
The following example is less straightforward because here every iteration of j involves a change to all the values of y. So each processor has to have its own value for each element of y (they can't divide them up). After doing its part of the computation the vectors y from each processor have to be added up. This is the reduction operator.
subroutine mvect(A,lda,m,n,x,y)
c
c This subroutine computes y = A*x
c where A is a matrix of m rows and n columns, x is a vector of
c n elements, y is a vector of m elements.
c
c Arguments
c
integer lda,m,n
double precision A(lda,*), x(*), y(*)
c
c local variables
c
integer i, j
c
do i=1,n
y(i) = 0.d0
end do
!$omp parallel do private(i, j)
!$omp+ reduce(+:y)
do j=1,n
do i=1,m
y(i) = y(i) + A(i,j)*x(j)
end do
end do
!$omp end parallel do
return
end
Data sharing is controlled by setting the "scope" or variables. "scope" is private, shared or reduction. Scoping clauses include
We can explictly declare variables to be shared or private. which Here's an example of how to declare variables private or shared
!$omp parallel do private(i, j)
!$omp+ shared (x, y, A)
do j=1,n
do i=1,m
y(j) = y(j) + A(i,j)*x(i)
end do
end do
!$omp end parallel do
The discussion of how variables are classified as private or shared could be very detailed and in practice you'll probably want to go to an online reference such as LLNL OpenMP Tutorial and Mozdznyski . Let's look at an example from Chandra, et al, and see how much more detail we want.
Fortran example
subroutine caller(a, n)
integer n, a(n), i, j, m
m = 3
!$omp parallel do
do i = 1, n
do j = 1, 5
call callee(a(i), m, j)
end do
end do
end
subroutine callee(x, y, z)
common /com/ c
integer x, y, z, ii, cnt
save cnt
cnt = cnt + 1
do ii = 1, z
x = x + c
end do
return
end
Fortran
variable scope Safe? why this scope?
----------------------------------------
A shared yes declared outside // do
n shared yes declared outside // do
i private yes // loop index
j private yes Fortran sequential loop index
m shared yes declared outside // do
x shared yes This is actually shared A
y shared yes This is actually shared m
z private yes This is actually private j
c shared yes in a common block
ii private yes local stack-allocated variable of subroutine
cnt shared no save attribute
C example
void caller(int a[], int n )
{ int i, j, m = 3;
#pragma omp parallel for
for (i = 0; i< n; i++ ) {
int k = m;
for (j = 1: j<= 5; j++)
callee(&a[i], &k, j);
}
}
extern int c;
void callee(int *x, int *y, int z)
{ int ii;
static int cnt;
cnt++;
for (ii = 0; ii < z; i++)
*x = *y + c;
}
C
variable scope Safe? why this scope?
----------------------------------------
a shared yes declared outside // do
n shared yes declared outside // do
i private yes // loop index
j sharedi yes sequential loop index, not in Fortran
m shared yes declared outside // do
k private yes automatic variable declered inside //
x private yes value parameter
*x shared yes This is actually shared A
y private yes value parameter
*y private yes This is actually private k
z private yes value parameter
c shared yes declared as extern
ii private yes local stack-allocated variable of subroutine
cnt shared no declared as static
Changing Defaults
Of course, we can explicitly declare as many variables as we like, but that might get painful, or perhaps we actually want to be declare everything (as in Fortran we sometimes use "implicit none").
To change the default, we add a clause to the parallel do. In fortran we can use any of default(shared), default(private), default(none) while in C the choice is default(shared) or default(none).
default(private) means that scoped variables are private by default. Above A and n would be private. But would not impact variables inside subroutine callee. One use would be to start to convert to a distributed memory code where variables really are not shared, so can make sure logic is right before taking the leap. Or if there are lots of temporary variables used then they would all have to be explicitly listed, but the default(private) let's us avoid that.
default(shared) doesn't actually change anything.
default(none) can help catch scoping errors by forcing us to think about the scoping of all of these many variables we are scoping by hand.
What happens if inside a "parallel do", each thread encounters another "parallel do"? If we have in csh or tcsh, done
setenv OMP_NESTED true
export OMP_NESTED=true
then each thread becomes the master of a batch of new threads.
Nested parallelism can happen inadvertently. For instance, each thread could call a BLAS routine such as DGEMM (matrix matrix multiply). If the DGEMM is from a threaded library, then each DGEMM call launches its own set of threads. For example, consider the following Fortran program
!
! A small matrix a long skinny matrix A times a small matrix B
! | A1 | |C1|
! | A2 | |C2|
! C = alpha * | A3 | * B + beta *
! .. ..
! | Ak | |Ck|
subroutine dskin(TRANSA, TRANSB, m, k, l,
+ alpha, a, lda, b, ldb,
+ beta, c, ldc )
IMPLICIT NONE
CHARACTER*1 TRANSA, TRANSB
INTEGER K, M, L, LDA, LDB, LDC
DOUBLE PRECISION ALPHA, BETA
INTEGER I, J, tnumber, OMP_GET_THREAD_NUM
INTEGER OMP_GET_NUM_THREADS
INTEGER kblock, lastbl, kbl2
double precision a(lda,*), b(ldb,*), c(ldc,*)
integer lblock,tblock
if(TRANSA.ne.'N') print*,'have only debugged dgskn for TRANSA = N'
if(TRANSB.ne.'N') print*,'have only debugged dgskn for TRANSB = N'
!$OMP PARALLEL
tnumber = omp_get_num_threads()
!$OMP END PARALLEL
kblock = m/tnumber
lblock = m - kblock * (tnumber-1)
! print*,' m is',m,'same as', lblock + kblock*(tnumber-1)
!$OMP PARALLEL DO
do i = 1, tnumber
if (i.lt.tnumber) then
kbl2 = kblock
tblock = OMP_GET_THREAD_NUM()
! print*,'i =', i,' kbl2 =',kbl2,' thread=',tblock
endif
if (i.eq.tnumber) then
kbl2 = lblock
tblock = OMP_GET_THREAD_NUM()
! print*,'i =', i,' kbl2 =',kbl2,' thread=',tblock
endif
call dgemm(TRANSA,TRANSB, kbl2, k, l,
+ alpha, a((i-1)*kblock+1,1), lda, b, ldb,
+ beta, c((i-1)*kblock+1,1), ldc )
end do
return
end
On the 16 core nodes, OMP_NUM_THREADS = 4 was fastest when dgemm was from a threaded BLAS library, OMP_NUM_THREADS = 16 was fastest when dgemm was from a non-threaded BLAS library.
Supposedly the default is that OMP_NESTED is set as false. Here the launch of additional threads by the BLAS library appeared to be independent of the OMP_NESTED environment.
An OpenMP compliant version BLAS would use threads only if the OMP_NESTED environmental variable was set to true. So we'd have a preprocessor that determined which library to use?
SideNote. One way to use multiple cores to speed a program is to call a threaded BLAS. Involves no use of OpenMP.
For example, using the 16 core processor and 4 threads (OMP_NUM_THREADS set to 4) found that a job that called a DGEMM (BLAS matrix matrix multiply) ran faster than with OMP_NUM_THREADS set to 16. In that case the BLAS library was threaded (so that each of 4 threads spawned 4 threads?). But I don't think the OMP_NESTED environmental variable came into play ? (to be checked)
Calling with the non-threaded BLAS library, then setting OMP_NUM_THREADS to 16 ran fastest.
The next sections show some datails on problems with loops that take personal attention (as opposed to automatic parallelism). But let's jump to OpenMP Parallel Features to get a more perspective on what can be done with OpenMP.
What do we mean by a data dependency? Here's an example.
a(1) = 1
a(2) = 1
!$omp parallel do
do i = 2,n ! produce some terms of the Fibonacci sequence
a(i+1) = a(i) + a(i-1)
end do
This code is innately serial in that each entry new value of a(i) depends on the previous 2. So a parallel do structure around this do loop would either be (correctly) ignored or the compiler would produce erroneous code. With the intel compiler I got
>ifc -openmp fib.f -o fib
ifc: Command line warning: openmp requires C style preprocessing; fpp level is rest to 2
program FIB
fib.f(5) : (col. 0) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
So generally, we can't expect that the compiler will discover dependencies for us.
So we'd better look at each variable within a do loop and see about its dependencies. If the variable is only read and never written, then its okay. Else if the variable is written only once and not reread on other iterations, then we're okay.
If a variable is written and also read, we have a dependence. Array variables can be especially tricky. Here's a few examples
do i = 2, n, 2
a(i) = a(i) + a(i-1)
end do
This would be okay. a(i) is only written for i even, so is only written
once. a(i-1) is odd and never written.
do i=1,n/2
a(i) = a(i) + a(i + n/2)
end do
What do you think?
do i=1, n/2 + 1
a(i) = a(i) + a(i + n/2)
end do
Is this the same?
do i=1,n
a(idx(i)) = a(idx(i)) + b(idx(i))
end do
What do you think?
Here's some examples with subroutines (also from Chandra, et al).
subroutine add(c,a,b)
common /count/ cnt
real*8 c, a, b
integer cnt
c = a+b
cnt = cnt + 1
end
suboutine smooth(a, n, 1)
integer n, i
read*8 a(n)
a(i) = ( a(i) + a(i-1) + a(i+1) ) / 3.0
end
subroutine add_count(a, n, i)
common /count/ cnt
integer n, i, cnt
real*8 a(n)
a(i) = a(i) + cnt
end
do i = 1,n
call add(c(i), a(i), b(i) )
end do
Okay?
do i=2,n-1
call smooth (a, n, i)
end do
Okay?
do i = 1,n
call add_count(a, n, i)
end do
Loop-carried dependencies occur when a value written by one iteration of a loop is read by another. Dependencies in one iteration of a loop do not cause problems. Here's a loop in which one iteration may or may not influence the next.
y = 0
do i= 1,n
if (rand().lt. .5) y = rand()
a(i) = y
end do
Could this loop be parallelized?
Classification of dependencies in terms of data flow can help in converting loops to be parallel.
do i=2,n-1
a(i+1) = a(i) + a(i-1)
end do
do i=1,n-1
a(i) = a(i) + a(i+1)
end do
Then would the following be correct?
!$OMP PARALLEL DO
do i=1,n-1
a(i) = a(i) + a(i+1)
end do
No. a(i+1) could be overwritten before it is used to create a(i). But this this can be parallelized by making a preliminary do loop which makes a new vector.
!$OMP PARALLEL DO
do i=1,n-1
ap1(i) = a(i+1)
end do
!$OMP PARALLEL DO
do i=1,n-1
a(i) = a(i) + ap1(i)
end do
Claim, antidependencies can always be parallelized by creating new storage.
do i=1,n-1
a(i) = a(i) + 2
x = a(i)
end do
!$OMP PARALLEL DO PRIVATE(X)
do i=1,n
a(i) = a(i) + 2
x = a(i)
end do
is incorrect if x is taken as x will be private and actually disappear at the end of the loop. Or if x were declared shared, we still wouldn't know which value of i actually performs the last update.
!$OMP PARALLEL DO PRIVATE(X)
do i=1,n
a(i) = a(i) + 2
end do
!$OMP END PARALLEL DO
x = a(n)
would work fine. Or if we want to perform a few more flops,
!$OMP PARALLEL DO SHARED(A) LASTPRIVATE(X)
do i=1,n
a(i) = a(i) + 2
x = a(i)
end do
We claim that anti depencies and output dependences can always be removed. In some cases, true dependenies can also be removed. The most standard case is when the dependency corresponds to one of the reduction operators. These are commutative operations, including sum, product, max, min, and, or.
M = a(1)
do i=2,n
M = max(M,a(i))
end do
Is the following correct?
!$OMP PARALLEL DO SHARED(a,M)
M = a(1)
do i=2,n
M = max(M,a(i))
end do
No, we would be finding the max of various subsets? Also, only one thread could operate at a time, so all would be waiting on one another. But the following would work.
!$OMP PARALLEL DO SHARED(a) REDUCE(M:MAX)
M = a(1)
do i=2,n
M = max(M,a(i))
end do
If a variable k is incremeted by the same amount in each loop, then each instance "depends" on the last. But we could compute it directly from the loop counter and so avoid that dependency.
A basic feature is the parallel construct, discussed next. Work is shared by loop constructs, sections constructs, a single construct, and (in Fortran) a work share construct.
Loop constructs are convenient in that we can parallelize code gradually. Often a fair proportion of the code time is spent in loops that parallelize well. Amdahl famously pointed out that the remaining serial bits of the code put a lid on the potential speedup. Let's give an example.
As a thought experiment , suppose we have a machine with ten processors and that 3/4s of the work is in loops and moreover that each loop parallelizes perfectly with speedup ten.
loop -- three seconds serial bit -- one second -- go to next iteration loop -- three seconds serial bit -- one second -- go to next iteration ...
so that after parallelization we have
parallel loop -- .3 seconds serial bit -- one second -- go to next iteration parallel loop -- .3 seconds serial bit -- one second -- go to next iteration ...
Before parallelization, we were taking 4 seconds per iteration. Using ten processors, the time goes to 1.3 seconds. So ten processors give a speedup of about 4/1.3 i.e., about 3. How much speedup would we get with one hundred processors? 4/1.01 , i.e., just less than 4. With a thousand? Throwing more processors at the problem doesn't help much. As the number of processors grows, the problem starts to take longer to distribute and the job takes longer. (We saw this with the short loops in the first lecture). Often speedups of about 4 are all we can get through just using OpenMP to parallelize loops.
The only way to get a higher potential speedup is to find some way to make the "serial" regions parallel. In OpenMP, other than parallelizing do loops, we can also get parallelism from a parallel region. Such a region causes each thread to execute the same program. For example, each thread could print "hello world". If there are two threads, we could get two "hello world"s. The "parallel do" construct takes a given amount of work and parcelled it out. If you put a "parallel" around a do loop, all of the loop will be performed by each processor. How could we use the "parallel region" to divide up work?
Returning to the thought experiment above, suppose the serial region between loops can be divided into 4 independent and equal sized tasks, then the lower bound on an iteration becomes .25 seconds, as opposed to 1 second, so potentially an iteration can be speeded by a factor of 16 instead of at most 4 if the "serial" region is irretrievably serial.
In order that each processor does useful work we typically need each processor to get its own set of data. This can be accomplished by having each processor work on its own part of a matrix. Another useful technique is to use a common block with a threadprivate operation. And also we want each processor to be able to output its own data. One way is by having each processor write to its own section of an array.
Branch statements out of (or into) a parallel region are not legal. Just as in the "parallel do" case, variables need to divided into shared and private.
subroutine sup(n)
integer n
!$OMP PARALLEL
call mypart(n)
if (n.lt.5) return
!$OMP END PARALLEL
return
end
would not be legal.
The statement
!$OMP PARALLEL
print*,'Hello World'
!$OMP END PARALLEL
or in C
#pragma omp parallel
{
printf("Hello World\n");
}
is not very useful, but okay. We get a 'Hello World' per thread. As in the case of the parallel do, the parallel region causes a master thread to fork a bunch of slaves, which all disappear at the END PARALLEL. Again the END PARALLEL is a synchronization point. What would the following construct do?
!$OMP PARALLEL
do i = 1, 100
print*,'Hello World'
end do
!$OMP END PARALLEL
or in C
#pragma omp parallel
for (i=0;i<99;i++)
{
printf("Hello World \n");
}
More usefully, the PARALLEL construct allows OpenMP programs to work with the SPMD (single program -- multiple data) paradigm, commonly used with distributed memory programming parallel codes (e.g. for MPI codes)
In C, we could do the following
#pragma omp parallel
{
printf(" This parallel region is performed by thread %d\n",
omp_get_thread_num();
if (omp_get_thread_num() == 0 ) {
/* some stuff */ ; }
if (omp_get_thread_num() == 1 ) {
/* some other stuff */ }
} /* end parallel region */
or in Fortran
!$OMP PARALLEL
if (OMP_GET_THREAD_NUM().eq.0) then
stuff...
elseif (OMP_GET_THREAD_NUM().eq.1) then
other stuff
endif
!$OMP END PARALLEL
You may think that some parallel regions do not need the same number of threads. You can use the function set_num_threads. Here's a fortran example.
! Adapted from the LLNL OpenMP Tutorial
!This program shows the use of the REDUCTION clause.
PROGRAM REDUCTION
IMPLICIT NONE
INTEGER tnumber, OMP_GET_THREAD_NUM, OMP_GET_NUM_THREADS
INTEGER OMP_GET_MAX_THREADS
INTEGER I,J,K, tmax
I=1
J=1
K=1
PRINT *, "Before Par Region: I=",I," J=", J," K=",K
PRINT *, ""
call OMP_SET_NUM_THREADS(4)
! this set the number of threads to 4
tmax = OMP_GET_MAX_THREADS
print*,'max num threads =', tmax
!$OMP PARALLEL DEFAULT(PRIVATE) REDUCTION(+:I)
!$& REDUCTION(*:J) REDUCTION(MAX:K)
tnumber=OMP_GET_THREAD_NUM()
I = tnumber
J = tnumber
K = tnumber
PRINT*,"Thread ",tnumber," I=",I," J=", J," K=",K
!$OMP END PARALLEL
PRINT *, ""
print *, "Operator + * MAX"
PRINT *, "After Par Region: I=",I," J=", J," K=",K
! can you explain what these come out to be ?
END PROGRAM REDUCTION
Here's a C example of reduction (adapted from LLNL)
#includemain () { int i, n, chunk; float a[100], b[100], result; /* Some initializations */ n = 100; chunk = 10; result = 0.0; omp_set_num_threads(2); #pragma omp parallel for private(i) for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } printf("Number of threads = %d \n",omp_get_num_threads() ); omp_set_num_threads(4); #pragma omp parallel for \ default(shared) private(i) \ schedule(static,chunk) \ reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); pritf("Final number of threads = %d \n",omp_get_num_threads() ); }
We already worked with the for or do loop work sharing construct.
!$OMP PARALLEL
!$OMP DO
do i = 1, 100
print*,'Hello World'
end do
!$OMP END DO ! parallel do ends but threads still alive
!$OMP END PARALLEL ! threads rejoin
#pragma omp parallel private(i)
{
#pragma omp for
for (i=0;i<100;i++)
{
printf("Hello World\n");
} /* End parallel do, but threads are alive
} /* End parallel region */
This would be equivalent to replace the !$OMP PARALLEL and !$OMP DO lines by a single statement !$OMP PARALLEL DO
!$OMP PARALLEL DO
do i = 1, 100
print*,'Hello world'
end do
#pragma omp parallel for private(i)
for (i=0;i<100;i++)
{
printf("Hello World \n");
} /* end parallel for and threads rejoin
Separating the "PARALLEL" and the "DO" allows us to control when the batch of threads are forked and joined. We might for example be able to put more than one loop inside the same parallel section. More detail below
Above, we used an if with omp_get_thread_num, to assign a batch of work to particular thread. More usually, we might use sections.
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
(void) fun1();
#pragma omp section
(void) fun2();
#pragma omp section
(void) fun3();
or
!$OMP PARALLEL
!$OMP SECTIONS
!$OMP SECTION
call sub1( .. )
!$OMP SECTION
call sub2( .. )
!$OMP SECTION
call sub3( .. )
!$OMP END SECTIONS
!$OMP END PARALLEL
If there are at least 3 threads, each of the 3 sections will be in parallel. If there are only 2 threads, one thread will do 2 of the sections.
Inside a parallel region you can have several work constructs. For example, a parallel loop, and then a section. By default the loops and sections both have a break to wait for all threads to synchronize.
You can also put a serial region inside a parallel region.
!$OMP SINGLE
read*,b
!$OMP END SINGLE
or
#pragma omp single
{
scanf("%d",&b);
}
If "b" is a shared variable in the parallel region, then all the threads in the region will get the same value of "b".
As in the special case of the parallel loop, a main concern is how to get the correct data in and out of each parallel thread. We have to be cautious when using common blocks. Here are some examples from Chandra et al that illustrate pitfalls and solutions. This first example attempts to use knowledge of the process thread number to specify what job a parallel region should perform.
program oops
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, nunthreads, size)
!$omp& private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (n + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, n)
call work(a)
!$omp end parallel
end
subroutine work(a)
real*4 a(*)
do i = ifirst,ilast ! but since ifirst and ilast are part
! of common block they don't vary with the
! stuff done in the thread ..
end do
return
end
Alas, the subroutine work pays no attention to the nice arithmetic and the private declaration in the parallel region. It knows what ifirst and iend are only because they are globally shared, hence confusion ..
program worksOK
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, numthreads, size)
!$omp& private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (N + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, N)
call work(a,ifirst,ilast)
!$omp end parallel
end
subroutine work(iarray,ifirst,ilast)
real*4 a(*)
do i = ifirst,ilast
...... stuff on my part of a
end do
return
end
By making ifirst and ilast as arguments to work, they refer in work to the private (as opposed to the global) variables.
The thread private directive lets us avoid messy argument lists. It lets common blocks (Fortran) global or static variables (C, C++) be private to each thread.
program works2
common /bounds/ istart, iend
$!omp threadprivate(/bounds/)
real*4 a(10000)
n = 10000
!$omp parallel private(iam, nthreads, chunk)
nthreads = omp_get_num_threads()
iam = omp_get_thread_num()
chunk = (N + nthreads -1 )/nthreads
istart = iam*chunk + 1
iend = min((iam+1)*chunk, N)
call work(a)
!$omp end parallel
end
subroutine work(iarray)
real*4 a(10000)
do i=ifirst,ilast
...... stuff on my chunk of iarray
end do
return
end
If you want more than one common block in a thread private block, you need commas as well as the paired slashes. Notice the private statement for istart and iend disappeared from the parallel region. Only the one threadprivate statement appears and in fact further private statement for istart, iedn would be incorrect.
Typically, the threadprivate variables are initialized within the parallel region. But if we want to get data from the master thread's thread data we can specify
!$omp parallel !$omp+ copyin(m,n)
where m and n are member of a threadprivate common block (and as usual we could list more with commas), e.g., copyin(m,n,j).
Perhaps the most frequent cause of confusion and bugs is keeping track of what values exist inside threads (similarly to why functions and subroutines can get problems).
So it's good to be explicit about what data is shared and private to threads. Also, when do we know when values in various threads get reset to be the same?
OpenMP has some default synchronizations that are usually helpful.
At the end of each parallel region, all the threads are joined into one thread. This is a required synchronization point. For example, consider
!$OMP PARALLEL DO
do i = 1, 100
print*,'Hello world'
end do
or in C
#pragma omp parallel for private(i)
for (i=0;i<100;i++) {
printf("Hello world \n");
}
Then "end do" here is the end of a parallel region, hence a required synchronization point.
Here's an equivalent loop.
!$OMP PARALLEL
!$OMP DO
do i = 1, 100
print*,'Hello World'
end do
!$OMP END DO NO WAIT
!$OMP BARRIER
!$OMP END PARALLEL
or in C
#pragma omp parallel
{
#pragma omp for private(i) nowait
for (i=0;i<100;i++) {
printf("Hello world \n");
} /* end of loop without synchronization */
#pragma omp barrier
} /* end of parallel region */
!$OMP END by itself implies a barrier, i.e., all the
threads have to reach this point before going on.
Putting an !$OMP END NO WAIT allows the threads to go on to
the next statement without waiting. Not very helpful as the next statement
is a !$OMP BARRIER, which makes them all wait here instead.
Let's recall the sax.f code which we experimented with last time.
sum = 0.d0
!$OMP PARALLEL DO
!$OMP& REDUCTION(+:sum)
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
!OMP END PARALLEL DO ! all threads killed
sum = 1./sum ! the one living serial thread does the divide
!$OMP PARALLEL DO ! threads are reborn
do i =1,j
a(i) = a(i)*sum
end do
end do
!$OMP END PARALLEL DO
sax.c
sum = 0.e0;
#pragma omp parallel for private(i) reduction(+:sum)
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
sum = sum + a[i];
} /* end of parallel for, threads die */
sum = 1./sum;
#pragma omp parallel for private(i)
for (i=0; i < j; i++) {
a[i] = a[i]*sum;
} /* end of parallel for, threads die */
This creates a batch of threads, then waits for them all at the first !$OMP END PARALLEL DO, then in serial, inverts sum, then immediately starts up a new parallel region and batch of threads with !$OMP PARALLEL DO.
The following does the same operations with a synchronization after the first !$OMP END DO. A difference is that the threads are not all killed and restarted. In the following, each processor does the inversion of sum.
sum = 0.d0
!$OMP PARALLEL
!$OMP DO
!$OMP& REDUCTION(+:sum)
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
!$OMP END DO ! a barrier here but the threads all live on
sum = 1./sum ! so everybody does the same division
!$OMP DO ! threads do some more work (but don't have to be
! created ..
do i =1,j
a(i) = a(i)*sum
end do
!$OMP END DO
!$OMP END PARALLEL
#pragma omp parallel
{
#pragma omp for reduction(+:sum) private(i)
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
sum = sum + a[i];
} /* end of omp for the threads do not die, they do synchronize */
sum = 1./sum; /* all the threads do this divide, by same value sum */
#pragma omp for private(i)
for (i=0; i < j; i++) {
a[i] = a[i]*sum;
} /* end of omp for */
} /* end of parallel region, now the threads die */
The time to invert 1./sum is negligible compared to the time
to synchronize threads, but if we want only one thread to compute
1./sum, we could do the following. Again there is an implied
synchronization at the !$OMP END SINGLE.
sum = 0.d0
!$OMP PARALLEL
!$OMP DO
!$OMP& REDUCTION(+:sum)
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
!$OMP END DO ! a barrier here
!$OMP SINGLE
sum = 1./sum ! just one of the threads does the divide ..
!OMP END SINGLE ! barrier
!$OMP DO ! everybody uses the same value "sum"
do i =1,j
a(i) = a(i)*sum
end do
!$OMP END DO
!$OMP END PARALLEL
C version.
#pragma omp parallel
{
#pragma omp for reduction(+:sum) private(i)
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
sum = sum + a[i];
} /* end of omp for */
#pragma omp single
{
sum = 1./sum;
} /* this time only one thread did this divide */
#pragma omp for private(i)
for (i=0; i < j; i++) {
a[i] = a[i]*sum;
} /* end of omp for */
} /* end parallel region */
The SINGLE construct would make more sense if it entailed something time consuming, e.g., reading or writing to a file (which could entail waiting for a hard drive, which could take a couple of milliseconds, equivalent to some millions of floating point operations).
Recalling that the point of the multiplication by 1./sum is just to keep a(i) from getting too large, we could decide that not all processors need to have the same copy of sum and don't need to wait before rescaling by 1./sum.
!$OMP PARALLEL SHARED (A,alpha,B,j) PRIVATE(i,sum)
sum = 0.d0
!$OMP DO
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
!$OMP END DO NO WAIT
sum = 1./sum ! everybody does a divide
! with their own "sum"
! and can get started on the next do
! w/o waiting on all the slowpokes
!$OMP DO
do i =1,j
a(i) = a(i)*sum
end do
!$OMP END DO NO WAIT ! possibly a stupid compiler would synchronize
! both here and at the next step
!$OMP END PARALLEL ! always synchronize at end of parallel region
or in C ..
#pragma omp parallel shared(a,alpha,b,j) private(i,sum)
{ sum = 0.0;
#pragma omp for nowait
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
sum = sum + a[i];
} /* end of omp for the threads do not die, they do NOT synchronize */
sum = 1./sum; /* all the threads do this divide, each has its own "sum" */
#pragma omp for nowait
for (i=0; i < j; i++) {
a[i] = a[i]*sum; /* the values sum could be different for different sums
} /* end of omp for, NO synchronization */
} /* end of parallel region, now the threads die, and we synchronize */
The SINGLE region has exactly one thread. For a CRITICAL region each thread performs the computations in the region, but each thread performs the operation in turn. For example, they can manipulate the same global variable. So returning to the idea that all processors should scale by the same quantity 1./sum, replace the reduce operator by an equivalent construct (in practice the reduce is likely to be more efficient).
sum = 0.d0
!$OMP PARALLEL shared(j,alpha,a,b,sum) private(i,mysum )
mysum = 0.d0
!$OMP DO
do i=1,j
a(i) = alpha* b(i) + a(i)
mysum = mysum + a(i)
end do
!$OMP END DO ! wait here
!$OMP CRITICAL
sum = sum + mysum ! everybody does their own bit in turn
!OMP END CRITICAL ! barrier
!$OMP SINGLE
sum = 1./sum ! just one guy does the divide
!OMP END SINGLE ! barrier
!$OMP DO
do i =1,j
a(i) = a(i)*sum
end do
!$OMP END DO
!$OMP END PARALLEL
or in C ..
sum = 0.0;
#pragma omp parallel shared(j,a,alpha,b,j,sum) private(i,mysum)
{ mysum = 0.0;
#pragma omp for
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
sum = sum + a[i];
} /* wait here at end of omp for */
#pragma omp critical
{
sum += mysum; \* everybody adds in their own update *\
} \* taking turns *\
#pragma omp single
sum = 1./sum; /* only one thread does this divide all get same "sum" */
#pragma omp for nowait
a[i] = a[i]*sum;
/* the values sum could be different for different threads */
} /* end of omp for, NO synchronization */
} /* end of parallel region, now the threads die, and we synchronize */
The main appeal of OpenMP is its simplicity. In a search for parallel efficiency with many threads, teasing parallelism out of all regions demands more tricks. So the kind of low-level programming used in pThreads comes into play. We can use ATOMIC constructs (one statement CRITICAL regions for all threads to update a global variable) or "locks" so that any given thread can be given exclusive write permissions on that variable (and the lock can be handed around). "locks" are beyond the scope of this short course, but the book Using Openmp, Portable Shared Memory Parallel Programm by Chapman, Jost, and van der Pas (MIT Press, 2008) is useful.
Here's one more version with the nonblocking MASTER construct.
sum = 0.d0
!$OMP PARALLEL SHARED (A,alpha,B,j,sum) PRIVATE(i,mysum)
mysum = 0.d0
!$OMP DO
do i=1,j
a(i) = alpha* b(i) + a(i)
mysum = mysum + a(i)
end do
!$OMP END DO NO WAIT ! let's not wait
!$OMP MASTER
sum = 1./mysum ! only thread 0 does this with own version of mysum
!$OMP END MASTER ! no one else has to wait on thread 0
!$OMP DO ! some guys have the new sum, some have 0.d0
do i =1,j
a(i) = a(i)*sum
end do
!$OMP END DO ! synchonize here (by default)
!$OMP END PARALLEL ! synchronize here (always at end of parallel region)
or in C ..
sum = 0.0;
#pragma omp parallel shared(j,a,alpha,b,j,sum) private(i,mysum)
{ mysum = 0.0;
#pragma omp for nowait
for (i=0; i < j; i++ ) {
a[i] = alpha*b[i] + a[i];
mysum = mysum + a[i];
} /* wait here at end of omp for */
#pragma omp master
sum = 1./mysum; /* node zero does this .. with no synchronization .. */
#pragma omp for nowait
for (i=0; i < j; i++) {
a[i] = a[i]*sum; /* some threads multiplying by zero ? */
} /* end of omp for, NO synchronization */
} /* end of parallel region, now the threads die, and we synchronize */
Of course, we could avoid the multiplications by sum = 0. by putting
$OMP BARRIER or #pragma omp barrier
after the !OMP END MASTER (before #pragma omp for ) The MASTER construct has thread 0 do the "sum = 1./mysum". Since there is no block after the MASTER, all the threads will immediately start with the loop. Some of them may still have sum as 0.d0.
When would the MASTER construct be useful ?
Exercise F (C)
Take the code sax-new.f. (sax.c)
Look at your results from last time (file gput16-shared4?) . How were the Mflop rates? Get rid of the SCHEDULE(runtime). How are Mflop rates ? What do you think happened ? Compare the max threads in the LSF output.
both C and Fortran versions (for C, see sax.c and you can use the file makecsaxpgi to compile)
In the file makesaxpgi (or makecsaxpgi) Substitute "-tp k8-64 " for "-tp x64". Compare the mflop rates. Is the pgf90 man page accurate when it recommends the "-tp x64" flag as useful for an AMD processor (such as those on the shared_memory queue?
Go through the examples above one at a time, changing the sax-new.f or sax.c code. Recompile and rerun. Do you see differences in computed Mflop rates? More for the long vector or short ? Why ? Please point out any errors to me.