When memory is shared between processors, we can co-ordinate processors by having them store data to RAM. On distributed memory processors, we similarly coordinate by storing to a global file system, but that would be orders of magnitude slower. The most common standards for shared memory machines are OpenMP and the Posix standard Pthreads. We won't cover Pthreads here, but there is an interface for Fortran available on the p690, and though you can't see the produced code, there is an automatic optimizer that (apparently) uses pthreads directives on the Intel ifc compiler.
Shared memory codes can be run either on 2 processors of a Xeon Blade or on the IBM p690. On the Blade Center, a primary limitation is that the bandwidth to memory must be shared by both CPUs. But if you aren't keeping the memory bus busy, then OpenMP is an easy way to increase performance by using two processor. There's a "speed-up" example below. On the p690, OpenMP can allow a single program to use a good deal of memory. It is not too long a wait to get 4 or 8 processors and OpenMP will let us use them. Again, this lecture mainly follows the text, "Parallel Programming in OpenMP", by Chandra, Dagum, Kohr, Maydan, McDonald, and Menon. Morgan Kaufmann Publishers, 2001.
On the BladeCenter, the PGI and Intel compilers both support use of OpenMP directives. To insure that we get 2 processes per node, include in the bsub script a line #BSUB -R span[ptile=2] . The following is an example of a simple bsub script.
#!/bin/csh #BSUB -R span[ptile=2] #BSUB -W 10 #BSUB -n 2 #BSUB -o out.%J ./foo
If this script is a file bit, it would be run by
>bsub < bit
For the PGI compilers (which are pgf77, pgf90, pgcc, no C++ compiler), the OpenMP compile flag is -mp . To get the path set to use the pgi compilers, type "add pgi". The pgi compilers recognize OpenMP directives but do not automatically insert them. For the Intel compilers ifc and icc, the flag is -openmp (having added the path by "add intel") . In theory, one should set the environmental variable OMP_NUM_THREADS to be 2, but I think this is the default. For the ifc and icc compilers, setting the -openmp flag causes the compiler to recognize user inserted OpenMP directives and from looking at the assembler code (-S compiler flag), the compiler produces different codes with the -openmp flag than without, even if there are no OpenMP directives. The -parallel flag enables automatic generation of codes with shared memory parallelization. Unfortunately, there seems to be no option to automatically insert OpenMP directives in Fortran or C code.
On the p690, with the xlf (or xlf90 compiler) the OpenMP library is specified by the -qsmp=omp flag. Here you should be careful to set the OMP_NUM_THREADS environmental variable as desired. To disable automatic parallelization, but recognize OpenMP directives, try the compiler flag -qsmp=omp,noauto. As with the ifc and icc compilers, there appears to be no mechanism for inserting OpenMP directives directly into Fortran or C codes.
In tcsh, for example
>setenv OMP_NUM_THREADS 4
For the p690, the bsub file could be almost the same
#!/bin/csh #BSUB -W 10 #BSUB -n 4 #BSUB -o out.%J ./foo
but here we are asking for 4 processors and did not need the #BSUB -R line. is unneccesary.
Many scientific computation codes have loops which involve many computations and which require most of the time in the code. If the computations in a loop are independent of one another and if the loop size is specified in advance, then it can be easy to have each of several processors read part of the data, work on it, and write it back to the common memory.
The Fortran syntax for a parallel do directive is
!$omp parallel do [clause [,] [clause ...]]
do index = first, last, stride
computations ...
end do
[!$omp end parallel do]
The variables first, last, stride are fixed so that the compiler can decide how to divide up the computation. The C syntax is
#pragma omp parallel for [clause [,] [clause ...]]
for (index = first; test_expr; increment_expr){
computations ...
}
For C, the test_expr can involve <,<=, etc. The increment_expr needs to be the equivalent of stride, i.e., increment by the same amount for each iteration of the loop. gotos (fortran) or breaks (C) which would take code execution out of the loop (or to a different iteration of the parallel loop) are not allowed.
The square brackets [ ] indicate optional expressions.
Clauses give some additional control over the loop. Most commonly
private or shared (or reduction) scoping clauses define how variables are to be treated. Recall that a shared variable has only one value and is common to all threads.
Clauses give some additional control over the loop. Most commonly
private or shared (or reduction) scoping clauses define how variables are to be treated. Recall that a shared variable has only one value and is common to all threads and private variables have different values in each thread. The Fortran default is for the loop variable to be private. In the C default, the only the outer loop variable is shared. In Fortran an inner loop will have a private loop index. In C an inner loop will have a shared loop index.
While scoping clauses are the most common, other clauses include schedule, if, ordered, and copyin.
schedule clauses control how the iterations are split across threads. if clauses can cause the loop to be parallel or serial depending on a user-controlled if statement (there needs to be enough work to cover the start-up cost for the parallel loop). The ordered clause says that some iterations must be done first, limiting the choice of parallelism. The copyin initializes so-called threadprivate variables. There can be lots of scoping and copyin clauses as there can be lots of variables. But at most one each of the schedule, if, order clauses.
Consider the code
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
A parallel OpenMP version is
subroutine saxpy(a, x, y, n)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
Consider a more complicated example. This is timing a BLAS 2.5
operator in serial vs. in parallel.
program calgemv
C This is a driver program to test Mflop rate for dgemv
implicit none
integer lda, n, i, j,k, idum , kmax, kout, kk
parameter(lda=2000)
real*8 a(lda,lda),x(lda),b(lda),y(lda),w(lda),z(lda)
c call dgemvt(m,n,alpha,a,lda,
c + x,incx,y,incy,
c + beta,w,incw,z,incz)
real*8 temp,mflops,sum
character*1 TRANSA,TRANSB
real*4 ftime, pretim, entim
external ftime
c print*, 'input n'
c read*, n
c Initialize matrices a and b
do kk = 1,20
n = 100*kk
do j = 1,n
do i = 1,n
a(i,j) = 1./( 2.*i + j -1.)
end do
x(j) = a(1,j)
y(j) = x(j)
w(j) = x(j)
z(j) = x(j)
b(j) = 0.d0
end do
pretim = ftime()
c
c Write your matrix vector multiply here
c
kmax = 40000. * (100./n)**2
do k = 1,kmax
call dgemvt(n,n,1.d0 ,a,lda,
+ x,1,y,1,
+ 1.d0 ,w,1,z,1)
*
* DGEMVT is a BLAS 2.5 operator. It performs
* x = beta * A'* y + z
* w = alpha * A * x
*
sum = 0.d0
do i = 1,n
sum = sum + abs(w(i))
end do
do i =1,n
x(i) = w(i)/sum
end do
end do
entim = ftime() - pretim
print *,' entim =', entim
if (entim.lt.1.e-6) then
print*,' too fast to time'
else
c
c The next formula assumes the matrix was square n by n
c Possibly, you will have to repeat the matvec
c k times to get a timing. If so you will want
c to multiply the RHS of the equation below by k.
c
mflops = n
print*,mflops,kmax
print*,' size =', n
mflops = kmax* (4.D0* mflops * mflops)/entim/1.d6
print*,' Mflop ', mflops
endif
end do
stop
end
SUBROUTINE DGEMVT( M, N, ALPHA, A, LDA,
* X, INCX, Y, INCY,
* BETA, W, INCW, Z, INCZ )
* .. Scalar Arguments ..
DOUBLE PRECISION ALPHA, BETA
INTEGER INCX, INCY, INCW, INCZ, LDA, M, N
* .. Array Arguments ..
DOUBLE PRECISION A( LDA, *), X(*), Y(*), W(*), Z(*)
DOUBLE PRECISION W2(2000)
* ..
*
* Purpose
* =======
*
* DGEMVT is a blas 2.5 operator. It performs
*
* x = beta * A'* y + z
* w = alpha * A * x
*
* GEMVT performs both a left and right matrix vector multiply
* Combining both matrix vector multiplies into one call
* is meant to reduce the volume of data transferred
* from main memory to the CPU.
*
* Parameters
* ==========
*
* M - INTEGER.
* On entry, M specifies the number of rows of the matrix A.
* M must be at least zero.
* Unchanged on exit.
*
* N - INTEGER.
* On entry, N specifies the number of columns of the matrix A.
* N must be at least zero.
* Unchanged on exit.
*
* ALPHA - DOUBLE PRECISION.
* On entry, ALPHA specifies the scalar used in the matrix vector
* multiplication on the right.
* Unchanged on exit.
*
* A - DOUBLE PRECISION array of DIMENSION ( LDA, n ).
* Before entry, the leading m by n part of the array A must
* contain the matrix of coefficients.
* Unchanged on exit.
*
* LDA - INTEGER.
* On entry, LDA specifies the first dimension of A as declared
* in the calling (sub) program. LDA must be at least
* max( 1, m ).
* Unchanged on exit.
*
* X - DOUBLE PRECISION array of DIMENSION at least
* (n-1)*abs(incx)+1.
* The input value of X is not used. X is the result of
* multiplying the vector y on the left of A and adding it to z.
* On exit, x = beta * A'* y + z.
*
* INCX - INTEGER.
* On entry, INCX specifies the increment for the elements of
* X. INCX must not be zero.
* Unchanged on exit.
*
* Y - DOUBLE PRECISION array of DIMENSION at least
* (m-1)*abs(incw)+1.
* Y is a column vector that is used in vector multiplication on
* the left.
* Unchanged on exit.
*
* INCY - INTEGER.
* On entry, INCY specifies the increment for the elements of
* Y. INCY must not be zero.
* Unchanged on exit.
*
* BETA - DOUBLE PRECISION.
* On entry, BETA specifies the scalar used in vector
* multiplication on the left.
* Unchanged on exit.
*
* W - DOUBLE PRECISION array of DIMENSION at least
* ( m - 1 ) * abs(incw) + 1.
* W is a column vector that is the result of the vector
* multiplication on the right.
* On exit, w = alpha * A * x.
*
* INCW - INTEGER.
* On entry, INCW specifies the increment for the elements of
* W. INCW must not be zero.
* Unchanged on exit.
*
* Z - DOUBLE PRECISION array of DIMENSION at least
* ( n - 1 ) * abs(incz) + 1.
* Z is a row vector that is used in vector multiplication on the
* left.
* Unchanged on exit.
*
* INCZ - INTEGER.
* On entry, INCZ specifies the increment for the elements of
* Z. INCZ must not be zero.
* Unchanged on exit.
DOUBLE PRECISION DDOT
INTEGER J ,K, I, KOLD, II
DOUBLE PRECISION TZ, XX, ZZ, XJ, XJ1, XJ2
DOUBLE PRECISION XJ3, ZJ, ZJ1, ZJ2, ZJ3
DOUBLE PRECISION XJ4, XJ5, ZJ4, ZJ5, T1, T2
DOUBLE PRECISION T3, T4, T5, T6
CHARACTER*1, TRANS
*
* code for all increments equal to one.
*
IF (INCX.eq.1.and.INCY.eq.1.and.INCZ.eq.1.and.INCW.eq.1) THEN
DO 70 J = 1,M
W(J) = 0.D0
70 CONTINUE
do 75 J = 1,2000
w2(j) = 0.d0
75 continue
*
* Make sure K is a multiple of 8 and no smaller than 4
K=MAX (8*INT(2.8D4/M/8),4)
* Make sure K is no larger than 160
K=MIN (8*INT(2.8D4/M/8),160)
* K = 16
CALL DCOPY( N, Z, 1, X, 1 )
!$omp parallel do private (i,j)
!$omp+ reduction(+:w2)
DO 80 J = 1, N/K
CALL DGEMV( 'T', M, K, BETA, A(1,1+(J-1)*K), LDA,
* Y, 1, 1.D0, X(1+(J-1)*K), 1 )
CALL DGEMV( 'N', M, K, ALPHA, A(1,1+(J-1)*K), LDA,
* X(1+(J-1)*K), 1, 1.D0, W2, 1 )
80 CONTINUE
!$omp end parallel do
II = (N/K) * K
CALL DGEMV( 'T', M, N-II, BETA, A(1,1+II), LDA, Y, 1,
* 1.D0, X(1+II), 1 )
CALL DGEMV( 'N', M, N-II, ALPHA, A(1,1+II), LDA, X(1+II),
* 1, 1.D0, W2, 1 )
*
* CODE FOR UNEQUAL INCREMENTS OR EQUAL INCREMENTS
* NOT EQUAL TO ONE.
*
ELSE
* Left out here.
*
END IF
CALL DCOPY( M, W2, 1, W, 1 )
RETURN
END
The serial DGEMVT code is for small matrices about the same
speed as for a plain call to the BLAS operator DGEMV.
Using column blocking, we call DGEMV twice for each column block,
where block sizes (number of columns) are chosen so that the column
block stays in cache.
The above lines of code specify that the number K of column in a block will be chosen as a multiple of 8. This is a column block with M rows. M*K is about 2.8D4 double precision entries, corresponding to about 230 KBytes of storage, i.e., about half of the L2 cache. Changing 2.8D4 to 2.8D5 reduces the DGEMVT speed to that of DGEMV.
* Make sure K is a multiple of 8 and no smaller than 4
K=MAX (8*INT(2.8D4/M/8),4)
* Make sure K is no larger than 160
K=MIN (8*INT(2.8D4/M/8),160)
* K = 16
Out-of-cache DGEMVT runs at almost twice the Mflop rate of
out-of-cache DGEMV.
It runs somewhat faster when A is too large
to fit in cache (1200 Mflops vs. 750 for DGEMV). This is using
the Intel BLAS library. The OpenMP version runs slower
for in-cache matrices, but faster for out-of-cache matrices (1200
Mflops for the serial code vs. 1400 Mflops for the parallel code).
Note there are only 5 lines of code added to the serial version.
!$omp parallel do private (i,j)
!$omp+ reduction(+:w2)
DO 80 J = 1, N/K
CALL DGEMV( 'T', M, K, BETA, A(1,1+(J-1)*K), LDA,
* Y, 1, 1.D0, X(1+(J-1)*K), 1 )
CALL DGEMV( 'N', M, K, ALPHA, A(1,1+(J-1)*K), LDA,
* X(1+(J-1)*K), 1, 1.D0, W2, 1 )
80 CONTINUE
!$omp end parallel do
3 lines were OpenMP directives. The other was to make W2
(a reduction variable) to be of fixed length. Originally, I had
used W, a "variable length" array, but the compiler refused to
use a "variable length" array as a reduction variable.
Declaring W2 to be of a fixed length was easy, but is easy to forget so
could cause an error when the code is rerun with larger matrices.
For a 533 MHz data bus, the peak speed for DGEMVT (4 flops per double precision number in A) is 4*533 = 2133 Mflops. We attained about 1400 Mflops.
size serial Mflops parallel Mflops 100 2740 1060 200 1468 1290 300 984 1454 400 1006 1168 500 1026 1053 600 1067 1159 700 1073 1178 800 1096 1240 900 1125 1278 1000 1096 1301 1100 1127 1299 1200 1128 1352 1300 1125 1352 1400 1164 1343 1500 1163 1362 1600 1158 1342 1700 1164 1363 1800 1217 1423 1900 1213 1428 2000 1230 1455
The above results were for code linked to the Intel Blas library, run on a 3.06GHz 2 CPU Xeon mother board with a 533 MHz data bus. The codes were compiled with ifc -O3 -tpp7 -xM, enabling vectorizing. The parallel code had also the additional flag -openmp. The -openmp flag is necessary both for linking and compiling.
Here we used a "CPU timer", the C standard clock function. Usually, for parallel computation we should be using a wall clock timer instead. A CPU timer may not be timing I/O for example. And what we're really interested in for parallel computation is how long the computation takes, so we sould be timing everything.
Automatic parallelization of the loop failed because it was "too short".
On the same 2 CPU Xeon Blades, OpenMP parallelization of an out of cache matrix vector multiply failed to produce any speedup over the serial version. In that case, we are performing 2 flops per entry of A fetched from RAM. The theoretical peack speed would be 2*533 = 1066 Mflops and we actually get 700 Mflops, serial or parallel.
What do we mean by a data dependency? Here's an example.
a(1) = 1
a(2) = 1
!$omp parallel do
do i = 2,n ! produce some terms of the Fibonacci sequence
a(i+1) = a(i) + a(i-1)
end do
This code is innately serial in that each iteration, the new value of a(i) depends on the previous 2. So a parallel do structure around this do loop would either be (correctly) ignored or the compiler would produce erroneous code. With the intel compiler I got
>ifc -openmp fib.f -o fib
ifc: Command line warning: openmp requires C style preprocessing; fpp level is rest to 2
program FIB
fib.f(5) : (col. 0) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
So generally, we can't expect that the compiler will discover dependencies for us.
So we'd better look at each variable within a do loop and see about its dependencies. If the variable is only read and never written, then its okay. Else if the variable is written only once and not reread on other iterations, then we're okay.
If a variable is written and also read, we have a dependence. Array variables can be especially tricky. Here's a few examples
do i = 2, n, 2
a(i) = a(i) + a(i-1)
end do
This would be okay. a(i) is only written for i even, so is only written
once. a(i-1) is odd and never written.
do i=1,n/2
a(i) = a(i) + a(i + n/2)
end do
This is okay because i + n/2 > n/2 .
do i=1, n/2 + 1
a(i) = a(i) + a(i + n/2)
end do
This would not be good because a(n/2) could be updated by one processor
before used on the right hand side by another.
do i=1,n
a(idx(i)) = a(idx(i)) + b(idx(i))
end do
If idx(i) maps i=1:n onto itself as a one-to-one and onto map (permuatation)
there is no confusion. Else if j=idx(i) and j=idx(i') then we could find
a race condition causing different answers in serial and parallel.
Here's some examples with subroutines (also from Chandra, et al).
subroutine add(c,a,b)
common /count/ cnt
real*8 c, a, b
integer cnt
c = a+b
cnt = cnt + 1
end
suboutine smooth(a, n, 1)
integer n, i
read*8 a(n)
a(i) = ( a(i) + a(i-1) + a(i+1) ) / 3.0
end
subroutine add_count(a, n, i)
common /count/ cnt
integer n, i, cnt
real*8 a(n)
a(i) = a(i) + cnt
end
Assume the folloiwing loops call the above subroutines.
Are iterations independent?
do i = 1,n
call add(c(i), a(i), b(i) )
end do
Okay?
do i=2,n-1
call smooth (a, n, i)
end do
Okay?
do i = 1,n
call add_count(a, n, i)
end do
Loop-carried dependencies occur when a value written by one iteration of a loop is read by another. Dependencies in one iteration of a loop do not cause problems. Here's a loop in which one iteration may or may not influence the next.
y = 0
do i= 1,n
if (rand().lt. .5) y = rand()
a(i) = y
end do
Could this loop be parallelized?
Classification of dependencies in terms of data flow can help in converting loops to be parallel.
do i=2,n-1
a(i+1) = a(i) + a(i-1)
end do
do i=1,n-1
a(i) = a(i) + a(i+1)
end do
Then would the following be correct?
!$OMP PARALLEL DO
do i=1,n-1
a(i) = a(i) + a(i+1)
end do
No. a(i+1) could be overwritten before it is used to create a(i). But this this can be parallelized by making a preliminary do loop which makes a new vector.
!$OMP PARALLEL DO
do i=1,n-1
ap1(i) = a(i+1)
end do
!$OMP PARALLEL DO
do i=1,n-1
a(i) = a(i) + ap1(i)
end do
Claim, antidependencies can always be parallelized by creating new storage.
do i=1,n-1
a(i) = a(i) + 2
x = a(i)
end do
!$OMP PARALLEL DO PRIVATE(X)
do i=1,n
a(i) = a(i) + 2
x = a(i)
end do
is incorrect if x is taken as x will be private and actually disappear at the end of the loop. Or if x were declared shared, we still wouldn't know which value of i actually performs the last update.
!$OMP PARALLEL DO PRIVATE(X)
do i=1,n
a(i) = a(i) + 2
end do
!$OMP END PARALLEL DO
x = a(n)
would work fine. Or if we want to perform a few more flops,
!$OMP PARALLEL DO SHARED(A) LASTPRIVATE(X)
do i=1,n
a(i) = a(i) + 2
x = a(i)
end do
We claim that anti depencies and output dependences can always be removed. In some cases, true dependenies can also be removed. The most standard case is when the dependency corresponds to one of the reduction operators. These are commutative operations, including sum, product, max, min, and, or.
M = a(1)
do i=2,n
M = max(M,a(i))
end do
Is the following correct?
!$OMP PARALLEL DO SHARED(a,M)
M = a(1)
do i=2,n
M = max(M,a(i))
end do
No, we would be finding the max of various subsets? Also, only one thread could operate at a time, so all would be waiting on one another. But the following would work.
!$OMP PARALLEL DO SHARED(a) REDUCE(M:MAX)
M = a(1)
do i=2,n
M = max(M,a(i))
end do
If a variable k is incremeted by the same amount in each loop, then each instance "depends" on the last. But we could compute it directly from the loop counter and so avoid that dependency.
Other than parallelizing do loops, we can also get parallelism by from a parallel region. Such a region causes each thread to execute the same program. For example, each thread could print "hello world". If there are two threads, we could get two "hello world"s. The "parallel do" construct takes a given amount of work and parcelled it out. the "parallel region" results in more work as there are more processors. If you put a "parallel" around a do loop, all of the loop will be performed by each processor.
In order that each processor does useful work we typically need each processor to be get its own set of data. This can be accomplished by having each processor work on its own part of a matrix. Another useful technique is to use a common block with a threadprivate operation. And also we want each processor to be able to output its own data. One way is by having each processor write to its own section of an array. operation.
Branch statements out of a private region are not legal. Just as in the "parallel do" case, variables may be declared public and private.
subroutine sup(n)
integer n
!$OMP PARALLEL
call mypart(n)
if (n.lt.5) return
!$OMP END PARALLEL
return
end
would not be legal.
The loop
!$OMP PARALLEL
print*,'Hello World'
!$OMP END PARALLEL
is not very useful, but okay. Get a 'Hello World' per thread. As in the case of the parallel do, the parallel region causes a master thread to fork a bunch of slaves, which all disappear at the END PARALLEL. Again the END PARALLEL is a synchronization point. What would the following construct do?
!$OMP PARALLEL
do i = 1, 100
print*,'Hello World'
end do
!$OMP END PARALLEL
The PARALLEL construct allows OpenMP programs to work with the SPMD (single program -- multiple data) paradigm, which we will also use with MPI.
The following paired OMP statements are equivalent to just having !$OMP PARALLEL DO
!$OMP PARALLEL
!$OMP DO
do i = 1, 100
print*,'Hello World'
end do
!$OMP END DO
!$OMP END PARALLEL
A main trick is how to get the correct data in and out of each parallel thread. We have to be cautious when using common blocks. Here are some examples from Chandra et al that illustrate pitfalls and solutions. This first example attempts to use knowledge of the process thread number to specify what job a parallel region should perform.
program oops
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, nunthreads, size)
!$omp+ private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (n + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, n)
call work(a)
!$omp end parallel
end
subroutine work(a)
real*4 a(*)
do i = ifirst,ilast
...... stuff on a(i) but since I can't figure
-------- out ifirst and ilast this loop ? ...
end do
return
end
Alas, the subroutine work pays no attention to the nice arithmetic and the private declaration in the parallel region. It knows what ifirst and iend are only because they are globally shared, hence confusion ..
program worksOK
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, numthreads, size)
!$omp+ private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (N + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, N)
call work(a,ifirst,ilast)
!$omp end parallel
end
subroutine work(iarray,ifirst,ilast)
real*4 a(*)
do i = ifirst,ilast
...... stuff on my part of a
end do
return
end
By making ifirst and ilast as arguments to work, they refer in work to the private (as opposed to the global) variables.
The thread private directive lets us avoid messy argument lists. It lets common blocks (Fortran) global variables (C, C++) be private to each thread.
program works2
common /bounds/ istart, iend
$!omp threadprivate(/bounds/)
real*4 a(10000)
n = 10000
!$omp parallel private(iam, nthreads, chunk)
nthreads = omp_get_num_threads()
iam = omp_get_thread_num()
chunk = (N + nthreads -1 )/nthreads
istart = iam*chunk + 1
iend = min((iam+1)*chunk, N)
call work(a)
!$omp end parallel
end
subroutine work(iarray)
real*4 a(10000)
do i=ifirst,ilast
...... stuff on my chunk of iarray
end do
return
end
If you want more than one common block in a thread private block, you need commas as well as the paired slashes. Notice the private statement for istart and iend disappeared from the parallel region. Only the one threadprivate statement appears and in fact further private statement for istart, iedn would be incorrect.
Typically, the threadprivate variables are initialized within the parallel region. But if we want to get data from the master thread's thread data we can specify
!$omp parallel !$omp+ copyin(m,n)
where m and n are member of a threadprivate common block (and as usual we could list more with commas), e.g., copyin(m,n,j).
Topic next time. Will code run faster everytime we introduce another OpenMP parallel do or parallel region? Why not?
Generally, we expect that forking off a bunch of processes will take some time. Also having a barrier and waiting will take some time. Another common problem is that when several processors want the same bit of data, then "cache-coherence" demands that only one processor have control of the data at a time. Moreover, the data comes in cache lines. So you may not think it is controlled by one processor, but it can happen that data in one cache memory becomes unavailable to the other process.
Also next time, we'll summarize a few other OpenMP commands.