The is a revision of Lecture 10. It discusses race conditions on OpenMP shared variables and gives a few options.
On the p690, with the xlf (or xlf90 compiler) the OpenMP library is specified by the -qsmp=omp flag. To disable automatic parallelization, but recognize OpenMP directives, try the compiler flag -qsmp=omp:noauto. As with the ifc and icc compilers, there appears to be no mechanism for inserting OpenMP directives directly into Fortran or C codes.
In tcsh you could specify use of 4 threads by
>setenv OMP_NUM_THREADS 4
For the p690, the bsub file could be almost the same as on the blades
#!/bin/csh #BSUB -W 10 #BSUB -n 4 #BSUB -o out.%J ./foo
but here we are asking for 4 processors and did not need the #BSUB -R line. The IBM automatic parallelism inserts SMP directives (SMP is the IBM set of directives, which predates OpenMP). OpenMP directives are also translated into SMP. To see a listing of the of the SMP code, compile with
>xlf -qsmp=omp:auto -qreport=smp -c spcol.f
The same spcol.f program for this week's homework compiled to give a spcol.o file and also a file spcol.lst, the listing of which follows.
XL Fortran for AIX Version 08.01.0000.0003 --- spcol.f 09/15/04 12:16:03
>>>>> OPTIONS SECTION <<<<<
*** Options In Effect ***
== On / Off Options ==
32 CCLINES ESCAPE I4
NOLM OBJECT SWAPOMP UNWIND
NOZEROSIZE
== Options Of Integer Type ==
FIXED(72) HOT() MAXMEM(2048)
OPTIMIZE(2) SPILLSIZE(512)
== Options of Integer and Character Type ==
SMP(OMP,AUTO,THRESHOLD(100),SCHEDULE(RUNTIME))
== Options Of Character Type ==
ALIAS(STD,INTPTR) ALIGN(STRUCT(NATURAL))
ARCH(COM) AUTODBL(NONE) DIRECTIVE($OMP)
FLAG(I,I) FLOAT(MAF,FOLD) HALT(S)
HOT(VECTOR) IEEE(NEAR) INTSIZE(4)
LANGLVL(EXTENDED) POSITION(APPENDOLD) REALSIZE(4)
REPORT(SMPLIST) SAVE(ALL) UNROLL(AUTO)
XFLAG() XLF77(NOLEADZERO,GEDIT77,NOBLANKPAD,OLDBOZ,INTARG,INTXOR,PERSISTENT,SOFTEOF)
XLF90(NOSIGNEDZERO,NOAUTODEALLOC)
>>>>> SOURCE SECTION <<<<<
** spcol === End of Compilation 1 ===
>>>>> PARALLELIZATION AND LOOP TRANSFORMATION SECTION <<<<<
PROGRAM spcol ()
! DIR_SCHEDTYPE loopId = -1 schedType = 5
19| #1 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat("input n",7,T_13,7)
CALL _xlfWriteLDChar(%VAL(#1),T_13,7,1)
_xlfEndIO(%VAL(#1))
20| #2 = _xlfBeginIO(5,1,NULL,1024,NULL,0,NULL)
CALL _xlfReadLDInt(%VAL(#2),n,4,4)
_xlfEndIO(%VAL(#2))
21| #3 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat(" input fraction of nonzeros",27,T_14,27)
CALL _xlfWriteLDChar(%VAL(#3),T_14,27,1)
_xlfEndIO(%VAL(#3))
22| #4 = _xlfBeginIO(5,1,NULL,1024,NULL,0,NULL)
CALL _xlfReadLDReal(%VAL(#4),frac,8,8)
_xlfEndIO(%VAL(#4))
23| iseed = n
24| sum = randw(iseed)
32| pretim = ftime()
33| k = 1
34| IF ((n > 0)) THEN
@CIVFINAL1 = int(n)
@CIV1 = 0
Id=1 DO @CIV1 = @CIV1, @CIVFINAL1-1
35| ic((@CIV1 + 1)) = k
36| i = 1
37| lab_2 /* loopid=9 */
IF (.NOT.(i < int((frac * real(n))) + 1)) GOTO lab_3
lab_4 /* loopid=10 */
38| a(k) = ( 1.0000000000000000E+000 / frac) / real(n)
@RET0 = randw(iseed)
39| @CSE1 = real(n)
ir(k) = int((@RET0 * @CSE1 + 1.0000000000000000E+000))
44| @CSE0 = (k - ic((@CIV1 + 1)))
IF ((@CSE0 > 0)) THEN
@CIVFINAL0 = int(@CSE0)
@CIV0 = 0
Id=2 DO @CIV0 = @CIV0, @CIVFINAL0-1
45| IF (ir((ic((@CIV1 + 1)) + @CIV0)) == ir(k)) GOTO &
& lab_2
51| IF ((ir((ic((@CIV1 + 1)) + @CIV0)) > ir(k))) THEN
52| itemp = ir((ic((@CIV1 + 1)) + @CIV0))
53| ir((ic((@CIV1 + 1)) + @CIV0)) = ir(k)
54| ir(k) = itemp
55| ENDIF
56| ENDDO
ENDIF
57| k = (k + 1)
58| i = (i + 1)
59| IF (i < int((frac * @CSE1)) + 1) GOTO lab_4
lab_3
60| ENDDO
ENDIF
61| ic((n + 1)) = k
62| nz = k
63| IF ((n > 0)) THEN
@DCIV0 = 0
@ICM.n0 = n
Id=11 DO @DCIV0 = @DCIV0, int(@ICM.n0)-1
! DIR_INDEPENDENT loopId = 0
64| x((@DCIV0 + 1)) = 1.0000000000000000E+000
63| ENDDO
@DCIV1 = 0
@ICM.n0 = n
Id=12 DO @DCIV1 = @DCIV1, int(@ICM.n0)-1
! DIR_INDEPENDENT loopId = 0
65| b((@DCIV1 + 1)) = 0.0000000000000000E+000
63| ENDDO
ENDIF
66| @RET1 = real(ftime())
67| entim = real((@RET1 - real(pretim)))
68| #5 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat(" initialization time was ",25,T_15,25)
CALL _xlfWriteLDChar(%VAL(#5),T_15,25,1)
CALL _xlfWriteLDReal(%VAL(#5),entim,4,4)
_xlfEndIO(%VAL(#5))
72| its = (100000000 / nz)
73| pretim = ftime()
74| IF ((its > 0)) THEN
@CIV7 = 0
IF ((n > 0)) THEN
@ICM.n0 = n
@ICM.its3 = its
Id=4 DO @CIV7 = @CIV7, int(@ICM.its3)-1
75| @CIV4 = 0
Id=5 DO @CIV4 = @CIV4, int(@ICM.n0)-1
78| @CSE3 = (ic((@CIV4 + 2)) - ic((@CIV4 + 1)))
IF ((@CSE3 > 0)) THEN
@CIV3 = 0
@ICM1 = x((@CIV4 + 1))
@ICM2 = @CSE3
Id=6 DO @CIV3 = @CIV3, int(@ICM2)-1
79| b(ir((ic((@CIV4 + 1)) + @CIV3))) = b(ir((ic((&
& @CIV4 + 1)) + @CIV3))) + a((ic((@CIV4 + 1)) + &
& @CIV3)) * @ICM1
80| ENDDO
ENDIF
81| ENDDO
82| sum = 0.0000000000000000E+000
83| @CIV5 = 0
Id=7 DO @CIV5 = @CIV5, int(@ICM.n0)-1
84| sum = (sum + abs(b((@CIV5 + 1))))
85| ENDDO
86| sum = ( 1.0000000000000000E+000 / sum)
87| @CIV6 = 0
Id=8 DO @CIV6 = @CIV6, int(@ICM.n0)-1
! DIR_INDEPENDENT loopId = 0
88| x((@CIV6 + 1)) = sum * b((@CIV6 + 1))
89| b((@CIV6 + 1)) = 0.0000000000000000E+000
90| ENDDO
91| ENDDO
ELSEIF
@ICM.its3 = its
74|Id=3 DO @CIV7 = @CIV7, int(@ICM.its3)-1
81| sum = 0.0000000000000000E+000
sum = ( 1.0000000000000000E+000 / sum)
91| ENDDO
ENDIF
ENDIF
@RET2 = real(ftime())
92| @CSE2 = (@RET2 - real(pretim))
entim = real(@CSE2)
93| #6 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _xlfWriteLDReal(%VAL(#6),entim,4,4)
CALL _xlfWriteLDInt(%VAL(#6),its,4,4)
_xlfEndIO(%VAL(#6))
94| mflops = ((( 2.0000000000000000E+000 * real(its)) * (real(nz) &
& / real(real(@CSE2)))) / 1.0000000000000000E+006)
95| #7 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat("sum = ",6,T_16,6)
CALL _xlfWriteLDChar(%VAL(#7),T_16,6,1)
T_17 = (sum / real(n))
CALL _xlfWriteLDReal(%VAL(#7),T_17,8,8)
_xlfEndIO(%VAL(#7))
96| #8 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat(" mflops =",9,T_18,9)
CALL _xlfWriteLDChar(%VAL(#8),T_18,9,1)
CALL _xlfWriteLDReal(%VAL(#8),mflops,8,8)
_xlfEndIO(%VAL(#8))
97| #9 = _xlfBeginIO(6,257,NULL,1024,NULL,0,NULL)
CALL _dconcat("nz =",4,T_19,4)
CALL _xlfWriteLDChar(%VAL(#9),T_19,4,1)
CALL _xlfWriteLDInt(%VAL(#9),nz,4,4)
_xlfEndIO(%VAL(#9))
99| CALL _xlfExit(0)
CALL _trap(3)
100| END PROGRAM spcol
Source Source Loop Id Action / Information
File Line
---------- ---------- ---------- ----------------------------------------------------------
0 34 1 Loop cannot be automatically parallelized. Loop
contains a call to "randw" that may have side effects.
0 44 2 Private variables recognized in this loop nest:
"itemp".
0 44 2 Loop cannot be automatically parallelized. A
dependency is carried by variable aliasing or
function call.
0 74 4 Private variables recognized in this loop nest:
"@CIV4", "@CIV3", "@CIV5", "@CIV6", and "@CIV3".
0 74 4 Loop cannot be automatically parallelized. A
dependency is carried by variable "b".
0 75 5 Private variables recognized in this loop nest:
"@CIV3".
0 75 5 Loop cannot be automatically parallelized. A
dependency is carried by variable "b".
0 78 6 Loop cannot be automatically parallelized. A
dependency is carried by variable "b".
>>>>> COMPILATION UNIT EPILOGUE SECTION <<<<<
TOTAL UNRECOVERABLE SEVERE ERROR WARNING INFORMATIONAL
(U) (S) (E) (W) (I)
0 0 0 0 0 0
>>>>> OPTIONS SECTION <<<<<
*** Options In Effect ***
== On / Off Options ==
32 CCLINES ESCAPE I4
NOLM OBJECT SWAPOMP UNWIND
NOZEROSIZE
== Options Of Integer Type ==
FIXED(72) HOT() MAXMEM(2048)
OPTIMIZE(2) SPILLSIZE(512)
== Options of Integer and Character Type ==
SMP(OMP,AUTO,THRESHOLD(100),SCHEDULE(RUNTIME))
== Options Of Character Type ==
ALIAS(STD,INTPTR) ALIGN(STRUCT(NATURAL))
ARCH(COM) AUTODBL(NONE) DIRECTIVE($OMP)
FLAG(I,I) FLOAT(MAF,FOLD) HALT(S)
HOT(VECTOR) IEEE(NEAR) INTSIZE(4)
LANGLVL(EXTENDED) POSITION(APPENDOLD) REALSIZE(4)
REPORT(SMPLIST) SAVE(ALL) UNROLL(AUTO)
XFLAG() XLF77(NOLEADZERO,GEDIT77,NOBLANKPAD,OLDBOZ,INTARG,INTXOR,PERSISTENT,SOFTEOF)
XLF90(NOSIGNEDZERO,NOAUTODEALLOC)
>>>>> SOURCE SECTION <<<<<
** rannor === End of Compilation 2 ===
>>>>> PARALLELIZATION AND LOOP TRANSFORMATION SECTION <<<<<
REAL*8 FUNCTION rannor (ix)
! DIR_SCHEDTYPE loopId = -1 schedType = 5
@RET0 = randw(ix)
110| r = sqrt((-( 2.0000000000000000E+000 * _log(%VAL(@RET0)))))
@RET1 = randw(ix)
113| RETURN
114| END FUNCTION rannor
>>>>> COMPILATION UNIT EPILOGUE SECTION <<<<<
TOTAL UNRECOVERABLE SEVERE ERROR WARNING INFORMATIONAL
(U) (S) (E) (W) (I)
0 0 0 0 0 0
>>>>> OPTIONS SECTION <<<<<
*** Options In Effect ***
== On / Off Options ==
32 CCLINES ESCAPE I4
NOLM OBJECT SWAPOMP UNWIND
NOZEROSIZE
== Options Of Integer Type ==
FIXED(72) HOT() MAXMEM(2048)
OPTIMIZE(2) SPILLSIZE(512)
== Options of Integer and Character Type ==
SMP(OMP,AUTO,THRESHOLD(100),SCHEDULE(RUNTIME))
== Options Of Character Type ==
ALIAS(STD,INTPTR) ALIGN(STRUCT(NATURAL))
ARCH(COM) AUTODBL(NONE) DIRECTIVE($OMP)
FLAG(I,I) FLOAT(MAF,FOLD) HALT(S)
HOT(VECTOR) IEEE(NEAR) INTSIZE(4)
LANGLVL(EXTENDED) POSITION(APPENDOLD) REALSIZE(4)
REPORT(SMPLIST) SAVE(ALL) UNROLL(AUTO)
XFLAG() XLF77(NOLEADZERO,GEDIT77,NOBLANKPAD,OLDBOZ,INTARG,INTXOR,PERSISTENT,SOFTEOF)
XLF90(NOSIGNEDZERO,NOAUTODEALLOC)
>>>>> SOURCE SECTION <<<<<
** randw === End of Compilation 3 ===
>>>>> PARALLELIZATION AND LOOP TRANSFORMATION SECTION <<<<<
REAL*8 FUNCTION randw (ix)
! DIR_SCHEDTYPE loopId = -1 schedType = 5
157| @CSE1 = ix / b16
@CSE4 = (ix - @CSE1 * b16) * a
@CSE0 = @CSE4 / b16
@CSE2 = @CSE1 * a
@CSE3 = (@CSE2 + @CSE0) / b15
ix = ((@CSE4 + @CSE3) + ((@CSE2 + (@CSE0 - @CSE3 * b15)) * &
& b16 - (@CSE0 * b16 + p)))
IF ( ix < 0 ) THEN
ix = ix + p
ELSE
ix = ix
ENDIF
166| RETURN
167| END FUNCTION randw
>>>>> COMPILATION UNIT EPILOGUE SECTION <<<<<
TOTAL UNRECOVERABLE SEVERE ERROR WARNING INFORMATIONAL
(U) (S) (E) (W) (I)
0 0 0 0 0 0
>>>>> FILE TABLE SECTION <<<<<
FILE CREATION FROM
FILE NO FILENAME DATE TIME FILE LINE
0 spcol.f 09/13/04 14:41:32
>>>>> COMPILATION EPILOGUE SECTION <<<<<
FORTRAN Summary of Diagnosed Conditions
TOTAL UNRECOVERABLE SEVERE ERROR WARNING INFORMATIONAL
(U) (S) (E) (W) (I)
0 0 0 0 0 0
Source records read....................................... 168
1501-510 Compilation successful for file spcol.f.
1501-543 Object file created.
Though the loop commands are different than those from OpenMP, we do
get an analaysis of each loop, whether the loop can be parallelized
and if it can be parallelized, what variables should be private or
shared. So we can use this information to aid in detecting dependencies
and in helping to correctly insert OpneMP pragmas.
In the program analyzed above, the loop for sparse matrix multiply of matrices in compressed column storage is
do j=1,n
k1 = ic(j)
k2 = ic(j+1) - 1
do i = k1, k2
b(ir(i)) = b(ir(i)) + a(i)*x(j)
end do
end do
The automatic parallelizer didn't like this loop. Question
would it be correct to parallelize with?
!$omp parallel do
do j=1,n
k1 = ic(j)
k2 = ic(j+1) - 1
do i = k1, k2
b(ir(i)) = b(ir(i)) + a(i)*x(j)
end do
end do
Possibly not. Since elements of the b vector can be updated by
several different threads, there could be a race condition on b,
so that some elements of b could be read by Proc1
and then the location for b could be read by Proc2
before Proc1 has returned its update? Then Proc2 returns an
updated which does not include the updating performed by
Proc1? The "race condition error" may be hard to produce
in the sparse example -- because usually different elements
of b are being updated by different threads.
Let's instead set up a case to see if we can produce a race
condition error.
Consider an alternate code (dense matrix vector multiplication). This was the "optimal" matvec code that got up to 3800 "in-cache" Mflops (with different compiler options than used here. For the fast code we turned on the vectorizor by -xM and set the loop unrolling level to 2. )
do j =1,n,8
do i=1,n
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
end do
end do
Just as in the sparse code, the b vector is updated by each value of j,
so that if we parallelize the loop, processors may encounter a loop
condition. To see what can happen, initialize all values of a
as one and also set the x vector to ones. Then each value of b
should just be the number n of columns.
So the parallelized code is
!$omp parallel do
do j =1,n,8
do i=1,n
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
end do
end do
Setting n to 104 (code only works for n a multiple of 8), compile with
ifc -c openmp calgemv-s.f
and get 1400 Mflops for two processors, but it turns out that some b(i) are 96 instead of 104. .7% of b(i) were miscalculated! So YES THE RACE CONDITION DID CAUSE THE CODE TO FAIL !!
Attempt II.
!$omp parallel do reduction(+:b)
do j =1,n,8
do i=1,n
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
end do
end do
In this case the calculcation proceeds at 777 Mflops (slower) but all
the b(i) are correct. Reservation: it appears that using vectors
as reduction variables is not a part of the standard. So the implementation
may not be portable. Moreover,
the ifc compiler accepts a vector x as a reduction variable only
if x is declared with constant lenght. For example, if x is an
argument to a subroutine
subroutine my(x,n)
double precision x(*)
integer n
!$omp parallel do reduction(+:x)
do j=1,n
do i=1,n
x(j) = x(j) + 1. /(i+j)
end do
end do
return
end
The compiler will give an error.
Attempt III.
!$omp parallel do
do j =1,n,8
do i=1,n
!$omp critical
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
!$omp end critical
end do
end do
The critical pragma ensures that only on processor at a time can
access any variables that are written. Again, the answers
b(i) are all correct.
Unfortunately, the Mflop rate is 47.
Moral: ensuring that only one processor at a time has access to data can be expensive.
Discussion?
To avoid the need to synchronize, we can give each thread its own version of a variable.
!$omp parallel do shared(a) private(b)then if b is updated, other threads don't worry. What happens to b at the end of the do?
If the global b is to the sum of all the local b variables, then we can
!$omp parallel do shared(a) reduction(+:b)
It could also be the case that occasional errors don't bother us. For example, for a parallel sparse matrix vector mutliply in the middle of an iteration, it might be better to just get the iteration done and get on to the next one. Perhaps we'll need a few more iterations but if they run twice as fast? Often it might be better to do more iterations than to wait before performing another.
Here was a use of the critical section code
!$omp parallel do
do j =1,n,8
do i=1,n
!$omp critical
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
!$omp end critical
end do
end do
More efficient in this case might be
do j =1,n,8
!$omp parallel do
do i=1,n
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
end do
end do
No errors. 777 Mflops. But the inner loop in the sparse case is very
short, so might not be as good there.
Another way is with a lock routine
call omp_init_lock(foo)
!$omp parallel do
do j =1,n,8
do i=1,n
call omp_set_lock(foo)
b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1)
+ + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) + a(i,j+4)*x(j+4)
+ + a(i,j+5)*x(j+5) + a(i,j+6)*x(j+6) + a(i,j+7)*x(j+7)
call omp_set_unlock(foo)
end do
end do
call omp_destroy_lock(foo)
There are quite a few other OpenMP synchronizations. barrier, master, atomic, flush, ordered do loops. If you seriously want to use OpenMP you will probably end up sorting through these for the best routine. A problem is that synchronization is relatively expensive. From the results below on efficiency the fork and join synchronization of a parallel do loop, run at most 600K times per second. So it takes about 1.6e-6 seconds for a synchronization?
Generally, we expect that forking off a bunch of processes will take some time. Also having a barrier and waiting will take some time. Another common problem is that when several processors want the same bit of data, then "cache-coherence" demands that only one processor have control of the data at a time. Moreover, the data comes in cache lines. So you may not think it is controlled by one processor, but it can happen that data in one cache memory becomes unavailable to the other process.
Remember our small sample program with one parallel do? The following is a bit of an elaboration of that program. The program repeats saxpys long enough to time various parallel loops.
program saxpy
integer n , its ,k ,j, its, Mflops
real*8 a(1000000), b(1000000), alpha, divn
real*4 pretim, entim
external ftime
j = 50
do while(j .lt. 400000)
alpha = 2.d2
divn = 1.0/j
its = 1.e8/j
pretim = ftime()
!$OMP PARALLEL DO
do i=1,j
a(i) = divn
b(i) = i
end do
do k = 1,its
sum = 0.d0
!$OMP PARALLEL DO
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
sum = 1./sum
!$OMP PARALLEL DO
do i =1,j
a(i) = a(i)*sum
end do
end do
entim = ftime() - pretim
Mflops = its*4*j/1.e6/entim
print*,' Mflops =', Mflops,j
print*,' ParDos/time =', 2*its/entim
print*,' flops/loop =', j*2,j
j = j*1.5
end do
stop
end
The above code was compiled on the Blade Center by
>ifc -c -tpp7 -xM -openmp sax.f >ifc -o sax -openmp timd.o sax.o
When run, it gave the following output file.
Sender: LSF SystemSubject: Job 84647: <#!/bin/csh;#BSUB -R span[ptile=2];#BSUB -W 10 ;#BSUB -n 2 ;#BSUB -o out.%J;set OMP_NUM_THREADS=2;./sax> Done Job <#!/bin/csh;#BSUB -R span[ptile=2];#BSUB -W 10 ;#BSUB -n 2 ;#BSUB -o out.%J;set OMP_NUM_THREADS=2;./sax> was submitted from host by user . Job was executed on host(s) <2*blade1-11>, in queue , as user . was used as the home directory. was used as the working directory. Started at Wed Sep 15 16:36:33 2004 Results reported at Wed Sep 15 16:37:09 2004 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #!/bin/csh #BSUB -R span[ptile=2] #BSUB -W 10 #BSUB -n 2 #BSUB -o out.%J set OMP_NUM_THREADS=2 ./sax ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 58.18 sec. Max Memory : 2 MB Max Swap : 4 MB Max Processes : 1 The output (if any) follows: Mflops = 59 50 ParDos/time = 596125.2 flops/loop = 100 50 Mflops = 89 75 ParDos/time = 593912.2 flops/loop = 150 75 Mflops = 117 112 ParDos/time = 523669.8 flops/loop = 224 112 Mflops = 177 168 ParDos/time = 529100.4 flops/loop = 336 168 Mflops = 254 252 ParDos/time = 505509.5 flops/loop = 504 252 Mflops = 399 378 ParDos/time = 529100.0 flops/loop = 756 378 Mflops = 476 567 ParDos/time = 419919.1 flops/loop = 1134 567 Mflops = 645 850 ParDos/time = 379506.4 flops/loop = 1700 850 Mflops = 975 1275 ParDos/time = 382590.3 flops/loop = 2550 1275 Mflops = 1290 1912 ParDos/time = 337425.8 flops/loop = 3824 1912 Mflops = 1481 2868 ParDos/time = 258274.1 flops/loop = 5736 2868 Mflops = 1818 4302 ParDos/time = 211318.2 flops/loop = 8604 4302 Mflops = 1999 6453 ParDos/time = 154960.0 flops/loop = 12906 6453 Mflops = 2105 9679 ParDos/time = 108747.4 flops/loop = 19358 9679 Mflops = 2352 14518 ParDos/time = 81035.30 flops/loop = 29036 14518 Mflops = 2352 21777 ParDos/time = 54023.53 flops/loop = 43554 21777 Mflops = 2352 32665 ParDos/time = 36011.77 flops/loop = 65330 32665 Mflops = 1249 48997 ParDos/time = 12750.00 flops/loop = 97994 48997 Mflops = 634 73495 ParDos/time = 4317.460 flops/loop = 146990 73495 Mflops = 366 110242 ParDos/time = 1664.220 flops/loop = 220484 110242 Mflops = 302 165363 ParDos/time = 915.1515 flops/loop = 330726 165363 Mflops = 294 248044 ParDos/time = 592.6470 flops/loop = 496088 248044 Mflops = 304 372066 ParDos/time = 409.1603 flops/loop = 744132 372066
Recompiling the code without the OpenMP directives (in serial), gave the following output file. In the serial case, even the smallest loop sizes run at the "peak" rate of 1300 Mflops. In the parallel case, the small loop sizes give very slow Mflop rates. In the fastest parallel cases (where the data fits in cache and the loop sizes are large enough to amortize the cost of the PARALLEL DO, the peak rate is 2300 Mflops (not quite twice as fast as the serial case). For problems too large to fit in cache, both the parallel and serial codes slow to about 300 Mflops.
We see that for small enough loops the parallel bottleneck is the time to execute the PARALLEL DO (fork and then join with implicit barrier). These are executed at about 600K/sec or .6M per second. Working down the list of times vs. loop size in both outputs, if there are fewer than 3800 flops/loop, then the PARALLEL DO construct causes the code to run more slowly than a serial code would.
Sender: LSF SystemSubject: Job 84653: <#!/bin/csh;#BSUB -R span[ptile=2];#BSUB -W 10 ;#BSUB -n 2 ;#BSUB -o out.%J;set OMP_NUM_THREADS=2;./sax2> Done Job <#!/bin/csh;#BSUB -R span[ptile=2];#BSUB -W 10 ;#BSUB -n 2 ;#BSUB -o out.%J;set OMP_NUM_THREADS=2;./sax2> was submitted from host by user . Job was executed on host(s) <2*blade1-3>, in queue , as user . was used as the home directory. was used as the working directory. Started at Wed Sep 15 16:55:41 2004 Results reported at Wed Sep 15 16:55:55 2004 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input #!/bin/csh #BSUB -R span[ptile=2] #BSUB -W 10 #BSUB -n 2 #BSUB -o out.%J set OMP_NUM_THREADS=2 ./sax2 ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 13.62 sec. Max Memory : 2 MB Max Swap : 4 MB Max Processes : 1 The output (if any) follows: Mflops = 1333 50 flops/loop = 100 50 Mflops = 1290 75 flops/loop = 150 75 Mflops = 1333 112 flops/loop = 224 112 Mflops = 1379 168 flops/loop = 336 168 Mflops = 1379 252 flops/loop = 504 252 Mflops = 1428 378 flops/loop = 756 378 Mflops = 1481 567 flops/loop = 1134 567 Mflops = 1379 850 flops/loop = 1700 850 Mflops = 1379 1275 flops/loop = 2550 1275 Mflops = 1379 1912 flops/loop = 3824 1912 Mflops = 1333 2868 flops/loop = 5736 2868 Mflops = 1379 4302 flops/loop = 8604 4302 Mflops = 1379 6453 flops/loop = 12906 6453 Mflops = 1333 9679 flops/loop = 19358 9679 Mflops = 1333 14518 flops/loop = 29036 14518 Mflops = 1052 21777 flops/loop = 43554 21777 Mflops = 655 32665 flops/loop = 65330 32665 Mflops = 403 48997 flops/loop = 97994 48997 Mflops = 287 73495 flops/loop = 146990 73495 Mflops = 273 110242 flops/loop = 220484 110242 Mflops = 275 165363 flops/loop = 330726 165363 Mflops = 277 248044 flops/loop = 496088 248044 Mflops = 276 372066 flops/loop = 744132 372066
Repeat the above program. Instead of a daxpy (as this should have been called because it was double precision) make the program compute the inner product of vectors a and b and increment each element of the vector a by 1. Run the code both in serial and in parallel.
Do you find the same number of flops are required for the PARALLEL DO to be faster than the serial DO?
Use a
!$omp parallel do if (n .ge. const )
and see if you can get code that for any size loop is as fast as the faster of your serial and parallel versions. Present timing results for all three code versions.
Loop constructs are convenient in that we can parallelize code gradually. Suppose we have a machine with ten processors and that 3/4s of the work is in loops and moreover that each loop parallelizes perfectly.
loop -- three seconds serial bit -- one second loop -- three seconds serial bit -- one second ...
so that after parallelization with ten CPUs we have
parallel loop -- .3 second serial bit -- one second parallel loop -- .3 seconds serial bit -- 1.0 second ...
Before parallelization, we were taking 4 seconds per iteration. Using ten processors, the time goes to 1.3 seconds. So ten processors give a speedup of about 4/1.3 i.e., about 3. How much speedup would we get with one hundred processors? 4/1.03 , i.e., just less than 4. With a thousand? Throwing more processors at the problem doesn't seem to help much. Actually what usually happens is after a certain number of processors are used, the problem starts to take longer to distribute and the job takes longer. (We saw this with the short loops above).
One way to get a higher potential speedup is by going to a "coarser" grained parallelism. In OpenMP, other than parallelizing do loops, we can also get parallelism by using parallel regions. Such a region causes each thread to execute the same program. For example, each thread could print "hello world". If there are two threads, we could get two "hello world"s. The "parallel do" construct takes a given amount of work and parcelled it out. the "parallel region" results in more work as there are more processors. If you put a "parallel" around a do loop, all of the loop will be performed by each processor.
In order that each processor does useful work we typically need each processor to be get its own set of data. This can be accomplished by having each processor work on its own part of a matrix. Another useful technique is to use a common block with a threadprivate operation. And also we want each processor to be able to output its own data. One way is by having each processor write to its own section of an array. operation.
Branch statements out of a private region are not legal. Just as in the "parallel do" case, variables may be declared public and private.
subroutine sup(n)
integer n
!$OMP PARALLEL
call mypart(n)
if (n.lt.5) return
!$OMP END PARALLEL
return
end
would not be legal.
The loop
!$OMP PARALLEL
print*,'Hello World'
!$OMP END PARALLEL
is not very useful, but okay. Get a 'Hello World' per thread. As in the case of the parallel do, the parallel region causes a master thread to fork a bunch of slaves, which all disappear at the END PARALLEL. Again the END PARALLEL is a synchronization point. What would the following construct do?
!$OMP PARALLEL
do i = 1, 100
print*,'Hello World'
end do
!$OMP END PARALLEL
The PARALLEL construct allows OpenMP programs to work with the SPMD (single program -- multiple data) paradigm, which we will also use with MPI.
The following paired OMP statements are equivalent to just having !$OMP PARALLEL DO
!$OMP PARALLEL
!$OMP DO
do i = 1, 100
print*,'Hello World'
end do
!$OMP END DO
!$OMP END PARALLEL
A main trick is how to get the correct data in and out of each parallel thread. We have to be cautious when using common blocks. Here are some examples from Chandra et al that illustrate pitfalls and solutions. This first example attempts to use knowledge of the process thread number to specify what job a parallel region should perform.
program oops
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, nunthreads, size)
!$omp+ private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (n + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, n)
call work(a)
!$omp end parallel
end
subroutine work(a)
real*4 a(*)
do i = ifirst,ilast
...... stuff on a(i) but since I can't figure
-------- out ifirst and ilast this loop ? ...
end do
return
end
Alas, the subroutine work pays no attention to the nice arithmetic and the private declaration in the parallel region. It knows what ifirst and iend are only because they are globally shared, hence confusion ..
program worksOK
common /mybnds/ ifirst, ilast
real*4 a(100000)
n = 100000
!$omp parallel private(iam, numthreads, size)
!$omp+ private (ifirst, ilast)
numthreads = omp_get_num_threads()
iam = omp_get_thread_num()
size = (N + numthreads -1 )/numthreads
ifirst = iam*size + 1
ilast = min((iam+1)*size, N)
call work(a,ifirst,ilast)
!$omp end parallel
end
subroutine work(a,ifirst,ilast)
real*4 a(*)
do i = ifirst,ilast
...... stuff on my part of a
end do
return
end
By making ifirst and ilast as arguments to work, they refer in work to the private (as opposed to the global) variables.
The thread private directive lets us avoid messy argument lists. It lets common blocks (Fortran) global variables (C, C++) be private to each thread.
program works2
common /bounds/ istart, iend
$!omp threadprivate(/bounds/)
real*4 a(10000)
n = 10000
!$omp parallel private(iam, nthreads, chunk)
nthreads = omp_get_num_threads()
iam = omp_get_thread_num()
chunk = (N + nthreads -1 )/nthreads
istart = iam*chunk + 1
iend = min((iam+1)*chunk, N)
call work(a)
!$omp end parallel
end
subroutine work(a)
real*4 a(10000)
do i=ifirst,ilast
...... stuff on my chunk of iarray
end do
return
end
If you want more than one common block in a thread private block, you need commas as well as the paired slashes. Notice the private statement for istart and iend disappeared from the parallel region. Only the one threadprivate statement appears and in fact further private statement for istart, iedn would be incorrect.
Typically, the threadprivate variables are initialized within the parallel region. But if we want to get data from the master thread's thread data we can specify
!$omp parallel !$omp+ copyin(m,n)
where m and n are member of a threadprivate common block (and as usual we could list more with commas), e.g., copyin(m,n,j).
The SPMD (single program multiple data) model can seem limiting. But one common trick is to have different threads execute entirely different programs by having each one find its thread number, and then sorting out threads by an
if mynumber is 0 then
.....stuff for thread 0 to do
elseif mynumber is 1 then
... stuff for trhead 1 to do
endif
One common procedure is for thread 0 to be a master and do coordination
and have the other threads be slaves doing all the same thing with
different data. So thread programming actually is pretty versatile.
Limitations of threads programming? Of OpenMP in particular?
If you type
>ps -a >
on any sort of Unix computer, even your own Linux box, you'll see that lots of processors are running. For example, you can very easily edit a file in one window and from another window or in the background run an executable. For instance, one process might refresh the monitor, another might keep track of keystrokes made on the keyboard. One use of OpenMP is to let you keep several trains of execution going on the same CPU. As an alternate HW assignment, write an OpenMP I/O thread.
Rationale. In a parallel job using lots of processors, a hazard is one of the many processors will die in the middle. The common precaution is to use a "check-point restart". A problem is that when all the processors decide to dump their data, then the common file system they are all writing to can't handle all the writes at once (serial bottleneck with everyone trying to write to one controller). A reasonable way to proceed is for every processor to write to its own local hard drive. Then each processor can proceed with its work. But saving the data to a local hard drive is not enough as the local hard drive may die if the CPU does (and the data will be hard to retrieve after the job exits). One solution: after the computational thread completes the local write, have a parallel region in which the computational thread starts computing again and the I/O thread copies a file from the local hard drive to the global file system.
for ( .......)
serial thread writes back-up file to scratch
start of parallel region
iam = mythread number
if iam == zero then
compute
else
I'm the I/O guy I'll copy the file from scratch to share
endif
end of parallel region
end for
Since each of many processes can split off its own region, this can work for a large distributed memory job. The above pseudo-code assumes that the environmental variable OMP_NUM_THREADS is two. To set the environmental variable in tcsh shell would be accomplished by
setenv OMP_NUM_THREADS 2