According to Moore's Law, the number of transistors on a chip doubles every couple of years. For thirty years or so, CPU speeds doubled every 18 months. But in the last five years, CPU speeds have stalled at around 2 GHz. CPUs pull power at the square of their clock speed, so faster processors create too much heat for air cooling.
Moore's law still holds in the rate at which extra transistors are appearing on mother boards. One use is for extra RAM. Moreover, peak theoretical speed is still increasing, due to GPUs and multi-cores. Both GPUs and multi-core processors require modified code if they are to speed computations. This short course concerns programming for multi-core architectures.
GPUs (Graphics Processing Units) are very fast and increasingly used for computations. Special flavors of C are available for NVIDIA, ATI, and Cell GPU processors. While less hard than a few years ago, GPU flavored C is still hard to program and is only beginning to have a standard workable for more than one architecture. Some other caveats. There are eyepopping flop rates for using GPUs, but double precision arithmetic with GPUs slows them down considerably, e.g. a factor of 4. GPU floating point arithmetic is not standard in how it handles rounding, overflow, underflow, etc., so you need to make sure the application still works. There is typically a bottleneck between the GPU accessible RAM and the CPU accessible RAM.
Summarizing, it's a challenge to convert large programs to use GPUs. From the user point of view, GPUs can perhaps be viewed as accelerators, useful if someone else has written and debugged a library call. For example, you could have Matlab where "x = A\b" calls a GPU. If you are interested in GPUs, there's quite a bit of expertise in the triangle. NVIDIA's computational GPU group is in RTP. Frank Mueller (a CSC faculty member at NCSU) has quite a bit of experience, as well as a sizable hybrid GPU cluster, to which NCSU faculty and students are allowed access. ARC -- A Root Cluster There is information on using PGI compiler tools with GPUs in the blade center How-To. GPUs are particularly used by molecular dyamics codes such as LAMMPS.
Multi-core processors are ubiquitous. Most laptops have a couple of cores sharing access to the same RAM. All the blades have at least two processors sharing memory. Many blades have two dual core processors. Our newest NC State HPC blades have 2 hex core processors, many blades have 2 quad core processors, so that 8 processors share 24 GBytes of memory. We've replaced the IBM p590 with 16 core Opteron blades with 64 GBytes of memory. The number of cores will likely double a few more times over the next few years.
Where programming a GPU is rather specialized work, anyone who can write and compile Fortran or C codes can rather easily use OpenMP calls to convert them to use mutliple cores. Since OpenMP is a widely used de facto standard, the converted code will compile and run on a wide variety of shared memory machines. Floating point arithmetic is fairly standard, with no large penalty for using double precision.
This short course will show how to convert codes to OpenMP and how to do a bit of tuning to enable speedups.
Programming for multi-core architectures has a headstart over programming for GPUs.
The two predominant forms of parallel computations are shared vs. distributed memory. In a distributed memory program, each processor (or possibly a node of processors) has its own chunk of RAM. To access RAM on another processor, we must send "messages" which typically involves writing to a buffer, and using some kind of communication network, e.g., GiGE ethernet. Distributed memory computing can use many processors, so is the method of choice for large jobs.
Multi-core architectures are an inexpensive new form of shared memory machine. Shared memory architectures have been around for 20 years or so as high end parallel computational resources. SGI machines were particularly popular, but there were also popular offerings from HP, IBM, and Sun, to name a few. In 2003, about half of the machines on the Top 500 list were shared memory machines. Tools for high end shared memory machines are already developed and can also be used for multi-core architectures.
An advantage to shared memory is that it's typically easier to convert a code to shared memory than it is to convert to a message passing code. Disbributed memory codes may need to be re-ordered in terms of logical flow. OpenMP provides a way to gradually convert codes from serial to parallel (a loop at time for example).
OpenMP is a portable and relatively painless method to convert a serial code to use shared memory parallelism.
Essential idea. OpenMP directives don't impact code execution on a serial code, but when more the one core is available, directives indicate "do this part in parallel". So the code runs in serial, then hits the parallel region, then goes back to serial. Easy to do. If you want to do more, is OpenMP the right way to go ?
Another portable way to get parallel performance on a shared memory platform is by using "threads". Threads are light weight processes, designed to swap in and out of the CPU quickly. Pthreads is a standard low-level library to do thread based programming from C or C++. Low-level Pthreads offers potential for more control than OpenMP, but requires more expertise. Two other C related alternatives are Cilk++ and UPC.
Since OpenMP has evolved to a version 3 in a bit more than a decade, it is a moving target. A defense of OpenMP by Ruud van der Pas (OpenMP book author vs. fan of Cilk++ ) may be fun to read.
Disclaimer: Neglected here is a discussion of how to deal with the NUMA (Nonlocal Uniform Memory Access) aspect. The problem is that since that when several "sockets" in a blade share memory, then each of the motherboards has faster access to its part of the RAM. Parallelization without taking into account locality of memory does not usually result in speedup beyond the number of cores on a given motherboard.
OpenMP obtains parallellism by inserting "pragmas" in Fortran or C code. If OpenMP is not available then the pragmas are ignored and the code can still run in serial. The compiler recognizes OpenMP pragmas if the appropriate flag is set at compile time. For the gnu compilers, gcc, g++, gfortran, the flag is -fopenmp. For example,
gfortran -fopenmp foo.f -o foowould compile the fortran code foo.f to produce an executable file foo. Note that g77 has not been updated to use OpenMP.
For the Portland group compilers, pgf90, pgf77, pgcc, and pgCC the OpenMP flag is -mp, for example
pgcc -mp -o foo foo.cWe may also wish to access the libnuma library.
pgcc -mp -o foo foo.c -lnumaFor the intel compilers, ifort, icc, icpc, the OpenMP flag is -openmp. Using the -fopenmp, -mp, and -openmp flags also sets the program to use the "auto" option, which initializes variables to zero. Usually, users will also set some optimization flags (reference html ) The number of threads used is set as an environmental variable OMP_NUM_THREADS. In the blade center default tcsh shell,
setenv OMP_NUM_THREADS 4sets the number of threads to 4. For the bash shell, this would be
export OMP_NUM_THREADS=4There may be a preprocessor # include.
In Fortran 77 or 90 , insert pragma lines beginning with
!$omp c$omp *$ompwith either 0's or blanks in the 6th column. Some other character in the 6th column would make the line a continuation of the previous line.
In free form Fortran, any line starting with
!$ompis a pragma. Continuation is indicated by an & at the end of the line, or at the start of a new line, as
!$omp !&omp
In C or C++, a pragma starts with
#pragma omp
Unless the compiler is instructed to use the pragmas, it ignores them and compiles the code as a serial code.
Other statements may be calls to parallel libraries which might not be available in a serial version, which would cause problems linking. Statements that should be compiled only in the paralllel version can also be preceded by the first column
!$ c$ *$or in free format, any line starting with !$ (in any column so long as preceding characters are white space)
A Fortran 77 example
iam = 0
!$ iam = omp_get_thread_num() ! only compiled if OpenMP enabled
c A continuation (continued only when OpenMP enabled)
y = x
!$ + + offset
Of course, one has to be careful that the serial version will still make sense if the OpenMP directives are not present. There are 3 language extensions embodied in OpenMP: parallel control structures, data environment, and synchronization.
Parallel Control Structures alter the flow of control. The model is "fork/join". Several different threads can execute concurrently. Or a parallel "do" can divide iterations, with each processor performing a different subset of the iterations
Communication and Data Environment An OpenMP program starts with a single thread of execution. It has access to global variables and to automatic (stack) variables within subroutines as well as to dynamically allocated (heap) variables. The global context remains throughout execution.
Each thread has its own execution context and private stack. It can use its private stack to call subroutines.
The
Synchronization OpenMP threads communicate through reads and
writes to shared variables. Two forms of scynchronizations are
mutual excusion and barriers. When multiple threads can modify
the same variable (think bank account balance as an example) a mutual
exclusion
Consider the code
subroutine saxpy(n, a, x, y)
integer i, n
real y(n), a, x(n)
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
A parallel OpenMP version is
subroutine saxpy(n, a, x, y)
integer i, n
real y(n), a, x(n), y
!$omp parallel do
do i = 1, n
y(i) = a * x(i) + y(i)
end do
return
end
In C, the serial version
int saxpy(int *n, float *sa, float *sx, float *sy) {
/* Local variables */
int i;
/* constant times a vector plus a vector. */
for (i = 0; i < *n; i++) {
sy[i] += *sa * sx[i];
}
return 0;
}
parallelizes to
int psaxpy(int *n, float *sa, float *sx, float *sy) {
#include
/* Local variables */
int i;
/* constant times a vector plus a vector. */
#pragma omp parallel for \
private(i)
for (i = 0; i < *n; i++) {
sy[i] += *sa * sx[i];
}
return 0;
}
Because there are no dependencies between loop iterations, the only
change required is to insert a pragma directing that the loop be parallelized.
Runtime Execution Model
(serial) Master thread executes serial portion of code
(serial) Master thread enters the saxpy subroutine
Master thread encounters "parallel do". Creates slave threads
(parallel) master and slave threads divvy up iterations and
each of them do some
(implicit barrier) Wait for everyone to finish
(serial) slave threads gone -- master resumes execution
Communication and data scoping and the saxpy
Each iteration of the loop read the scalar values a. It read x(i)
and read, changed and wrote y(i). y(i) is not a problem because
each value of i belongs to a different processor, so these do not overlap.
"i" is a problem. Each process needs its own version of the loop variable
i. The loop index "i" is
-------------------------------------------------------
Serial execution -- a,x,y,n,i are global shared
-------------------------------------------------------
Parallel execution
-- a,x,y,n are global shared
-- i | i | i | i is private to each thread
-------------------------------------------------------
Once the
Synchronization in the Simple Loop Example
"y" is modified, but by multiple threads, but since each modifies its
own segment, that requies no synchronization. However, for a
For parallelizing a loop, steps must be independent of one another, and also keep track of which variables should be private to each thread and which variables are shared across threads. Detailed examples in the next lecture.
Loop-level parallelization is straightforward and can be quickly accomplished. When does parallelization make code run faster?
Note that the clock() timer, which is part of the Posix standard would add the times taken by all cores. We need instead "wall clock time". Also the clock() timer is not very fine grained. In Linux, it would time thousandths of a second. A micro second timer is helpful in not requiring repetitions to add enough total time to be measured. Finally, note that if results are not used, than an optimized compiler may detect that the computation is not necessary and actually not perform it, in which case timings get miraculously fast.
Mflops vs. vector length. These are with the pgf90 compiler using the acml4.3.0 library, running on 2.2 GHz Opteron cores. When vectors are short, the parallel do slows the computation.
Speedups for various length vectors. For short vectors (less than length 1000) the "parallel do" is horrible. For long (but not too long) vectors, the parallel do gives almost perfect speedup.
Comparing Mflops for 4 fortran compilers. The commercial compilers (pgf90 and ifort) were faster than the open source compilers gfortran and openf95. The peak flop rate is about 4 Gflops, which is actually the peak flop rate (two flops per clock cycle) for one core, where here we are using 16 cores (so are running at about 1/16 peak speed for the commercial compilers 1/32 peak speed for the open source compilers). The ifort was 32 bit (trying to run optimized 64 bit code gave runtime errors).
The next few sections discuss some details involved in the above graphs. The code was rerun repeatedly with choices of compiler flags, adding flags as were helpful, then taking out flags to see if the same results could be produced with fewer flags.
pf90 was the best compiler in terms of speed. The PGI compilers support AMD processors by the -tp k8-64e flag. One disadvantage is that they check for a pgi license before running an executable.
add pgi pgcc -c timw3.c pgf90 -fastsse -tp k8-64e -mp=numa -Mprefetch=w -c sax.f pgf90 -fastsse -tp k8-64e -mp=numa -Mprefetch=w -Bstatic -o sax -timw3.o
gfortran did not produce code that ran only about half as fast as when compiled with pgf90. The gfortran compiler is public domain and can be installed on any current linux operating system (as well as Windows and Macs, and some other unix). AMD processors are supported to the extent of having a flag -mtune=k8
add gnu gcc -c timw3.c gfortran -mtune=k8 -funroll-loops -fexpensive-optimizations -ffast-math -fopenmp -O3 -c sax.f gfortran -mtune=k8 -funroll-loops -fexpensive-optimizations -static -ffast-math -fopenmp -O3 -o sax timw3.o
ifort is the intel compiler. It was competitive with the PGI fortran compiler in terms of speed on the AMD processor, but does not officially support AMD, and in fact will not run 64 bit code on the Opteron, throwing a runtime error.
add intel icc -c timw3.c ifort -openmp -O3 -xW -c sax.f ifort -openmp -O3 -xW -o sax timw3.o
openf95 is an open source compiler downloadable from AMD. So far, it has no discernable advantage over gfortran, so we are not supporting it.
Having compiled callem.f to produce callem the pgi compiled version was run by
bsub < btrywhere btry was the following file
#!/bin/csh #BSUB -n 1 -x #BSUB -W 10 #BSUB -q shared_memory source /usr/local/apps/env/pgi.csh setenv OMP_NUM_THREADS 16 ./sax #BSUB -o /share/gwhowell/openmp/outkl.%J #BSUB -e /share/gwhowell/openmp/errkl.%J
"#BSUB -n 1 -x" asks for exclusive use of a node. "#BSUB -q shared_memory" runs in the shared memory queue (the 16 core Opteron blades). "setenv OMP_NUM_THREADS 16" sets an environmental variable to request 16 OpenMP threads. The -o and -e files set standard output and standard error output files, respectively. The user should set a path to which he/she has write privileges.
To see if parallelization did any good, we need to know how long code takes to run. An easy way to time codes is to put "time" in front of the executable, e.g., the btry code from above revised by inserting "time" and using calls with various numbers of threads.
#!/bin/csh #BSUB -n 1 -x #BSUB -W 10 #BSUB -q shared_memory source /usr/local/apps/env/pgi.csh setenv OMP_NUM_THREADS 1 time ./sax source /usr/local/apps/env/pgi.csh setenv OMP_NUM_THREADS 2 time ./sax source /usr/local/apps/env/pgi.csh setenv OMP_NUM_THREADS 4 time ./sax source /usr/local/apps/env/pgi.csh setenv OMP_NUM_THREADS 8 time ./sax #BSUB -o /share/gwhowell/openmp/outkl.%J #BSUB -e /share/gwhowell/openmp/errkl.%JFor finer scale timings, we can put timers directly in the code, as discussed in the next section.
Here we chose to compute wall clock time from the posix C function gettimeofday. gettimeofday is convenient in that it typically returns microseconds. An alternative for serial codes is the clock() function. clock() typically is a less fine timer (hundredths or thousandths of a second). clock() give time as the total of user time on each core and neglects "system" time. Blow is some C code timw3.c (the discussion of compiler flags above included compilation of this code and showed which compilers could be used if Fortran and C codes are to be linked). timw3.c returns a 2-vector "tim" via the function wtime. The function wdiff takes the 2-vector "tim" and returns the elapsed time between the calls of timw and wdiff. Here's a template for its use from Fortran
! ... declaration ..
real*8 tim(2), tim2(2)
real*4 elapsed, wdiff
integer ierr
external wdiff
call wtime(tim, ierr)
! ..
! stuff to be timed
!..
elapsed = wdiff(tim, ierr)
!
From C,
/* declaration */
double tim[2];
float elapsed, wdiff_();
int ierr;
wtime_(tim, &ierr);
/*
stuff to be timed
*/
elapsed = wdiff_(tim, &ierr);
The timing code follows.
/* wdiff gives the elapsed time since tim was initialized * by a call to wtime * * by a call to wtime * To call from fortran, declare * real*8 tim(2) * real*4 wdiff, elapsed * external wdiff * * call wtime(tim, ierr) * ... stuff to be timed * * elapsed = wdiff(tim, ierr) * * To call from C, declare * int ierr; * float wdiff_(), elapsed; * * wtime_(tim, &ierr); * ... stuff to be timed * * elpased = wdiff_(tim, &ierr); * * */ #include #include #include/* This timer returns wall clock time. The first entry is current time in seconds. The second entry is microseconds. works either from fortran or C .. from C call by wtime_ from fortran use wtime */ void wtime_(double tim[2], int *ierr2 ) { struct timeval time1; int ierr; ierr = gettimeofday(&time1, NULL) ; *ierr2 = ierr; if (ierr != 0 ) printf("bad return of gettimeofday, ierr = %d \n",ierr); tim[0] = time1.tv_sec; tim[1] = time1.tv_usec; /* printf("tim1 = %f \n",tim[0]); printf("tim2 = %f \n", tim[1]); */ } float wdiff_(double tim[2], int *ierr2 ) { struct timeval time2; double tim2[2]; int ierr; float wdiff, wdiff_; double wd1; ierr = gettimeofday(&time2, NULL); *ierr2 = ierr; if (ierr != 0 ) printf("bad return of gettime of day, ierr = %d \n", ierr); tim2[0] = time2.tv_sec; tim2[1] = time2.tv_usec; tim2[0] = tim2[0] - tim[0]; tim2[1] = tim2[1] - tim[1]; wd1 = tim2[1] + 1.e6*tim2[0]; wd1 = wd1/1.e6; /* wdiff = wd1; */ wdiff_ = wd1; return ( wdiff_) ; } /* * in case someone wants to make a C call to a more natural * wdiff (as opposed to wdiff_), they can use wdiff * */ float wdiff(double tim[2], int *ierr2 ) { struct timeval time2; double tim2[2]; int ierr; float wdiff, wdiff_; double wd1; ierr = gettimeofday(&time2, NULL); *ierr2 = ierr; if (ierr != 0 ) printf("bad return of gettime of day, ierr = %d \n", ierr); tim2[0] = time2.tv_sec; tim2[1] = time2.tv_usec; tim2[0] = tim2[0] - tim[0]; tim2[1] = tim2[1] - tim[1]; /* printf("tim2 = ( %f %f ) \n",tim2[0],tim2[1]); */ wd1 = tim2[1] + 1.e6*tim2[0]; wd1 = wd1/1.e6; /* wdiff = wd1; */ wdiff = wd1; /* printf("wdiff = %f \n",wdiff); */ return ( wdiff) ; }
Here's the fortran code referred to in the above compiler examples.
Elapsed time is elapsed wall clock time between the call of wtime and wdiff. The section above on compiler flags indicates the corresponding C compiler to use for timw3.c for each Fortran compiler. timw3.o can also be linked to codes compiled by the C compiler used to compile it.
sax.f above was modified to be timed as follows:
program saxpy
integer n , its ,k, j, ierr
real*8 a(1000000), b(1000000), alpha, divn
real*8 Mflops
real*8 tim(2), tim2(2)
real*4 elapsed, wdiff
external wdiff
integer numth, OMP_GET_MAX_THREADS
external OMP_GET_MAX_THREADS
numth = OMP_GET_MAX_THREADS()
print*,'OMP_NUM_THREADS =', numth
j = 50
do while(j .lt. 400000)
alpha = 2.d2
divn = 1.0/j
its = 1.e8/j
call wtime(tim, ierr)
!$OMP PARALLEL DO
do i=1,j
a(i) = divn
b(i) = i
end do
do k = 1,its
sum = 0.d0
!$OMP PARALLEL DO
!$OMP& REDUCTION(+:sum)
do i=1,j
a(i) = alpha* b(i) + a(i)
sum = sum + a(i)
end do
sum = 1./sum
!$OMP PARALLEL DO
do i =1,j
a(i) = a(i)*sum
end do
end do
elapsed = wdiff(tim,ierr)
print*,' elapsed time =', elapsed, ' sum =',sum
Mflops = ((its*4.0)*j) /1.e6/elapsed
print*,' Mflops =', Mflops,j
print*,' ParDos/time =', 2*its/elapsed
print*,' flops/loop =', j*4, 'length of loop =',j
print*,' '
j = j*1.5
end do
stop
end
Normalizing by "sum" after each iteration "j" avoided overflow. Printing
"sum" is necessary, as some optimizing compilers can detect that
none of the computations are actually outputs, hence can be safely
skipped. Without the print of "sum", the openf95 compiler produced code able to run
in negligible time, ifort was able to run faster than the theoretical
machine peak speed.
/* this is from Using Open MP * MIT Press, Chapman, Jost, Van der Pas * 2008, p. 44 (slightly adapted to use a wall-clock timer ) */ /* Driver program for the mxv routine. The user is * prompted for the matrix dimension m and n. Memory is * allocated and intialized prior to the call to mxv. Memory * is released after the call */ #includein the declaration of the PARALLEL DO. Does that change the printed value of i ? For a parallel loop, the lastprivate corresponds to the last iterations of the serial loop.#include #include #include void mxv(int m, int n, double* a, double* b, double* c); /*float wdiff_(double tim[2], int* ierr); float wdiff(double tim[2], int* ierr); You could use one of these instead of the declaration float wdiff_(); */ main(int argc, char* argv[]) { double *a, *b, *c ; int i, j, k, m, n, numthreads; /* float startim, endtim, mflops; */ double tim[2]; double mflops; float wdiff_(); float endtim; int ierr; printf("please give m and n: "); scanf("%d %d",&m,&n); printf("please give number of repetitions k: "); scanf("%d", &k); if ( (a=(double *)malloc(m*sizeof(double))) == NULL ) perror("memory allocation for a"); if ( ( b=(double *)malloc(m*n*sizeof(double))) == NULL ) perror("memory allocation for b"); if ( (c=(double *)malloc(n*sizeof(double))) == NULL ) perror("memory allocation for c"); printf("Initializing matrix B and vector c\n"); for ( j=0; j Lab Exercises
You can copy the codes listed above from /share3/gwhowell/openmp. The first is to to start with the saxpy code above and revise it to use the various OpenMP loop schedules. If you prefer C to the Fortran code, you can start with mxv2.c instead of sax-new.f.
For some basic information on working in linux, see HPC HowTo and FAX
1. Schedules for OpenMP parallel loops
Log into the blade center. From the lab machines you can use putty. Often it's a good idea to open more than one putty session.
If you have a new account, you may want to make a subdirectory to avoid clutter of your home directory. For some review of unix, try
pwd ls mkdir openmp ls cd openmp ls pwd cd .. pwd cd openmp cp * /share3/gwhowell/openmp/* . ls ls -l cat sax-new.f cat mxv2.c cat makesaxpgi more bshort tail bshort tail sax-new.fThe lines in the file makesaxpgi are
pgf90 -fastsse -tp x64 -mp=numa -Mprefetch=w -c sax-new.f pgcc -c timw3.c pgf90 -fastsse -tp x64 -mp=numa -Mprefetch=w -Bstatic -o sax timw3.o sax-new.oIf you type
ls -l makesaxpgiyou see something like
[gwhowell@login02 openmp]$ ls -l makesaxpgi -rwxr-xr-x 1 gwhowell ncsu 156 Aug 25 12:04 makesaxpgiThe x's at the beginning of the second line indicate the file is executable. Generally in unix, you can type any sequence of commands into a file foo and then make the file executable by typing
chmod +x fooThe commands in makesaxpgi can be executed by
./makesaxpgiIf you prefer C, a good exercise is to change makesaxpgi to a file which would compile mxv2.c (it links to the same timw3.o). The -c flag in the first two lines compiles sax-new.f and timw3.c to produce object files sax-new.o and timw3.o. The last line links the two object files to produce an executable file, which has the name sax, because sax follows -o in the command. The OpenMP directives in sax-new.f are recognized because of the -mp flag. OpenMP for the pgi compiler turns out to need the library /usr/local/apps/lib64/libnuma.a
Try compiling by typing
./makesaxpgiPresumably, the command will fail unless you first type "add pgi". The "add pgi" command is equivalent to "source /usr/local/apps/env/pgi.csh". To see what it does you can type
cat /usr/local/apps/env/pgi.cshAny of the .csh files in /usr/local/apps/env can be "added" or "sourced".
If you type
ls -lrtthe files in your directory are listed with the newest last. Hopefully, an executable file sax is listed last. sax is a binary file, readable only by the computer. Since sax would take a few minutes to run, it's not polite to run it on a login node shared by many users. As with all other computationally intensive jobs, it should be submitted to run through the job scheduler LSF.
cat bshortshows you the contents of an LSF job submission script.
#!/bin/csh #BSUB -n 1 -x #BSUB -W 20 #BSUB -q shared_memory setenv LD_LIBRARY_PATH /usr/local/apps/lib64:$LD_LIBRARY_PATH setenv OMP_NUM_THREADS 16 time ./sax > gput16-shared4 #BSUB -o /share/gwhowell/openmp/out.%J #BSUB -e /share/gwhowell/openmp/err.%J#BSUB -n 1 -x requests exclusive use of a node. #BSUB -q shared_memory will cause the job to execute on one of the Opteron blades with 16 cores. The LD_LIBRARY_PATH line may not be necessary for the pgi compilers, but would be needed by the gnu compilers, gfortran, gcc, or g++. The line "setenv OMP_NUM_THREADS 16" specifies that OpenMP will use 16 threads in its parallel regions, which makes sense because we have 16 cores to use. I could submit sax to run by
bsub < bshortand you can too, except the script will fail because it won't be able to write to /share/gwhowell/openmp/out.%J and /share/gwhowell/openmp/err.%J. So you need to figure out how to open the file and edit those two lines to point to someplace you can write. (For a C exercise, use the code from compiling mxv2.c).
One way to edit bshort is by
nano bshortThe nano editor is fairly easy to use, does not need a GUI, and exists on most current linux systems. vi is painful, but universally available on unix and does not require a GUI. emacs requires a GUI, but is perhaps the most popular unix editor.
So edit the bshort file and get the program to run.
In fortran, the loop variable is by default private to each thread. In sax.f, the loop variable is "i". Re-edit the program to print "i" after all the "end do" for the end of a parallel do. The fortran line is
print*,' i =', iSince we haven't specified the code to be free form, the print should start at column 7 or after. Recompile and run again. Note in a serial code, i after the loop would have the last value obtained in a loop. Try putting a line
!OMP& lastprivate(i)or in a C code, < pre> #pragma omp parallel for lastprivate(i)
The loops here are sloppy in not specifying the shared variables. Try explicitly putting in the shared and private variables for each loop. The syntax is
!$OMP PARALLEL DO !$OMP& SHARED(x,y,z) PRIVATE(i) do i = 1,n ... end do !$OMP END PARALLEL DO
or in C
#pragma omp parallel for shared(x,y,z) private(i)
for (i=0; i < n; i++)
{
...
}
Writing these out explicitly for each loop lets a couple of people work on the same code and check on each other. In Fortran (but not C), you can also specify a default as shared. But actually all variables are default shared in either Fortran or C (unless we use DEFAULT(none) )
!$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) PRIVATE(i,j)
What happens if you declare j as private, initialize it before the parallel loop and then print it inside the loop ? Try it and see.
Moral: each parallel region must either initialize its own private variables our use a firstprivate list.
!$OMP PARALLEL DO !$OMP& DEFAULT(SHARED) FIRSTPRIVATE(J) PRIVATE(i)
Schedules for loops can be specified. For example, the line
!$OMP PARALLEL DO
can become
!$OMP PARALLEL DO SCHEDULE(RUNTIME)
Make that change in the code and recompile. We can set the schedule as static, dynamic, or guided by the environmental variable, e.g.,
setenv OMP_SCHEDULE "static 10000"
which would set a round robin of chunks of size 10000. You can reset the environmental variable without recompiling the code. Try a few chunk sizes. You can also try dynamic and guided. Chuck size is an optional parameter, so you can also try leaving it out. Personally, I've had no luck improving on the default.
Try breaking up the PARALLEL DO into two statements
$!OMP PARALLEL $!OMP DO .... $!OMP END DO $!OMP END PARALLEL
or in C
#pragma omp parallel #pragma omp for .. #pragma omp end parallel
Does this still run? One advantage is we could put several parallelized do loops inside a parallel region. There would be a barrier between each loop, but we could do
$!OMP PARALLEL $!OMP DO .. $!OMP END DO NOWAIT $!OMP DO .. $!OMP END DO $!OMP END PARALLEL
#pragma omp parallel #pragma omp for nowait private(i) ... #pragma omp for ... #pragma omp end parallel
2. Write a small program and time it
Do you know how to use an editor? Some choices are nano, emacs, vi. If you prefer to program by changing existing codes, you can copy files from /share3/gwhowell/openmp. Commands for compilation are in the make* files. For example, to compile the makepgi
Suppose we have values a(i),i = 1,n and want to produce a smoothed function b(i), i = 1,n-3, where b(i) = (a(i) + 2*a(i+1) * 3*a(i+3) + 4*a(i+4))/10 . This is a weighted average.
Write a serial function in Fortran or C that initialized a and computes b. Typically in C, indexes of a will run from 0 to n-1.
How do you know it's gving a correct answer? One way to test is by having it compute something you already know. For example, you could set all entries to one, or every other entry to one or minus one.
Run it as a batch job using LSF.
parallelize the code using OpenMP
Did the parallelization do any good ? Time the runs with one thread vs. with several. Vary the number of elements in the inner product by resetting the environmental variable OMP_NUM_THREADS.
One way to time an executable foo is to put
time ./foo
for the call to foo. The wall clock timer used above is more accurate.