As written the code passes messages in a ring in the direction of a higher numbered processor. Change the code so that it also passes to the left. Does the resulting code take twice as long to run? What is the estimated time to start a message?
Passing messages by persistent communication is a bit more complicated in terms of the MPI calls. Is it faster on this machine?
So a common activity is to port codes. This often stumps people, so they ask me to do it. Which is fine, unless you want it right now. So let me show you what it usually requires.
For example, fftw fast fourier transform is a parallel C code that is widely used and works in an MPI environment.
First google and find it. Typically I download to my PC. The Windows attempts to turn the .tar.gz file into a .tar.tar file. So use "save as" to give it the right name.
Then transfer the file to your home directory. For example,
from cygwin (a way to do Linux from a Windows PC, so I'm editing
this with vi, questionable taste, eh? )
sftp gwhowell@login02.hpc.ncsu (enter password) sftp>put fftw-2.1.5.tar.gz sftp>quitWARNING: NEED to sftp to login01 or login02 , not to login !!
Then log into the blade center.
>gunzip fftw-2.1.5.tar.gz >tar xvf fftw-1.2.5.tar >cd fftw-2.1.5From the web page I downloaded from, it seems the easiest way to install is from an rpm, but this may not be feasible as you need sys admin privileges. Also it installs by default in /usr/local which in our case is currently full. But looking at INSTALL in this directory, there are explcit easy instructions which don't actually look as if they will produce the MPI (parallel) version. But looking at the configure.in file we can see what options, there are. I ended up making a file myconfig containing the following lines.
add gnu ./configure --enable-mpi --enable-type-prefix --with-gcc --enable-float --enable-i386-hacks
Excerpt from config.log
/bin/arch = i686 /usr/bin/arch -k = unknown /usr/convex/getsysinfo = unknown hostinfo = unknown /bin/machine = unknown /usr/bin/oslevel = unknown /bin/universe = unknown PATH: /usr/local/gnu/mpich-rhel3/mpich-1.2.6-3.2.3/32/bin PATH: /usr/local/gnu/mpich-rhel3/mpich-1.2.6-3.2.3/32/bin PATH: /home/gwhowell/bin PATH: /usr/local/lsf/6.1/linux2.4-glibc2.3-x86/bin PATH: /usr/local/lsf/6.1/linux2.4-glibc2.3-x86/etc PATH: /usr/kerberos/bin PATH: /usr/local/bin PATH: /bin PATH: /usr/bin PATH: /opt/xcat/bin PATH: /opt/xcat/sbin PATH: /opt/xcat/i686/bin PATH: /opt/xcat/i686/sbin PATH: /usr/X11R6/bin
make |& tee make.log
In this case, make.log has 1154 lines and looks as if it compiled correctly.
The point of compiling this library is that it produces some library
files we can link to. But it's confusing, where are they ? The find
command will tell us. From the directory, /home/gwhowell/fftw-2.1.5,
we look for files *.a in subdirectories by
[gwhowell@login01 ~/fftw-2.1.5]$ find . -name '*.a' -print ./fftw/.libs/libsfftw.a ./rfftw/.libs/libsrfftw.a ./mpi/.libs/libsfftw_mpi.a ./mpi/.libs/libsrfftw_mpi.a
/home/gwhowell/fftw-2.1.5/fftw/.libs/libsfttw.a
You could repeat all this as an exercise. But sitting in
/home/ghowell, when I type
[gwhowell@login01 ~]$ du fftw-2.1.5 296 fftw-2.1.5/fftw/.deps 320 fftw-2.1.5/fftw/.libs 2636 fftw-2.1.5/fftw 272 fftw-2.1.5/rfftw/.deps 284 fftw-2.1.5/rfftw/.libs 2260 fftw-2.1.5/rfftw 16 fftw-2.1.5/tests/.deps 4 fftw-2.1.5/tests/.libs 940 fftw-2.1.5/tests 916 fftw-2.1.5/doc 44 fftw-2.1.5/threads/.deps 252 fftw-2.1.5/threads 52 fftw-2.1.5/mpi/.deps 52 fftw-2.1.5/mpi/.libs 2504 fftw-2.1.5/mpi 12 fftw-2.1.5/fortran 268 fftw-2.1.5/gensrc 28 fftw-2.1.5/matlab 68 fftw-2.1.5/cilk 44 fftw-2.1.5/FAQ/fftw-faq.html 192 fftw-2.1.5/FAQ 11668 fftw-2.1.5
Note there are test directories, we can't really be comfortable that the port was successful, till we've tried these. (And so far I haven't).
In a given directory, you can hunt for the desired object file foo_, hopefully available as part of a .a or static library, by
nm -A *.a | grep foo_
Googling for an undefined symbol may give some hints as to where it can be found. Typically symbols may occur in either a 32 bit or a 64 bit version: you need to be consistent which libraries you choose.
The other common form of library is the shared or ".so" library. .so libraries are theoretically convenient in that they reduce executable file size by deferring a link into the executable till compile time. Unfortunately, many .so libraries differ depending on the computational node, so they can cause run time errors.
Avoiding use of .so files is often advisable. Use of static (.a) libraries can be specified at compile time (look at the man page of the compiler you are using). Another possible solution is to copy .so files to your /home directory file space and specify their location at compile time using the environmental variable LD_LIBRARY_RUN.
rm -r dir
Intel produces the fastest code. Our previous efforts to port fftw with intel have failed to link.
Many commercial developers choose pgi compilers as producing code they can link without error. If you want f90, pgi and Intel are your only choices.
Open source codes usually work in the gnu (gcc, g77, g++) environment. Which motivated me to gnu this one. But the gnu compilers also tend to produce the slowest code.
Rather often libraries compiled with one family of compilers do not link well with libraries. For example, there are problems with the number of underscores on library files. (Use the nm command on a library). So we often end up with several versions of various libraries. Often the order in which libraries are listed on a link line makes a difference in whether the link is successful. Intel libraries are particularly fussy and we sometimes find we have to list the same library twice in a link line.
The peak advertised computational rates on the Blade Center processors are around 12 Gflops per second per node (12 billion floating point operations per second per node).
I attained this rate on a user problem. She was solving a matrix equation Ax = b, with A dense, using a code from Numerical Recipes. This was a problem in crystallography. Matrices of size 10K required about ten hours to solve, but by calling LAPACK running on top of a good BLAS, I was able to get run times of less than five minutes (factor of one hundred speedup).
Most real world problems are more closely related to sparse matrices than to dense, so we can't use the Basic Linear Algebra libraries to get these "magical" speedups. One trick that is more universally applicable to to multiply a sparse matrix by several dense vectors at the same time. The time is almost the same as multiplying the sparse matrix by one vector.
The following web page shows how to link to BLAS and LAPACK on the Blade Center BLAS on the Blade Center.