MPI Short Course


Gary Howell, March 2004


Outline:


>Why Parallel? - More Data, More Computations.

>Why MPI? - Simple MPI.

>Basic Considerations in Message Passing & How to Combine Messages

>Collective Communication


Analyze timings.

Timings

Topologies

Nonblocking and Persistent Communication


fortran files displayed:

monte.f

ring.f

ring2.f

monte3.f

monte2.f

nonblock.f

persist3.f

>Why Parallel


Science and engineering computations are now typically made on computers. Computers enable realistic modeling of physical phenomenon, allowing us to examine more than just special cases. e.g., "dusty deck" Fortran codes may represent a core engineering expertise. To improve validity of physical models, we typically add more grid points. For example, we might want a 1K by 1K by 1K grid. If there is one eight byte floating point number for each element of the grid, that would be 8 Gbytes.


One reason to go to parallel computation is just to have sufficient RAM available. Many of the standard finite element packages still are not well parallelized, but nonetheless require lots of memory to model in sufficient detail. The fastest computers now (the ones that solve physical problems with the best resolution) are parallel computers.


The dominant style of parallel computer is an MIMD (Multiple Instruction Multiple Device) computer. The style of programming I'll talk about here is Single Instruction Multiple Device (SIMD). This uses "if this group of nodes, else that group of nodes") constructs so that SENDS, RECEIVES, and synchronization points can all appear in the same code.


If all processors can access the same RAM, the computer is said to be shared memory.


If memory is distributed, message passing is the usual style of programming.


Why MPI? -- Simple MPI.


MPI (Message Passing Interface) is a standard. Before MPI (and PVM) it was necessary to rewrite the message passing parts of routines for every new parallel platform that came along.


Considering the great number of "dead" architectures, this meant that by the time code worked well (a year or two) the machine it was written for was almost out of date.


Connection Machines, DAP, BBN, Kendall Square, DEC, Intel Message Passing, Sequent, Cray, SGI ...


MPI programs work on both shared memory and distributed memory machines. So just as we have "dusty deck" Fortran codes, we will have legacy MPI-FORTRAN and MPI-C codes. Hopefully, they will be maintainable? Portably fast?


MPI is a very rich (complicated) library. But it is not necessary to use all the features. An advantage to using a simple subset is these have been optimized by most vendors. Also it makes thinking about programming easy. On the other hand, it is fun to look up other functions and see if they will simplify programming. Often the problem you are having is one addressed by some MPI construct.


There are two sets of sample programs. Due mainly to Peter Pacheco from the University of San Francisco.

ppmpi_c and ppmpi_f in C and Fortran respectively. Download from anonymous ftp.

ftp cs.fit.edu

login: anonymous

cd pub/howell


get mpi_rweed.ppt -- is a power point presentation migrated from

Mississippi State (Dr. Rick Weed).

get pachec.tar -- for a more complete set of example MPI codes.



References

Parallel Programming with MPI -- Peter Pacheco -- Morgan Kaufman Press. A good introduction to parallel programming -- and to MPI, examples in C.


For these, notes go to


http://www.ncsu.edu/itd/hpc


MPI--The Complete Reference by Snir, Otto, Huss-Ledermean, Walker, and Dongarra. Vols I and II, MIT Press Vol. I systematically presents all the MPI calls, listing both the C and Fortran bindings. Expands on the MPI standard. Vol.II presents MPI II.


Using MPI – Portable Parallel Programming with the Message-Passing Interface by Gropp, Lusk, and Skjellum,

MIT Press, 1996.


RS/6000 SP: Practical MPI Programming—www.redbooks.ibm.com by Yukiya Aoyama and Jun Nakano


www-unix.mcs.anl.gov/mpi/standard.html has the actual MPI standard documents


You can look at the mpi.h, fmpi.h, mpio.h file on your system. Usually these are found in /usr/include


I'm presenting codes in Fortran.

(Argue: Fortran 77 is the subset of C which are most useful for scientific computing, other features of C, e.g. pointer, arithmetic are likely to slow performance).


Fortran calling arguments to MPI routines are different than C. Usually, an ierr argument is appended. Where a C function would return an integer zero on successful return, the Fortran subroutine returns ierr as zero instead.




SAMPLE EASY CODES


Monte Carlo codes run many instances of a given event. The original Monte Carlo calculations were run to model an H-Bomb. The Russians already had one, how by hook or by crook to model the interactions of a neutron?


Monte Carlo calculations of a given constant (e.g., mean free path of a neutron or area under a curve) have


Error = O(1/sqrt(number of simulations) )


So to get one more digit of accuracy we have to multiply the number of simulations by one hundred. Hence a need for many simulations, i.e, so a need for parallel computation.


The simulations are independent and require little communication, so Monte Carlo codes are known as “embarrassingly parallel”.



FILE: monte.f

c

c Template for a Monte Carlo code.

c

c The root processor comes up with a list of seeds

c which it ships to all processors.

c

c In a real application, each processor would compute

c something and send it back. Here they compute

c a vector of random numbers and send it back.

c

c This version uses a loop of sends to mail out the

c seeds. And uses a loop to send data back to root.


program monte

c

include 'mpif.h'

c

integer my_rank

integer p

integer source

integer dest

integer tag

integer iseed,initseed

integer status(MPI_STATUS_SIZE)

integer ierr

integer i

real*8 ans(10), ans2(10)

real*8 startim,entim

c

c function

integer string_len

c

call MPI_Init(ierr)

c

call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

c

if (my_rank.eq.0) then

print*,'input random seed'

read*,iseed

endif

c------------------------------------------------------------------------

c if the timing results seem peculiar, try uncommenting the next line

c call MPI_BARRIER(MPI_COMM_WORLD,ierr) !

c !

c ! What difference is there?

startim = MPI_Wtime()

if (my_rank.eq.0) then

initseed = int(random()*1000)

c startim = MPI_Wtime()

do dest = 1, p-1 ! CPU 0 loops through p-1 sends

initseed = int(random()*1000)

tag = 0



call MPI_Send(initseed, 1, MPI_INTEGER,

+ dest, tag, MPI_COMM_WORLD, ierr)

c ----------------------------------------------------------------------

c Send message consisting of

c initseed -- arg 1 message sent

c 1 -- arg 2 , length of message

c MPI_INTEGER -- arg 3 , type of data sent

c dest -- arg 4, rank of processor to which message sent

c tag -- arg 5, some integer, needs to be matched by RECV

c MPI_COMM_WORLD -- arg 6, handle of communicator, matched by RECV

c ierr -- arg 7, output from MPI_SEND, will be 0 if successful

c This call is blocking. Code will not proceed until the receiving processor

c signals that it has started to receive.

c -----------------------------------------------------------------------

end do

else ! each of p-1 CPUs gets a message from CPU 0

root = 0

tag = 0

call MPI_Recv(initseed, 100, MPI_CHARACTER, root,

+ tag, MPI_COMM_WORLD, status, ierr)

c------------------------------------------------------------------------

c Receive message consisting of

c initseed -- arg 1 message sent

c 100 -- arg 2 , maximal length of message

c MPI_INTEGER -- arg 3 , type of data sent

c root -- arg 4, rank of processor which sent message

c (could use a wild card)

c tag -- arg 5, some integer, needs to be matched by SEND

c (could use a wild card)

c MPI_COMM_WORLD -- arg 6, handle of communicator, matched by SEND

c (no wild card allowed)

c status -- arg 7 integer array status(MPI_STATUS_SIZE)

c ierr -- arg 8, output from MPI_SEND, will be 0 if successful

c The receive is blocking. Code will not go to next step until the

c receive is completed.

c------------------------------------------------------------------------

call MPI_Get_count(status, MPI_INTEGER, size, ierr)

c -----------------------------------------------------------------------

c This call tells us how long the passed message came out to be

c Information about the received message is found in status vector

c ----------------------------------------------------------------------

endif ! input phase done


c-----------------------------------------------------------------------

c Left out -- a body of code that does a bunch of particle tracking

c stuff to produce the double precision vector ans

c-----------------------------------------------------------------------

do i=1,10

ans(i) = rand() ! at least we initialize stuff to send back.

end do


if (my_rank.eq.0) then

tag = 1

do source = 1,p-1

call MPI_RECV(ans2, 10, MPI_DOUBLE_PRECISION,

+ source, tag, MPI_COMM_WORLD,status, ierr)

do i=1,10

ans(i) = ans(i) + ans2(i)

end do

end do

else

tag = 1

call MPI_SEND(ans, 10, MPI_DOUBLE_PRECISION, root,

+ tag, MPI_COMM_WORLD, ierr)

endif

if(my_rank.eq.0) then

c do some stuff to process and output ans

endif

entim = MPI_Wtime() - startim

c

call MPI_Finalize(ierr)

print*,' elapsed time =',entim, ' my_rank=',my_rank

end

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

c


Let's look a bit at the arguments for the send and receive.


MPI_SEND,

1st argument, variable to be sent

2nd argument, integer number of items in variable, here 1

3rd , data type--different Fortran and C bindings !!

4th , dest -- integer rank of processor to which message is sent

5th , tag -- integer message tag

6th , handle for MPI communicator

7th , ierr , integer for error message, this argument not used in C ,

output


MPI_RECV

1st arg, variable to be received (output)

2nd arg, integer of number of variables in message , This is

an upper bound for the receive

3rd arg, data type

4th arg, source -- integer rank of processor sending message

(or could be MPI_ANY_SOURCE)

5th arg, tag -- same integer as matching send,

(or could be MPI_ANY_TAG)

6th arg handle for MPI communicator, must match sending communicator

7th arg status, integer status(MPI_STATUS_SIZE) – output

8th arg ierr, output, 0 for successful return


The integer status(MPI_SOURCE) tells us the processor rank of the source of the received message. The integer status(MPI_TAG) sends us the tag off the received message. There’s also a value status(MPI_ERROR).


In general, I won't write out the MPI arguments in such detail, but as with any C or Fortran library, keeping good track of subroutine arguments is a first key to successfully using a call.


It is often helpful to write a small program to illustrate and verify the action of the routine. On a given architecture, you need to know the correct data types to match the integers used in MPI calls, e.g. source, root, tag, ierr etc. Typically 4 bytes, but ...


In order to model the running of your code, you may want to time a communication pattern.


How long does it take to start a message?

What is the bandwidth for long messages?

Would another MPI operator work better?


Fortran MPI data types include Fortran datatype

MPI_INTEGER INTEGER

MPI_REAL (single precision) REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_CHARACTER CHARACTER(1)

MPI_COMPLEX COMPLEX

MPI_BYTE

MPI_PACKED

MPI_LOGICAL LOGICAL


C data types include

MPI_INT

MPI_FLOAT

MPI_DOUBLE

MPI_CHAR

Etc.

(See P. 34 MPI – The Complete Reference)



Let's consider instead a ring program. It passes

FILE: ring.f

c ring program

c

c pass messages around a ring.

c Input: none.

c

c See Chapter 3, pp. 41 & ff in PPMPI.

c

program ring

c

include 'mpif.h'

c

integer my_rank

integer p

integer source, source2

integer dest,dest2

integer tag,tag2

integer root

character*100 message,message2

character*10 digit_string

integer size

integer status(MPI_STATUS_SIZE)

integer ierr

integer i,nreps

real*8 startim,entim

c

c function

integer string_len

nreps = 10000

c

call MPI_Init(ierr)

c

call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

startim = MPI_Wtime()

do i=1,nreps

call to_string(my_rank, digit_string, size)

message = 'Greetings from process ' // digit_string(1:size)

+ // '!'

if (my_rank.ne.p-1) then

dest = my_rank+1

else

dest = 0

endif

if (my_rank.ne.0) then

source = my_rank-1

else

source = p-1

endif

tag = 0 ! message from even processors have an even tag

tag2 = 1 ! messages from odd processors have an odd tag

root = 0

c

c Note this solution only works if the total number of processors is even

c

if(my_rank.eq.2*(my_rank/2)) then ! if my_rank is even

call MPI_Send(message, string_len(message), MPI_CHARACTER,

+ dest, tag, MPI_COMM_WORLD, ierr)

call MPI_Recv(message2, 100, MPI_CHARACTER, source,

+ tag2, MPI_COMM_WORLD, status, ierr)

else

call MPI_Recv(message2, 100, MPI_CHARACTER, source,

+ tag, MPI_COMM_WORLD, status, ierr)

call MPI_Send(message, string_len(message), MPI_CHARACTER,

+ dest, tag2, MPI_COMM_WORLD, ierr)

endif

call MPI_Get_count(status, MPI_CHARACTER, size, ierr)

c print*,'my_rank=',my_rank

c write(6,101) message2(1:size),my_rank

101 format(' ',a,' my_rank =',I3)

end do

entim = MPI_Wtime() - startim

c

call MPI_Finalize(ierr)

print*,' elapsed time =',entim, ' my_rank=',my_rank

if (my_rank.eq.0)print*,'number of reps =', nreps

end






Note: For each send there must be a matching receive.


An MPI_SEND is blocking. The program call MPI_SEND and waits till an acknowledgement from the matching MPI_RECV. This can be helpful in synchronizing code.

I

But notice we had to complicate things by writing if statements for odd and even rank processors.

Else we would have had a hung code.


MPI is very rich. There are many ways to accomplish this same operation. e.g.


MPI_SENDRECV

FILE: ring2.f

c send messages right around a ring.

c

c This is simpler than ring.f

c This is in that it uses the MPI_SENDRECV operator.

c

c

c Input: none.

c

c See Chapter 3, pp. 41 & ff in PPMPI.

c

program greetings

c

include 'mpif.h'

c

integer my_rank

integer p

integer source, source2, right, left

integer dest,dest2

integer tag,tag2

integer root

character*100 message,message2

character*10 digit_string

integer size

integer status(MPI_STATUS_SIZE)

integer ierr

integer i,nreps

real*8 startim,entim

c

c function

integer string_len

nreps = 10000

c

call MPI_Init(ierr)

c

call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

startim = MPI_Wtime()

do i=1,nreps

call to_string(my_rank, digit_string, size)

message = 'Greetings from process ' // digit_string(1:size)

+ // '!'

if (my_rank.ne.p-1) then

right = my_rank+1

else

right = 0

endif

if (my_rank.ne.0) then

left = my_rank-1

else

left = p-1

endif

root = 0

c Send to the right -- receive from the left.

call MPI_SENDRECV(message, string_len(message), MPI_CHARACTER,

+ right, 0 ,

+ message2, 100 , MPI_CHARACTER,

+ left, 0 ,

+ MPI_COMM_WORLD, status, err)

call MPI_Get_count(status, MPI_CHARACTER, size, ierr)

c print*,'my_rank=',my_rank

c write(6,101) message2(1:size),my_rank

c101 format(' ',a,' my_rank =',I3)

end do

entim = MPI_Wtime() - startim

c

call MPI_Finalize(ierr)

print*,' elapsed time =',entim, ' my_rank=',my_rank

if(my_rank.eq.0) print*,' nreps =',nreps

end


The syntax is as follows

1st argument, buffer (variable) to be sent --input

2nd argument, integer number of items in buffer (think vector) --input

3rd , data type--different Fortran and C bindings –input

4th , dest -- integer rank of processor to which message is sent

input

5th tag1 -- integer message tag –input

6th receive buffer (name unchanged

on output contains received data)

7th integer upper limit on number of received items --input

8th MPI data type --input

9th integer source—rank of processor sending message –input

10th integer tag of received message – input

11th name of communicator – input

12th status integer vector (output)

13th integer ierr (output)

Timings per send receive were 1.72e-3 seconds. For either paired

sends and receives or for an MI_SENDRECV

There is also an MPI_SENDRECV_REPLACE that uses the same

send and receive buffer.



To get the basic MPI set completed, we'll also need global communications, non-blocking communications and ...




A chart of times for passing to the right


paired send and receives, 1.72e-3 seconds

nonblocking (ISENDS and IRECVs) 1.61 e-3 seconds

by a SENDRECV 1.72e-3 secs

(but an interchange via two send receive calls was 1.e-4 ? )

pass to right by one-sided (persistent) .6 e-5 secs


>BASIC CONSIDERATIONS OF MESSAGE PASSING.


Compared to a memory access or to a computation, passing messages is expensive. Passing a message is more like accessing a hard drive. Just as a program that overflows RAM will "thrash", so a program that does fine-grained communication will run very slowly. It is very easy to write parallel programs that run more slowly than serial ones.


An add requires O(1.e-9) secs in register

O(1.e-8) secs in L2 cache

O(1.e-7) secs in RAM

Other operations.

O(1.e-6 to 1.e-7) secs for a subroutine call--or local MPI call

such as MPI_PACK or MPI_UNPACK

O(1.e-4 to 1.e-5) secs for an MPI_SEND message

O(1.-2 ) secs access data from hard drive


So obviously, we want to make sure that when an add or multiply is be performed that we don't have to wait for a RAM, MPI, or hard drive fetch. So it makes sense to model communication. A simple model is


T_c = (time to start a message) + (bytes in a message)*(time/byte)

Or


T_c = T_s + L * t_b


where T_c is the time to start a message, typically 1.e-5 to 1.e-4 secs depending . t_b is the time to pass a byte is 1.e-8 secs for gigabit ethernet or myrinet. If we want to have the message time T_c in terms of the number of clock cycles or flops we would optimistically have


T_c = 1.e4 + (bytes in a message) * 200


So we have to pass at least 1Kbyte-10Kbytes before the time start the message is less than half the message time.

MESSAGES. IT IS VERY EASY TO MAKE CODES RUN SLOWER IN PARALLEL THAN IN SERIAL. THE MOST COMMON BEGINNER ERROR IS TO PASS LOTS OF SMALL MESSAGES


So if we want to make a program slow, all we need do is pass lots of messages.


On the other hand, there is hope. If we can ration the number of messages to be received, then we have a good chance of getting a good parallel efficiency.


Thinking about data locality is an intrinsic part of getting good parallel code. It does seem a shame to burden users with it. But of course, it’s also what you have to do to get good performance in serial computing. After all, it’s less expensive to get data from another processor’s RAM than it is to get information from a local hard drive. So if your problem is too big to fit in RAM, you’ll probably get better performance by using more than one CPU allotment of RAM.


The discipline of considering data locality also enables more efficient serial code. For example, in serial computing, advertised flop rates are obtained only when data in cache can be reused. Bus speed and cache size are often more important than CPU speed. For accessing data in RAM, we get a formula in clock cycles like


T_a = 200 + (number of bytes) * 50


The computer tries to hide the 200 from you by bringing data into cache in blocks, but if you access the data in the wrong order (e.g., a matrix row wise instead of columnwise), you can get factors of 10 or more slow downs.


I’ll present an example for which accessing data from another allows more efficient use of cache memory, (perhaps) enabling superlinear speedup.


Reducing both the total "volume" and the "number" of messages will speed computations.


Successful parallel computations use "local" data for computations, requiring only periodic "global" refreshing, thereby keeping the global message volume small. Physical partitions are laid out to minimize the ratio


surface area/ volume.


Arctic bears weigh 1200 pounds, Florida bears weigh 250 pounds. Fully utilize RAM on one processor and just update the data on regions which are influenced by data resident on other processors.


Successful parallel computations find ways to overlap computation and communication. For instance, use non-blocking communications. We'll explore these later.


Not only does one minimize the volume of communication, but also the number of communications should be minimized. Short messages are "packed" into longer messages.


For example, we could pack integer values in a vector, floating point values in another vector, character values in a third vector. Three calls to MPI_PACK can pack the three vectors can be packed into a buffer of type


MPI_PACKED


and sent in one message. Corresponding MPI_UNPACK calls after an MPI_RECV call can unpack the vectors. The initial integer vector can give instructions as to how many elements are to be unpacked.


Example. The following version of the Monte Carlo code packs up data into one message.

FILE: monte3.f

c

c Template for a Monte Carlo code.

c

c The root processor comes up with a list of seeds

c which it ships to all processors.

c

c In a real application, each processor would compute

c something and send it back. Here they compute

c a vector of random numbers and send it back.

c

c This version uses a single BCAST to distribute

c both integer and double precision data. It packs

c integer vectors as well as double precision data

c into an MPI_PACKED buffer.

c This illustrates the use of MPI_PACK and MPI_UNPACK commands.

c

program monte3

c

include 'mpif.h'

c

integer my_rank

integer p

integer source

integer dest

integer tag

integer iseed, initseed, initvec(200)

integer status(MPI_STATUS_SIZE)

integer ierr

integer i , j , root

integer isize, position, kquad, nj

integer itemp(100), buf(4000)

real*8 ans(10), ans2(10) , temp(100)

real*8 startim,entim

c

c

call MPI_Init(ierr)

c

call MPI_Comm_rank(MPI_COMM_WORLD, my_rank, ierr)

call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

startim = MPI_Wtime()

c

do j = 1,100

if (my_rank.eq.0) then

iseed = 2

initvec(1) = random(iseed)

do i=2,p

initvec(i) = int(random()*1000)

end do

ISIZE = ( 26 + NJ + 2*KQUAD )

POSITION = 0

nj = 3

kquad = 4 ! actually these might have been read from a file

itemp(1) = isize

itemp(2) = kquad

do i=1,100

temp(i) = 1. ! more realistically we would read from a file

end do

CALL MPI_PACK ( ITEMP, 2, MPI_INTEGER, BUF, 4000,

+ POSITION, MPI_COMM_WORLD, IERR)

c This call packs the integer vector itemp of length 2 into buf

c starting at position 0. It increments position.

c --------------------------------------------------------------

c See below to see how to unpack.

c MPI_PACK and MPI_UNPACK are good for reducing the number of

c total calls. These calls will allow us to pass multiple

c messages for the latency of one. They are flexible in

c that the length of the unpack can be part of the message.

c The time of the calls to pack and unpack is not significant

c compared to the time to pass a message.

c

c Disadvantage: On the Compaq AlphaServerSC the packing turned

c out to be pretty loose, i.e., there were empty bytes. So combining

c short messages would speed things, but long messages might get

c enough longer that they would take more time.

c

c --------------------------------------------------------------

CALL MPI_PACK ( initvec, 10, MPI_INTEGER, BUF, 4000,

+ POSITION, MPI_COMM_WORLD, IERR)

CALL MPI_PACK ( TEMP, 100,

+ MPI_DOUBLE_PRECISION, BUF, 4000,

+ POSITION, MPI_COMM_WORLD, IERR)

endif

c pack up data into one message


root = 0

call MPI_BCAST(BUF,4000,MPI_PACKED,0,

+ MPI_COMM_WORLD,IERR)

call MPI_BARRIER(MPI_COMM_WORLD,ierr)

IF ( my_rank.NE.0 ) THEN

POSITION = 0

C for the unpack, reset position to 0. The unpack order is

c the same as the pack order. But the order of arguments

c is changed.

c The call below unpacks integer vector itemp of length 2 from

c the BUF buffer.

c unpack itemp

CALL MPI_UNPACK(BUF, 4000, POSITION, ITEMP, 2, MPI_INTEGER,

+ MPI_COMM_WORLD,IERR)

isize = ITEMP(1)

kquad = ITEMP(2)

c unpack initvec

CALL MPI_UNPACK (BUF, 4000, POSITION, initvec, 10,

+ MPI_INTEGER, MPI_COMM_WORLD, IERR)

myseed = initvec(my_rank)

c unpack temp


CALL MPI_UNPACK ( BUF, 4000, POSITION, TEMP, 100,

+ MPI_DOUBLE_PRECISION, MPI_COMM_WORLD,IERR)

ENDIF


c-----------------------------------------------------------------------

c Left out -- a body of code that does a bunch of particle tracking

c stuff to produce the double precision vector ans

c-----------------------------------------------------------------------

call MPI_BARRIER(MPI_COMM_WORLD,ierr)

ans1 = rand(myseed)

do i=1,10

ans(i) = rand() ! at least we initialize stuff to send back.

! But this call is something I had to change to get the code to

! run here.

end do


call MPI_REDUCE (ans,ans2, 10, MPI_DOUBLE_PRECISION,

+ MPI_SUM, root, MPI_COMM_WORLD, ierr)

c-------------------------------------------------------

c Get the (sum of) data back

c ans -- arg1 -- message sent from each processor

c ans2 -- arg2 -- result deposited on root -- out

c 10 -- arg3 -- length of message

c MPI_DOUBLE_PRECISION --arg4 - data type

c MPI_SUM -- arg5 -- operation performed by reduce

c root -- arg6 -- reduce deposits answer on root

c same on all processors

c MPI_COMM_WORLD -- arg7 -- all procs must have same communicator

c ierr -- arg8 -- integer error--out (only in Fortran)

c------------------------------------------------------

call MPI_BARRIER(MPI_COMM_WORLD,ierr)

if(my_rank.eq.0) then

c do some stuff to process and output ans2

endif

end do

entim = MPI_Wtime() - startim

c

call MPI_Finalize(ierr)

print*,' elapsed time =',entim, ' my_rank=',my_rank

end

cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc

The MPI_PACK and MPI_UNPACK calls are quite fast compared to the time to start a communication. The main drawback is that the MPI_PACKED data form may include empty bytes, so that for example a four byte integer might end up occupying 16 bytes in the MPI_PACKED message. So the packing of messages will help with latency but not bandwidth.


The most general MPI data type is the

MPI_TYPE_STRUCT -- the most general fixed MPI "derived data type" allows multiple types of data entries.

This MPI data structure is loosely modeled on C structs (but allows other data types than just integers). The MPI_TYPE_STRUCT may suffer from the same loose packing as the MPI_PACKED structure. For the MPI_STRUCT an MPI_Type_Commit is required as a declaration.


The MPI_Type_Commit operation requires more time than an MPI_PACK but may be worthwhile if the same structure will be used repeatedly.

The following data types also require an MPI_Type_Commit as a declaration.


MPI_VECTOR -- elements of a single type, allowing a stride.

MPI_INDEXED -- elements of a single type, with a variable indexed stride, e.g., to pass the upper triangular part of a matrix.

MPI_TYPE_HVECTOR – if we want to compute the stride in bytes.


There is some expense in committing a data type, but it only occurs once, so is worthwhile if the same message type is to be frequently reused. Production codes I’ve seen used the VECTOR, INDEX, HVECTOR constructs. These are supported in MPI-2 IO. MPI_TYPE_STRUCT

Is not.


Another "hacker" option.


Some MPI programmers just pack all their integer and double precision data into a single double precision vector and rely on type conversion. And plausibly you could also pack your character data into a double precision vector since you’re writing the “decode”.




Practical limitations of the


T_c = T_s + L * t_b


model are that it neglects:

1) noise in the system (e.g. processors have other tasks such as heartbeat, spare daemons) - T_s is sporadically large.

2) network saturation. Causes T_b to be sporadically large.

Both 1&2 can cause slower communications and motivate use of non-blocking sends and receives to allow overlapping of communication.


An advantage of the


T_c = T_s + L * t_b


model is we can model communication time via pen and paper before writing a large program. For instance we can conclude that laying out matrix data from square blocks will give total message lengths $O(n /sqrt(p))$ for p processors. Where column blocks might give

$O(n)$


EXERCISE:

Estimate T_c by repeatedly passing short messages, and L by passing long messages. Note MPI comes with a good wall clock timer MPI_Wtime() which returns a double precision number and usually times in microseconds.


startim = MPI_Wtime()


stuff to time, e.g., 1000 repetitions of some MPI call.


entim = MPI_Wtime() - startim


Then if we can track down all the communications and already know how long the computation will take in serial, then we can estimate parallel performance.


EXERCISE:

What parallel computation are you interested in? How long does it take in serial? Can you estimate how much communication is required? Can you predict parallel performance?


A typical 2-D parallel application may have communication volume O(sqrt(volume on a processor)) * log(number of processors) where the computations go as O(volume on a processor) So the total time is


T = O(sqrt(V)*log(P)) + O(V)


and the parallel efficiency is


E = O(V)/(( O(sqrt(V)*log(P) + O(V) )


which can be close to the ideal of one if O(V) sufficiently large compared to O(sqrt(V)). But slowness of communication puts a high constant on O(sqrt(V))


Example:

Consider matrix vector multiplication. Suppose we'll multiply an $n x n$ matrix $A$ times an n-vector $x$ and return the vector $x$ to the root processor.

Assume that $A$ is already distributed to all processors (communicating $A$ to all processors would require more time than computing $Ax$ on one processor). In particular, assume that $A$ is distributed with $n/p$ columns per processor.


A = [A_1 | A_2 | ... | A_p ]


so that A_i has n/p columns. Partition x with n/p elements per block

x = [x_1 | x_2 | ... | x_p ]


Then


A*x = A_1*x_1 + A_2*x_2 + ... + A_p*x_p


can be performed as

1) scatter x_i to processor i, i=1:p-1 (MPI_SCATTER)

2) In parallel perform w_i <---A_i*x_i , i=1,p-1 -- computations on each node.

3) perform an MPI_REDUCE to add all the w_i to get x on the head node.


Communication costs are mainly for the reduce, which is passing a vector of length n, log(p) times, i.e., 8*n*log(p)*t_b compared to 2*n*n/p flops for each of the A_i*x_i


so if t_b is flops/byte (how many flops can be performed in the time it takes to pass an additional byte of a message) the parallel efficiency should be


E = (2*n*n/p)/ (2*n*