The pgdbg, Totalview, and gdb debuggers should each work in parallel. The description here is of using the gdb (or ddd debugger) used on a VCL node.
MPI (Message Passing Interface) code are typically used for distributed memory codes, i.e., each processor has its own memory and communicates to other processors by sending and receiving messages. For purposes of debugging with gdb, we've implemented a shared memory version of MPI. Here the messages are passed within a single shared memory node. It's possible to start more processes than the number of physical cores.
The mpirun call starts up a process. When an MPI_Init call is encountered, that root process starts up the requested number of new processes. By putting a pause after the MPI_Init, we can identify the new processes and attach each new process to a gdb session. Then we can step through each of the processes individually.
Here's an example. From one of the 32 bit login nodes,
source /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/gnu32sh.csh
This sets up the mpif77, mpicc, mpicxx commands to use the gnu32
library. Compile a simple MPI code, e.g., the monte.f code
from the MPI short course. An excerpt is as follows.
real*8 ans(10), ans2(10)
real*8 startim, entim, sum, sindex
c
c function
integer string_len
iflag = 1
c
call MPI_INIT(ierr)
do while (iflag.eq.1)
end do
c
call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
* print*,' I am ', my_rank
c
if (my_rank.eq.0) then
cc print*,'input random seed'
<\pre>
Notice the "do while" line just after the MPI_INIT. This line stalls
the code indefinitely (taking 100% of the cycles on some core). For this
reason it's a good idea to debug on a vcl node (go to vcl.ncsu.edu and
get an HPC linux image) as opposed to login01 or login02.
Typing
>mpirun -np 2 ./monte
starts up the code. Doing "ps -ef | grep monte"
[gwhowell@login02 ~]$ ps -ef | grep monte
gwhowell 25532 25386 0 14:39 pts/64 00:00:00 /bin/sh /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/bin/mpirun -np 2 ./monte
gwhowell 25560 25532 6 14:39 pts/64 00:01:42 /home/gwhowell/ppmpi_f/chap03/./monte
gwhowell 25561 25560 10 14:39 pts/64 00:02:40 /home/gwhowell/ppmpi_f/chap03/./monte
<\pre>
shows that 2 monte jobs have started up. In another window,
gdb
gdb> attach 25560
gdb> set iflag = 0
and in yet another window
gdb
gdb> attach 25561
gdb> set iflag = 0
bring up gdb sessions attached to these two processes. Setting iflag
as 0 in each session pulls the process out of infinite loop. Repeatedly
typing "n" will step through a process, and we now have two parallel
gdb debug sessions attached to the two processes. A similar approach
is outlined in Parallel debugging>. (The gdb attach process syntax given there does not quite work on the blade center). The link shows syntax for stalling C codes
while the debugger is attached.
You may prefer to start up one or more of the gdb sessions with "ddd". This
has a GUI interface which can be useful. For example, when I tried
gdb>p my_rank
gdb did not know of any such local variable. By clicking the "display" button in ddd,
I was able to find a local variable called my_rank__ and print that.
Totalview debugging can be accomplished in much the same way as outlined
here (but so far works only for fortran and C codes, not for C++).
For the totalview debugger, try
source /home/gwhowell/mpiches/mpich-1.2.7p1/intel32/ch_shmem/tv32.csh