|
How do I use a Debugger on the HPC Machines?
An updated discussion
November 5, 2008.
This discussion is meant to give users an idea how to use debuggers and
how to apply them in parallel, but it is by no means complete.
One on-line tutorial is Norm Matloff's Debugging Tutorial. Of course, you can find many
other on-line tutorials, e.g., gdb for C and C++. If you have more questions on how to
use debuggers on the NC State blade center, please contact
me (gary_howell@ncsu.edu).
What Does a Debugger Do?
One way to debug Fortran or C code is to write print statements
and recompile and
rerun. For instance, if you have just changed a bit of code
and want to make sure that the new code executes as you think, you might
print variables to see if the code modifies them in the way you predict.
Or if having added or changed a subroutine, you find that the code fails to
execute correctly, you might put print statements at the start of
the subroutine to verify that variables are passed correctly.
Using a debugger allows you to accomplish these tasks without
repeatedly recompiling. So if you've had to change hundreds of
lines of code without good test cases for each few lines,
and want to monitor the code behavior line by line, perhaps
comparing to a known test case, using a debugger
can be helpful. Learning to use a debugger may be useful
either for your own future projects or in aiding colleagues.
Stepping through programs.
A debugger allows you to step through a Fortran or C program.
At each step the program listing is displayed and before going on,
you can check current values of program variables. Before starting
program execution under the debugger, the user specifies one or
more break points. On command,
the program runs till the first break point. The user can then go on step by
step or can set a new break point and ask the program to continue
execution to the next break. All the debuggers
described below can be used in this fashion.
Examining core files.
When code execution fails, a core file is created, typically called
"core" or "core.jobnumber".
The core file is in binary format so is not viewable with an editor.
Assuming the code has been compiled with the -g flag, debuggers can
allow you to examine a core file to see what subroutine crashed
and at what line,
what program called that routine (and so on through the whole stack).
Also the user can print out values of program variables at each
level of the stack. dbx on the IBM shared memory machine works well for examining core files. On the blade center, Totalview allows examination of
core files; I have not succeeded in examining fortran core files using pgdbg or gdb, but
suspect it may be possible.
Compiling Code So You Can Use a Debugger
The program should be compiled with
the -g flag, constructing a symbol table that allows a line by line
stepping through the source code. Also
turn off the -O2 optimizations and all other optimizations. Compiler optimizations are quite a nice set of tricks, but they usually work by rearranging the order of operations, so they make it hard for the debugger to correlate program lines with code execution.
What Debuggers are Available?
On the Linux Blade Center, the Portland Group C and Fortran compilers work with
the gdb and pgdbg debuggers.
Totalview works with Intel as well as Portland group compiled codes, with
gnu codes. Below we describe a method of using any of the gdb debuggers in
parallel. It is also possible to use the pgdbg and Totalview debuggers in
parallel.
On the shared memory IBM machines, the IBM supplied dbx and pdbx debuggers
work well with IBM xlf and xlc
compilers. dbx is a good serial debugger and pdbx works well in parallel.
dbx works well with core files.
Debuggers on the Linux BladeCenter
On the Linux blade center, the gdb, pgdbg, and Totalview debuggers are available.
The GDB Debugger
The pgdbg Debugger
The Totalview Debugger
The GDB Debugger
GDB is a classic open source program developed by Richard
Stallman. The GUI based interface is called ddd. By making
small modifications to code, you can debug parallel MPI jobs.
If you learn gdb (or ddd) you can use them on almost any linux
based system. gdb works well with codes compiled with gcc, g++
or gfortran. In the past, it also worked well with PGI compilers,
but I have not verified that recently.
>info gdb
gives a complete and fairly easy to follow set of instructions.
If X11 forwarding works so that you can pop a GUI,
>ddd
brings up a ddd session that includes a "help" button.
For debugging
purposes, compile with the -g flag and no optimization (optimizing can
confuse things by rearranging code execution order). For example,
>gfortran foo.f -g -o foo
compiles foo.f to produce the executable file foo, where the -g preserves
the symbol table in such a way that the debugger can step through the
source code, listing the current code line. Typically at run time,
one sets a break point, lets the code execute
to that point, then steps it through a suspect section of code, observing
variables to see where they go astray.
>gdb ./foo
starts a gdb session attached to the executable foo. Similarly
>ddd ./foo
brings up a ddd GUI based version of gdb, which lets you do more
with a mouse, but which also has a window which allows the
commands given here to work.
Suppose that the know the code's problem is in SUBROUTINE FOOSUB.
At the prompt one can enter,
gdb>break foosub_
or
gdb>b foosub_
Then entering
gdb>run
or
gdb>r
will run the code till it enters SUBROUTINE FOOSUB.
gdb>n
will step through the code to the next executable line.
(Actually I've often found that the code misses the break point
at foosub_ the first time and has to run again).
'n' (short for 'next')
steps through an executable a line at a time, stepping past
a subroutine or function call in one step. To step into a subroutine,
use
gdb>s
(short for 'step'). If ivar is a variable inside foosub
gdb> print ivar
or
p ivar
will display the current value of ivar. Suppose that A is a two
dimensional matrix
gdb> print a(2,3)@5
would print a(2,3) and a total of five adjacent elements from
memory, which in Fortran storage is the consecutive entries from a
column, but a peculiarity of gdb is that this notation
only works in the main program. Inside subroutines, fortran arrays
are stored as a vector starting with position 0, stored by
columns. So if A has leading dimension lda and A is being used
to store a matrix of m rows and n columns
gdb>p a(0)@m
would print the first column of A.
gdb>p a(2*lda)@m
would print the second column of A.
gdb does not seem to have a good way to print a section of
a Fortran matrix row (in C matrix rows are stored consecutively, so gdb
would easily display a matrix row). So a Fortran row would have to be
displayed one print statement at a time (where in pgdbg you could use
matlab notation to print a matrix row).
Once you're stepping through foosub, and want to leap to a breakpoint at line
1142, you can set a new breakpoint.
gdb>break 1142
and jump to it by
gdb> cont
(provided your code would execute this line).
One way to tell where to put the next breakpoint
is by opening another xterm with an edit session of the source
code. Find the line number
you want (in vi, you would park the cursor on the line you want
and ascertain its line number by typing :.= ), say 1311, then
gdb> break 1311
would put a break at that line. In ddd, the file you are editing
is diplayed as part of a split screen in which you can scroll
up and down, so the spare xterm is not quite as necessary,
though the spare xterm may still be convenient.
gdb> l 1311
lists lines around 1311 in the command line window.
dbg> quit
Of course, this is just a start on how to use a debugger, but you
can get a hint that using the debugger can save time on recompiling
just to put in print statements.
The pgdbg Debugger
The pgdbg debugger uses most of the conventional dbg debugger commands.
For some on-line documentation, see The Portland group user guide
The sample session for gdb will also work for pgdbg, where
the session is initiated by
>pgdbg ./foo
Displaying a slice of a matrix is a bit easier. While the gdb
notation still works, the easier column slice
pgdbg> print a(2:6,3)
and row slice
pgdbg> print a(2,3:5)
notations are also available.
Pgdbg has man pages. Help is available from within the
debugging sessions by typing
pgdbg> help
The Totalview Debugger
We have a license for the Totalview debugger.
It works well with Intel ifc compiled codes.
A Totalview tutorial is available at Totalview
Tutorial .
Debuggers on the IBM p590
To start the dbx debugger, produce an executable foo.exe by compiling
it with the IBM Fortran or C compilers with the -g flag.
>xlf90 -o foo.exe -g foo.f
You can start a debug session by
>dbx ./foo.exe
Breakpoints are set by the name and line number of the file containing them.
(dbx) stop at "foo.f":1169
This will set a break at line 1169 of foo.f.
The syntax
(dbx) print a(1,2)
is valid, but there seems to be no way to show a slice of a matrix.
Another problem can be that though scalar variables print quickly, there
can be a long delay in printing elements of a matrix.
(dbx) cont
continues execution of the code to the next breakpoint.
One virtue of the dbx debugger is convenience of examining core files.
Suppose that a -g compiled code foo runs and dumps a core
> foo
Segmentation fault - core dumped
To investigate the error,
>dbx foo
Dbx reports the line where the dump occurred.
You can examine the stack (what program called the subroutine
and what program called that routine, and so on), and can print
variables on each level of the stack.
(dbx) up
(dbx) down
move up and down the stack respectively.
(dbx) quit
exits dbx.
The p590 has a long man page for dbx which includes example sessions.
Parallel Debuggers on the Linux BladeCenter
The pgdbg, Totalview, and gdb debuggers should each work in parallel.
The description here is of using the gdb (or ddd debugger) used on a VCL node.
MPI (Message Passing Interface) code are typically used for distributed
memory codes, i.e., each processor has its own memory and communicates
to other processors by sending and receiving messages. For purposes
of debugging with gdb, we've implemented a shared memory version of MPI.
Here the messages are passed within a single shared memory node. It's possible to start more processes than the number of physical cores.
The mpirun call starts up a process. When an MPI_Init call is encountered,
that root process starts up the requested number of new processes. By
putting a pause after the MPI_Init, we can identify the new processes and
attach each new process to a gdb session. Then we can step through
each of the processes individually.
Here's an example. From one of the 32 bit login nodes,
source /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/gnu32sh.csh
This sets up the mpif77, mpicc, mpicxx commands to use the gnu32
library. Compile a simple MPI code, e.g., the monte.f code
from the MPI short course. An excerpt is as follows.
real*8 ans(10), ans2(10)
real*8 startim, entim, sum, sindex
c
c function
integer string_len
iflag = 1
c
call MPI_INIT(ierr)
do while (iflag.eq.1)
end do
c
call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
* print*,' I am ', my_rank
c
if (my_rank.eq.0) then
cc print*,'input random seed'
Notice the "do while" line just after the MPI_INIT. This line stalls
the code indefinitely (taking 100% of the cycles on some core). For this
reason it's a good idea to debug on a vcl node (go to vcl.ncsu.edu and
get an HPC linux image) as opposed to login01 or login02.
Typing
>mpirun -np 2 ./monte
starts up the code. Doing "ps -ef | grep monte"
[gwhowell@login02 ~]$ ps -ef | grep monte
gwhowell 25532 25386 0 14:39 pts/64 00:00:00 /bin/sh /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/bin/mpirun -np 2 ./monte
gwhowell 25560 25532 6 14:39 pts/64 00:01:42 /home/gwhowell/ppmpi_f/chap03/./monte
gwhowell 25561 25560 10 14:39 pts/64 00:02:40 /home/gwhowell/ppmpi_f/chap03/./monte
shows that 2 monte jobs have started up. In another window,
gdb
gdb> attach 25560
gdb> set iflag = 0
and in yet another window
gdb
gdb> attach 25561
gdb> set iflag = 0
bring up gdb sessions attached to these two processes. Setting iflag
as 0 in each session pulls the process out of infinite loop. Repeatedly
typing "n" will step through a process, and we now have two parallel
gdb debug sessions attached to the two processes. A similar approach
is outlined in Parallel debugging. (The gdb attach process syntax given there does not quite work on the blade center). The link shows syntax for stalling C codes
while the debugger is attached.
You may prefer to start up one or more of the gdb sessions with "ddd". This
has a GUI interface which can be useful. For example, when I tried
gdb>p my_rank
gdb did not know of any such local variable. By clicking the "display" button in ddd,
I was able to find a local variable called my_rank__ and print that.
Totalview debugging can be accomplished in much the same way as outlined
here (but so far works only for fortran and C codes, not for C++).
For the totalview debugger, try
source /home/gwhowell/mpiches/mpich-1.2.7p1/intel32/ch_shmem/tv32.csh
|