A few commands of the most useful commands are explained here, in particular, bsub, bhist, bjobs, bqueues,
and bkill. Fuller explanations are on the man pages, e.g. type
man bsub
An executable file is submitted to run on the computational nodes
using the LSF command bsub. The bsub command can be made
from the command line, but it is convenient to use a script file.
Consider the following file bstry consisting of the lines
#! /bin/csh source /usr/local/lsf/conf/cshrc.lsf #BSUB -W 5 #BSUB -n 4 mpiexec ./pmonte #BSUB -o /share/gwhowell/chap03/pmonte.out.%J #BSUB -e /share/gwhowell/chap03/pmonte.err.%J #BSUB -J pmonte
add intel
source /usr/local/apps/env/intel.csh
bsub < bstry
Line by line, explaining the bsub file.
"-W 5" asks for five minutes of time. The job
will time out after five minutes if still running.
"-n 4" asks for 4 processors.
"mpiexec ./pmonte" starts up copies of ./pmonte on 4 CPUs. Typing just ./foo where foo is an executable file would start up ./foo on a "head" processor of the job. So for example, you could do shell commands such as "cd" inside the bsub script.
"-o /share/gwhowell/chap03/pmonte.out.%J" denotes a standard
output file. Of course, I'm the only one with permission to
write to the directory /share/gwhowell/chap03, so a user with
username jsbach needs to do
cd /share mkdir jsbach mkdir share/jsbach
The "-e" line designates a standard error file.
The "-J" line gives a runtime name for the job.
A typical error file might contain
Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding.
could not open could not open Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding.
which: no mpirun in (/home/gwhowell/bin:/home/gwhowell/R-2.4.0/bin/exec:/usr/local/lsf/6.1/linux2.4-glibc2.3-x86/bin:/home/gwhowell/bin:/home/gwhowell/R-2.4.0/bin/exec:/usr/local/lsf/6.1/linux2.6-glibc2.3-x86/bin:/usr/local/lsf/6.1/linux2.6-glibc2.3-x86/etc:/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/sbin/rsct/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/i686/bin:/opt/xcat/i686/sbin:/usr/X11R6/bin:/usr/lpp/mmfs/bin:/usr/sbin/rsct/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/i686/bin:/opt/xcat/i686/sbin) /usr/local/lsf/6.1/linux2.4-glibc2.3-x86/bin/mpichp4_wrapper: line 465: -machinefile: command not found Mar 8 11:59:27 2007 21495 3 6.1 PAM: An error occurred starting the PJL.
Another common error is SIGPIPE. It usually indicates a problem with finding a file or a path, or in having necssary permissions to write a file. p4_ errors typically indicate problems with MPI calls.
Asking for four or fewer processors and fifteen minutes or less time allows a job to run in debug queue with a high priority, so turn around is typically quite fast. So you can catch your silly mistakes in small sample cases. As above, these options are specified by
#BSUB -n 4 #BUSB -W 15
A typical debug procedure is to open a few windows, perhaps you would have one window for compiling code, another for modifying the bsub file and submitting jobs, and a third window to look at files in the output and error directory. The "add" command is needed in the window from which jobs are submitted. Typing
ls -lrt
Having submitted an LSF job, you can type bhist or bjobs at the command line. "bhist -l" or "bjobs -l" will give some more verbose output.
[gwhowell@login02 chap03]$ bhist Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 582122 gwhowel *.err.%J 5 0 0 0 0 0 5 [gwhowell@login02 chap03]$ bhist Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 582122 gwhowel *.err.%J 6 0 1 0 0 0 7
If a job has started, you can type "bpeek" to see the standard output and standard error so far produced. Alternatively you can "cat" or "more" the output and error files.
Typing bqueues on the command line will give you a summary of how many jobs are running and pending.
bjobs - u all
Clicking on Monitor on the left side of the HPC home page toolbar, and then on "Availability of Blades" shows the current status of each blade.
If looking at the output files or you've remembered the job needs some other data .. or for some other reason you want to get the current job out of the way and run another job, you may want to kill it. You need the job ID, which you can get by typing "bhist". Then
bkill 582122
was cause LSF to terminate the running job.
It's possible to specify which queue you want your job to run in by using the "-q" flag, e.g. you could specify the debug queue by having a line
#BSUB -q debug
in your bsub script. Often or usually, specifying a queue is unnecessary and can cause code failure. For example, if you specify debug queue but also have a flag "-W 30" asking for thirty minutes time, LSF will kick out the job as not fitting in the specified queue (debug queue only allows 15 minutes). If the queue had not been specified, LSF would have put the job in short queue, and it would still have run.
You can see what queues are available with the bqueues command. "bqueues -l debug" would tell you what the requirements are for the debug queue. "bqueues -l standard" would tell you how much time you can ask for in the standard queue.
If you have access to a partner queue because your research project purchased some of the cluster, then you will get a higher priority by using that queue. An exception would be if your partner queue is already occupied by a long running job that uses its available processors (which you can check by typing bqueues).
If your job will only run with 64 bit processors (you may have compiled it on login03 because you needed to support large files, for example), then your bsub file should include a "-R em64t" flag. Generally, asking for specific resources is likely to make your job wait longer to run.
Many other bsub flags can be useful. For example, if you want to run a succession of LSF jobs each of which depends on the successful completion of the last, investigate the "-w" flag.