Some Useful LSF Commands
- LSF (Load Sharing Facility) is a job scheduling and resource management software
system developed and maintained by Platform
Computing (now acquired by IBM). Use it to
run jobs on the blade center. A job is submitted from one of the login
nodes and waits until resources become available on the compute nodes.
Jobs which ask for 4 or fewer processors and 15 minutes or less time
are given a high priority and typically
run very quickly. Quick turn around of such small jobs enables users
to quickly debug.
A few commands of the most useful commands are explained here, in particular, bsub, bhist, bjobs, bqueues, and bkill. Fuller explanations are on the man pages, e.g. type
man bsubon the command line. Or see the Platform LSF Documentation.
An executable file is submitted to run on the compute nodes using the LSF command bsub. The bsub command can be made from the command line, but it is convenient to use a script file. Consider the following file "runpmonte" consisting of the lines
#! /bin/csh #BSUB -W 5 #BSUB -n 4 #BSUB -o /share/gwhowell/chap03/pmonte.out.%J #BSUB -e /share/gwhowell/chap03/pmonte.err.%J #BSUB -J pmonte mpiexec_hydra ./pmonte
Before submitting the job for execution, set environmental variables. In this case suppose that pmonte has been compiled by using one of the intel compilers. Then use
which is equivalent to the commandadd intel
The "add intel" command enables the right choice of "mpiexec_hydra", in this case using the intel specific version of the MPI library. To submit the job, then typesource /usr/local/apps/env/intel.cshbsub < runpmonteLine by line, explaining the bsub file.
"-W 5" asks for five minutes of time. The job will time out after five minutes if still running.
"-n 4" asks for 4 processors.
"-o pmonte.out.%J" denotes a file where standard output from the job will be saved.
The "-e" line designates a file where standard error output from the job will be saved.
The "-J" line gives a runtime name for the job.
"mpiexec_hydra ./pmonte" starts up copies of ./pmonte on 4 CPUs. Typing just ./foo where foo is an executable file would start up ./foo on just one of the processors assigned to the job. So for example, you could do shell commands such as "cd" inside the bsub script.
cd /share mkdir jsbach
to create a directory jsbach in /share file system. Changing the -o line to "-o /share/jsbach/pmonte.out.%J" would place the standard output in this directory. The "%J" makes a new file with the unique number of the LSF job. Without the "%J" the last file of the same name would have the current file appended. An advantage to using the /share directory is that these files will be purged after a while, so your /home directory doesn't get cluttered.
- Most common errors
A typical error file might contain
Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding.
There's no error here, only a harmless warning. Equally typically, and also without actual error we could have
could not open could not open Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding. Warning: No xauth data; using fake authentication data for X11 forwarding.
The most common actual error has a line like
which: no mpirun in (/home/gwhowell/bin:/home/gwhowell/R-2.4.0/bin/exec: /usr/local/lsf/6.1/linux2.4-glibc2.3-x86/bin:/home/gwhowell/bin: /home/gwhowell/R-2.4.0/bin/exec:/usr/local/lsf/6.1/linux2.6-glibc2.3-x86/bin: /usr/local/lsf/6.1/linux2.6-glibc2.3-x86/etc:/usr/kerberos/bin:/usr/local/bin: /bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/sbin/rsct/bin:/opt/xcat/bin: /opt/xcat/sbin:/opt/xcat/i686/bin:/opt/xcat/i686/sbin:/usr/X11R6/bin: /usr/lpp/mmfs/bin:/usr/sbin/rsct/bin:/opt/xcat/bin:/opt/xcat/sbin: /opt/xcat/i686/bin:/opt/xcat/i686/sbin) /usr/local/lsf/6.1/linux2.4-glibc2.3-x86/bin/mpichp4_wrapper: line 465: -machinefile: command not found Mar 8 11:59:27 2007 21495 3 6.1 PAM: An error occurred starting the PJL.
This means you forgot to "add intel" or "add pgi" or "add gnu", so the mpiexec_hydra command can't find an mpirun in its path.Another common error is SIGPIPE. It usually indicates a problem with finding a file or a path, or in having necssary permissions to write a file. p4_ errors typically indicate problems with MPI calls.
- The debug queue
Asking for four or fewer processors and fifteen minutes or less time allows a job to run in debug queue with a high priority, so turn around is typically quite fast. So you can catch your silly mistakes in small sample cases. As above, these options are specified by
#BSUB -n 4 #BUSB -W 15
in the bsub file.A typical debug procedure is to open a few windows, perhaps you would have one window for compiling code, another for modifying the bsub file and submitting jobs, and a third window to look at files in the output and error directory. The "add" command is needed in the window from which jobs are submitted. Typing
ls -lrt
in the output window puts the latest files last and tells you how many bytes they have. - Monitoring job progress
Having submitted an LSF job, you can type bhist or bjobs at the command line. "bhist -l" or "bjobs -l" will give some more verbose output.
[gwhowell@login02 chap03]$ bhist Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 582122 gwhowel *.err.%J 5 0 0 0 0 0 5 [gwhowell@login02 chap03]$ bhist Summary of time in seconds spent in various states: JOBID USER JOB_NAME PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL 582122 gwhowel *.err.%J 6 0 1 0 0 0 7
In the above lines I was impatiently typing bhist every few seconds to see if the job had started to run.If a job has started, you can type "bpeek" to see the standard output and standard error so far produced. Alternatively you can "cat" or "more" the output and error files.
Typing bqueues on the command line will give you a summary of how many jobs are running and pending.
bjobs - u all
will give you a list of all jobs that LSF is currently queueing or running.Clicking on Monitor on the left side of the HPC home page toolbar, and then on "Availability of Blades" shows the current status of each blade.
- How do I kill a job?
If looking at the output files or you've remembered the job needs some other data .. or for some other reason you want to get the current job out of the way and run another job, you may want to kill it. You need the job ID, which you can get by typing "bjobs". Then
bkill 582122will cause LSF to terminate the running job.
- Should I specify a queue?
It's possible to specify which queue you want your job to run in by using the "-q" flag, e.g. you could specify the debug queue by having a line
#BSUB -q debugin your bsub script. Often or usually, specifying a queue is unnecessary and can cause code failure. For example, if you specify debug queue but also have a flag "-W 30" asking for thirty minutes time, LSF will kick out the job as not fitting in the specified queue (debug queue only allows 15 minutes). If the queue had not been specified, LSF would have put the job in short queue, and it would still have run.
You can see what queues are available with the bqueues command. "bqueues -l debug" would tell you what the requirements are for the debug queue. "bqueues -l standard" would tell you how much time you can ask for in the standard queue.
If you have access to a partner queue because your research project purchased some of the cluster, then you will get a higher priority by using that queue. An exception would be if your partner queue is already occupied by a long running job that uses its available processors (which you can check by typing bqueues).
- Some more bsub options
You can request specific resource requirements for the compute nodes that LSF will assign to your job. For example you may want to ask for quad-core processors or nodes with at least 8GB of memory.
Many other bsub flags can be useful. For example, if you want to run a succession of LSF jobs each of which depends on the successful completion of the last, investigate the "-w" flag.
Last modified: April 19 2013 11:39:06.