The following are common errors encountered when running jobs on
the blade center under LSF.
- Permission Denied
In order to keep track of how much demand there is for licensed software packages, we require users to specifically request access. Software packages
which require access requests include nwchem, abaqus, cfx, ansys, amber, espresso, dlpoly, ensight, and fluent. If you encounter a "permission denied"
error for one of the above packages, please contact eric_sills@ncsu.edu or gary_howell@ncsu.edu and we will (if the software license permits), add you to the list of users. You can check if you are already permitted access
by looking at the appropriate line in the file /etc/group.
- Common Nonerrors
From the bsub script, you can designate standard error and standard
output files by
#BSUB -o out.%J
#BSUB -e err.%J
In the error file, you find a message of the type
could not open
could not open
could not open
could not open
could not open
Warning: No xauth data; using fake authentication data for X11 forwarding.^M
Warning: No xauth data; using fake authentication data for X11 forwarding.^M
These messages are usual and don't actually intefere with normal running of
jobs.
- No Output
Suppose that you submit a job and typing "bhist" indicates it is pending,
but then on typing "bhist" again, the job no longer exists. If you get no
standard input or output, the usual problem is that your bsub script did
not specify an output file, or that you did not have permission to write
to the files you did specify. For example, if your bsub script has
lines
#BSUB -o /share3/gwhowell/o.%J
#BSUB -e /share3/gwhowell/e.%J
then since you lack permission to write to the directory /share3/gwhowell,
your job will die rapidly without output. You do have permission to
cd to /share3 then "mkdir myusername". Then you can write error and
output to /share3/myusername
- No mpirun.
If your job returns quickly with a line including
"no mpirun found in"
the problem is you have not specified a run environment by an "add" command.
The mpiexec command calls an "mpirun" but can't find one.
The "add" depends on the compiler that was used to compile your application.
For example, if you compiled your code with intel compilers specified
by "add intel", then in the terminal window from which you submit the
bsub command, you should type "add intel" before submitting your job.
Specifying the compiler at job submit time allows the job to run with
a parallel message passing library (MPI) compiled with the same compiler used
to compile the application.
- pjlSpawn
In this case the job appears to be running normally for a minute or two
after bhist indicates it is no longer pending, but actually running.
Then the error file gets something like the following
M: pjlSpawn: Time expired waiting for TS to register
Jul 19 09:22:34 2007 2812 3 6.1 PAM: An error occurred starting the PJL.
Jul 19 09:22:39 2007 2812 4 6.1 PAM: pjl_rwait: Didn't get all TS to report status.
Jul 19 09:22:39 2007 2812 3 6.1 PAM: pWaitRtask(): ls_rwait/pjl_rwait() failed, Communication time out.
Jul 19 09:22:44 2007 2812 4 6.1 PAM: pjl_rwait: Didn't get all TS to report status.
Jul 19 09:22:44 2007 2812 3 6.1 PAM: pWaitRtask(): ls_rwait/pjl_rwait() failed, Communication time out.
Jul 19 09:22:44 2007 2812 3 6.1 pWaitAll(): NIOS is dead
and stops.
The mpiexec line in your bsub script has not actually
succeeded in launching an MPI communicator, so LSF decides
something is wrong and kills the job. The typical cure is to
rewrite and compile
your code to have an MPI_Init call so that every processor will
succeed in starting the executable code. Of course, converting a code
to run usefully in parallel with message passing via MPI may take some thought.
- Not found
Within a fairly short time after the job starts to run, a file cannot
be found, e.g. "libgunk.so can not be found" and the job fails. A
lib*.so file is a collection of subroutines (library). It is a dynamic
or shared
.so library .. Where a .a library is linked into the executable at
link time, the .so library is located while the code is running (run-time).
The advantage to linking at runtime is that the executable code can be
smaller. The disadvantage of .so files is that they may not exist on
each node where the executable is trying to run or the executable
may not be able to locate them.
If you encounter a missing .so error at runtime, contact us (oit_help@help.ncsu.edu) and we can try to help you get it straightened out.
Some other files that may not be found are input files required for the program.
The program itself may be throwing error messages to alert you to provide those.
Perhaps they exist but are not in the directory the program requires.
- Bad argument for option .. Job not submitted
If you write a text file with a windows editor, then on transfer to a unix box, it
will have some extra symbols (^M) on the ends of lines. These often cause problems.
So for example, a bsub file produced in windows will throw an error for the first
line starting with "#BSUB". If the bsub file is foofile, then
cat foofile
will show you the offending "^M" symbols. One cure is to create the bsub file
with a unix editor, an easy one is "nano" (using it may brand you as a beginner,
vi, nedit, and emacs are more standard but perhaps trickier). To convert foofile
from windows to foofile2 which is linux, you could try
cp /home/gwhowell/bin/dos2unix.pl .
./dos2unix.pl < foofile > foofile2
Of course, you can also get the "bad argument" error if your #BSUB
line does not correspond to a recognized bsub
option. Try "man bsub" and see the FAQ on LSF commands.
- Program suddenly does not run
If your program works, but then the next week does not, one possibility
is that you've exceeded your quota in /home. If you type
quota
you'll get an idea. One possible reason: everytime a job is submitted to LSF, a small file
is written to the user's home directory. If LSF can't write that file,
errors will occur.
Another possibility is that your job landed on an already busy blade.
For a running job "bjobs -l xxxxx" where xxxxx is the JOB ID number,
will give you a list of blades on which the job is running.
You can check how heavily a blade is being used by looking at the
Monitor
web page."Yellow" indicates that the blade is more heavily used
than the LSF submitted jobs indicate. "Red" indicates the blade is
down. "Yellow" blades are likely to slow your job.