If you need more cycles than you readily have your hands on, you're welcome to use the SLAC linux batch farm. If you need more than about 50 CPU hours in a night, you should contact SAS management to negotiate with the computing center.
There are really only three commands you might need to know to use the farm:
Logging into a noric system is your first step. Jobs submitted from these machines will automatically go to the Linux farm.
An easy way to operate is to use the GLAST nfs user space
/nfs/slac/g/glast/users/glground/
Create a directory there (same as your account name) and off you go. The farm machines can access this space. Note that this space has a one-year auto-cleanup policy!
The syntax is simple:
bsub -q <queueName> command
where queueName is one of short, medium, long and xlong.
bsub will pass along any environment variables you have set at submission time, and use the directory you submitted from as the job's working directory.
the output console file will be emailed to your SLAC email account.
eg.
bsub -q medium glastpack.pl run Gleam v2r2p11 Gleam.exe myPath/jobOptions.txt
Note that you can tell Gleam where to pick up a jobOptions file, by passing it in as an argument. In this case, 'myPath' is to be replaced by where you put your jobOptions file. glastpack currently runs the executable from the Gleam version directory, so in this case from Gleam/v2r2p11/.
man bsub will tell you more.
bjobs (-l)
this gives you a summary of running jobs. It gives you a batch id for the job, which you can use with bkill
bash-2.05$ bjobs -l
Job <217304>, User <richard>, Project <none>, Status <RUN>, Queue <medium>, Command <glastpack.pl run Gleam v2r2p11 Gleam.exe ..
/../jobOptions-gamma100MeV-vert-NoNoise.txt>Error Codes on exit
Exit codes from batch jobs should be interpreted similarly to those from any other unix program:
- Exit codes 1-128 are exit codes generated by the job itself
- 129-255 usually indicate that a signal was received and the value is the signal+128.
- For example, an exit code of 131 means the job received signal 3 - (131-128) which is SIGQUIT.
A brief summary of the signal codes can be seen with the command "kill -l". A more verbose list can be found in /usr/include/sys/signal.h. Note that there are some variations in the codes by architecture. For example, SIGSTOP is 17 in AIX but 23 in Solaris.
bqueues (-l)
The standard queues we can submit to are
short medium long xlong 20 mins 90 mins 6 hours 30 hours
Note that these times are approximate and are for a particular class of machine (called broncos). The current machines are faster, and the equivalent cutoffs are a bit lower.
bkill <id>
For the above example, one would use
bkill 217304
Note that id=0 is a wildcard - all your jobs would be canceled.
R.Dubois Last Modified: 2004-08-04 15:41:28 -0700