Using the SLAC Batch Farm

If you need more cycles than you readily have your hands on, you're welcome to use the SLAC linux batch farm. If you need more than about 50 CPU hours in a night, you should contact SAS management to negotiate with the computing center.

There are really only three commands you might need to know to use the farm:

Logging into a noric system is your first step. Jobs submitted from these machines will automatically go to the Linux farm.

An easy way to operate is to use the GLAST nfs user space

/nfs/slac/g/glast/users/glground/

Create a directory there (same as your account name) and off you go. The farm machines can access this space. Note that this space has a one-year auto-cleanup policy!

Submitting a job

The syntax is simple:

bsub -q <queueName> command

where queueName is one of short, medium, long and xlong.

bsub will pass along any environment variables you have set at submission time, and use the directory you submitted from as the job's working directory.

the output console file will be emailed to your SLAC email account.

eg.

bsub -q medium glastpack.pl run Gleam v2r2p11 Gleam.exe myPath/jobOptions.txt

Note that you can tell Gleam where to pick up a jobOptions file, by passing it in as an argument. In this case, 'myPath' is to be replaced by where you put your jobOptions file. glastpack currently runs the executable from the Gleam version directory, so in this case from Gleam/v2r2p11/.

man bsub will tell you more.

Monitoring a Job

bjobs (-l)

this gives you a summary of running jobs. It gives you a batch id for the job, which you can use with bkill

bash-2.05$ bjobs -l

Job <217304>, User <richard>, Project <none>, Status <RUN>, Queue <medium>, Command <glastpack.pl run Gleam v2r2p11 Gleam.exe ..
/../jobOptions-gamma100MeV-vert-NoNoise.txt>

Error Codes on exit

Exit codes from batch jobs should be interpreted similarly to those from any other unix program:

A brief summary of the signal codes can be seen with the command "kill -l". A more verbose list can be found in /usr/include/sys/signal.h. Note that there are some variations in the codes by architecture. For example, SIGSTOP is 17 in AIX but 23 in Solaris.

Queue Information

bqueues (-l)

The standard queues we can submit to are 

short medium long xlong
20 mins 90 mins 6 hours 30 hours

        Note that these times are approximate and are for a particular class of machine (called broncos). The current machines are faster, and the equivalent cutoffs are a bit lower.

Killing a Bad Job

bkill <id>

For the above example, one would use

bkill 217304

Note that id=0 is a wildcard - all your jobs would be canceled.


R.Dubois Last Modified: 2004-08-04 15:41:28 -0700