Post

Post-mortem of 4M background event generation

The first big run at SLAC using GlastRelease v2r1 was done this past weekend, April 25-27. The goal was to run upwards of 10M backgroundavgpdr requested events.

The output is at

/nfs/farm/g/glast/u09/PerfEval2003Spring/backgndavgpdr/

with the individual runs in IndividualRuns/. Failures are kept in the IndividualRuns/failed/ directory.

In the end, we got 4M events requested (this translated to 193k after conservative cuts - Ntrks>0 && AcdActiveDist < -20) . Here is the tale of what happened to the others:

first run involved a submission of 250 jobs to the long queue (these jobs generally take 5k secs). It appears that they all attempted to start at once. According to network records our fileserver lost contact at this time. Only 87 of the jobs ran.
I then tried again on Saturday morning, submitting 50 jobs. They all failed, but I finally realized after a couple more attempts that our submission script had a bug in it when starting from a particular run number.
I then copied GlastRelease to our scratch nfs disk and submitted 250 jobs. They generally worked fine. At one point I had more than 100 run simultaneously. We saw a smallish percentage of jobs exceeding the 10k sec queue limit. We also saw a number that were just killed. Don't know why.
I submitted another 250, but then saw some failures connecting to the MySql server on Glast01.

The list of bad runs and the ntuple filesizes are shown here. Navid wrote a perl script to identify the bad runs and move them into their own directory for further examination, and clean up the main directory.

The sizes group into 4 categories:

0 - pilot error. Job was not set up properly
0.347 - time exceeded. tuple file never closed properly
~725 kB - apparently arbitrarily killed
~22 - could not connect to MySql

Observations

We need to stress test the u05 fileserver holding the releases. We'll be pushing a lot harder for DC1.

Why are jobs killed, and what is this long loop? Tracy thinks it is still the propagator.

We now are dependent on access to MySql. How do we debug failures like we saw? And are there ways to bypass it if we don't really need constants? This came in with the CAL calibration constants.

R.Dubois Last Modified: 2010-06-01 15:48:18 -0700