Core Minutes 12 February 2008

Core Minutes 2/12/2008

Present: Joanne Bogart, Anders Borgland, Toby Burnett, Emmanuel Cephas, David Chamont, Jim Chiang, Seth Digel, Richard Dubois, Dan Flath, Warren Focke, Tom Glanzman, Navid Golpayegani, Heather Kelly, David Landriu, Chuck Patterson, Leon Rochester, Eric Winter

Bari report: Links to all talks may be found on the agenda page (Richard) See particularly the wrap-up talk. The meeting was fully recorded as well.

There was a morning devoted to Beamtest. not everything is understood. Bill feels there are issues with beamline, materials, etc., apart from the LAT itself.

(Richard) A lot of time was spent on backgrounds. (Toby) There is a substantial improvement with Pass 6. In particular, it includes a new and very useful variable: ratio of measured energy to total energy deposited in ACD. (Richard) Won't use Pass 6 for OpsSim, but hope to have it available for [something?] in March.

(Seth) Riccardo Rando gave a nice talk on reducing dispersion tails. The technique will be incorporated into Pass 6. (Toby) The key was to apply an over-all scaling to dispersion.
Big run: As always, follow progress on the Big Run Checklist. (Richard) Over 300 K jobs are done, about 85%. At this rate, should be finished tomorrow. There are about 2500 failed runs which will have to be re-run, an acceptably small percentage. All-gammas are being wrapped up in Lyon. Will do level 1 processing next week the first 5 days of data, modifying various parameters in order to stress the system to see what might need improvement.
Science Tools report: (Seth) talked from these comprehensive notes.
Data handling: (Dan) There isn't much to say. We're investigating ways to make the web UIs more responsive (processing history and dataset numbers from the big run are really killing it), and getting ready to try and serve data from the big run through the data catalog. (Richard) Resource relief is on the way. We'll be getting our own mail server (our auto-generated mail is accounting for about 1/2 of the SLAC total!) and two new Oracle servers next week; that is, they should have physically arrived here by then.
Documentation report: (Chuck) My focus for the past week was on "how-to-fix" a SAS software infrastructure failure, and I've determined that there are basically three components to the problem: a remote user; a SLAC resource person(e.g., shift coordinator) who would perform triage; and a list of on-call experts. By triage, I mean a person at SLAC who would diagnose (or confirm the diagnosis) of a problem and, if necessary, either fix it or make sure it was referred to the appropriate on-call expert.

One reason for a triage is a limited diagnostic capability at remote locations. There are currently two diagnostic tools available (i.e., the Server Monitoring tool for the tomcat servers, and Nagios for almost everything else). However, Nagios is not currently available to a remote location, and it is not clear that it will be anytime soon, if at all.

The Server Monitoring tool is not yet a completed product in that it has some rather serious problems, which Tony, Charlotte, Max, and I met yesterday afternoon to discuss in some detail. Max is currently working on the major issues, and Tony plans to incorporate a reset button, which will enable a server reboot to be initiated remotely.

A third tool, Ganglia, is also available, but it is not yet clear to me how valuable it may be for diagnostic purposes.

From a documentation point-of-view, all of this boils down to the need for two-levels of documentation; one for the remote user, and one for the experts. The latter would probably best be implemented in Confluence, where it could be added to - and updated by - the experts. We would also probably want a history database and/or FAQ page, both of which could be implemented in Confluence. In addition, we'll need a "how-to-fix Confluence" page which, for obvious reasons, would have to be maintained someplace separate from Confluence. And lastly, we'll need to publish who performs the triage function, and who the "on-call experts" are. [Thanks to Chuck for providing these in-depth notes after the meeting. ed.]
GR news: (Heather) has started a new HEAD build including the new tags Tracy made just before leaving: GlastClassify and Interleave. The former seems ok. The Interleave library builds successfully but the test program doesn't. Do we a) include it in a new release anyway? b) hold off entirely? c) attempt to find and fix the problem before including it in a new release tag? Especially since there seems to be no way to exercise the new code easily, (Richard) votes for c). Heather will take a look. [Turns out the test program was referring to an obsolete include file which had been removed from CVS. But apparently no real use was being made of it. Heather simply took out the reference and all is well.]
Layoffs fallout: (Richard) Three GLAST people were laid off, one in SAS and two in Flight Operations. Also Tony Johnson lost another prospect. All in all, Data Handling is particularly hard-hit. We're looking around in the Collaboration. The Italians or the French may be able to help out.
Composite event lists: (Heather) In his new tag Tracy commented out CEL stuff. (Richard) We can't wait until Tracy returns (24th at the earliest; more likely the 27th) to reinstate it. Heather and David will look into it. (David) Has written a CEL application, celRelocate.exe, included in rootUtil, which demonstrates how to modify the data file paths within a ROOT CEL file. Next step will be to better document the package, provide an application celConvert.exe able to convert a ROOT CEL file into a TEXTUAL CEL file, and debug the use of TVirtualIndex when reading back a CEL. [Thanks to David for filling in all the holes in my real-time notes. ed.]
Skimmer: (David C.) The proposed new release discussed last week has been made. Next step will be to investigate how the Skimmer could accept ROOT CEL files as input. Note: A first attempt to use ROOT 5.18 resulted in a crash, so he does not expect the support of ROOT 5.18 will be straightforward.
Ground software freeze: (Richard) On the 16th of March we will make a baseline release of everything that runs in the Pipeline — GlastRelease, ScienceTools, Half-pipe, etc. — as well as the Pipeline itself. After that, any changes to the baseline have to be OKed by some form of CCB before going into production. (We can continue development unimpeded however.) NASA, of course, thinks we should make no changes except those needed to fix fatal errors. But then there is the danger, if not certainty, that we will be stuck with something inadequate. We need to

* Define what we think is reasonable.

significant improvement for acceptable risk

* Convince NASA to buy into it.

Steve is working on this part.

We will strive for no changes at all during the two weeks just before launch.
SCons status; related questions : (Navid) New RM can run (does checkouts, compiles, runs test programs), but it is not yet running routinely because we wouldn't be able to see the output easily. He is waiting for Karen to finish the web frontend, expected sometime this week. Meanwhile he would like to start looking at GR.

(Heather) raised a couple questions on Jim's behalf:
1. Could we add another python library, NumPy, to our externals? (Navid) has built it before. His only concern was the possibility of backwards-compatibility issues, but it seems this will not be a problem.
2. In the new SCons world, how does one set up LD_LIBRARY_PATH for, e.g., ASP. (Navid) Packages will continue to have wrapper scripts, as they do now with CMT, but the scripts will be much simpler since most environment variables will be taken care of elsewhere. Navid plans to add a special SCons target to build the script, so it will just be a matter of modifying the ASP build to use SCons.
AOB: (Richard) Anders and Maria Elena have been comparing MC data to the "same" data after it's been turned into EVT files and run through L1. It's not always quite the same! Their work to date is described in detail in confluence.

minutes index

J. Bogart, Last Modified: 01-Jun-2010 15:48:21 -0700