Data Processing Facility

This facility has four major functions:

automatically process Level 0 data through reconstruction (Level 1)
provide near real-time feedback to IOC Ops
facilitate the verification and generation of new calibration constants
produce bulk Monte Carlo simulations
backup all data that passes through

First, it is instructive to examine the scale of the processing problem. The downlink rate of 300 kb/s, results in a daily rate of some 3 GB of data, or approximately 1 TB per year. Products generated from the raw data will perhaps double or triple this volume. Over a 5 year period, this comes to some 15-30 TB: reasonably modest and fairly easily all held on disk.

The average event rate in the telemetry is expected to be 30 Hz, split between signal photons and background cosmic rays. Our current reconstruction algorithm consumes about 0.1 s per event on a 400 MHz Solaris processor. Assuming processors at 4 GHz by launch time (a conservative estimate), then this rate drops to 0.01 s/event, allowing a single processor to keep up with incoming data on a daily basis. If we wished to turn a full day's downlink around within 4 hours, we would require perhaps 3-5 processors. The gist of the message is that disk and CPU time are not drivers for GLAST's Level 1 analysis.

These disk and CPU needs represent perhaps a percent of SLAC's Computing Center capacity, so even a gross under-estimation of rates and volumes is easily accommodated within the existing facility.

One can inflate these estimates by requiring the capacity to re-process data, and perhaps generate Monte Carlo simulations, concurrently with prompt processing. An estimate of the maximum computation and storage capacity required is perhaps a few 10's of processors and 50 TB of disk over the life of the mission. The SLAC Computing Center is committed to supplying, at no explicit expense to GLAST, these disk and CPU resources. Needs confirming by Richard Mount!

The task at hand will be to have a sensible backup scheme for the data, and a well designed database which can handle the state of the processing (for all of prompt, re-, and MC generation) and description of the resulting datasets. The database will be the heart of the operation. From it, a fully automated server can completely handle the data processing, with a minimum of human intervention.

Database

The database will be used to track the progress of any given dataset - most likely a single file initiated from a Level 0 file - through its life in the system, from arrival at the IOC through to Level 1 processing output. The database is dataset oriented: it must keep track of the properties of the file (location, size, and any needed metadata), as well as its state (completed, pending, failed, etc) and how it got into that state.

Such a database is being prototyped for GLAST use, based upon experience from a similar data pipeline used for the SLD experiment at SLAC. An entity relationship diagram is shown here. This database is designed for use in the engineering model tests as well as for flight mode. The tables are divided into three categories:

dataset property description
processing status
metadata about the dataset.

In addition, datasets can be grouped, so that like datasets can be easily linked together. An example would be a particular set of Monte Carlo simulations which require many individual files all with the same setup conditions.

Processing Server

It is assumed that the IOC will create entries in the database for new Level 0 datasets. An automated processing server can poll for new entries and take immediate action when it finds them. For event data, we currently envisage a single process acting on an input file with a single output file. It may be conducive for calibration tasks if a separate file of candidate calibration events is created; alternatively, those events could be tagged in the standard output file. Either way will work and is supported by the database. Similarly, the database will support multiple processes operating on a dataset. [do we need a table to tie processing step(s) to dataset types?]

The server's life is considerably ameliorated by having all datasets on disk all the time. There is no urgency for backups, and the server does not even need to be responsible for doing the backups [does database need to know where backups are?].

Questions of programming language and database technology are somewhat interconnected. A mainstream interpretive language like Perl is a good match to this kind of work. The database is assumed to be relational, and sql-based. SLAC has an Oracle site-licence, so that seems like a natural choice. Perl has a good interface to Oracle, so that combination of Perl and Oracle will be a good match to the needs.

Since datasets are independent, the server can make use of a load balancing batch system (SLAC uses the LSF batch system) to handle dispatching the processing jobs. So, assuming the Level 0 data are broken into small chunks (the unix filesystem already limits files to 2 GB), then the server can submit all the chunks to separate processors to achieve parallel throughput. Each process can then communicate its results directly to the database or to the server which would do the updates.

We will also need web interfaces to the server, both for watching its progress and for interacting with it. These interactions will involve both direction communications with the server (restarts, etc) and with the database (eg altering the state of a dataset to set an OK flag to resume processing).

Near Real-time Feedback to the IOC

The automated processing provides the opportunity to obtain high level diagnostics from the data and feed that back to the IOC operators in a timely fashion. Raw data arriving at the IOC can give basic sub-system-centric information on the detector elements, but cannot tie subsystems together, nor give higher level measurements on those subsystems.

Examples are

measuring ACD tile efficiencies by fitting charged min-I tracks and extrapolating them to the ACD
measuring TKR layer efficiencies by looking at the distribution of hits per layer on fitted tracks.
measuring at least some amount of the CsI light tapering along the length of the logs, again by extrapolating min-I particles through the calorimeter.

There is a long list of such diagnostics that can be used. These diagnostics will likely take the form of statistics and plots and will be tracked for each dataset, yielding a performance metric as a function of time. These will be compared to standards and as many automated comparisons made as possible. Some of the diagnostics will have to be examined by people, using their expertise to decide whether current distributions are acceptable. All of these diagnostics and standards will be tracked in a database, and viewable over the web.

Calibrations

... stuff goes here ..

Monte Carlo Generation

It is anticipated that significant volumes of simulations will be generated. As much as possible, the machinery that does this generation should make use of the automated server described above. That will leverage all the benefits of the server and database.

The issue involved is to record the metadata that is unique to MC: the source generator, its parameters, and the configuration and parameters of the simulation package. These are readily handled by the flexible metadata scheme in the database. The code management system also makes code version identification unambiguous.

Data Manager Prototype

Development of a Data Manager (automated server) prototype is currently underway at SLAC. The prototype is intended to demonstrate automation of the data processing steps described above. The prototype and eventually the final product are being developed in Perl, based, as mentioned above, on the availability and reliability of extensively tested libraries providing general network, web, sql, and Oracle support. These capabilities will be important in providing efficient and convenient access to the data and current processing status.

As well as performing automated processing to Level 1, the data manager, in combination with the database, will provide the logic that allows users to access data sets with similar properties as a group. The data manager will work in tandem with the code management system to provide extensive version information on processing algorithms used for each stage of processing a given data set.

The current version of the prototype is able to generate MC data and run various versions of the reconstruction code on it and will soon have the capability of logging metadata on the simulation and reconstruction specific algorithms to the previously describe Oracle database. A block diagram providing a more detailed view of the interaction of the Data Manager with various SAS components is here .

R.Dubois, K.Young Last Modified: 07/26/2001 15:10