Online Workshop notes for October 29 – November 1, 2002

Attending:

Connie Houchens (GSFC/ACD)
Sharon Orsborne (GSFC/ACD)
Jim La (GSFC/ACD)
Byron Leas (NRL/CAL)
Luca Latronico(SLAC/TKR)
Alicia Kavelaars(SLAC/I&T-Online)
Selim Tuvi (SLAC/I&T-Online)
Ric Claus (SLAC/I&T-Online), minutes author
Dave Lung (SLAC/IOC)

Curt Brune (SLAC/FSW)
Dan Wood (NRL/FSW)
Tony Waite (SLAC/FSW)
JJ Russell (SLAC/FSW)
Mark Freytag (SLAC/ELX)
Mike Huffer (SLAC/ELX)
Larry Wai (SLAC/I&T-IFCT)
Eduardo do Couto e Silva (SLAC/I&T-SVAC)
Elliott Bloom (SLAC/I&T)
Paul Kunz (SLAC/Group K)

Page Contents:

Topics:

Tuesday

Architecture presentation (Ric's ppt)

All output files will be of the form pjjjyyhhmssq.ext (see Byron’s Action Items list for details)

Addendum: Post-workshop discussion indicated pyymmddhhmmssq.ext is better

Schema & Configuration presentation (Alicia’s ppt)

Schema and Configuration files are written in XML
Additional system initialization can be done through initialization scripts
Currently can cause dataless commands to be issued when loading a configuration file

We need to decide whether this is desirable
If it is, we need to be able to control execution order
This would turn the configuration file into a scripting language

A loaded schema and configuration plus initialization describes an instrument or subsystem state

We will need to be able to configuration manage these system states
Test scripts should restore the system state at their conclusion
More details in Byron’s Action Items

Byron has a requirement for masks that describe which FEs are in the system

Mark suggested that this information can be built from the LayerMasks

Eduardo suggested that a reference value column be added to the configuration builder tool
Various people suggested that the configuration builder tool would be even more user friendly if the register bit fields were called out separately
Action item: Determine how to merge schemas and configurations (e.g., from different subsystems).
Action item: Online to include constraints, rules and EGU instantiation and configuration in schema and configuration reading

Discussion of subsystem Online deliverables to I&T (Eduardo’s html)

Subsystems to interact with Eduardo on test particulars
Byron is working on a test procedures document for CAL
Luca Latronico is the point person to generate a similar document for TKR
Ric asked where the functional test list for ELX tests and CAL&TKR combination tests is

Action item: For ELX we don’t have anything. Ric to follow up.

Follow-up: ELX feels it’s still too early to define these tests

The only CAL&TKR combination test needed is the event data synchronization test that Larry will take care of

Byron suggested using MIL-STD-498 for the test procedure document format

Wednesday

Discussion and demonstration of Test Stand Architecture IIa by Curt (html, pdf)
Discussion on a message logging facility

We used Byron’s logger.py class as a starting point
We agreed that logged messages should contain:

Time
Module
Call stack
Line number (maybe)
Formatted message

The logging class should

Implement a detail level that can be used to display/reduce the amount of detail output
Implement a severity level that can be used to suppress messages below a given severity

Possible logging call:

logit(severity, message format string, argument list)

Messages should be stored in include files
Messages will need to be Configuration Managed
Provide a central repository of common messages that the user can augment
Provide various logging clients in addition to the screen

File I/O
RDB (MySQL?)

No need for “Entering/Exiting routine” type debug messages since call stack will provide this information
Proposal to make a system object containing the logger object, etc.

System object always passed to everything
Need to ensure that only one logging object is instantiated

Command and telemetry discussion with Tony Waite (FSW) (Tony's pdf, Ric's pdf)

Zeroth order: FSW supports all currently existing Online commands
Later: will need to be able to create batches of commands

e.g., in order to dump a bunch of registers

What to do with the SIS that appears in January
Action item: Tony requests that each subsystem puts together a command list for FSW

Thursday

Run Control discussion (Ric's pdf)

Make checksum of all files used as input to a test before and after the test is run to ensure nothing has changed during the test
Record number of times test was paused in the test report
Need to be able to correlate test data with environmental house keeping not collected by the instrument. Implies having to synchronize clocks.
Test and schema/configuration files are not tightly bound
Make a report class
Determine a test suite name convention, e.g.:

ped.xxx.py
gain.xxx.py

Output files labeled with the form pjjjyyhhmssq.ext, as above
Logging of session activity

Session: offline building of schema/configuration, online running of tests

Security

Should test operators log in to Run Control?
Run log should contain operator name

Must handle situation in which multiple copies of Run Control are started

Dealt with by inability to share a socket?

We decided that looping over a test is a property of the test, not of Run Control
Do we need the ability to log all commands sent to the embedded system?

Consensus is no. Likely to be useful only for debugging.

Electronic log book discussion

Post workshop: Eduardo requires the ability to determine whether a run was flawed from the electronic logbook. Flawed runs will normally not be analyzed offline.

The flawed flag is determined from coarse metrics from subsystems, e.g. is HV on for all components for which it should be
Do we need to be able to drill down into the flawed flag to find out why the run is flawed or is it sufficient to say that if you’re really interested in a flawed run, go ahead and analyze it? Consensus is no need to drill.
Need to make the criteria that go into the flawed flag not too stringent so that all runs don’t become marked flawed

Curt and Selim chaired a discussion the lower levels of command and event network packets

Scripts will need to be able to ensure that the event pipeline is empty

This is done with the marker field of a trigger message. A trigger message is issued with a unique marker value. The script then looks at the marker field of events that come back from the embedded system. If it sees the special value, the pipe is empty.

Selim demoed the creation of a GUI with Qt Designer on the fly.

Scripts that care should have a GUI-enabled flag so that they can interact with a GUI or the command line interface depending on whether they’re invoked interactively or in batch mode

Friday

Performance discussion

Selim measures command rates at ~1200 commands/second
Selim measures event rates at 2000 events/second without parsing, 200-700 events/second with parsing, depending on size
Byron measures event rates at ~1200 events/second using SciPy/numeric to parse CAL data
Byron raised his concern about FSW recoding subsystem scripts

Adds risk
Processing Calorimeter data onboard is excessive – this should be done on the ground

JJ indicated that the amount of data involved is too large to ship to the ground

JJ described FSW plans:

Need to handle case where calibration fails on orbit
Calibrations are FSW’s best tool for determining the health of the system
Need to decide where to draw line between ground based and on orbit functions
FSW will have statistics tools (means and standard deviations, etc.)
For onboard pedestal calibrations, results can be fed back into FSW

Rather not include ground in this loop

Byron plans to design his scripts to separate data collection from analysis
FSW is not planning to import code from non-FSW groups
Byron: algorithms are given to FSW for implementation in FSW environment
JJ indicates a constraint on the algorithms is that it must be a single pass over the data since the SSR is a write-only device

Slight modification on this is that some amount of data can buffered and analyzed before writing to the SSR

FSW may implement algorithms by distributing events to multiple CPUs and processing them in parallel
Can treat multiple “channels” of data in parallel, if crosstalk and suchlike aren’t a problem
TKR data size is large so no possibility of involving the ground, depending on compressibility
Threshold sweep could maybe be done from the ground if compressibility allows since it won’t be done very often
Byron asked whether pedestal normalization data will be stored on board

JJ thinks FSW will have to, yes

JJ requests some cheap (CPU time, memory) instrument integrity tests
JJ expects calibrations to typically take less than a minute apiece
How to handle bad calibrations?

Mark corresponding dataset bad (can’t recall it from SSR)
Try again

FSW is not planning to do several different subsystem calibrations simultaneously
CPU, memory, bandwidth limits lead to partitioning of what portion of the detector is calibrated when
Definitions:

Calibration: actions incompatible with regular (physics) data taking
Monitoring: actions compatible with regular data taking, e.g. dead strip list accumulation

Certain things can be done on the ground better than on board

On board’s advantage is statistics – only a small fraction of the data is sent to the ground
Can’t do photon pointing on board – too hard

JJ requests subsystems submit monitoring ideas requiring high statistics to FSW
JJ expects to get pedestal data from CAL during regular data taking periodically to monitor performance around the orbit
Action items: JJ requests from subsystems:

Information on cross talk effects
What order to go through calibrations
How to march through the DAC curve
Any information on funny effects in the real system
How should calibrations be taken, processed and presented

FSW won’t change thresholds (CAL HI, LO, ACD Veto Hit)

These are expected to be uploaded from the ground
Ditto gains
However, there is sensitivity to hot sections – need to be able to taper these on the fly

ACD Veto Hit thresholds tug on trigger and data taking in opposite ways
Dave asked how often calibrations will be run

JJ answered that they would be run fairly regularly in order to monitor health
This allows the building up of a historical record

Luca indicated that it will be necessary to get some of the TKR raw data to the ground

JJ will preserve this ability

Byron indicated that he understood from Mike that command processing goes at 50 ns/command

Should be 50 uSec/command?
Wants to know whether he needs to test at that rate
Concerned about batched commands
It may be a FSW issue on how to buffer and meter out commands but still need to determine whether the subsystem can handle it

Byron wants to know what the theoretical telemetry rate limit is

Depends on event size
What rate do subsystems need to be able to handle without dropping anything?
CAL has a requirement to be able to produce 1000 events/second

Driven by power consumption
This comes from Level 3 and Level 4 requirements document from Eduardo

LAT-SS-00017-05 for TKR
Didn’t have doc number handy for CAL
ACD has, or should have, such a document as well

Requirements for CAL were worked out between Eric Grove, Neil Johnson and Eduardo

Hippodraw talk and demonstration given by Paul Kunz (pdf)

Working on making the first Qt-based release (previous releases were Java based)

A release should be ready in the order of one week

Current cvs head is a decent beta version with a short list of known bugs that are not commonly tickled
Elliott stressed the need to get a release out the door since “the train is leaving the station”