



**GLAST Large Area Telescope** 

**Monthly Mission Review** 

### LAT Flight Software Status

February 8, 2007

Jana Thayer

**Stanford Linear Accelerator Center** 



- Builds available on LAT:
  - **B0-8-1** in lower bank (reboot trolling, LAT operations)
  - B0-6-15+ in upper bank (preserved for functional testing)
- Reboots (since January monthly):
  - More in Erik's RRT presentation
  - "Lost Decrementer Interrupt."
    - EPU0 at 2007-01-26 08:17:14 ~80 seconds into first 30-minute muon run.
    - EPU1 at 2007-01-27 10:46:22 with system idling for hours after initial boot.
    - EPU1 at 2007-02-06 10:45:00 during datataking after several hrs of running
    - EPU0 at 2007-02-07 09:36:44 with EPU idling during LCI run
    - EPU0 at 2007-02-08 07:13:26 late into a muon run after ~19 hrs of running
  - "Caching unexpectedly enabled for PowerPCI bridge chip registers."
    - EPU0 at 2007-01-27 07:41:29 with system idling during ACD front-end power-up.
- B0-9-0: stable build to carry us into Observatory CPT. Includes --
  - Work around to lost decrementer problem
  - VXW with write-through/write-back selection implemented as telecommand
  - Compression fixes
- B1-0-0: GRB algorithm
  - Serialized behind reboots
  - Identifying needs for testing of algorithm and interface



## Lost decrementer interrupt

- (Very) Late breaking news:
  - This problem is a bug in the design of the original MPC750.
    From the MPC750 user's manual:
  - "No combination of the thermal assist unit, the decrementer register, and the performance monitor can be used at any one time"
- We are using the thermal assist unit and decrementer at the same time!
  - The workaround is a trivial fix
  - Documented in JIRA FSW-863
  - Project CCB approval for FSW-863?



## **Build 0-9-0**

| Key            | Package Affected | Summary                                                                           |
|----------------|------------------|-----------------------------------------------------------------------------------|
| FSW-857        | VXW              | Revert RAD750 cache configuration to write-back                                   |
| FSW-863        | PBS              | Lost decrementer interrupt                                                        |
| FSW-861        | PBS              | Incorrect initial processor frequency/period assignment                           |
| FSW-860        | PBS              | WUT_sys_adjust does not properly calculate its return value                       |
| FSW-850        | PBS              | In the PL queue routines, inserting a queue element can disable interrupts        |
| FSW-800        | PBS              | LHK stopped sending telemetry                                                     |
| FSW-831        | EDS              | Event packet reassembly code review issues                                        |
| FSW-844        | Compression      | Segmentation fault during event decoding                                          |
| <u>FSW-842</u> | Compression      | Error decoding format 4/5 events                                                  |
| <u>FSW-744</u> | THS              | Incorrect CCSDS header timestamps on EPU1 LSEP datagram                           |
| FSW-802        | THS              | CTDB Bus timeout messages time tags 4.2 seconds fast                              |
| <u>FSW-843</u> | LIM, LPA         | Modify LIM behavior to favor ARR over TOO and to always obey LPASTART and LPASTOP |
| <u>FSW-837</u> | LIM              | Lack of task pointer check prior to invoking ITC_detachRaw                        |
| <u>FSW-862</u> | LHK              | CCSDS Header sequence counter always 0 for EPU HSK packets                        |

- Stable FSW baseline that includes:
  - Reboot workarounds identified in the next week
  - Fixed compression (error rate < 1 in 80 x 10<sup>6</sup> muon events)
  - Interrupt trace capability (already in place)
    - VXW write-through mode essential for debugging reboots, but
      - There is a performance hit associated with using it
      - Combined with short interrupt latency and increased execution time of interrupt dispatch tracing, can cause VxWorks work queue panic (reboot)
    - If we run VXW in write-back mode, we cannot diagnose reboots

8 February 2007



# LAT instrument time and B0-9-0 release

- We have been given seven future 8 hour shifts on the instrument.
- Use this time to troll for reboots (2/8 2/9), install B0-9-0 (2/19 2/20), test B0-9-0/troll some more (2/20 2/23)
- Expected availability of B0-9-0: week of 2/19
  - Code completion:
    - PBS fixes complete by ~2/12
    - Compression fixes released by ~2/15
    - THS fixes complete by ~2/15
    - LHK fix complete by ~2/12
    - VXW, LIM and all other JIRAs complete
  - Roll build: 2/15
  - Regression test: 2/15 2/19
- Upload to LAT
  - Time to upload VXW + FSW can be reduced to 2 shifts or less with LICOS\_Scripts changes (LCS-222, LCS-223): 2/19 2/20
- Regression test/troll for reboots 2/20 2/23
  - Perform "standard" datataking loop that includes LCI and muon runs in config 1
    - Include LIM mode test
    - Enable VXW write-through mode
  - Will gladly accept any additional time given to troll for reboots/regression test



- Implementation of GRB algorithm can be split up into three pieces:
  - Internal FSW infrastructure for handling a GRB (complete)
  - GRB algorithm detecting a burst (work in progress)
    - Algorithm has been available for some time
    - Porting the algorithm to an onboard environment has begun
  - Infrastructure for testing GBM/LAT interface
    - Some thought and effort required to implement
    - Need to include features such as
      - Ability to enable/disable GRB algorithm
      - Ability to trigger algorithm via telecommand or other external stimulus so we can test interface with the GBM
    - FQT test to be written using testbed/FES



## **B1-0-0 JIRAs**

| Key            | Summary                                                                  | Fix Version/s |
|----------------|--------------------------------------------------------------------------|---------------|
| FSW-582        | Capture of layer splits in LATC does not consider the FE mode registers  | B1-0-0        |
| FSW-292        | Implement GRB detection algorithm                                        | B1-0-0        |
| FSW-693        | Command confirmation configuration report                                | B1-0-0        |
| FSW-732        | Task messaging configuration report                                      | B1-0-0        |
| FSW-576        | Bug in CAL data compression algorithm                                    | B1-0-0        |
| FSW-789        | LCI event data is inconsistent if TEM errors or diagnostics present      | B1-0-0        |
| <u>FSW-456</u> | EMP and LCM do zlib compress with malloc/free, should use MBA_alloc/free | B1-0-0        |
| FSW-811        | Modify the sample parameters of the Gamma, MIP, and Heavy Ion filters    | B1-0-0        |
| FSW-841        | Implement enumerations in LCAT so they're part of the T&C database       | B1-0-0        |
| FSW-808        | Problem enabling periodic triggers                                       | B1-0-0        |
| FSW-747        | Correct two separate errors with the extended counters                   | B1-0-0        |
| FSW-723        | LATC (and RIM) XML contains duplicate tag names                          | B1-0-0        |
| <u>FSW-164</u> | Add LATC Telecommand Interface to LIM                                    | B1-0-0        |



## **JIRA Metrics as of 5 February 2007**



- Open issues are divided as follows
  - 6 planned for B0-9-0
  - 13 planned for B1-0-0
  - 8 planned for B2-0-0 (post L+60)
  - 13 deferred indefinitely
  - 14 unscheduled
    - 11 being assessed by FSW team
    - 3 awaiting Project CCB adjudication

8 February 2007



# Test bed vs LAT Reboot Investigations

# Eric J. Siskind (as munged by wnjohnson and jbthayer)

8 February 2007



- The LAT is sufficiently complicated that it cannot even digitally reproduce its own results
  - LAT contains multiple asynchronous clocking domains whose frequencies are not phase locked
- None of the strategies discussed is without some cost, often involving highly-skilled manpower for an extended period of time



- Premise: the processing of specific events generated by LAT through FSW leads to memory corruption which ultimately results in a watchdog reboots – EPU only
- Issue: Reboot frequency statistics
  - Assuming ~200 hrs of ops between reboots and 500 Hz event rate on ground, then you we are seeing 1 failure per 3x10^8 events.
- Consequence:
  - To reproduce reliably on test bed, need ~1x10^9 events ( $3\sigma$ )
    - Exceeds capabilities of the test bed FES by factors of 4 5. (designed for 1 orbit at 10kHz)
  - Repeatedly playing the same small sample does not help
- Problems:
  - Monte Carlo sim of 1x10^9 ground muon events is not trivial
  - No guarantee that MC captures the complexity (zoo) of the events seen by LAT



## **Data-Driven Reboots (cont)**

- Mitigation
  - Use events captured by LAT in testing and transmitted to the SSR
- Strategy:
  - Push event data from LAT (passthru filter) thru the test bed
  - Focus on datagrams from EPU just prior to reboot
- Problems:
  - the event or event sequence that caused the reboot may not have been packaged in a datagram and made it out of LAT to the SSR before reboot occurred.
  - Recent running of LAT has been with gamma filter to reduce data volume (for dump and storage issues)
    - this reduces the complexity and rate of events into the SSR.



- Test bed DAQ side is logically an identical copy of the LAT DAQ system
- Test bed FES (front end simulator) is NOT a simulation of TKR, CAL or ACD
  - Provides trigger primitive timing
  - Provides event data formatted as by subsystem
  - Does not simulate readout latencies from TACK.
  - Does not simulate subsystem command register actions or impact on data content. Readback is copy of what was commanded.
- Consequently, test bed FSW validation is a NECESSARY but not SUFFICIENT condition to guarantee successful operation on LAT.



- FSW validation step w/ fully functional DAQ system
  - Necessary but not sufficient
- FSW performance and margin testing with respect to compute cycles in SIU and EPUs
- With FES capabilities, the primary tool to verify FSW and DAQ performance at flight-like trigger rates
  - Study dead time, buffer occupancy, data transfer cycles

The test bed and FES were not designed to be a high fidelity simulation of the LAT and the complexity of its detectors.



EM RAD-750 boards in test bed use cots MPC-750 processors

- Issues
  - BAE cache functionality errata
  - Potential other unknown problems in interactions w/ bridge chips
- Mitigation
  - Upgrade test bed EM RAD-750s to the v1.05 RAD-750 processor chips – either flight or rumored non flight versions
  - But, if problems are software, this doesn't help
- Plan Forward
  - Bring LAT spare flight crate(s) into dataflow lab
  - Set up on uber teststand
  - Upgrade EM boards on test bed???
- Caution: putting flight processors on the testbed will not necessarily help diagnose reboots:
  - Need to know how to use the flight hardware to reproduce a reboot
    - Don't know: is problem hardware? Data driven? Can it only be reproduced if we run identically to LAT?
  - Need tools in place to capture breadcrumbs left behind after a

8 February report



FES (front end simulator) is NOT a simulation of TKR, CAL or ACD

- Mitigation
  - Set up Calibration Unit in dataflow lab
- Plan Forward
  - Get CU back to SLAC and running in dataflow lab
  - Develop configurations and tests for CU
    - LCI and muon runs
  - Estimate for arrival and operation of the CU is ~ 1 month (Eduardo)
- Caution: do not have full 16 towers on CU



The SDRAM organization in the EM processor board is slightly different from that in the flight board.

- Issue
  - Only known potential impact relates to erratum #24 from BAE. (maximum bank active timeout)
- Mitigation
  - Disable the timeout feature. We did but have seen at least one watchdog reboot in this configuration.
- Path forward
  - Don't believe this is difference is significant



- All test bed DAQ ASICS are the same revision as flight ASICS
- All test bed FPGAs have the same revision of VHDL as the flight FPGAs
  - Test bed and flight FPGAs target different FPGA families but logically (clock tick by tick) they function the same.







## B0-8-1: EPU0 CPU Load (write-through)

