Minutes of the Meeting on Tracker GTRC Issues

September 26, 2003

Notes taken and written by J.J. Russell, edited by R. Johnson

GTRC - ISSUES

The meeting centered around two distinct problems with the GTRC: the well-known TOT problem and a newly discovered data synch problem. The latter occurs when transferring data from one layer's GTRC to the next layer's GTRC. Each GTRC latches data on the clock rising edge and then outputs the same bit on the next falling edge. The following GTRC should therefore latch the data on the rising edge of the next clock, ½ clock cycle later (25 ns). What is observed to happen at 20 MHz is that the delay in the transfer is too slow (i.e. greater than 25 ns), so the data get latched on the next following clock rising edge (1.5 cycles after output by the preceding GTRC). Testing at the chip level and MCM level shows good operation up to 30 MHz, but that is not sufficient. This is a system issue that is only seen when multiple GTRC's are connected together.

DATA SYNCH PROBLEM

As opposed to the TOT problem, for which the problem and solution are well understood, the GTRC synch problem is relatively new and not well understood. Various people suggested potential ideas to explore. Gunther suggested that increasing the drive current capability may solve the problem. Robert said he tried this to no avail. However, he did state that this test was done at 5:00pm last night, so there might be a good reason to repeat it more carefully.

Steve saw where this was going and made a plea to see the 'smoking gun'; not just resort to twiddling external knobs until it 'works'.

Robert said that there is a delay of 36 ns from the time that the falling clock edge enters the GTRC until the time that data first starts to appear at the GTRC output. Such a long delay is not predicted by schematic-level simulation. Since this delay is the same in all chips, it does not necessarily explain the transfer problem (if it is a delay in clock distribution rather than data output, for example). However, it indicates a lack of quantitative understanding of the internal clocking of the chip. This is also seen in the measurement of delay from clock in to clock out (where clock out is the clock distribution to the GTFE chips). This delay is 19 ns, which is about double what the schematic-level simulation predicts.

Because of the nature of LVDS (the inter-chip signaling mechanism) it is difficult to observe directly the relevant signals. UCSC made the measurements on the breakout board using a differential probe. The measurements on the traces between GTRCs was done by a two picoprobes, using the oscilloscope to difference the two waveforms.

The conversation then turned to the tactics and available personal that might be used to solve the problem. Gunther volunteered the following

OREN MILGROME: Will extract the netlist from the layout, to include parasitics inside the chip
DAVE NELSON: Will use the extracted net list to drive the simulation
SANTA CRUZ: Will continue to make measurements, varying the termination resistance at the GTRC data inputs and the current bias for the GTRC drivers and receivers. They will also investigate the parameter space in voltage, frequency, and temperature to find a stable operating point with good margin.

The discussion then veered slightly, with Steve wondering whether other chips may suffer the same fate. Unfortunately Steve chose the words 'lack of testing/simulation' which prompted Gunther to challenge the statement, indicating that, at the chip level, everything that could be tested was tested. The meeting was then placed back on the tracks, addressing the two target issues.

Robert noted that the problem disappears at a clock frequency of less than about 15 MHz (at the nominal 2.5V supply) and has been run successfully as low as 1 MHz. The system also runs fine above about 19 MHz, but in that case it operates in a mode of skipping one clock cycle at each GTRC. Gunther and Dave were interested in running at 3.0V (The current TEM PS is limited to 2.75V, so getting to 3.0V requires replacing a resistor). UCSC had run with VDD=2.75V, and in that case the correct operational mode (i.e. no skipping of clock cycles) extended up to about 18 MHz, and the operation with skipping began at about 22 MHz (i.e. it was unstable at 20 MHz).

Conclusion on the Data Synch Problem: Wait for more information.

TOT Problem

-----------

The discussion then turned to the TOT problem. In contrast with the data synch problem, both the problem and solution are well understood. The discussion concerned tactics and schedule.

Jeff is only now available to work on the GTRC. The solution is already in hand. The issue is testing that the solution works and does not introduce any unintended side effects (i.e. it doesn't break something that is currently working). Tools at Jeff's disposal are

More simulation (Gunther issued his warnings again on the practical limits on simulation testing).
Port the design to the FPGA test board and kludge an interface between that board and the UCSC GTRC wafer test hardware, so that the full suite of existing test vectors could be run.

The issue of schedule was next:

Jeff starts work on this Monday
2 days to implement the code changes
Layout by one of Gunther's RAs, done by next Friday
Oren will do Nanosim simulation the following week
Somebody has to extract a schematic with which to do final LVS verification

Bottom line: Time early to the foundry is 2 weeks. Results from the simulation may lengthen this time. Gunther will start the PO process so that it is in place as soon as the design is ready.

The discussion then turned back to technical issues. Mike Huffer raised the question of whether there are other problems out there. There is certainly the potential for other problems, with Mike noting that neither the buffer overflow handling nor the treatment of parity errors occurring on front-end destined commands has been adequately tested. (In Mike's best estimation, by looking at the VHDL code, he believes that parity errors occurring in the readout commands will lock the system. So here it is not a question of the consequence, but 'do parity errors occur in the real system at any kind of significant rate?')

Many of the tests Mike was referring to demand that the system be run at 'nominal' rates, i.e. up to 10 KHz. So far, most of the system level testing has been around the cosmic rate of 30 Hz. Higher rate testing is difficult for two different reasons: one due to the present EGSE test setup and one due to the lack of a known data source at 10 KHz.

The EGSE test setup currently has two basic limitations

The COMM IO board limits the system to < 1KHz. This will be relieved when the LCB replaces the COMM IO board.
The EGSE software path cannot handle more than about 200-300 Hz worth of data. The limit is lower if any on-the-fly analysis is done to the data.

JJ indicated that a month's effort on the part of FSW would be needed to move to a higher rate system. (SIDE NOTE: Although not stated in the meeting, the commissioning of the LCB is still in its infancy. Using an unknown quantity as part of a test procedure seems a bit dubious.)

One way to get higher rate data is to simply turn down the thresholds on the TKR's trigger discriminators. This will have the desired effect of raising the trigger rate, but, apart from just proving that the data is transported without hang-ups, checking the integrity of the data is limited to internal consistency checks. Simply put, since the trigger is random, there is no known pattern (like a cosmic ray track) to serve as an anchor.

The discussion then turned to tactics. This boils down to accessing the consequences of adopting one of the following solutions:

1. Do nothing

2. Run a mixed system, that is some towers with old chips

and some with new chips.

3. Move the TOT measurement and logic to the TEM

4. New chips in all the towers

Taking these in order:

DO NOTHING

Here the issue is, can one live with the timeouts or without the TOTs. Steve asked JJ what was the recovery time from a timeout. The idea here is that FSW must always provide some recovery method; just how long is the recovery procedure. JJ was tasked with answering this question, although, he was admittedly very reluctant to be lead one step at a time down this path. His objections are (this is where being the author of the meeting minutes pays off, JJ gets to state his case without rebuttal).

- Timeouts are generally a catchall error. This strategy could lead to the masking of other problems.

- The rate of timeouts in the full LAT will go up for two reasons:

1. Simply scaling of the 6 instrumented layers to 16 x 36 = 576 layers, drives the timeout rate up by roughly 100

2. Since this problem occurs because of the coincidence of two signals, the rate of timeouts will go up non-linearly as the rate increases. However, I'll ignore this non-linearity and do simple scaling to 10 KHz. This will push the rate up by a factor of 300, given that the present cosmic trigger rate is about 30 Hz. Again, my guess is that this is a lower limit because the number of events that are within the evil coincidence time must scale non-linearly.

A lower limit would be the timeout rate would increase by at least 30,000. According to Mike, empirically they were seeing 1 error in 10K cosmics. This translates to 100 errors a second for the full LAT at 10KHz trigger rate.

Finally JJ objects that designing error recovery for something that happens once in a blue moon is very different from making the recovery at high rate as a part of normal running.

The other obvious tactic within the DO NOTHING strategy is run with the TOTs disabled, effectively giving up on the TOT measurement for normal running. As a diagnostic, some running with the TOTs enabled could be done. At the very least, it could be used during Tracker calibration running.

RUN A MIXED SYSTEM

The attraction here is one of schedule. By admitting to a mixed system, the first production towers would be outfitted with the current chips. Later towers could have the fixed chips. Lowell pointed out that the law of being maximally screwed was at work, since the first towers are destined for placement as the 4 central towers (Note added in proof: Elliott says that I&T will start at an edge, not in the center). Note that even with only 4 towers with bad chips, the timeout rate is still likely prohibitive. This leads to the strategy of turning off the TOTs in the towers with the original GTRCs and leaving them on in the others. JJ indicated that nothing in the on-board data handling (either just data handling or the filtering algorithm) cares whether the TOTs are there or not. This mode of operation would mean that at least some of the towers give viable TOTs.

(Again, abusing his privileged position as editor of the minutes, JJ quakes at the thought of a mixed system. A good portion of commissioning an instrument is devoted to discovering and inventing workarounds to the inevitable 'interesting' features and interactions of the hardware and real-life running. The watch word here is 'variety kills'. Dealing with the peculiarities of 2 species of GTRCs just sounds too gross to contemplate.)

MOVING THE TOT MEASUREMENT TO THE TEM

Curiously, this is actually both technically not too hard (i.e. the VHDL is reasonably straightforward and doable, at least according to Mike) and possible (i.e. it does not require a board layout change, and the ASIC has sufficient resources to accommodate the extra logic). The advantage of this solution is that the TEM production schedule is not as tight as the MCM production schedule. A downside is that the logic would be too large to test in an existing TEM version.

Gunther's objections centered around the argument that he currently has a working ASCIC, so why risk it? The counter-argument is, well, if it doesn't work, we are no worse off than we are now. Basically, we would be in the same position as if we were to redo the GTRC and it fails for some reason. We would just learn to live without the TOTs. Gunther also cautioned that this would draw on important resources of his electronics group.

WAIT FOR A NEW CHIP

Obvious, this had both the upsides (hopefully, if successful, it all works and the TOT and data synch problem are solved). The downside is that the schedule to Tower A (and Tower 14) would slip by 2 months or more, in Robert’s estimate.

In general, Robert cautioned that redoing the GTRC has a lot of costs, some more obvious than others. The MOSIS run would cost $105,000, plus some amount for lapping and dicing. Then there is the cost to Gunther’s group of the design effort, to the UCSC group to kludge the GTRC test system to help with evaluation, to UCSC to test the resulting wafers. Then the largest cost is probably to redo qualification testing, radiation testing, and system testing (at a level of several trays) to validate the new design. The latter would be finished only months after completion of the run, so any fabrication use of the chips in the interim would be at risk. Furthermore, all of these efforts will draw on manpower that will be in critical demand for the flight tower production, and we all know that this effort is already short-staffed.

Final note added in proof: while work is still going on to understand quantitatively the delays inside the GTRC chip, we have found that correct data-transfer operation at 22 MHz and above can be obtained by adding a 200-ohm termination resistor between the data lines and in parallel with the existing 700-ohm resistor (which is inside the GTRC). The existing termination, multiplied by the capacitance of the cable, gives a rather long RC time constant. Decreasing that RC has the effect of allowing the data to arrive in time, even in the presence of the other internal and less understood delays, and the signal voltage swing is still adequate. UCSC will explore the margins using this modification. In the flight system this could be incorporated by putting the resistors on the cactus arms of the flex-circuit cable (adding them to the MCM would be very problematic at this stage).