Minutes of the Meeting on
Tracker GTRC Issues
September 26, 2003
Notes taken and written by
J.J. Russell, edited by R. Johnson
GTRC - ISSUES
The meeting centered around
two distinct problems with the GTRC: the well-known TOT problem and a newly
discovered data synch problem. The latter occurs when transferring data from
one layer's GTRC to the next layer's GTRC.
Each GTRC latches data on the clock rising edge and then outputs the
same bit on the next falling edge. The
following GTRC should therefore latch the data on the rising edge of the next
clock, ½ clock cycle later (25 ns).
What is observed to happen at 20 MHz is that the delay in the transfer
is too slow (i.e. greater than 25 ns), so the data get latched on the next
following clock rising edge (1.5 cycles after output by the preceding
GTRC). Testing at the chip level and
MCM level shows good operation up to 30 MHz, but that is not sufficient. This
is a system issue that is only seen when multiple GTRC's are connected
together.
DATA SYNCH PROBLEM
As opposed to the TOT
problem, for which the problem and solution are well understood, the GTRC synch
problem is relatively new and not well understood. Various people suggested
potential ideas to explore. Gunther suggested that increasing the drive current
capability may solve the problem. Robert said he tried this to no avail.
However, he did state that this test was done at 5:00pm last night, so there
might be a good reason to repeat it more carefully.
Steve saw where this was
going and made a plea to see the 'smoking gun'; not just resort to twiddling
external knobs until it 'works'.
Robert said that there is a
delay of 36 ns from the time that the falling clock edge enters the GTRC
until the time that data first starts to appear at the GTRC output. Such a long delay is not predicted by
schematic-level simulation. Since this
delay is the same in all chips, it does not necessarily explain the transfer
problem (if it is a delay in clock distribution rather than data output, for
example). However, it indicates a lack
of quantitative understanding of the internal clocking of the chip. This is also seen in the measurement of
delay from clock in to clock out (where clock out is the clock distribution to
the GTFE chips). This delay is
19 ns, which is about double what the schematic-level simulation predicts.
Because of the nature of
LVDS (the inter-chip signaling mechanism) it is difficult to observe directly
the relevant signals. UCSC made the
measurements on the breakout board using a differential probe. The measurements on the traces between GTRCs
was done by a two picoprobes, using the oscilloscope to difference the two
waveforms.
The conversation then turned
to the tactics and available personal that might be used to solve the problem.
Gunther volunteered the following
The discussion then veered
slightly, with Steve wondering whether other chips may suffer the same fate.
Unfortunately Steve chose the words 'lack of testing/simulation' which prompted
Gunther to challenge the statement, indicating that, at the chip level,
everything that could be tested was tested. The meeting was then placed back on
the tracks, addressing the two target issues.
Robert noted that the
problem disappears at a clock frequency of less than about 15 MHz (at the
nominal 2.5V supply) and has been run successfully as low as 1 MHz. The system also runs fine above about 19
MHz, but in that case it operates in a mode of skipping one clock cycle at each
GTRC. Gunther and Dave were interested in running at 3.0V (The current TEM PS
is limited to 2.75V, so getting to 3.0V requires replacing a resistor). UCSC
had run with VDD=2.75V, and in that case the correct operational mode (i.e. no
skipping of clock cycles) extended up to about 18 MHz, and the operation with
skipping began at about 22 MHz (i.e. it was unstable at 20 MHz).
Conclusion on the Data Synch
Problem: Wait for more information.
TOT Problem
-----------
The discussion then turned
to the TOT problem. In contrast with the data synch problem, both the problem
and solution are well understood. The discussion concerned tactics and
schedule.
Jeff is only now available
to work on the GTRC. The solution is already in hand. The issue is testing that
the solution works and does not introduce any unintended side effects (i.e. it
doesn't break something that is currently working). Tools at Jeff's disposal
are
The issue of schedule was
next:
Bottom line: Time early to
the foundry is 2 weeks. Results from the simulation may lengthen this time.
Gunther will start the PO process so that it is in place as soon as the design
is ready.
The discussion then turned
back to technical issues. Mike Huffer
raised the question of whether there are other problems out there. There is
certainly the potential for other problems, with Mike noting that neither the
buffer overflow handling nor the treatment of parity errors occurring on
front-end destined commands has been adequately tested. (In Mike's best
estimation, by looking at the VHDL code, he believes that parity errors
occurring in the readout commands will lock the system. So here it is not a
question of the consequence, but 'do parity errors occur in the real system at
any kind of significant rate?')
Many of the tests Mike was
referring to demand that the system be run at 'nominal' rates, i.e. up to
10 KHz. So far, most of the system level testing has been around the
cosmic rate of 30 Hz. Higher rate testing is difficult for two different
reasons: one due to the present EGSE test setup and one due to the lack of a
known data source at 10 KHz.
The EGSE test setup currently
has two basic limitations
JJ indicated that a month's
effort on the part of FSW would be needed to move to a higher rate system.
(SIDE NOTE: Although not stated in the meeting, the commissioning of the LCB is
still in its infancy. Using an unknown quantity
as part of a test procedure seems a bit dubious.)
One way to get higher rate
data is to simply turn down the thresholds on the TKR's trigger discriminators.
This will have the desired effect of raising the trigger rate, but, apart from
just proving that the data is transported without hang-ups, checking the
integrity of the data is limited to internal consistency checks. Simply put,
since the trigger is random, there is no known pattern (like a cosmic ray
track) to serve as an anchor.
The discussion then turned
to tactics. This boils down to accessing the consequences of adopting one of
the following solutions:
1. Do nothing
2. Run a mixed system, that is some towers with old chips
and some with new chips.
3. Move the TOT measurement and logic to the TEM
4. New chips in all the towers
Taking these in order:
Here the issue is, can one
live with the timeouts or without the TOTs. Steve asked JJ what was the
recovery time from a timeout. The idea here is that FSW must always provide
some recovery method; just how long is the recovery procedure. JJ was tasked
with answering this question, although, he was admittedly very reluctant to be
lead one step at a time down this path. His objections are (this is where being
the author of the meeting minutes pays off, JJ gets to state his case without
rebuttal).
-
Timeouts
are generally a catchall error. This
strategy could lead to the masking of other problems.
-
The
rate of timeouts in the full LAT will go up for two reasons:
1.
Simply
scaling of the 6 instrumented layers to 16 x 36 = 576 layers, drives the
timeout rate up by roughly 100
2.
Since
this problem occurs because of the coincidence of two signals, the rate of
timeouts will go up non-linearly as the rate increases. However, I'll ignore
this non-linearity and do simple scaling to 10 KHz. This will push the
rate up by a factor of 300, given that the present cosmic trigger rate is about
30 Hz. Again, my guess is that this is a lower limit because the number of
events that are within the evil coincidence time must scale non-linearly.
A lower limit would be the
timeout rate would increase by at least 30,000. According to Mike, empirically
they were seeing 1 error in 10K cosmics. This translates to 100 errors a second
for the full LAT at 10KHz trigger rate.
Finally JJ objects that
designing error recovery for something that happens once in a blue moon is very
different from making the recovery at high rate as a part of normal running.
The other obvious tactic
within the DO NOTHING strategy is run with the TOTs disabled, effectively
giving up on the TOT measurement for normal running. As a diagnostic, some running with the TOTs enabled could be
done. At the very least, it could be
used during Tracker calibration running.
The attraction here is one
of schedule. By admitting to a mixed system, the first production towers would
be outfitted with the current chips. Later towers could have the fixed chips.
Lowell pointed out that the law of being maximally screwed was at work, since
the first towers are destined for placement as the 4 central towers (Note added
in proof: Elliott says that I&T will start at an edge, not in the
center). Note that even with only 4
towers with bad chips, the timeout rate is still likely prohibitive. This leads to the strategy of turning off
the TOTs in the towers with the original GTRCs and leaving them on in the
others. JJ indicated that nothing in the on-board data handling (either just
data handling or the filtering algorithm) cares whether the TOTs are there or
not. This mode of operation would mean
that at least some of the towers give viable TOTs.
(Again, abusing his
privileged position as editor of the minutes, JJ quakes at the thought of a
mixed system. A good portion of commissioning an instrument is devoted to
discovering and inventing workarounds to the inevitable 'interesting' features
and interactions of the hardware and real-life running. The watch word here is
'variety kills'. Dealing with the peculiarities of 2 species of GTRCs just
sounds too gross to contemplate.)
Curiously, this is actually
both technically not too hard (i.e. the VHDL is reasonably straightforward and
doable, at least according to Mike) and possible (i.e. it does not require a
board layout change, and the ASIC has sufficient resources to accommodate the
extra logic). The advantage of this solution is that the TEM production
schedule is not as tight as the MCM production schedule. A downside is that the logic would be too
large to test in an existing TEM version.
Gunther's objections
centered around the argument that he currently has a working ASCIC, so why risk
it? The counter-argument is, well, if
it doesn't work, we are no worse off than we are now. Basically, we would be in the same position as if we were to redo
the GTRC and it fails for some reason.
We would just learn to live without the TOTs. Gunther also cautioned that this would draw on important
resources of his electronics group.
Obvious, this had both the
upsides (hopefully, if successful, it all works and the TOT and data synch
problem are solved). The downside is
that the schedule to Tower A (and Tower 14) would slip by 2 months or more, in
Robert’s estimate.
In general, Robert cautioned
that redoing the GTRC has a lot of costs, some more obvious than others. The MOSIS run would cost $105,000, plus some
amount for lapping and dicing. Then
there is the cost to Gunther’s group of the design effort, to the UCSC group to
kludge the GTRC test system to help with evaluation, to UCSC to test the
resulting wafers. Then the largest cost
is probably to redo qualification testing, radiation testing, and system
testing (at a level of several trays) to validate the new design. The latter would be finished only months
after completion of the run, so any fabrication use of the chips in the interim
would be at risk. Furthermore, all of
these efforts will draw on manpower that will be in critical demand for the
flight tower production, and we all know that this effort is already
short-staffed.
Final note added in proof:
while work is still going on to understand quantitatively the delays inside the
GTRC chip, we have found that correct data-transfer operation at 22 MHz and
above can be obtained by adding a 200-ohm termination resistor between the data
lines and in parallel with the existing 700-ohm resistor (which is inside the
GTRC). The existing termination,
multiplied by the capacitance of the cable, gives a rather long RC time
constant. Decreasing that RC has the
effect of allowing the data to arrive in time, even in the presence of the
other internal and less understood delays, and the signal voltage swing is
still adequate. UCSC will explore the
margins using this modification. In the
flight system this could be incorporated by putting the resistors on the cactus
arms of the flex-circuit cable (adding them to the MCM would be very
problematic at this stage).