Bit error rate

K

Kevin Kilzer

Guest
Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.

Kevin
 
Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.
I'm missing something. What kind of errors are you interested
in?

If your design is clean, the error rate from everything short of
cosmic rays should be 0. Or at least low enough so that it
is very very hard to measure.

Note that "clean" includes the logic and power supply and SI
on the input and output sides.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.
 
On Tue, 30 Sep 2003 03:13:34 -0000, hmurray@suespammers.org (Hal
Murray) wrote:

Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.
snip
If your design is clean, the error rate from everything short of
cosmic rays should be 0. Or at least low enough so that it
is very very hard to measure.

Note that "clean" includes the logic and power supply and SI
on the input and output sides.
Then why do DRAM memory systems include a CRC or parity bit?
Certainly there is some non-zero probability that a latch will miss or
a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the
fault of a noisy head, and not the deserializer, cache or bus drivers?

Kevin
 
Hi Kevin,

If your design is clean, the error rate from everything short of
cosmic rays should be 0. Or at least low enough so that it
is very very hard to measure.

Then why do DRAM memory systems include a CRC or parity bit?
Certainly there is some non-zero probability that a latch will miss or
a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the
fault of a noisy head, and not the deserializer, cache or bus drivers?
DRAMs include CRC to protect against bit flips in their RAM cells. The
thing they worry about are alpha particle and neutron strikes. When the
strike occurs, it can create a momentary current that flips the state of the
RAM cell -- it's a function of the number of RAM cells and the resilience of
each cell. The bigger the cap of the cell, the harder it is to flip the
value. The more there are, the higher the chance that a strike will affect
a given chip.

There are some people who are starting to worry about particle induced
glitches in logic/routing, but the consensus is that isn't a problem yet.
For those people who are very paranoid, there are techniques such as triple
modular redundancy (think of it as a circuit in triplicate that takes a best
two-out-of-three result) that essentially reduce the chance of logic faults
to zero.

A far more common cause of logic faults is poor design -- if your design is
sensitive to momentary glitches (i.e. asynchronous) you are much more likely
to have a problem if some event causes a glitch. Most often, this is due to
cross-talk or other such down-to-earth electrical issues. We design the
routing in our FPGAs so that they will not glitch even with worst-case
attackers.

BTW, if I recall correctly, the biggest cause of BER in a hard disk is due
to the tinyness of the bits they read & write -- they ain't 1's and 0's at
that point, more like best guesses :) You also get bit errors in the
communication medium (cheap ribbon cable) connecting the hard disk to the
HDD controller.

Regards,

Paul Leventis
Altera Corp.
 
On Tue, 30 Sep 2003 04:49:44 GMT, Kevin Kilzer
<kkilzer.remove.this@mindspring.com> wrote:

On Tue, 30 Sep 2003 03:13:34 -0000, hmurray@suespammers.org (Hal
Murray) wrote:

Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.
snip
If your design is clean, the error rate from everything short of
cosmic rays should be 0. Or at least low enough so that it
is very very hard to measure.

Note that "clean" includes the logic and power supply and SI
on the input and output sides.

Then why do DRAM memory systems include a CRC or parity bit?
Certainly there is some non-zero probability that a latch will miss or
a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the
fault of a noisy head, and not the deserializer, cache or bus drivers?
The issue with DRAM is that they occupy such a large portion of the
die and that they are optimized for size to hold the minimum charge
necessary to keep a bit till the next refresh cycle. This makes them
particularly vulnerable to various forms of radiation. If you notice,
there are very few designs where the error probability of random logic
is controlled against errors caused by radiation.
In terms of disk drives, the main problem is the uncertainty in the
bit lengths on the media and the clock/data recovery after data is
captured by the head in addition to the noise added by the head. Again
errors caused by radiation is not a major concern in the data path and
even in cache as the cache is most probably static memory which is
more resistant to such errors. With a well tested (with external scan
or BIST etc) logic, the probability of radiation induced errors are
completely negligible in almost all of the designs, unless they
involve large quantities of DRAM or used in space or life critical
applications.


Muzaffer Kal

http://www.dspia.com
ASIC/FPGA design/verification consulting specializing in DSP algorithm implementations
 
Kevin Kilzer <kkilzer.remove.this@mindspring.com> wrote in message news:<dh2invss3jbj7j0ovr8n89urk7or73lh5b@4ax.com>...
On Tue, 30 Sep 2003 03:13:34 -0000, hmurray@suespammers.org (Hal
Murray) wrote:

Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.
snip
If your design is clean, the error rate from everything short of
cosmic rays should be 0. Or at least low enough so that it
is very very hard to measure.

Note that "clean" includes the logic and power supply and SI
on the input and output sides.

Then why do DRAM memory systems include a CRC or parity bit?
Certainly there is some non-zero probability that a latch will miss or
a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the
fault of a noisy head, and not the deserializer, cache or bus drivers?
Considering that you specificly mentioned "input gates, block RAM, and
output gates" in your original posting, Hal's response was correct.

Now, if you'd actually mentioned DRAM and disk drives, I'm sure Hal's
response would have been different.

Marc
 
Kevin,

If the design has the proper amount of timing slack, and the clock for
the design has well behaved jitter, then the error rate is 0.

If there is inadequate slack in the timing, and the clock jitter is
unbounded, then the error rate is non-zero.

At some point, the error rate becomes so small that other things are
likely to occur before an error is noticed/logged/reported. Like a power
loss. Or a circuit failure somewhere (not in the FPGA).

Jitter is often modeled with gaussian distributions, but actual
oscillators do not have infinite energy, so they don't actually have
unbounded "tails" where the jitter value keeps increasing indefinitely as
the probability decreases (true random jitter).

Bit errors also almost never occur at a rate, but rather occur in clumps,
or bursts, and are therefore not random at all. A channel with dribbling
bit errors is broken, and should get fixed or have error correction added
on top of it.

Check out the articles on the tech Xclusives pages on jitter, timing, and
slack.

Soft errors from cosmic rays are well understood (at least by us), so you
can also take these into account (if an error every ~1000 years is
important in your application - which it is for many today).

Check out the article on this on the Xillinx website: "1000 Years
Between Single Event Upsets" on the tech Xclusives pages.

By the way, we recently put the 90 nm Spartan 3 in the neutron beam, and
we are gratified (and delighted) that it has ~30% smaller cross section
than the 150 nm technology (ie it will be upset less frequently!).

(Presented at MAPLD this last month. For a copy of the presentation,
contact your FAE.)

Austin

Kevin Kilzer wrote:

Is there any way to estimate the bit error rate of a data bus that
passes through a Xilinx FPGA? I have input gates, the block RAM, and
output gates involved in the system, and I would like to predict the
error rate of data passing through.

Kevin
 
Thanks to all for the clear explanations. I will further assume that
since SRAM is similar to logic (unlike DRAM), that the SRAM is
practically immune to upset also. I'll stop worrying about random
events in the logic, and concentrate my testing on the other factors
that were mentioned.

Kevin
 
Followup to: <g94qnv8hcapk91jgl1djj6nqlvtq9ounte@4ax.com>
By author: kkilzer.remove.this@mindspring.com
In newsgroup: comp.arch.fpga
Thanks to all for the clear explanations. I will further assume that
since SRAM is similar to logic (unlike DRAM), that the SRAM is
practically immune to upset also. I'll stop worrying about random
events in the logic, and concentrate my testing on the other factors
that were mentioned.
SRAM is more immune than DRAM, but SRAM is frequently built using
process-optimized cells, and do have a (small) probability of being
affected by intrinsic radiation.

This is why most processor vendors have started using ECC on the
caches; especially the larger (L2+) caches.

That being said, I don't expect to see these in an FPGA, and it's
highly unlikely to be the source of any problems you might see.

-hpa




--
<hpa@transmeta.com> at work, <hpa@zytor.com> in private!
If you send me mail in HTML format I will assume it's spam.
"Unix gives you enough rope to shoot yourself in the foot."
Architectures needed: ia64 m68k mips64 ppc ppc64 s390 s390x sh v850 x86-64
 
Then why do DRAM memory systems include a CRC or parity bit?
Certainly there is some non-zero probability that a latch will miss or
a gate will experience a random noise spike?

If what you say is true, the BER of a disk drive will be entirely the
fault of a noisy head, and not the deserializer, cache or bus drivers?
Here is how I look at this area...

There are two types of electronics: logic and communications.

Logic includes computers. Everybody expects them to work correctly.

Communications includes things like Ethernet, fibers, and satellite
links. People (including engineers) expect a few errors.

The error rate you actually get on a communications link is determined
by the signal to noise ratio. That assumes classic gausian noise. See
any good communications text book. The key here is "gausian", which
is pretty good for fibers and satellite links.

On most communication links, there is an economic tradeoff. How
many miles can you go on a fiber before you get too many errors?
How many bits/second can you get on a satellite link before you get
too many errors?

Disks are similar to communications links. How many bits per square
inch can I get vs what is the error rate when reading them back?

On the other hand, people expect logic (and/or computers) to get the
right answer - zero errors. All that means is that they are running
with a signal/noise ratio that is very very very good relative to
communications links.

If you look at the classic errors vs signal/noise chart, you will see
that it's exponential. Make the signal stronger and the errors go
way down. Make it still stronger and you can't mesure them.
Classic logic (gates and FFs) operate in a signal/noise range that
is so far off scale that they don't make enough errors to worry about.
You need to worry about other sources of errors instead - things
like meteorites landing in your lab and smashing your FPGA but
missing your error testing gear.

If you look at errors on logic/computers, you can lump them into
several buckets.
Design/software errors
For example, Intel's divide bug
Fabrication errors from the factory.
broken chips, inadequate testing, assembly errors, ...
Systematic errors that are similar to noise. These are things
we can localize and analyse, but often overlook:
Clock jitter (see Austin's msg)
crosstalk
noise on power rails
Alpha particles, Cosmic rays...
Signal Ingegrety, reflections
(see the recent Spartan3 discussions)
...
These are the hardware versions of the software bugs above.
Thermal noise:
This is what's left after you correct for all of the above.
If something strange happens often enough, somebody will figure
out what's going on and put a name on it and it will get added
to the list above.

DRAMs are interesting. They are on the border between communications
and logic. We want them to work all the time, but we also want them
to be cheap. Cheap means small which means they are more likely to
drop bits if an alpha particle hits the right place.
(When people were first starting to get interested in alpha particles,
they were black magic or "thermal" noise. As soon as people understood
what was going on then they could measure and avoid the problem.)

If you want cheap DRAMs, you will get occasional errors. (You can't
buy any other kind, so get used to it.) With ECC and good software
(scrubbing) you can get close to error free DRAMs. Similarly, with
good FEC (Forward Error Correcting) you can get close to no errors
on satellite links.


Back to FPGAs. Roughly, they don't make any errors. What I mean
by that is that the gates and FFs work as expected. I expect there
is some thermal noise, but I doubt if you can measure it. There are
too many other things causing more intersting problems.

If your question was really how many errors to expect on a
simple input-FPGA/BRAM-output type design, my answer would be
"it depends". How solid is your design? Any metastability
involved? What sort of external noise/EMI? What are your input
output lines connected to? Is the clock solid? When you get done
with all those questions, then you get to ask about alpha particles
and cosmic rays.

It's really really hard to prove that your design is solid.
The software/systems guys have a neat phrase. Testing can't
prove the absence of bugs. It can only demonstrate their
existence.

The software guys have a set of tricks that make looking for bugs
more productive. Similarly, with hardware, it helps to look in
the places that are likely to cause errors. Put a scope on your
clocks. Check your power. Look at the places where signals cross
clock domains. ...

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.
 

Welcome to EDABoard.com

Sponsor

Back
Top