Watchdog Timers for FPGA Designs

R

rickman

Guest
A recent thread in comp.arch.embedded concerns using watchdog timers. I
made the point that they are used with software designs because of the
many possible ways they can screw up while hardware designs tend to be
less prone to failures that would require the use of a watchdog timer to
restore operation. This of course does not include designs that are
subject to single event upset (SEU) such as space flight.

Opinions? Anyone here use watchdogs and care to share examples?

Anyone seen an ASIC that used a watchdog to get it out of a stuck state?
That would include a CPU monitoring behavior and giving the ASIC a
swift kick in the reset.

One point I was challenged on was that every FSM has potential for
locking up and if it can't be designed to preclude that an internal
watchdog would reset the FSM. I don't agree that this is "always"
needed, but if the protocol specifies a timeout, then this is part of
the protocol and not a "watchdog" in a true sense looking for aberrant
behavior.

--

Rick C
 
rickman wrote:

A recent thread in comp.arch.embedded concerns using watchdog timers. I
made the point that they are used with software designs because of the
many possible ways they can screw up while hardware designs tend to be
less prone to failures that would require the use of a watchdog timer to
restore operation. This of course does not include designs that are
subject to single event upset (SEU) such as space flight.

Opinions? Anyone here use watchdogs and care to share examples?
I make a line of motion control interfaces. all of them have a problem that
if the CPU stops talking to them, they'd just keep commanding motion at the
same speed. One uses analog velocity outputs, so even if the local crystal
oscillator stopped, it would keep that analog voltage going to the servo
amp. If the oscillator stopped, then any kind of digital watchdog would
never trip. So, I used an external one-shot and "non-clocked" logic to trip
the E-stop FF to go into E-stop. That clears the DAC registers to zero
Volts and shuts down the digital outputs that may enable servo amps, spindle
motors, etc.

When I say "non-clocked" above, I am referring to SR FFs and such latch-like
constructs, so they are expected to work without a system clock.

Jon
 
On 5/11/2016 2:46 PM, Jon Elson wrote:
rickman wrote:

A recent thread in comp.arch.embedded concerns using watchdog timers. I
made the point that they are used with software designs because of the
many possible ways they can screw up while hardware designs tend to be
less prone to failures that would require the use of a watchdog timer to
restore operation. This of course does not include designs that are
subject to single event upset (SEU) such as space flight.

Opinions? Anyone here use watchdogs and care to share examples?

I make a line of motion control interfaces. all of them have a problem that
if the CPU stops talking to them, they'd just keep commanding motion at the
same speed. One uses analog velocity outputs, so even if the local crystal
oscillator stopped, it would keep that analog voltage going to the servo
amp. If the oscillator stopped, then any kind of digital watchdog would
never trip. So, I used an external one-shot and "non-clocked" logic to trip
the E-stop FF to go into E-stop. That clears the DAC registers to zero
Volts and shuts down the digital outputs that may enable servo amps, spindle
motors, etc.

When I say "non-clocked" above, I am referring to SR FFs and such latch-like
constructs, so they are expected to work without a system clock.

If I understand, your watchdog is not specific to the FPGA, but is a
system watchdog in case of failure anywhere in the system, right? Have
you seen the oscillator stop on a board? I don't have much experience
with clocks faulting in the field, but obviously it is important to
protect against any failure.

The other conversation I had in c.a.e is making me wonder if anyone has
enough mistrust of their FPGA to add a watchdog in case a design fault
causes a problem. SEU is very uncommon unless you are in a high
radiation environment. There are always power supply glitches which can
upset an FPGA, especially the RAM based types. But do designers worry
about design flaws in HDL? Any examples of design mistakes you wish to
protect from that a watchdog is useful?

--

Rick C
 
On 11/05/16 17:30, rickman wrote:
A recent thread in comp.arch.embedded concerns using watchdog timers. I made
the point that they are used with software designs because of the many possible
ways they can screw up while hardware designs tend to be less prone to failures
that would require the use of a watchdog timer to restore operation. This of
course does not include designs that are subject to single event upset (SEU)
such as space flight.

Opinions? Anyone here use watchdogs and care to share examples?

Anyone seen an ASIC that used a watchdog to get it out of a stuck state? That
would include a CPU monitoring behavior and giving the ASIC a swift kick in the
reset.

One point I was challenged on was that every FSM has potential for locking up
and if it can't be designed to preclude that an internal watchdog would reset
the FSM. I don't agree that this is "always" needed, but if the protocol
specifies a timeout, then this is part of the protocol and not a "watchdog" in a
true sense looking for aberrant behavior.

I was once called in to fault-find a system that turned 18" pipes
into very interesting and pretty shapes. Every now and then the
controller would become catatonic and a /large/ lump of metal would
zoom off at ~1m/s - until the Big Red Switch was kicked.

The source turned out to be an infrequent hardware static-1 hazard
glitch. My recommendations were to fix the specific fault and also
to implement a hardware watchdog timer.

Yes, that was a design fault, and could have been trapped during
a design review - but it wasn't. Where all available people are
operating in new areas, there is a significant chance that design
faults will slip through despite everybody's best intentions.

Hence, in practice, it is unduly optimistic to say watchdog
timers aren't necessary because design rules are /sufficient/
to prevent design errors.

And we should, of course, always consider that equipment may not
be installed correctly, and/or an installation can degrade over
time.

Watchdog timers can be a useful last line of defence against
such events.
 
rickman wrote:


If I understand, your watchdog is not specific to the FPGA, but is a
system watchdog in case of failure anywhere in the system, right? Have
you seen the oscillator stop on a board? I don't have much experience
with clocks faulting in the field, but obviously it is important to
protect against any failure.
Well, to protect against as many as can be done, practically. A guy I knew
a long time ago was given the task of finding ALL single points of failure
in a large central office telephone switch. They had two lockstep parallel
CPUs and a hang detector. There was a complicated network to provide a
glitch-free changeover if the master clock source failed. No matter how
hard the design engineer tried, my friend always found there was STILL a
single point of failure that would leave the system with no clock.
There were a bunch of other single points of failure in the system.

So, I'm just trying to cover as many cases as I can with simple logic. The
watchdog was much more aimed at computer or communication failure than a
clock stoppage, but this logic should handle both cases.

In my case, all I want is a reliable halt to all motion, and don't care
about automatic recovery.

The other conversation I had in c.a.e is making me wonder if anyone has
enough mistrust of their FPGA to add a watchdog in case a design fault
causes a problem. SEU is very uncommon unless you are in a high
radiation environment. There are always power supply glitches which can
upset an FPGA, especially the RAM based types. But do designers worry
about design flaws in HDL? Any examples of design mistakes you wish to
protect from that a watchdog is useful?
Yes, I did a little motion control thing that was supposed to shuttle a rack
of samples back and forth, and it was going to be in an area where people
access would be restricted, so we wanted it to try to muddle through even
when something went wrong. I added a few lines of VHDL here and there to
try to trap abnormal cases that "should never happen" and go to the main
reset condition and keep running. These are standard practices, such as a
binary-coded state machine, where all unused states explicitly go to a
defined state.

Jon
 
On 5/11/2016 5:41 PM, Tom Gardner wrote:
On 11/05/16 17:30, rickman wrote:
A recent thread in comp.arch.embedded concerns using watchdog timers.
I made
the point that they are used with software designs because of the many
possible
ways they can screw up while hardware designs tend to be less prone to
failures
that would require the use of a watchdog timer to restore operation.
This of
course does not include designs that are subject to single event upset
(SEU)
such as space flight.

Opinions? Anyone here use watchdogs and care to share examples?

Anyone seen an ASIC that used a watchdog to get it out of a stuck
state? That
would include a CPU monitoring behavior and giving the ASIC a swift
kick in the
reset.

One point I was challenged on was that every FSM has potential for
locking up
and if it can't be designed to preclude that an internal watchdog
would reset
the FSM. I don't agree that this is "always" needed, but if the protocol
specifies a timeout, then this is part of the protocol and not a
"watchdog" in a
true sense looking for aberrant behavior.



I was once called in to fault-find a system that turned 18" pipes
into very interesting and pretty shapes. Every now and then the
controller would become catatonic and a /large/ lump of metal would
zoom off at ~1m/s - until the Big Red Switch was kicked.

The source turned out to be an infrequent hardware static-1 hazard
glitch. My recommendations were to fix the specific fault and also
to implement a hardware watchdog timer.

Interesting. I learned that synchronous logic was used to provide
outputs that needed to not change state with changing inputs that were
stable within the setup time of the circuit. It can be hard to produce
combinatorial logic without static hazards, so they should not be
attempted without formal analysis of the final circuit.


Yes, that was a design fault, and could have been trapped during
a design review - but it wasn't. Where all available people are
operating in new areas, there is a significant chance that design
faults will slip through despite everybody's best intentions.

So how did the watchdog work?


Hence, in practice, it is unduly optimistic to say watchdog
timers aren't necessary because design rules are /sufficient/
to prevent design errors.

And we should, of course, always consider that equipment may not
be installed correctly, and/or an installation can degrade over
time.

Watchdog timers can be a useful last line of defence against
such events.

--

Rick C
 
On 12/05/16 05:55, rickman wrote:
On 5/11/2016 5:41 PM, Tom Gardner wrote:
I was once called in to fault-find a system that turned 18" pipes
into very interesting and pretty shapes. Every now and then the
controller would become catatonic and a /large/ lump of metal would
zoom off at ~1m/s - until the Big Red Switch was kicked.

The source turned out to be an infrequent hardware static-1 hazard
glitch. My recommendations were to fix the specific fault and also
to implement a hardware watchdog timer.

Interesting. I learned that synchronous logic was used to provide outputs that
needed to not change state with changing inputs that were stable within the
setup time of the circuit. It can be hard to produce combinatorial logic
without static hazards, so they should not be attempted without formal analysis
of the final circuit.


Yes, that was a design fault, and could have been trapped during
a design review - but it wasn't. Where all available people are
operating in new areas, there is a significant chance that design
faults will slip through despite everybody's best intentions.

So how did the watchdog work?

The company didn't want to pay me to design it for them. I have
no idea what they did/didn't do after my three days were up!
 

Welcome to EDABoard.com

Sponsor

Back
Top