reset strategy FPGA Igloo

rickman · Oct 17, 2013

On 10/11/2013 3:43 PM, Joseph H Allen wrote:

In article<l39dsb$4q1$1@speranza.aioe.org>,
glen herrmannsfeldt<gah@ugcs.caltech.edu> wrote:

First, I agree with what Mark says.

In addition, note that most FPGA families have a global reset line
similar to the global clock lines. They usually keep all the FF held
at reset until configuration is done, and also allow you to use that
reset line. It is there, it is free, and you might was well use it.

You do still have to get the timing right, so you release it at
the right time relative to the clock edge.

The timing provided by the global reset line is not good.. it's nowhere near
as good as a global clock line as far as I understand.

One way to deal with this is to have all of your state machines start in a
reset state which does nothing but wait for a synchronous "start" edge which
is generated after reset with a counter or a shift register.

This has a big advantage that you no longer have to worry about global reset
timing. On the other hand, if you use libraries you may have no choice
since you can't change the logic.

This really doesn't solve the problem. The problem is that globally
resetting all the state machines in a design can tax the routing and
timing of the global reset signal. If you do this the routing problem
can be fixed by resynchronizing the reset to the clock before using it
in a given section of logic that should be geographically local on the
chip. Otherwise you can give P&R some very tough problems to solve.

--

Rick

rickman · Oct 17, 2013

On 10/16/2013 11:21 AM, alb wrote:

Hi Rick,

On 16/10/2013 10:29, rickman wrote:
[]
Ok, I think I understand the problem now. The fact that you are working
on a space application makes it more clear why you are using an Actel
part. I think space applications are what kept Actel alive for so many
years.

and automotive, avionics, defense, just to mention a few more.

I'm not familiar with automotive use. I would think they would have
been a much bigger company if there had been much automotive use of
their parts. Just one design win is 500,000 a year in that market.

I sort of lump space, avionic and defense into the same category since
they have similar requirements and are all much smaller than commercial
markets. Perhaps we can call this generically "high rel" and also
include the small commercial market segment. High markup, low volume...

So you have a poor hardware design that you can't change and you are
looking for some trick within the FPGA to generate a power on reset.

correct.

That is the crux of the problem. While some brands of FPGAs do a
reasonable job of detecting power on and resetting the entire device, it
sounds like the Actel doesn't bother since... well, since it doesn't
need to as long as it can rely on the user to reset it.

I'm
still not clear on why you can't use the PLL lock signal with
conditioning. If the PLL lock signal is deasserted at power on, there
is your power on reset. Then you simply need to condition it well
enough to properly control exit from reset reliably.

Am I still missing something?

Maybe you missed what I wrote in my previous post in reply to your
suggestion, so I'll paste it here for your convenience:

I'll set the counter to count the maximum lock time as specified in the
datasheet and I should be good to go. Synchronous release of the reset
line should make it simple to perform STA on the reset line and let the
P&R meet the time constraints.

Exactly as you proposed but specifying the length of the counter.

I'm not clear on this. It sounds like you are using the timer in
*place* of the lock signal from the PLL. My point was to condition the
lock signal from the PLL. But since you say below that you don't know
the state of the PLL lock signal at power up I suppose this won't work.
It would be a simple matter to test it though. Or you can contact the
factory. Since there is no configuration process the logic of the PLL
should be ready and working on power up I would expect.

Indeed there's even a more subtle detail to add: since I do not know
what is the state of the lock signal at power up, my counter should have
a fixed value at reset and start to count when lock signal is deasserted
if and only if the fixed value of the counter is present. In this way I
can guarantee that a lock signal which is undefined at the beginning
does not allow the counter to start from a random position and hit the
target value too early. Something along these lines:

code: not tested!
-- rst is the lock signal from PLL
process (clk, rst)
if rst = '0' then
counter<= "101001000000"; -- 6 bit counter + 6 bit check
global_rst<= '0'; -- active low reset
elsif rising_edge(clk) then
if counter (11 downto 6) = "101001" then
counter<= counter + 1;
elsif counter (6) = '0' then
global_rst<= '1';
end if;
end if;
end process;
/code

I'm not sure the synthesis will prune the upper end of the bits, so I
might need to do something more clever, but the upper part of the
counter is to verify that it did have a proper reset signal at some point.

The fact that you are checking the upper bits prevents any pruning of
the counter. I think your approach is to help assure that the counter
has been reset before it will count down. I would not think this was
workable. Either the counter is reset properly or the circuit can
malfunction by lockup, possibly with global_rst = '0'. Also be aware
that you are really only using 5 check bits since the sixth is used to
flag end of timeout.

If this is a high rel application, I would not want to rely on a five
bit checksum to control a reset. You say "guarantee" and "random", but
I see the possibility that the counter starts up in the done state with
reset never having been asserted, 1 in 64 chance. If you check all the
bits in the counter for the done state it is still a 1 in 4096 chance of
malfunctioning without ever producing a reset out.

If you can't rely on the PLL lock signal to be asserted at power up, I
don't think you can rely on *any* logic in the FPGA to compensate for
this. Why not investigate and find out if the PLL signal will work as a
power on reset? Contact the factory and/or test it yourself on the
bench. Bring the LOCK signal out to a pin and scope it while powering
up the unit.

I think you need a reset signal for the device, period. Even if you
manage to supply a reset to all the FFs in your logic, is there nothing
else on the chip that requires a reset like the PLL itself? What other
circuits are on the device other than the user configurable logic? Does
the data sheet talk about a requirement for the reset signal?

The PLL issue and the idea of using the I/O pin to generate a reset are
both issues I would contact the factory about. If I were in your shoes,
I would push hard to have the module disqualified since the FPGA can not
be assured to have been reset. That is the part that is insanity!

--

Rick

alb · Oct 18, 2013

Hi Rick,

On 16/10/2013 20:26, rickman wrote:
[]

I'm
still not clear on why you can't use the PLL lock signal with
conditioning. If the PLL lock signal is deasserted at power on, there
is your power on reset. Then you simply need to condition it well
enough to properly control exit from reset reliably.

Am I still missing something?

Maybe you missed what I wrote in my previous post in reply to your
suggestion, so I'll paste it here for your convenience:

I'll set the counter to count the maximum lock time as specified in the
datasheet and I should be good to go. Synchronous release of the reset
line should make it simple to perform STA on the reset line and let the
P&R meet the time constraints.

Exactly as you proposed but specifying the length of the counter.

I'm not clear on this. It sounds like you are using the timer in
*place* of the lock signal from the PLL. My point was to condition the
lock signal from the PLL. But since you say below that you don't know
the state of the PLL lock signal at power up I suppose this won't work.
It would be a simple matter to test it though. Or you can contact the
factory. Since there is no configuration process the logic of the PLL
should be ready and working on power up I would expect.

The idea is to use the 'lock' signal *and* the timer. The lock signal
should be, at a certain point, signalling 'unlock' condition. This
condition, even if not reliable, allows the following logic:

1. use 'not lock' to set the global reset
2. start count when 'lock' is reporting 'pll locked' (unreliable) *and*
the upper part of the count is correctly set to a certain pattern
3. count up until the maximum lock time specified in the datasheet
4. release the global reset

If 'lock' signals chatters at any stage of its life (that I suppose is
only happening when locking) does not matter, it will only mean the
counter will restart counting.

Indeed there's even a more subtle detail to add: since I do not know
what is the state of the lock signal at power up, my counter should have
a fixed value at reset and start to count when lock signal is deasserted
if and only if the fixed value of the counter is present. In this way I
can guarantee that a lock signal which is undefined at the beginning
does not allow the counter to start from a random position and hit the
target value too early. Something along these lines:

code: not tested!
-- rst is the lock signal from PLL
process (clk, rst)
if rst = '0' then
counter<= "101001000000"; -- 6 bit counter + 6 bit check
global_rst<= '0'; -- active low reset
elsif rising_edge(clk) then
if counter (11 downto 6) = "101001" then
counter<= counter + 1;
elsif counter (6) = '0' then
global_rst<= '1';
end if;
end if;
end process;
/code

I'm not sure the synthesis will prune the upper end of the bits, so I
might need to do something more clever, but the upper part of the
counter is to verify that it did have a proper reset signal at some
point.

The fact that you are checking the upper bits prevents any pruning of
the counter. I think your approach is to help assure that the counter
has been reset before it will count down. I would not think this was
workable. Either the counter is reset properly or the circuit can
malfunction by lockup, possibly with global_rst = '0'.

I have to start from the assumption that my pll lock signal is, at a
certain stage, signalling a pll not in lock. If the pll lock signal
wrongly reports a pll not in lock is not an issue, while it is an issue
if it wrongly reports a pll in lock.

If the counter *is not* reset properly then a system lockup is at least
better than a logic that runs without a global reset.

Also be aware
that you are really only using 5 check bits since the sixth is used to
flag end of timeout.

Correct, I did not really care how many bits I'm using in the example,
it was just to show the concept.

If this is a high rel application, I would not want to rely on a five
bit checksum to control a reset.

You have a point and, yes, I was too lazy to write the example in the
proper way.

You say "guarantee" and "random", but
I see the possibility that the counter starts up in the done state with
reset never having been asserted, 1 in 64 chance. If you check all the
bits in the counter for the done state it is still a 1 in 4096 chance of
malfunctioning without ever producing a reset out.

If you can't rely on the PLL lock signal to be asserted at power up, I
don't think you can rely on *any* logic in the FPGA to compensate for
this. Why not investigate and find out if the PLL signal will work as a
power on reset? Contact the factory and/or test it yourself on the
bench. Bring the LOCK signal out to a pin and scope it while powering
up the unit.

The investigation might be an extremely tedious process. Under which
conditions should I verify the behavior? Our temperature range is -40 +
80, should I run the test in all conditions? Should the test be
performed in thermovacuum (the application will run in low earth orbit).
In order to verify that the lock signal is correctly reporting a lock is
a not so straight forward path.

But I agree that relying on the fact that my pll lock signal is at least
reporting 'not locked' at a certain stage is still an assumption.

[]

Even if you
manage to supply a reset to all the FFs in your logic, is there nothing
else on the chip that requires a reset like the PLL itself?

PLL does not require an external reset.

What other
circuits are on the device other than the user configurable logic?

there's a little bit of RAM and FLASH units. Not much indeed.

Does
the data sheet talk about a requirement for the reset signal?

Nope. Is up to the user to take care about meeting timing constraints.

The PLL issue and the idea of using the I/O pin to generate a reset are
both issues I would contact the factory about.

That is a good hint and I'm actually doing it. I'll try to report back
here at least to share the information with you all, somebody else might
find it useful.

If I were in your shoes,
I would push hard to have the module disqualified since the FPGA can not
be assured to have been reset. That is the part that is insanity!

I agree in principle, but pushing hard is not always beneficial and
being only a 'new comer' in the project I guess I do not have the
critical mass to push that hard.

HT-Lab · Oct 18, 2013

Hi Alb,

On 18/10/2013 10:19, alb wrote:
...

The investigation might be an extremely tedious process. Under which
conditions should I verify the behavior? Our temperature range is -40 +
80, should I run the test in all conditions?

yes!

Should the test be
performed in thermovacuum (the application will run in low earth orbit).

yes!

I suspect you didn't have your CDR yet but the first thing that was
discussed when I was working on satellites was the reset/POR circuitry.
I worked on OBC's during the Wire mission (1999) and hence reset/supply
rise time/unused jtag pins etc were hot topics.

Regards,
Hans
www.ht-lab.com

Tom Gardner · Oct 18, 2013

On 18/10/13 10:19, alb wrote:

Should the test be
performed in thermovacuum (the application will run in low earth orbit).

I don't think the answer to that question will be left to chance!
(Nor its near equivalent, an answer in a usenet posting)

Thomas Stanka · Oct 20, 2013

Am Freitag, 11. Oktober 2013 23:46:33 UTC+2 schrieb alb:

In addition, note that most FPGA families have a global reset line
similar to the global clock lines. They usually keep all the FF held
at reset until configuration is done, and also allow you to use that
reset line. It is there, it is free, and you might was well use it.

Apparently I have not found a 'global reset line' for the igloo family,

Thats because Actel tend to provide global resources on all families that can be either reset, or clock, or just a high fanout net like enable.
There are some slight differences from familiy to family, as for some fuse based there exist dedicated "clock-only" global resources, for those flashbased I used there was no difference between using them as clock or reset.

bye Thomas

Thomas Stanka · Oct 22, 2013

Am Dienstag, 22. Oktober 2013 10:45:42 UTC+2 schrieb alb:

apparently there's a 'clkint' buffer which is used to route global nets
and 'buf/bufd' for high fanout nets. I haven't found yet what should be
a reasonable fanout value I should consider before inserting a dedicated
'buf/bufd' but certainly the reset line has a high fanout.

I'm not familiar with igloo family, but I guess allowed max fanout might be 10-20. If fanout exceeds this value a buf can be inserted as normal buffer tree.
Clkint is for gloabl resources like clock, but also reset or enable with several hundred as fanout.

If you are not able to meet timing, that means very likely you have a buf-tree with to high fanout/depth instead of a global resource

regards Thomas

alb · Oct 22, 2013

Hi Hans,

On 18/10/2013 18:59, HT-Lab wrote:
[]

The investigation might be an extremely tedious process. Under which
conditions should I verify the behavior? Our temperature range is -40 +
80, should I run the test in all conditions?

yes!

A full temperature test can be done at the subsystem level (board at
minimum) since board layout may affect your results. The idea to route
out to a pin the 'lock' signal and monitor it during a temperature scan
is quite painful and I may still miss aspects of the characterization
(like power variations) which are even more difficult to test unless a
special test board is built.

Those are the main reasons why I would not rely on such tests unless the
manufacturer or some other group has already done an intensive analysis
and test campaign on the device. As a small group we can not afford the
costs and time it takes to undergo such campaigns.

At the system level we are certainly performing ESS and TVT in order to
meet requirements, but those tests are not the right place where we can
verify such details.

Should the test be
performed in thermovacuum (the application will run in low earth orbit).

yes!

The above comments apply even more in this case. The FPGA itself is only
one part of a much larger system which has to have appropriate thermal
paths in order to be tested in a TVT chamber. Routing the lock signal
out of the FPGA and out of the subsystem to make it visible in a TVT
Chamber is certainly out of question and, considering the cost of a TVT,
rather unfeasible.

I suspect you didn't have your CDR yet but the first thing that was
discussed when I was working on satellites was the reset/POR circuitry.

I'm not aware of any CDR done on this component since it is a payload
subsystem. Unfortunately (or fortunately [1]) Research institutes are
not always so strict in within their hierarchical organizations to
demand PDR and CDR for their subsystems. We certainly went through CDR
for our main interfaces with the hosting spacecraft (power, data,
mechanics, harness).

I worked on OBC's during the Wire mission (1999) and hence reset/supply
rise time/unused jtag pins etc were hot topics.

[1] working without such a structure has some benefits up to some level
since is less bureaucratic and much more pragmatic. This unstructured
approach does not scale well though and for large projects what's at
risk is not only the budget but also mission success.

alb · Oct 22, 2013

Hi Tom,

On 18/10/2013 19:14, Tom Gardner wrote:

On 18/10/13 10:19, alb wrote:
Should the test be
performed in thermovacuum (the application will run in low earth orbit).

I don't think the answer to that question will be left to chance!
(Nor its near equivalent, an answer in a usenet posting)

I agree with you, since I'm neither a believer nor a good gambler!

Let me say though, considering a TVT for such a payload (only 30Kg) is
on the 40K$ range I would never even dream of verifying my lock signal
in such conditions.

For each problem there's an appropriate environment where things need
to be verified. In a TVT I cannot be worried about an 'and' gate working
properly. Moreover the observability of your unit under test is so
limited in a TVT that you can only verify your thermal calculations were
accurate and every component is working in within the specified
temperature range.

We typically run full functional tests (all possible mode configurations
and external stimuli variations) during the TVT, but certainly are not
looking at a fpga pin signal on a scope.

At the system level it is good to add as much embedded diagnostics (DFT)
as possible to enhance observability and allow to anticipate and/or
diagnose issues early in the process. These features are certainly
neither free of cost nor without problems themselves.

alb · Oct 22, 2013

Hi Thomas,

On 20/10/2013 23:47, Thomas Stanka wrote:
[]

Apparently I have not found a 'global reset line' for the igloo
family,

Thats because Actel tend to provide global resources on all families
that can be either reset, or clock, or just a high fanout net like
enable. There are some slight differences from familiy to family, as
for some fuse based there exist dedicated "clock-only" global
resources, for those flashbased I used there was no difference
between using them as clock or reset.

apparently there's a 'clkint' buffer which is used to route global nets
and 'buf/bufd' for high fanout nets. I haven't found yet what should be
a reasonable fanout value I should consider before inserting a dedicated
'buf/bufd' but certainly the reset line has a high fanout.

Funny enough the p&r fails to meet timing requirements on some
reset-to-clock paths even if reset is removed synchronously.

rickman · Oct 26, 2013

On 10/16/2013 5:16 PM, Mark Curry wrote:

In article<l3mli5$g8o$1@dont-email.me>, rickman<gnuarm@gmail.com> wrote:
On 10/11/2013 3:43 PM, Joseph H Allen wrote:
In article<l39dsb$4q1$1@speranza.aioe.org>,
glen herrmannsfeldt<gah@ugcs.caltech.edu> wrote:

First, I agree with what Mark says.

In addition, note that most FPGA families have a global reset line
similar to the global clock lines. They usually keep all the FF held
at reset until configuration is done, and also allow you to use that
reset line. It is there, it is free, and you might was well use it.

You do still have to get the timing right, so you release it at
the right time relative to the clock edge.

The timing provided by the global reset line is not good.. it's nowhere near
as good as a global clock line as far as I understand.

One way to deal with this is to have all of your state machines start in a
reset state which does nothing but wait for a synchronous "start" edge which
is generated after reset with a counter or a shift register.

This has a big advantage that you no longer have to worry about global reset
timing. On the other hand, if you use libraries you may have no choice
since you can't change the logic.

This really doesn't solve the problem. The problem is that globally
resetting all the state machines in a design can tax the routing and
timing of the global reset signal. If you do this the routing problem
can be fixed by resynchronizing the reset to the clock before using it
in a given section of logic that should be geographically local on the
chip. Otherwise you can give P&R some very tough problems to solve.

Wow - can't keep up - lots of replies here. The GSR from Xilinx isn't
the end all solution that Xilinx touts it as. The release of the global
GSR is completely asynchronous. It doesn't really matter much that
router may have trouble routing this signal with low skew. It's
asynchronous - Murphy's law says that one FFs going to see the inactive
edge of reset on one clock edge - the next FF's going to see the it on the
following clock cycle. Low skew likely lowers the likelihood of the event,
but it can happen none-the-less and should be accounted for in your design.

I've gone into this in the Xilinx forums some, but you've got to be careful
on using that GSR...

I won't argue that for a moment. Xilinx has exactly the same problem as
the Actel devices. GSR is not a solution, you need to locally resync
any reset to the clock. Further you need to design your circuit so that
each reset section works if it is not released from reset on the same
clock as other sections. But the problem is compounded if there is no
GSR at all. Then you have to use other routing resources to spread the
reset signal. But I guess it is six of one vs. half dozen of the other.
With a GSR the resources have already been gobbled up by the GSR net
before you even run the router.

--

Rick

rickman · Oct 26, 2013

On 10/22/2013 4:45 AM, alb wrote:

Hi Thomas,

On 20/10/2013 23:47, Thomas Stanka wrote:
[]
Apparently I have not found a 'global reset line' for the igloo
family,

Thats because Actel tend to provide global resources on all families
that can be either reset, or clock, or just a high fanout net like
enable. There are some slight differences from familiy to family, as
for some fuse based there exist dedicated "clock-only" global
resources, for those flashbased I used there was no difference
between using them as clock or reset.

apparently there's a 'clkint' buffer which is used to route global nets
and 'buf/bufd' for high fanout nets. I haven't found yet what should be
a reasonable fanout value I should consider before inserting a dedicated
'buf/bufd' but certainly the reset line has a high fanout.

Funny enough the p&r fails to meet timing requirements on some
reset-to-clock paths even if reset is removed synchronously.

Just a comment on timing analysis. We were doing a retrofit of an
existing hardware design using an Altera Flex 10K part, IIRC. The tool
was MAX+ II. The company I worked for had identified a problem with the
tool that allowed it to pass timing analysis and fail on the bench. We
decided it was clearly a timing issue because of the temperature
sensitivity. Warm it up and it fails, cool it down and it passes. Not
sensitive to which chip (other than small differences in the threshold
temp) was used. Design changes modified it only slightly. We figured
it was a poor timing estimation of a heavily loaded net, but were never
able to prove that. Altera was no help for this problem sticking their
head in the sand since they would be dropping support for this tool in
another year.

Just a caution that passing static timing analysis is no indication that
the design is actually meeting timing.

--

Rick

Tom Gardner · Oct 26, 2013

On 26/10/13 09:20, rickman wrote:

Just a comment on timing analysis. We were doing a retrofit of an existing hardware design using an Altera Flex 10K part, IIRC. The tool was MAX+ II. The company I worked for had identified a
problem with the tool that allowed it to pass timing analysis and fail on the bench. We decided it was clearly a timing issue because of the temperature sensitivity. Warm it up and it fails, cool it
down and it passes. Not sensitive to which chip (other than small differences in the threshold temp) was used. Design changes modified it only slightly. We figured it was a poor timing estimation
of a heavily loaded net, but were never able to prove that. Altera was no help for this problem sticking their head in the sand since they would be dropping support for this tool in another year.

Just a caution that passing static timing analysis is no indication that the design is actually meeting timing.

Nasty nasty nasty.

So, how do you[1] convince yourselves and your customers that each individual chip actually /is/ working with a reasonable margin?

No, I don't expect a neat easy response.

[1] the impersonal pronoun, since "how does one" sounds too stilted to the modern ear/brain combination

rickman · Oct 27, 2013

On 10/26/2013 4:59 AM, Tom Gardner wrote:

On 26/10/13 09:20, rickman wrote:
Just a comment on timing analysis. We were doing a retrofit of an
existing hardware design using an Altera Flex 10K part, IIRC. The tool
was MAX+ II. The company I worked for had identified a
problem with the tool that allowed it to pass timing analysis and fail
on the bench. We decided it was clearly a timing issue because of the
temperature sensitivity. Warm it up and it fails, cool it
down and it passes. Not sensitive to which chip (other than small
differences in the threshold temp) was used. Design changes modified
it only slightly. We figured it was a poor timing estimation
of a heavily loaded net, but were never able to prove that. Altera was
no help for this problem sticking their head in the sand since they
would be dropping support for this tool in another year.

Just a caution that passing static timing analysis is no indication
that the design is actually meeting timing.

Nasty nasty nasty.

So, how do you[1] convince yourselves and your customers that each
individual chip actually /is/ working with a reasonable margin?

No, I don't expect a neat easy response.

[1] the impersonal pronoun, since "how does one" sounds too stilted to
the modern ear/brain combination

Can "one" ever assure "one's" customers that "one's" designs are
entirely bug free? I have never been able to do that with *any* design.
Why would FPGAs be any different?

My statement above may be a bit strong. Surely the static timing
analysis tool is intended to verify timing. But it *can* be wrong, that
is my point.

--

Rick

Tom Gardner · Oct 27, 2013

On 27/10/13 16:34, rickman wrote:

On 10/26/2013 4:59 AM, Tom Gardner wrote:
On 26/10/13 09:20, rickman wrote:
Just a comment on timing analysis. We were doing a retrofit of an
existing hardware design using an Altera Flex 10K part, IIRC. The tool
was MAX+ II. The company I worked for had identified a
problem with the tool that allowed it to pass timing analysis and fail
on the bench. We decided it was clearly a timing issue because of the
temperature sensitivity. Warm it up and it fails, cool it
down and it passes. Not sensitive to which chip (other than small
differences in the threshold temp) was used. Design changes modified
it only slightly. We figured it was a poor timing estimation
of a heavily loaded net, but were never able to prove that. Altera was
no help for this problem sticking their head in the sand since they
would be dropping support for this tool in another year.

Just a caution that passing static timing analysis is no indication
that the design is actually meeting timing.

Nasty nasty nasty.

So, how do you[1] convince yourselves and your customers that each
individual chip actually /is/ working with a reasonable margin?

No, I don't expect a neat easy response.

[1] the impersonal pronoun, since "how does one" sounds too stilted to
the modern ear/brain combination

Can "one" ever assure "one's" customers that "one's" designs are entirely bug free? I have never been able to do that with *any* design. Why would FPGAs be any different?

Of course, but that's a bog-standard and therefore uninteresting
point. But it is the margin (or lack of it) that is the interesting
question.

So, how do you assess the margin?

If there's a problem, how do you positively determine
the cause is an internal margin problem? (As opposed to
merely presuming)

> My statement above may be a bit strong. Surely the static timing analysis tool is intended to verify timing. But it *can* be wrong, that is my point.

If the problem was the static timing analysis of the chip's internals,
then I'm concerned because internal points are somewhat difficult to
directly observe. External timing is a different issue.

rickman · Oct 28, 2013

On 10/27/2013 1:46 PM, Tom Gardner wrote:

On 27/10/13 16:34, rickman wrote:
On 10/26/2013 4:59 AM, Tom Gardner wrote:
On 26/10/13 09:20, rickman wrote:
Just a comment on timing analysis. We were doing a retrofit of an
existing hardware design using an Altera Flex 10K part, IIRC. The tool
was MAX+ II. The company I worked for had identified a
problem with the tool that allowed it to pass timing analysis and fail
on the bench. We decided it was clearly a timing issue because of the
temperature sensitivity. Warm it up and it fails, cool it
down and it passes. Not sensitive to which chip (other than small
differences in the threshold temp) was used. Design changes modified
it only slightly. We figured it was a poor timing estimation
of a heavily loaded net, but were never able to prove that. Altera was
no help for this problem sticking their head in the sand since they
would be dropping support for this tool in another year.

Just a caution that passing static timing analysis is no indication
that the design is actually meeting timing.

Nasty nasty nasty.

So, how do you[1] convince yourselves and your customers that each
individual chip actually /is/ working with a reasonable margin?

No, I don't expect a neat easy response.

[1] the impersonal pronoun, since "how does one" sounds too stilted to
the modern ear/brain combination

Can "one" ever assure "one's" customers that "one's" designs are
entirely bug free? I have never been able to do that with *any*
design. Why would FPGAs be any different?

Of course, but that's a bog-standard and therefore uninteresting
point. But it is the margin (or lack of it) that is the interesting
question.

So, how do you assess the margin?

If there's a problem, how do you positively determine
the cause is an internal margin problem? (As opposed to
merely presuming)

My statement above may be a bit strong. Surely the static timing
analysis tool is intended to verify timing. But it *can* be wrong,
that is my point.

If the problem was the static timing analysis of the chip's internals,
then I'm concerned because internal points are somewhat difficult to
directly observe. External timing is a different issue.

Yes, I agree. The symptoms were the erratic nature of the failure in
terms of routing. Then it would go away when the chip was cooled down.
Thirdly some chips were consistently more sensitive than others.
Finally it was also sensitive to Vcc. This all points to timing. Can't
prove it, but we acted on that assumption and wrote some timing analysis
tools ourselves. We eventually got a route that passed timing at
elevated temperatures and low Vcc voltage and shipped the product.

Ever since I have not trusted the tools 100%. But then like I said,
this was a product that was being replaced by Quartus in less than a
year, so Altera wouldn't put any effort into working on the problem,
even to see if it really existed.

You can draw your own conclusions.

--

Rick

Hal Murray · Nov 1, 2013

In article <l4k5rr$nqs$1@dont-email.me>,
rickman <gnuarm@gmail.com> writes:

>Ever since I have not trusted the tools 100%.

Software geeks have been fighting compiler bugs for a long long time.

The thing that makes FPGA timing bugs so nasty is that you can't
reasonably check the output. With a compiler, you can look at the
instructions it produces. With a PCB router, you can eyeball the gerbers.

Many years ago, I made a list of all the possible places that could
cause a board I was working on. At the high level, there were things
like
bugs in the board design,
bugs in the individual PCB or in assembling the board,
bugs in the firmware or FPGA or ...
bugs in the driver

Mixed in with those were things like
bugs in the tools (there are a lot of them)
the board layout tools, their libraries
the assembler for the firmware (which we had written)
the FPGA tools
bugs in the data sheets
bugs in my reading of the data sheets

--
These are my opinions. I hate spam.

alb · Nov 1, 2013

Hi Hal,

On 31/10/2013 21:23, Hal Murray wrote:

In article <l4k5rr$nqs$1@dont-email.me>,
rickman <gnuarm@gmail.com> writes:

Ever since I have not trusted the tools 100%.

Software geeks have been fighting compiler bugs for a long long time.

it does not take 'software geeks' to fight bugs. It takes a process of
development and verification that has a level of complexity that is
certainly beyond anything that 'software geeks' can deem to conceive
alone. The process is also full of compromises which are constraints
driven and pitfalls as well. For further readings refer to [1].

The thing that makes FPGA timing bugs so nasty is that you can't
reasonably check the output.

I'm not sure what makes you think that you cannot 'reasonably' check the
output. A synthesis tool provide a netlist and the netlist is
verifiable. A P&R tool provides a bitstream which is verifiable.

The problem is that, unfortunately, being tools proprietary software
with a non-standardized output format, is difficult for the *end user*
to check. But the main developers of the tools can certainly check at
each level of complexity they want, it is all a matter of 'pain vs. gain'.

In the open source software world (without wandering even further in the
'libre' software world) there's a level of peer-review that is orders of
magnitude higher than in proprietary software and that is why open
software is - by far - less buggy than proprietary software.

Now try to convince any EDA company to release their source code...

With a compiler, you can look at the
instructions it produces.

Not knowing the instruction set is exactly the same as not knowing the
bitstream format for an FPGA. Having said that, even assuming you know
how the instruction set looks like, there's still a lot of work to
'reasonably' verify the tool.

Be also aware that the level of 'reasonableness' is what companies have
clear in mind, considering that complex bugs are difficult to find
(meaning they cost money to the company), they decide what is the level
of 'reasonableness' they pick according to their market.

> With a PCB router, you can eyeball the gerbers.

Complex designs need a verification plan. There's no eyeballing that can
help you. With a verification plan you can minimize the amount of time
you spend on the bench to debug it, but it's all matter of the amount of
risk you want to deal with.

Many years ago, I made a list of all the possible places that could
cause a board I was working on. At the high level, there were things
like
[...]

> bugs in my reading of the data sheets

every other bug you referred to has a root in this last one. Every spec,
at each level of the design flow, might be misinterpreted since there's
no process, AFAIK, that can verify the correct interpretation of a
requirement.

Al

[1] An Assessment of Space Shuttle Flight Software Development
Processes: http://www.nap.edu/catalog.php?record_id=2222

alb · Nov 6, 2013

Hi all, here is a feedback from the FAEs at Microsemi concerning the
power up reset, please see my comments inline if you are interested.

On 16/10/2013 10:02, alb wrote:
[]

On 15/10/2013 03:59, rickman wrote:
[]
If there is no reset, what can you know about the state of the FFs on
powerup? If they are random, I don't think you can make this work
without a power up reset.

In the AN I posted there's a solution the vendor proposes to implement a
POR. They suggest to rely on an external weak pull-up and profit of the
different time for input/output configuration during a power-up
sequence. I do not have an external pull-up, but I/Os can be opted with
a weak pull-up and maybe the result is the same.

According to the FAE it is possible to configure the internal weak
pull-up resistor on the PIN configuration and profit of the same
mechanism described in the AN I was referring to
(http://www.actel.com/documents/LPF_AC380_AN.pdf), therefore *without*
the need of an additional external pull up resistor.

Al

rickman · Nov 8, 2013

On 11/6/2013 11:07 AM, alb wrote:

Hi all, here is a feedback from the FAEs at Microsemi concerning the
power up reset, please see my comments inline if you are interested.

On 16/10/2013 10:02, alb wrote:
[]
On 15/10/2013 03:59, rickman wrote:
[]
If there is no reset, what can you know about the state of the FFs on
powerup? If they are random, I don't think you can make this work
without a power up reset.

In the AN I posted there's a solution the vendor proposes to implement a
POR. They suggest to rely on an external weak pull-up and profit of the
different time for input/output configuration during a power-up
sequence. I do not have an external pull-up, but I/Os can be opted with
a weak pull-up and maybe the result is the same.

According to the FAE it is possible to configure the internal weak
pull-up resistor on the PIN configuration and profit of the same
mechanism described in the AN I was referring to
(http://www.actel.com/documents/LPF_AC380_AN.pdf), therefore *without*
the need of an additional external pull up resistor.

The app note goes into great detail about the timing of VCC and VCCI.
In this discussion I believe they are talking about the input from the
IBUF (RST_p) when they say, "The I/Os are tristated and the core logic
detects '1' on the inputs from the boundary scan register (BSR)." It is
not clear what sets the value in the BSR. It is also not clear how this
determines the value of the RST_p signal.

Do you understand this portion of the reset design?

This entire circuit seems to depend on VCC reaching "its functional
voltage level" before VCCI. Do you know that this is true for your board?

It would be good to have a dialog with the person who wrote the app
note, but they don't say who this is. Much of the language usage would
indicate it is someone for whom English is a second language and so
might not be easy to converse with.

--

Rick

reset strategy FPGA Igloo

rickman

Guest

rickman

Guest

alb

Guest

HT-Lab

Guest

Tom Gardner

Guest

Thomas Stanka

Guest

Thomas Stanka

Guest

alb

Guest

alb

Guest

alb

Guest

rickman

Guest

rickman

Guest

Tom Gardner

Guest

rickman

Guest

Tom Gardner

Guest

rickman

Guest

Hal Murray

Guest

alb

Guest

alb

Guest

rickman

Guest

Log in

Welcome to EDABoard.com

Sponsor