V6 BUFR -> BUFG clocking structure (hold issue?)

mmihai · Nov 30, 2012

Hi!

I have a Xilinx webcase for about 2mo about this that goes nowhere ... may be better luck here.

My problem:
- V6 design
- clocking structure with a IBUF to BUFR which drives a BUFG, so both BUFR/BUFG are on the same clock domain
- the BUFR also clocks few flops
- BUFG clocks main logic
- par finishes w/o hold errs
- I can detect data transfer errors between the flops clocked by BUFR and the flops clocked by BUFG (direction is data from BUFR flops -> BUFG flops, no logic, just data transfer).
- timingan reports no hold errs on those paths
- different runs (different placement) will produce a full working design
[- ISE 13.4... but it should not matter]

Anyone seen this? Any feedback about this structure?

Goal is to be able to produce predictable results... Now I have no way to do that unless I try it on HW ... but my confidence level is low (i.e. if it works on one device will it work on //all//?).

--
Thanks,
mmihai

glen herrmannsfeldt · Nov 30, 2012

mmihai <iiahim@yahoo.com> wrote:

I have a Xilinx webcase for about 2mo about this that
goes nowhere ... may be better luck here.

My problem:
- V6 design
- clocking structure with a IBUF to BUFR which drives a BUFG,
so both BUFR/BUFG are on the same clock domain
- the BUFR also clocks few flops
- BUFG clocks main logic
- par finishes w/o hold errs
- I can detect data transfer errors between the flops clocked
by BUFR and the flops clocked by BUFG (direction is data from
BUFR flops -> BUFG flops, no logic, just data transfer).
- timingan reports no hold errs on those paths
- different runs (different placement) will produce a full working design
[- ISE 13.4... but it should not matter]

Maybe I am missing something, but unless you tell the timing
analysis the relative timing of the two clocks, it can't do
setup/hold analysis on them.

Is it supposed to follow the timing through that combination?

-- glen

mmihai · Nov 30, 2012

On Thursday, November 29, 2012 7:32:03 PM UTC-8, glen herrmannsfeldt wrote:

Maybe I am missing something, but unless you tell the timing
analysis the relative timing of the two clocks, it can't do
setup/hold analysis on them.

Both clocks are internal; no extra timing is required. Only the Xilinx's tools know the relative timing since it's the delay only through the FPGA itself.

Is it supposed to follow the timing through that combination?

a) BUFR output is the input of BUFG; same clock domain.
b) the tool is propagating the clock through the design and is aware of the propagation delay through buffers (and routing?)

I can ask timingan to report the paths between the flops and nothing is obviously wrong and hold slack is >=0.

Unfortunately it is not that verbose on the clock propagation time but the clock timing looks like:

Clock Path Skew: 1.851ns (2.677 - 0.826)
Source Clock: ipclk rising at 0.000ns
Destination Clock: pclk rising at 0.000ns
Clock Uncertainty: 0.035ns

ipclk is the output of BUFR, pclk is the output of BUFG.
I guess 'Clock Path Skew' contains the added BUFG propagation delay and BUFR->BUFG routing.

As summary, for all I can say, the path is constrained in the tools and has non-negative slack as reported. But I can see the HW failing...

Anyone using a BUFR feeding a BUFG?

--
mmhai

glen herrmannsfeldt · Nov 30, 2012

mmihai <iiahim@yahoo.com> wrote:

(snip, I wrote)

Maybe I am missing something, but unless you tell the timing
analysis the relative timing of the two clocks, it can't do
setup/hold analysis on them.

Both clocks are internal; no extra timing is required.
Only the Xilinx's tools know the relative timing since
it's the delay only through the FPGA itself.

You might look at

http://www.xilinx.com/support/documentation/user_guides/ug362.pdf

I looked some, but didn't see the answer.

Just because two signals are internal doesn't guarantee that
the timing is known, though.

-- glen

Allan Herriman · Nov 30, 2012

On Thu, 29 Nov 2012 16:35:43 -0800, mmihai wrote:

Hi!

I have a Xilinx webcase for about 2mo about this that goes nowhere ...
may be better luck here.

My problem:
- V6 design - clocking structure with a IBUF to BUFR which drives a
BUFG, so both BUFR/BUFG are on the same clock domain - the BUFR also
clocks few flops - BUFG clocks main logic - par finishes w/o hold errs
- I can detect data transfer errors between the flops clocked by BUFR
and the flops clocked by BUFG (direction is data from BUFR flops -
BUFG flops, no logic, just data transfer).
- timingan reports no hold errs on those paths - different runs
(different placement) will produce a full working design
[- ISE 13.4... but it should not matter]

Anyone seen this? Any feedback about this structure?

Goal is to be able to produce predictable results... Now I have no way
to do that unless I try it on HW ... but my confidence level is low
(i.e. if it works on one device will it work on //all//?).

I understand that the circuit looks like this:

Pin--->IBUF--->BUFR--+--->BUFG---+-->
| |
| |
| |
| |
+-----+ +-----+
| FF Q|---->|D FF |
+-----+ +-----+
^
|
hold time
errors here

I ran into that exact same problem a couple of years ago. I was given
the task of fixing (someone else's) design that featured a similar misuse
of clock buffers in a Virtex 4. I think the tools might have been ISE
8.2.

PAR and Trace said it was fine. Actual tests on the chip over
temperature showed otherwise.

Moral: BUFGs have a large delay. Don't expect PAR to be able to make up
for that amount of hold time using routing.
You need to avoid going from your BUFR domain into the BUFG domain on the
same clock edge.
One solution might be to insert FFs clocked from the other edge of the
BUFG clock.
Another solution might be to connect the BUFG input to the IBUF output
(not via the BUFR).

Regards,
Allan

Nov 30, 2012

While I agree that there should be ways to avoid this with other design choices, the tool is clearly identifying the clocks as related (it reported skew and relavent edges), but apparently it does not always find/report some hold timing violations.

Are there options in Xilinx STA to run 4 corner vs 2 corner timing analysis? Or does it always run 4 corner? If it is not running 4 corner, that could be the reason it is missing the hold time problem. I've seen other tools that offer the choice occasionally miss a timing problem in 2 corner timing mode. 2 vs 4 corner analysis has to do with whether all 4 combinations of min/max clock and min/max data are analyzed. Tricks to make hold time analyses less pessimistic (by accounting for correlations in propagation delays) are always a potential issue.

Andy

glen herrmannsfeldt · Nov 30, 2012

Allan Herriman <allanherriman@hotmail.com> wrote:

On Thu, 29 Nov 2012 16:35:43 -0800, mmihai wrote:

(snip)

My problem:
- V6 design - clocking structure with a IBUF to BUFR which drives a
BUFG, so both BUFR/BUFG are on the same clock domain - the BUFR also
clocks few flops - BUFG clocks main logic - par finishes w/o hold errs
- I can detect data transfer errors between the flops clocked by BUFR
and the flops clocked by BUFG (direction is data from BUFR flops -
BUFG flops, no logic, just data transfer).
- timingan reports no hold errs on those paths - different runs
(different placement) will produce a full working design
[- ISE 13.4... but it should not matter]

Seems like according to

http://www.xilinx.com/support/documentation/user_guides/ug362.pdf

especially in the summary near the end, that BUFR --> BUFG
is allowed, though it doesn't say anything about the timing.

Anyone seen this? Any feedback about this structure?

Goal is to be able to produce predictable results... Now I have no way
to do that unless I try it on HW ... but my confidence level is low
(i.e. if it works on one device will it work on //all//?).

I understand that the circuit looks like this:

Pin--->IBUF--->BUFR--+--->BUFG---+--
| |
| |
| |
| |
+-----+ +-----+
| FF Q|---->|D FF |
+-----+ +-----+
^
|
hold time
errors here

I wonder if MMCMs would help?

I ran into that exact same problem a couple of years ago. I was given
the task of fixing (someone else's) design that featured a similar misuse
of clock buffers in a Virtex 4. I think the tools might have been ISE
8.2.

As well as I understand it, for FF's clocked off the same clock,
(and clock edge) you should never have hold problems. The minimum
logic between Q and the next D is long enough that, even with
maximum clock skew, a D can never change that fast.
(Usually described by saying that the hold time is 0.)

There is much discussion on using MMCMs to generate zero delay clocks.

That is, the MMCM provides enough delay such that, for a clock of
constant frequency, it can match the given edge.

PAR and Trace said it was fine. Actual tests on the chip over
temperature showed otherwise.

Moral: BUFGs have a large delay. Don't expect PAR to be able to make up
for that amount of hold time using routing.

As well as I understand it (which might not be all that well) it
never tries to make up hold time.

You need to avoid going from your BUFR domain into the BUFG domain on the
same clock edge.
One solution might be to insert FFs clocked from the other edge of the
BUFG clock.
Another solution might be to connect the BUFG input to the IBUF output
(not via the BUFR).

That is what I would have thought one would do.

-- glen

mmihai · Nov 30, 2012

On Friday, November 30, 2012 4:46:51 AM UTC-8, Allan Herriman wrote:

Thanks for your comments.

Most interesting ... different chip(V4) had same issues....

Moral: BUFGs have a large delay. Don't expect PAR to be able to make up
for that amount of hold time using routing.

I don't think is the BUFGs delay; my guess is more related to routing.
Based on datasheet BUFG delay is 0.10ns .... reported "Clock Path Skew" is 1.851ns... whatever that includes.

You need to avoid going from your BUFR domain into the BUFG domain on the
same clock edge.
One solution might be to insert FFs clocked from the other edge of the
BUFG clock.

I thought about that .... can't do it, the clock is fast and it won't meet setup for half clock cycle.

Another solution might be to connect the BUFG input to the IBUF output
(not via the BUFR).

Can't do that either :-(
a) pin is not BUFG capable
b) even if it was capable it adds to much delay .... the flops clocked by BUFR sample the input, having a BUFG clocking those won't meet hold time on the IOs because the clock is too much delayed.

Some more notes:
- I don't constrain the placement of BUFR/BUFG.
- out of 27 signals always the ones failing have the smallest hold slack
(less than .250ns?). Depending on placement the number of failing signals
is anywhere between 0 and 5

I find it strange the tools can not handle the clock tree..... I do not think my structure is that exotic. What is the use of regional clocks if one can not transfer data to a global clock?

Any way to constrain the hold target >0.0 only for some specific paths?

--
mmihai

mmihai · Nov 30, 2012

On Friday, November 30, 2012 8:12:49 AM UTC-8, glen herrmannsfeldt wrote:

http://www.xilinx.com/support/documentation/user_guides/ug362.pdf

especially in the summary near the end, that BUFR --> BUFG
is allowed, though it doesn't say anything about the timing.

I did open a webcase with Xilinx ... I've sent them my routed .ncd.
Nobody said my clocking scheme is not allowed or not supported. I don't know if I've moved from 1st tier support though ..... till I did not get any meaningful help to solve my problem :-(

I wonder if MMCMs would help?

Thought about this too .... it won't work:
input freq can change ... I don't think the PLL/DLL would like that

--
mmihai

Allan Herriman · Dec 1, 2012

On Fri, 30 Nov 2012 16:12:49 +0000, glen herrmannsfeldt wrote:

Allan Herriman <allanherriman@hotmail.com> wrote:
On Thu, 29 Nov 2012 16:35:43 -0800, mmihai wrote:

(snip)
My problem:
- V6 design - clocking structure with a IBUF to BUFR which drives a
BUFG, so both BUFR/BUFG are on the same clock domain - the BUFR also
clocks few flops - BUFG clocks main logic - par finishes w/o hold
errs - I can detect data transfer errors between the flops clocked by
BUFR and the flops clocked by BUFG (direction is data from BUFR flops
-> BUFG flops, no logic, just data transfer).
- timingan reports no hold errs on those paths - different runs
(different placement) will produce a full working design
[- ISE 13.4... but it should not matter]

Seems like according to

http://www.xilinx.com/support/documentation/user_guides/ug362.pdf

especially in the summary near the end, that BUFR --> BUFG is allowed,
though it doesn't say anything about the timing.

Anyone seen this? Any feedback about this structure?

Goal is to be able to produce predictable results... Now I have no way
to do that unless I try it on HW ... but my confidence level is low
(i.e. if it works on one device will it work on //all//?).

I understand that the circuit looks like this:

Pin--->IBUF--->BUFR--+--->BUFG---+--
| |
| |
| |
| |
+-----+ +-----+
| FF Q|---->|D FF |
+-----+ +-----+
^
|
hold time errors here

I wonder if MMCMs would help?

MMCMs would not help for the problem I saw - the "clock" was bursty.

I ran into that exact same problem a couple of years ago. I was given
the task of fixing (someone else's) design that featured a similar
misuse of clock buffers in a Virtex 4. I think the tools might have
been ISE 8.2.

As well as I understand it, for FF's clocked off the same clock,
(and clock edge) you should never have hold problems. The minimum logic
between Q and the next D is long enough that, even with maximum clock
skew, a D can never change that fast.
(Usually described by saying that the hold time is 0.)

I fully agree that it *should* work. However, for at least V4 (and now
it seems V6 as well) Xilinx's model of min / max delays on their chips
isn't too good, and PAR will fail to compensate for clock skew if that
skew is as large as a BUFG delay.

Please note I said BUFG delay not BUFG skew. BUFG delay is > 1ns. BUFG
skew is usually < 0.3ns.

Moral: BUFGs have a large delay. Don't expect PAR to be able to make
up for that amount of hold time using routing.

As well as I understand it (which might not be all that well) it never
tries to make up hold time.

PAR always tries to make up hold time. There is an entire pass in PAR
dedicated to that process. I don't have any PAR log files handy, but I
believe it's easy to spot: look for the timing score. It will have a
score for setup and another score for hold. The initial passes in PAR
will reduce the setup score, with the hold time score remaining
constant. Then towards the end (when it's finished working on the setup
times) the hold time score will drop, usually to zero.

Regards,
Allan

Allan Herriman · Dec 1, 2012

On Fri, 30 Nov 2012 11:22:35 -0800, mmihai wrote:

Any way to constrain the hold target >0.0 only for some specific paths?

Simply by specifying clocked logic you have constrained the hold time to
be > 0 ns.

I don't know of a way of adding extra margin though. I believe the best
approach is to avoid the need for extra margin.

This is a tool bug. You have zero chance of fixing the tool, however you
do have a good chance of being able to step around the bug.

Some other suggestions:

- Lock the placement of the BUFG and BUFR. You might find there is some
magic combination of placements that just works. Earlier you said that
some runs of PAR would produce designs that worked. Copy the placement
from those runs as a starting point.

- constrain the logic in the BUFG domain to be physically apart from the
BUFR region. This forces longer routes on the chip that will improve
your hold time margin.

- Finally the brute force approach: treat the BUFG and BUFR clocks as if
they were different clock domains. Use some sort of FIFO that is
designed to handle different clock domains to pass data from the BUFR
domain to the BUFG domain.

Regards,
Allan

Brian Drummond · Dec 1, 2012

On Thu, 29 Nov 2012 16:35:43 -0800, mmihai wrote:

Hi!

I have a Xilinx webcase for about 2mo about this that goes nowhere ...
may be better luck here.

My problem:
- V6 design - clocking structure with a IBUF to BUFR which drives a
BUFG, so both BUFR/BUFG are on the same clock domain - the BUFR also
clocks few flops - BUFG clocks main logic - par finishes w/o hold errs
- I can detect data transfer errors between the flops clocked by BUFR
and the flops clocked by BUFG

Anyone seen this? Any feedback about this structure?

Goal is to be able to produce predictable results... Now I have no way
to do that unless I try it on HW ... but my confidence level is low
(i.e. if it works on one device will it work on //all//?).

Is this a new design for the V6 or a port from another FPGA or a previous
ISE release? Are you directly instantating these primitives and checking
that they are still there in the RTL view?

I had a problem some years ago when moving a design from ISE7 to ISE10
and the tools silently changed what I asked into something completely
different; in my case it moved a DCM from the BUFG where it generated a
nicely aligned x2 clock, to the BUFG input signal - considerably
increasing skew between these clocks!

So bugs in this area are not particularly new...

- Brian

mmihai · Dec 2, 2012

On Friday, November 30, 2012 4:23:51 PM UTC-8, Allan Herriman wrote:

On Fri, 30 Nov 2012 11:22:35 -0800, mmihai wrote:

Any way to constrain the hold target >0.0 only for some specific paths?

Simply by specifying clocked logic you have constrained the hold time to
be > 0 ns.

I don't know of a way of adding extra margin though. I believe the best
approach is to avoid the need for extra margin.

Yes, I've meant extra margin. It seems the hold target is 0.0ns (I would guess the numbers include some padding).

This is a tool bug. You have zero chance of fixing the tool, however you
do have a good chance of being able to step around the bug.

It looks like a tool bug.
It is very disturbing that it is not related to a particular version and it's on multiple [virtex] families...

I would expect the things to work if STA has good numbers.
My confidence in the tools took a hit ...

Some other suggestions:

- Lock the placement of the BUFG and BUFR. You might find there is some
magic combination of placements that just works. Earlier you said that
some runs of PAR would produce designs that worked. Copy the placement
from those runs as a starting point.

- constrain the logic in the BUFG domain to be physically apart from the
BUFR region. This forces longer routes on the chip that will improve
your hold time margin.

a) I did not see any correlation between passing/failing and particular BUFR/BUFG placement
b) I think it is a risky approach; if I can get a particular map/par run to work on some systems ... I have no guarantee it will be fine on //all// systems, over PVT.

- Finally the brute force approach: treat the BUFG and BUFR clocks as if
they were different clock domains. Use some sort of FIFO that is
designed to handle different clock domains to pass data from the BUFR
domain to the BUFG domain.

Something like this could be the best solution, if doable ... but it's a pity to add logic because Xilinx tools can't handle the clock tree properly....

My logic looks very much like Figure 1-24/Page 28 from UG362, except I don't use BUFIO, so it is not that exotic.

--
mmihai

mmihai · Dec 2, 2012

On Saturday, December 1, 2012 2:28:43 AM UTC-8, Brian Drummond wrote:

Is this a new design for the V6 or a port from another FPGA or a previous
ISE release? Are you directly instantating these primitives and checking
that they are still there in the RTL view?

New design ... and BUFR/BUFG instantiated by hand.
I've looked on fpga_editor and the buffers are there.

--
mmihai

Brian Drummond · Dec 2, 2012