V6 BUFR -> BUFG clocking structure (hold issue?)

mmihai <iiahim@yahoo.com> wrote:

(snip, someone wrote)

This is a tool bug. You have zero chance of fixing the tool,
however you do have a good chance of being able to step
around the bug.

It looks like a tool bug.
It is very disturbing that it is not related to a particular
version and it's on multiple [virtex] families...

I would expect the things to work if STA has good numbers.
My confidence in the tools took a hit ...
(snip)

Something like this could be the best solution, if doable ...
but it's a pity to add logic because Xilinx tools can't
handle the clock tree properly....
It seems to me that they do pretty well.

Well, the effects of voltage and temperature should be pretty
much the same for all transistors on a chip. But process variations
could be very different.

They verify that the usual paths have delay variations that they
can account for, and compute delays based on those. If there are
some that they can't account for the delays, at least not to the
accuracy required, then they don't guarantee those.

As far as I understand, though mostly in general, the idea is
to make clock skew in a clock tree small enough, relative to
the minimum delay through routing, that two FFs clocked off
the same clock can't violate hold time. The skew also must
be added to the delay when verifying setup time.

But that only works within one clock tree. Computing the
variation between two clock trees is different.

Now, it would be nice to say that some delay is not characterized
enough to use, and so far I haven't seen that they do say that,
but it isn't the tools' fault if the data isn't available.

-- glen
 
On Fri, 30 Nov 2012 11:22:35 -0800, mmihai wrote:

On Friday, November 30, 2012 4:46:51 AM UTC-8, Allan Herriman wrote:

Thanks for your comments.

Most interesting ... different chip(V4) had same issues....

Moral: BUFGs have a large delay. Don't expect PAR to be able to make
up for that amount of hold time using routing.

I don't think is the BUFGs delay; my guess is more related to routing.
Based on datasheet BUFG delay is 0.10ns .... reported "Clock Path Skew"
is 1.851ns... whatever that includes.
Sorry I missed that earlier. You seem to be mixing up skew and delay.

datasheet BUFG delay is 0.10ns
That figure is the BUFG skew, not the BUFG delay.
It represents the worst case timing difference between outputs on the
same BUFG. It isn't relevant to your problem.


The "Clock Path Skew" is the important figure. It is the difference
between the time of arrival of the clock at the source (clocked from
BUFR) flip flops and destination (clocked from BUFG) flip flops. In this
case it is mostly made up of the BUFG delay.

PAR has to include a routing delay to compensate for that skew.

An earlier comment:

One solution might be to insert FFs clocked from the other edge of the
BUFG clock.

I thought about that .... can't do it, the clock is fast and it
won't meet setup for half clock cycle.
It might not meet setup for half a clock cycle, but it doesn't have to!
The skew works in your favour when using opposite edges and the
requirement for setup time is half a clock cycle + 1.851ns. Unless you
have a GHz clock that doesn't sound too hard.


Regards,
Allan
 
Glen,

All three (voltage, temperature and process) vary over a single die, but not by much. The trick is always "by how much?" Are we willing to live with slower guaranteed performance in order to simplify the analysis, or is it worth it to invest more in the analysis (NRE) to "speed up" the parts (recurring profit)?

Managing hold time is a lot more complicated than it used to be. In the past, the clock skew could always be less than Tco plus minimum routing by design, so they did not even spec hold time for the registers. Over time, the raw speed of the devices has out-stripped the skew of the clock tree, and hold time is a real problem that has to be taken care of in placement and routing. We users just don't have control over the clock tree itself to deal with the problem, like in other domains.

Andy
 
jonesandy@comcast.net wrote:

All three (voltage, temperature and process) vary over a
single die, but not by much. The trick is always "by how much?"
Are we willing to live with slower guaranteed performance in
order to simplify the analysis, or is it worth it to invest
more in the analysis (NRE) to "speed up" the parts
(recurring profit)?

Managing hold time is a lot more complicated than it used to be.
In the past, the clock skew could always be less than Tco plus
minimum routing by design, so they did not even spec hold time
for the registers.
I do rememeber specs. of 0ns hold time. Hold time can even go
negative in some cases. I think I remember some TTL parts with
negative hold time, but that is some years ago.

Over time, the raw speed of the devices has
out-stripped the skew of the clock tree, and hold time is a real
problem that has to be taken care of in placement and routing.
Xilinx used to publish actual books about their parts.
We could read about them, understand them, and use them
appropriately. Yes, I am remembering from some generations ago.

We users just don't have control over the clock tree itself to
deal with the problem, like in other domains.
-- glen
 
On Monday, December 3, 2012 4:55:56 AM UTC-8, Allan Herriman wrote:

Sorry I missed that earlier. You seem to be mixing up skew and delay.

datasheet BUFG delay is 0.10ns

That figure is the BUFG skew, not the BUFG delay.
It represents the worst case timing difference between outputs on the
same BUFG. It isn't relevant to your problem.


The "Clock Path Skew" is the important figure. It is the difference
between the time of arrival of the clock at the source (clocked from
BUFR) flip flops and destination (clocked from BUFG) flip flops. In this
case it is mostly made up of the BUFG delay.
I don't think I am mixing skew w/ delay.

From DS152 (Virtex 6 AC-DC):

Table 59: Global Clock Switching Characteristics (Including BUFGCTRL)

TBCCKO_O(2) BUFGCTRL delay from I0/I1 to O 0.07 0.08 0.10 0.10 ns

From v6 speedprint:

BUFG

Tbgcko_O (33/35) (66/70)

BUFGCTRL

Tbccko_O (33/35) (66/70)

So the delay through BUFGCTRL is small. However, the delay is much bigger since it includes the routing. Xilinx is not verbose with the clock path, or at least I don't know how to generate it....

For timingan report all I get is:

Clock Path Skew: 1.851ns (2.677 - 0.826)

Yes, it is skew; it is bigger than usual because the 1st clock (0.826ns insertion delay) is driven by the BUFR and the 2nd clock, the target, (2.677ns insertion delay) is driven by the BUFG fed by BUFR. The delta, 1.851ns, is much higher than the prop delay through the buffer - my interpretation is 0.1ns in buffer, rest in routing and/or distribution (both to and from BUFG); we don't see netlists. I don't think Xilinx has much info about that clock routing available for general public.

--
mmihai
 
On Sunday, December 2, 2012 2:22:37 PM UTC-8, glen herrmannsfeldt wrote:

It seems to me that they do pretty well.
I would not say 'pretty well'; they're not bad but not very good either.
Otherwise I would not have problems on the hardware when I'm getting >0.0ns hold slack. And form this thread I would say I am not alone seeing this problem and it is not happening for only one tool version/one chip.

Well, the effects of voltage and temperature should be pretty
much the same for all transistors on a chip. But process variations
could be very different.
Huh? Numbers for STA should cover PVT. That 'P' stands for process. I am not sure what is your idea: numbers could be wrong because the process has variations?

Now, it would be nice to say that some delay is not characterized
enough to use, and so far I haven't seen that they do say that,
but it isn't the tools' fault if the data isn't available.
I would like to think they're extracting/characterizing all the delays involved in their fabric.... otherwise nobody would use this devices, they won't work in a reliable fashion.

For a successful STA you need good delay extraction and good algorithm for design/constrains understanding and path computation. In this case both extraction/delay computation and timing analysis tools are from Xilinx. In my case it looks the delays might be off.... but since the delay is from Xilinx (and you have no 2nd choice) I'll still call it bad STA on their flow....

--
mmihai
 
On Monday, December 3, 2012 4:55:56 AM UTC-8, Allan Herriman wrote:

One solution might be to insert FFs clocked from the other edge of the
BUFG clock.

I thought about that .... can't do it, the clock is fast and it
won't meet setup for half clock cycle.

It might not meet setup for half a clock cycle, but it doesn't have to!
The skew works in your favour when using opposite edges and the
requirement for setup time is half a clock cycle + 1.851ns. Unless you
have a GHz clock that doesn't sound too hard.
I was starting to review that this weekend ...

Could be the next logic level was had the issue.... because I ended with half clock cycle for the next stage. That should not be a problem though, I can add a new set of flops to realign to the proper edge w/o any logic in between.

This is the path I am exploring right now.

--
mmihai
 
On Mon, 03 Dec 2012 12:29:40 -0800, mmihai wrote:

On Monday, December 3, 2012 4:55:56 AM UTC-8, Allan Herriman wrote:

Sorry I missed that earlier. You seem to be mixing up skew and delay.

datasheet BUFG delay is 0.10ns

That figure is the BUFG skew, not the BUFG delay.
It represents the worst case timing difference between outputs on the
same BUFG. It isn't relevant to your problem.


The "Clock Path Skew" is the important figure. It is the difference
between the time of arrival of the clock at the source (clocked from
BUFR) flip flops and destination (clocked from BUFG) flip flops. In
this case it is mostly made up of the BUFG delay.

I don't think I am mixing skew w/ delay.

From DS152 (Virtex 6 AC-DC):

Table 59: Global Clock Switching Characteristics (Including
BUFGCTRL)

TBCCKO_O(2) BUFGCTRL delay from I0/I1 to O 0.07 0.08 0.10 0.10 ns

From v6 speedprint:

BUFG

Tbgcko_O (33/35)
(66/70)

BUFGCTRL

Tbccko_O (33/35)
(66/70)

So the delay through BUFGCTRL is small. However, the delay is much
bigger since it includes the routing. Xilinx is not verbose with the
clock path, or at least I don't know how to generate it....

For timingan report all I get is:

Clock Path Skew: 1.851ns (2.677 - 0.826)

Yes, it is skew; it is bigger than usual because the 1st clock (0.826ns
insertion delay) is driven by the BUFR and the 2nd clock, the target,
(2.677ns insertion delay) is driven by the BUFG fed by BUFR. The delta,
1.851ns, is much higher than the prop delay through the buffer - my
interpretation is 0.1ns in buffer, rest in routing and/or distribution
(both to and from BUFG); we don't see netlists. I don't think Xilinx has
much info about that clock routing available for general public.

Yes, I see those figures in the datasheet. They don't make much sense to
me though - I'm fairly sure the actual delay through the BUFG is much
larger than 0.10 ns worst case. Your STA results seems to be in
agreement with me.

This might be one of those cases where the datasheet timing model doesn't
match reality. Total delay through the routing to the BUFG plus the
BUFGMUX logic plus the distribution tree itself plus the routing out of
the BUFG comes to 1.851ns. Since those figures can't really be separated
(in that only their sum matters) Xilinx can assign any figure it wants to
some internal part that gets published in the datasheet.

All of this is speculation on my part, of course. It's unfortunate that
knowlegable Xilinx staff don't contribute in this newsgroup. You could
ask the same question on the Xilinx forums, but I find it's unusual to
get a good answer there.

Regards,
Allan
 
On Monday, December 3, 2012 1:56:49 PM UTC-8, Allan Herriman wrote:

Yes, I see those figures in the datasheet. They don't make much sense to
me though - I'm fairly sure the actual delay through the BUFG is much
larger than 0.10 ns worst case. Your STA results seems to be in
agreement with me.

This might be one of those cases where the datasheet timing model doesn't
match reality. Total delay through the routing to the BUFG plus the
BUFGMUX logic plus the distribution tree itself plus the routing out of
the BUFG comes to 1.851ns. Since those figures can't really be separated
(in that only their sum matters) Xilinx can assign any figure it wants to
some internal part that gets published in the datasheet.
I think we are on the same page here. I've just wanted to point I don't mix the skew with the delay :)
I don't expect the clock tree to have a single big buffer (i.e. one gate) that drives it. I think the number for the datasheet is only one (input) gate form the clocktree, the following drivers & routing are lumped in the delay number reported in timingan.

All of this is speculation on my part, of course. It's unfortunate that
knowlegable Xilinx staff don't contribute in this newsgroup. You could
ask the same question on the Xilinx forums, but I find it's unusual to
get a good answer there.
I do agree with you on this one too :)

I've copied the head of this thread on Xilinx forums... no reply till now.

--
mmihai
 

Welcome to EDABoard.com

Sponsor

Back
Top