Aligned PLL clocks in RTL simulation

J

Jonathan Bromley

Guest
Every cloud has a silver lining, but it seems
every rose has its thorns too.

PLLs/DCMs/DLLs (or whatever your favourite FPGA
happens to offer) provide a wonderful way to create
multiplied-up clocks within the device. What's more,
you can line up the active clock edges so closely
that you can treat the x1 and xN clock domains as
if they were one single clock domain; hold times
can be avoided when crossing the boundary in either
direction.

Until recently I've always avoided taking advantage
of this, and have treated the x1 and xN clock domains
as if they were asynchronous, using FIFOs or whatever
to convey things across the boundary. But in a recent
client engagement I was faced with a design in which
a x2 and x4 clock, from the same PLL, were used in
a completely sensible way as if they were in the same
clock domain as the original x1 clock. The TimeQuest
timing analyzer (for it was Brand A that was in use
on this occasion) was quite happy to deal with these
crossings, giving clear-headed and (as it turned out)
accurate reports of what was going on. There is no
doubt that this is cool.

However, it's not so cool in RTL simulation. The
PLL simulation models, not too surprisingly,
introduce some delta delays between the
nominally coincident clock edges. Consequently
I get everything working when going in one direction
(from fast clock to slow clock, as it turns out)
but I get shoot-through behaviour, the RTL equivalent
of a hold time violation, when crossing from slow to
fast clock; data is arriving one or more delta cycles
*before* the clock.

We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various
clock domains as if they were asynchronous, thereby
losing one of the nicest benefits of the PLLs? Or
do you simply accept that it's necessary to do timing
simulation in order to see what will really happen?

This is partly a plague of VHDL RTL sim (hence the
posting to c.l.vhdl as well); in Verilog you can
model clock gating and PLL-ish behaviour with "less"
zero delay than the nonblocking assignments to your
flip-flops, by taking care to use blocking assignment
in all your clock paths. I have not yet tried the
Verilog simulation models for the PLLs to see whether
that makes any difference.

One further whinge: I haven't tried this in Brand X
recently, but the Altera PLL models are spectacularly
inefficient for RTL simulation. In our modest-size
project - think SDRAM controller, a few FIFOs occupying
most of the blockRAM, and a fairly small bunch of
additional logic - the two PLLs are responsible for
at least 90% of the simulation time - OUCH. I swapped-in
much simpler, but perfectly adequate in-house models and
got x10 simulation speedup.

Opinions/rants/insults welcomed. Thanks in advance.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
In comp.arch.fpga Jonathan Bromley <jonathan.bromley@mycompany.com> wrote:
....

Opinions/rants/insults welcomed. Thanks in advance.
I have a similar problem:
20 MHz "clock_in", internal used multiplied by five, used als "clk" and
also used doubled as "clkx2"
The clock_in is not used.
I use this:
`ifdef __ICARUS__
reg clkx2 = 0;
reg clk = 0;

always @(posedge clk_in)
{clk, clkx2} <= {clk, clkx2} + {2{clk_in}};
assign alu_ctl_bits[`CMD_RST] = 1'b0;
`else
wire clk, clkx2;

clk100 dcm0
(
.CLKIN_IN(clk_in),
.RST_IN(alu_ctl_cmd[`CMD_RST]),
.CLKFX_OUT(clk80),
.CLKIN_IBUFG_OUT()
);

DCM dcmac (
.CLKIN(clk80),
.CLKFB(clkx2),
.RST(alu_ctl_cmd[`CMD_RST]),
.CLK0(clkacdcm),
.CLK2X(clkacx2dcm),
.LOCKED(alu_ctl_bits[`CMD_RST]));
BUFG clkbuf(.I(clkacdcm),.O(clk));
BUFG clkx2buf(.I(clkacx2dcm),.O(clkx2));
`endif // !`ifdef __ICARUS__

For simulation I now use clk_in == clk
--
Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
 
Jonathan Bromley wrote:

We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various
clock domains as if they were asynchronous, thereby
losing one of the nicest benefits of the PLLs? Or
do you simply accept that it's necessary to do timing
simulation in order to see what will really happen?
Naturally, I would prefer to fix up the design
to use a single clock or another "known good"
synchronization scheme.

If I were forced to use both clocks,
and to trust that the vendor got the
analog portions of the PLL right, I would
write a simplified rtl model that
just trusted the vendor specs.
I don't think a gate sim would make me feel better.
Maybe a SPICE sim would ;)

This is analogous to the case of a
two flop bit synchronizer. I might simplify
a model that gave me 'U' outputs for
setup violations because I 'believe"
the synchronizer will work well enough.

... I swapped-in
much simpler, but perfectly adequate in-house models and
got x10 simulation speedup.
Sounds reasonable to me.


-- Mike Treseler
 
Jonathan Bromley wrote:

I swapped-in
much simpler, but perfectly adequate in-house models and
got x10 simulation speedup.
Ditto!

Regards,

--
Mark McDougall, Engineer
Virtual Logic Pty Ltd, <http://www.vl.com.au>
21-25 King St, Rockdale, 2216
Ph: +612-9599-3255 Fax: +612-9599-3266
 
Jonathan Bromley <jonathan.bromley@MYCOMPANY.com> wrote in
news:5ra3i41ksqpt0r432qv6tulvijk28s7qta@4ax.com:

Every cloud has a silver lining, but it seems
every rose has its thorns too.

PLLs/DCMs/DLLs (or whatever your favourite FPGA
happens to offer) provide a wonderful way to create
multiplied-up clocks within the device. What's more,
you can line up the active clock edges so closely
that you can treat the x1 and xN clock domains as
if they were one single clock domain; hold times
can be avoided when crossing the boundary in either
direction.

Until recently I've always avoided taking advantage
of this, and have treated the x1 and xN clock domains
as if they were asynchronous, using FIFOs or whatever
to convey things across the boundary. But in a recent
client engagement I was faced with a design in which
a x2 and x4 clock, from the same PLL, were used in
a completely sensible way as if they were in the same
clock domain as the original x1 clock. The TimeQuest
timing analyzer (for it was Brand A that was in use
on this occasion) was quite happy to deal with these
crossings, giving clear-headed and (as it turned out)
accurate reports of what was going on. There is no
doubt that this is cool.

However, it's not so cool in RTL simulation. The
PLL simulation models, not too surprisingly,
introduce some delta delays between the
nominally coincident clock edges. Consequently
I get everything working when going in one direction
(from fast clock to slow clock, as it turns out)
but I get shoot-through behaviour, the RTL equivalent
of a hold time violation, when crossing from slow to
fast clock; data is arriving one or more delta cycles
*before* the clock.

We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various
clock domains as if they were asynchronous, thereby
losing one of the nicest benefits of the PLLs? Or
do you simply accept that it's necessary to do timing
simulation in order to see what will really happen?

This is partly a plague of VHDL RTL sim (hence the
posting to c.l.vhdl as well); in Verilog you can
model clock gating and PLL-ish behaviour with "less"
zero delay than the nonblocking assignments to your
flip-flops, by taking care to use blocking assignment
in all your clock paths. I have not yet tried the
Verilog simulation models for the PLLs to see whether
that makes any difference.

One further whinge: I haven't tried this in Brand X
recently, but the Altera PLL models are spectacularly
inefficient for RTL simulation. In our modest-size
project - think SDRAM controller, a few FIFOs occupying
most of the blockRAM, and a fairly small bunch of
additional logic - the two PLLs are responsible for
at least 90% of the simulation time - OUCH. I swapped-in
much simpler, but perfectly adequate in-house models and
got x10 simulation speedup.

Opinions/rants/insults welcomed. Thanks in advance.

I use a behavioural clock generator that has 0 skew outputs, specifically
to avoid many of the problems you observe with vendors' PLLs.

Yet another problem: Some PLL models can't accept jitter. I recently
had an Altera PLL tell me that it was unlocking because my input clock
was changing frequency. My input clock had a stable frequency, but with
a jitter equal to the timing resolution of the simulator (which is
necessary to simulate clocks that have a period that isn't integer
multiple of the resolution, e.g. 155.52MHz with a 1ns resolution).

Regards,
Allan
 
On Nov 17, 3:38 pm, Mike Treseler <mtrese...@gmail.com> wrote:
Jonathan Bromley wrote:
We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it.  Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation?  Do you treat the various
clock domains as if they were asynchronous, thereby
losing one of the nicest benefits of the PLLs?  Or
do you simply accept that it's necessary to do timing
simulation in order to see what will really happen?

Naturally, I would prefer to fix up the design
to use a single clock or another "known good"
synchronization scheme.

If I were forced to use both clocks,
and to trust that the vendor got the
analog portions of the PLL right, I would
write a simplified rtl model that
just trusted the vendor specs.
I don't think a gate sim would make me feel better.
Maybe a SPICE sim would ;)

This is analogous to the case of a
two flop bit synchronizer. I might simplify
a model that gave me 'U' outputs for
setup violations because I 'believe"
the synchronizer will work well enough.

... I swapped-in
much simpler, but perfectly adequate in-house models and
got x10 simulation speedup.

Sounds reasonable to me.

   -- Mike Treseler
Not quite where Jonathan was headed with this, but:

Applying "standard" synchronization techniques to not-quite-
asynchronous interfaces can and has caused problems. With truly
asynchronous interfaces, the probability that an input will fall
within the narrow region that causes metastability lasting long enough
to be a problem (with two flop synchronizers) is extremely rare.
However, if the two clock domains are related, such an event can
happen much more often (or never at all). If they do happen (i.e. the
stars align...) they will happen much more frequently (i.e. the stars
will stay aligned).

If at all possible I would take steps to ensure that either the clocks
are related and a fully synchronous interface is employed, or that
they are not related and asynchronous interface techniques are
employed. Failing that, a three stage synchronizer should be
considered.

I have solved the simulation problem in the past by running the main
clock through the same module where the DCM is, and providing a 1:1
clock output that is delayed (RTL) for the same number of delta cycles
as the DCM delays its output. That delayed 1:1 output is used to drive
the rest of the design. This is not always easy, especially when the
DCM would otherwise best be buried down at an appropriate level of
hierarchy along with it's associated functionality.

Andy
 
On Tue, 18 Nov 2008 06:36:22 -0800 (PST), Andy wrote:

Applying "standard" synchronization techniques to not-quite-
asynchronous interfaces can and has caused problems. With truly
asynchronous interfaces, the probability that an input will fall
within the narrow region that causes metastability lasting long enough
to be a problem (with two flop synchronizers) is extremely rare.
However, if the two clock domains are related, such an event can
happen much more often (or never at all). If they do happen (i.e. the
stars align...) they will happen much more frequently (i.e. the stars
will stay aligned).
Yes. Worse still, you can easily lose track of which source
clock gave rise to the datum on a given destination clock,
because the quasi-static phase relationship between the
two clocks is unknown and highly variable from one instance
of the design to another. I suffered this on the same
recent project: part of the design was, for very good reasons,
clocked by exactly the main system clock that had been through
a chain of external buffers (thereby allowing the design to
track temperature/voltage/process variations in the behaviour
of other signals that went through similar external buffers).
I had the devil of a time trying to persuade the designers
that we needed to know the window within which the delayed
clock would fall, so that we could decide which edge of it
belonged with which edge of the master clock. Of course,
no-one had thought to provide a synchronous "data valid"
signal that could have been used to track this.

I have solved the simulation problem in the past by running the main
clock through the same module where the DCM is, and providing a 1:1
clock output that is delayed (RTL) for the same number of delta cycles
as the DCM delays its output. That delayed 1:1 output is used to drive
the rest of the design. This is not always easy, especially when the
DCM would otherwise best be buried down at an appropriate level of
hierarchy along with it's associated functionality.
Perfect summary of the issues I was hoping to raise. Thanks.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
"Jonathan Bromley" <jonathan.bromley@MYCOMPANY.com> wrote in message
news:jgn5i4hgvo65a0stlpsfsmlo6de0q1lugs@4ax.com...
I have solved the simulation problem in the past by running the main
clock through the same module where the DCM is, and providing a 1:1
clock output that is delayed (RTL) for the same number of delta cycles
as the DCM delays its output. That delayed 1:1 output is used to drive
the rest of the design. This is not always easy, especially when the
DCM would otherwise best be buried down at an appropriate level of
hierarchy along with it's associated functionality.

Perfect summary of the issues I was hoping to raise. Thanks.
--
Jonathan Bromley, Consultant

It also seems that if the design only uses the outputs from the DCM only,
i.e. CLK0, CLKDV, CLK2X, which is the way they are 'meant' to be used, then
they are already aligned. Problems arise when folks subsequently add stuff
to their VHDL like:-

my_clock <= his_clock;

This assignment is optimised away in real life, but in the simulation,
my_clock is now a delta later than his_clock, and maybe no longer aligns
with his_clock_2X.

HTH., Syms.
 
PLLs/DCMs/DLLs (or whatever your favourite FPGA
happens to offer) provide a wonderful way to create
multiplied-up clocks within the device. What's more,
you can line up the active clock edges so closely
that you can treat the x1 and xN clock domains as
if they were one single clock domain; hold times
can be avoided when crossing the boundary in either
direction.
Do the vendors actually support that mode?

It seems reasonable, but I remember some discussion from
a year or three ago where somebody eventually tracked
a bug down to it not quite working.

Newer silicon might take that into account..

The basic idea is that the Xilinx tools don't bother
checking hold times. All their FFs have "0 hold time".
What that really means is that the min clock-to-out time
plus min prop delays are enough to cover the hold time
and the clock skew.

The catch is that you can get additional skew if you
are using two clocks even though they should be aligned.

--
These are my opinions, not necessarily my employer's. I hate spam.
 
Jonathan
Every cloud has a silver lining, but it seems
every rose has its thorns too.

PLLs/DCMs/DLLs . . .

We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various
clock domains as if they were asynchronous, thereby
losing one of the nicest benefits of the PLLs? Or
do you simply accept that it's necessary to do timing
simulation in order to see what will really happen?
Haven't had to do this, so I will introduce a fourth question,
if all clocks are truely aligned have you tried removing
delta cycle differences via adding a small non-zero time
delay (less than tperiod_Clk/2) to the clock outputs?

Clk_X1_DS <= Clk_X1 after 1 ns ;
Clk_X2_DS <= Clk_X2 after 1 ns ;
Clk_X4_DS <= Clk_X4 after 1 ns ;

Since synthesis tools ignore after (or at least are supposed to),
this should be ok to add to the RTL code.

Cheers,
Jim
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jim Lewis SynthWorks VHDL Training http://www.synthworks.com

A bird in the hand may be worth two in the bush,
but it sure makes it hard to type.
 
Symon wrote :
It also seems that if the design only uses the outputs from the DCM only,
i.e. CLK0, CLKDV, CLK2X, which is the way they are 'meant' to be used, then
they are already aligned.

That's how the DCM model is _supposed_ to work, but I've
encountered problems with delta delay offsets in the Xilinx
DCM models in years past:

http://groups.google.com/group/comp.arch.fpga/msg/6e5b0b6da92b4ad1

Other thoughts:

As suggested elsewhere on the thread, I usually attempt
to bundle all the DCM's into a clock module that can be
replaced by a simpler model for functional sims.

This module also gives a handy spot to take care of any DCM
reset sequencing, unlock logic, and the startup enables needed
to avoid the insidious initialized BRAM corruption feature.
( IIRC, xapp873.zip has an example of this sort of thing )

One additional suggestion: even if one replaces the DCM module
for functional sims, it is very helpful to thoroughly beat up
the actual DCM module in it's own simulation testbench to make
sure all of the required startup and unlock recovery sequencing
is done properly without the simulation model throwing any errors.

Brian
 
Jonathan Bromley wrote:

We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you
introduce small non-zero time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various
At least one trick I use with modelsim is to force
clock signal to the DCM outputs from the simulator.
When forced signals are created they work on the same
delta cycle. The same trick can be used also with asics
that contain clock buffers for clock tree roots etc.

--Kim
 
Kim
We've easily enough got around this for the present
design, but I'd love to know what all you seasoned
PLL/DCM users out there do about it. Do you introduce small non-zero
time delays in all the
signals crossing the clock domains, so that it all
works in simulation? Do you treat the various

At least one trick I use with modelsim is to force
clock signal to the DCM outputs from the simulator.
When forced signals are created they work on the same
delta cycle. The same trick can be used also with asics
that contain clock buffers for clock tree roots etc.

--Kim
This is good to know as VHDL-2008 gives you this capability
directly in code.

Jim

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jim Lewis SynthWorks VHDL Training http://www.synthworks.com

A bird in the hand may be worth two in the bush,
but it sure makes it hard to type.
 

Welcome to EDABoard.com

Sponsor

Back
Top