Optimizing an inferred counter

M

Marty Ryba

Guest
Hello everyone,

After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine).
Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope). Today I got brave enough to load
it into the EEPROM; it worked this afternoon but who knows tomorrow (grrr).
Looking at the Synplify timing report (with -4 speed setting in Synplify),
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock). The key code is very
simple (some syntax may be a bit off since I'm doing it from memory). We
tried trimming the size of the counter from 32 bits down to 20 and it seems
to help some.

signal my_counter : std_logic_vector(COUNT_WIDTH-1 downto 0);

countdown_process: process (CLK)
begin
if rising_edge(CLK) then -- do everything synchronous
if RESET = '1' then
my_counter <= ZEROIZE_COUNT;
elsif counter_load = '1' then
my_counter <= input_bus(COUNT_WIDTH-1 downto 0);
elsif making_data = '1' then
my_counter <= my_counter - '1';
endif
endif -- CLK
end -- process

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions? My main FPGA engineer has been working with the local
Xilinx FAE but no "Eureka!" moments yet. A very similar (if anything,
somewhat larger) design has never shown this trouble on the same boards. The
current design takes about 45% of the LUTs.

I notice the RTL shows this using an adder (it fans out making_data to
COUNT_WIDTH bits so that it becomes either the 0 or -1 to add). Isn't there
a simpler structure to define a count(down) counter? I sure can buy a simple
counter in 74xx series logic. I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).

Thanks in advance for your sagacity.

-Dr. Marty Ryba
Mad GNSS scientist
 
Marty Ryba wrote:

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions?
Combine the two processes into one.

-- Mike Treseler
 
Marty Ryba wrote:
countdown_process: process (CLK)
begin
if rising_edge(CLK) then -- do everything synchronous
if RESET = '1' then
my_counter <= ZEROIZE_COUNT;
elsif counter_load = '1' then
my_counter <= input_bus(COUNT_WIDTH-1 downto 0);
elsif making_data = '1' then
my_counter <= my_counter - '1';
endif
endif -- CLK
end -- process

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.
I've been banging my head as well trying to improve poor legacy code
to pass timing at 250Mhz, for the last month. Based on this experience,
I'd suggest registering *everything*... well, as much as possible.
Make sure your making_data is a flop, 'cos if it is combinatorial and
based on my_counter, that's a recipe for failure.

HTH,
-P@
 
Marty Ryba wrote:

Hello everyone,

After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine).
Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope). Today I got brave enough to load
it into the EEPROM; it worked this afternoon but who knows tomorrow (grrr).
Looking at the Synplify timing report (with -4 speed setting in Synplify),
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock). The key code is very
simple (some syntax may be a bit off since I'm doing it from memory). We
tried trimming the size of the counter from 32 bits down to 20 and it seems
to help some.
When you have symptoms like this, that suggest the real limit is lower
than the tools report, have you tried variable clocking speeds,
to check if at 10MHz or 1MHz, it DOES work properly ?

I notice the RTL shows this using an adder (it fans out making_data to
COUNT_WIDTH bits so that it becomes either the 0 or -1 to add). Isn't there
a simpler structure to define a count(down) counter? I sure can buy a simple
counter in 74xx series logic. I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).
In a counter, you usually need to 'see' the state of the lower bits
to decide when to toggle the upper bit - and in a FPGA the carry chain
is often faster than other paths, so that makes adders a natural
counter solution. Certainly easy to write.

For long counters, the carry pathway can limit the speed, then you can
split it and make it more complex, but faster.
Look at 74161 for a faster carry scheme.

-jg
 
I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).
Thats because ++ -- are nothing special, it still requires an adder
with the the 1st input as the registered output of the adder, and the
2nd tied to +-1. An FPGA is just an array of LUTs, flip-flops and
RAMs, not alot more (some FPGAs may have dedicated multipliers too).

As for the "making_data" becoming the 2nd adder input, Im surprised.
It might be better if you can try and force it to synthesize
"making_data" as the adder's register enable rather than the 2nd adder
input, and then you can keep the 2nd adder input as a constant -1. As
to how to do this, Im not sure. How about changing "my_counter" into
an unsigned instead (or signed, makes no difference) using the
numeric_std package (implementation is IEEE defined) instead of the
std_logic_arith package (implementation is Vendor defined, and non-
standard).
 
"Marty Ryba" <martin.ryba.nospam@verizon.net> writes:

Hello everyone,

After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine).
Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope). Today I got brave enough to load
it into the EEPROM; it worked this afternoon but who knows tomorrow (grrr).
Looking at the Synplify timing report (with -4 speed setting in Synplify),
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock).
What does the Xilinx timing report say? Have you constrained the
clock correctly (or indeed at all :)?

Synplify's report is an educated guess on the part of the tools.
Xilinx's represents what they think the absolute worst-case is, so if
it thinks you meet your timing constraints, then any chip you get will
run that design. Of course, that depends on your constraints being
right :)

You say this design has a history of working fine - what's changed
since then?

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
 
On 19 Mar, 04:13, "Marty Ryba" <martin.ryba.nos...@verizon.net> wrote:
Hello everyone,

    After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine)..
Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope). Today I got brave enough to load
it into the EEPROM; it worked this afternoon but who knows tomorrow (grrr)..
Looking at the Synplify timing report (with -4 speed setting in Synplify),
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock). The key code is very
simple (some syntax may be a bit off since I'm doing it from memory). We
tried trimming the size of the counter from 32 bits down to 20 and it seems
to help some.

signal my_counter : std_logic_vector(COUNT_WIDTH-1 downto 0);

countdown_process: process (CLK)
begin
  if rising_edge(CLK) then        -- do everything synchronous
    if RESET = '1' then
       my_counter <= ZEROIZE_COUNT;
    elsif counter_load = '1' then
      my_counter <= input_bus(COUNT_WIDTH-1 downto 0);
    elsif making_data = '1' then
      my_counter <= my_counter - '1';
    endif
  endif  -- CLK
end -- process

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions? My main FPGA engineer has been working with the local
Xilinx FAE but no "Eureka!" moments yet. A very similar (if anything,
somewhat larger) design has never shown this trouble on the same boards. The
current design takes about 45% of the LUTs.

I notice the RTL shows this using an adder (it fans out making_data to
COUNT_WIDTH bits so that it becomes either the 0 or -1 to add). Isn't there
a simpler structure to define a count(down) counter? I sure can buy a simple
counter in 74xx series logic. I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).

Thanks in advance for your sagacity.

-Dr. Marty Ryba
Mad GNSS scientist
Just some ideas:
Are all control signals synchronous to "CLK" ? Is the clock "clean"?
What about supply voltage (DC-level, ripple, decoupling etc)? Could it
be a board layout problem (insufficient ground plane, crosstalk)?
If the long carry chain is the problem, you may divide the counter
into 2 smaller counters with a pipelined carry chain.

/Peter
 
On Wed, 19 Mar 2008 03:13:40 GMT, "Marty Ryba" <martin.ryba.nospam@verizon.net>
wrote:

Hello everyone,

After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip.

the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock).
One thought: Are you using anything like a DCM or (since it's a Virtex) DLL to
clean up the clock?

PCI clock can be stopped, and switched between 33 and 66 MHz, during a PC's boot
sequence (I have watched this in a scope). This can confuse a DLL; you may need
means to reset it after any such change; or use an alternative (constant
frequency) clock.

- Brian
 
On Mar 18, 11:13 pm, "Marty Ryba" <martin.ryba.nos...@verizon.net>
wrote:
Hello everyone,

    After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine)..
Which design are you referring to here that has some history of
working fine? The PCB design or the FPGA design?

Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope).
You might want to clarify what you mean by 'messes up critical timing'
and 'looks all wrong'. I'm assuming here that the counter starts off
correctly and just doesn't decrement properly which would lead one to
suspecting problems related in some way to the signal 'making_data'
but again you should clarify this.

In any case, the problem is one of the following (not in any
particular order):
1. Inadequate power supply. Check the Vcc at the chip with a high
speed scope and good probing techniques, make sure that you're within
spec. If only 'slightly' out that's not likely the cause of your
symptoms but is still something that needs to be addressed.
2. Timing. Are the signals 'RESET', 'counter_load' and 'making_data'
all synchronized to 'CLK'? As I said, I'm not sure which symptoms
you're exactly seeing but I'm guessing that it resets and initializes
properly it's just not counting correctly in which case 'making_data'
is the likely culprit.
3. More timing. There is more to timing than just clock frequency.
There are also setup/hold time requirements. Do 'counter_load' or
'making_data' come from external I/O pins? If so, then
- Do the signals on the board meet the timing requirements that you
specified?
- You did specify a timing requirement on the inputs?
- Did the computed setup time from the P&R timing report (not
Synplify's estimated timing) meet all requirements?
4. Yet more timing. 'CLK' isn't a gated clock is it?
5. Clock signal quality. Put a scope on the input clock. Is it
absolutely monotonic through the entire Vih voltage range? Both
edges? No dips and bounces anywhere between Vih(min) and Vih(max)?

Go through the above checklist and I'm fairly confident that you'll
find the cause.

We
tried trimming the size of the counter from 32 bits down to 20 and it seems
to help some.
This is a symptom of failing timing, see items #2, 3 and 4 or double
clocking , see item #5 above.

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.
This is a process block that is clocked by 'CLK' I presume?

How about the inputs into that process block? The same sort of timing
considerations mentioned previously apply here as well. Violating
timing may cause 'making_data' to miss or double hit occasionally.

I notice the RTL shows this using an adder (it fans out making_data to
COUNT_WIDTH bits so that it becomes either the 0 or -1 to add). Isn't there
a simpler structure to define a count(down) counter? I sure can buy a simple
counter in 74xx series logic. I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).

The RTL viewer is a graphical view of your SOURCE code, it is not a
view of the final routed design. Have no fear, the adder that adds -1
and the muxer that selects the final output will get optomized
appropriately.

Good luck

Kevin Jennings
 
On Mar 19, 8:38 am, Brian Drummond <brian_drumm...@btconnect.com>
wrote:
On Wed, 19 Mar 2008 03:13:40 GMT, "Marty Ryba" <martin.ryba.nos...@verizon.net
wrote:

Hello everyone,

After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip.
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock).

One thought: Are you using anything like a DCM or (since it's a Virtex) DLL to
clean up the clock?

PCI clock can be stopped, and switched between 33 and 66 MHz, during a PC's boot
sequence (I have watched this in a scope). This can confuse a DLL; you may need
means to reset it after any such change; or use an alternative (constant
frequency) clock.

- Brian
My experience agrees with the common wisdom that it's almost always
clocks or power. But sometimes in mysterious, non-obvious ways :)

Basic things to check on the timing side:

- how do you *know* that this is the process that's giving you
trouble? I'm assuming that there's a lot more logic than that in a
V2-1000 :) Step back and ask yourself why you're so sure this is the
culprit.

- do you know that the clock is OK inside the chip? Have you brought
the clock out to a pin (or even a divide-by-two or by-four version,
using a simple flop, *not* a DCM/DLL) and scoped it out?

- are any of the signals in the process generated or used by a
different clock? If so, fix that first, and make sure any clock
crossing logic is (1) designed right, and (2) laced with
do_no_replicate and do_not_retime synthesis attributes (details differ
by synthesis tool)

- Xilinx DLLs and DCM tend to exhibit peculiar behaviour in that their
LOCK output can assert even though the output clock is completely
unstable, or possibly just running at a harmonic, like half-rate. If
you are using one of these devices, you need to manually implement
your own "LOCK" output, in a frequency measurement module using a pair
of counters, one from DCM and one from a known rock-solid (XTAL input)
reference and verifying that they count up at the right rate relative
to each other. Otherwise you need to keep hitting them with a reset.
Then once everything else is stable, reset the rest of the chip. This
problem can happen when input clocks are changed, as Brian mentioned,
and can be made worse if you have a cascade of DLL's, depending on
each other, producing a series of clocks.

- do you have enough synthesis constraints on the clock and any pin
inputs that may drive signals going into the state machine?

- have you tried running a post-place-and-route timing annotated
simulation to see if any timing errors show up there?

Good luck,

- Kenn
 
On Mar 18, 8:13 pm, "Marty Ryba" <martin.ryba.nos...@verizon.net>
wrote:
Hello everyone,

    After banging our heads for last few weeks (sometimes literally), I
figure I'll query the group of experts here. We have a design that is
functionally correct (ModelSim test bench) but it appears to be very iffy
when it gets on the real chip. I have a couple copies of "identical" boards
with Virtex2-1000 chips on them. I'll check again soon, but I believe they
are -6 parts. We've been synthesizing the design in Synplify Pro (v7.5
though others are available; this design has some history of working fine)..
Sometimes it works on one or more boards, other times after I load it (and
verify) with iMPACT this counter acts screwy (it messes up critical timing,
and it also looks all wrong on Chipscope). Today I got brave enough to load
it into the EEPROM; it worked this afternoon but who knows tomorrow (grrr)..
Looking at the Synplify timing report (with -4 speed setting in Synplify),
the timing is marginal for a specified clock of 100 MHz around this path,
but the chip is really running at 66 MHz (PCI clock). The key code is very
simple (some syntax may be a bit off since I'm doing it from memory). We
tried trimming the size of the counter from 32 bits down to 20 and it seems
to help some.

signal my_counter : std_logic_vector(COUNT_WIDTH-1 downto 0);

countdown_process: process (CLK)
begin
  if rising_edge(CLK) then        -- do everything synchronous
    if RESET = '1' then
       my_counter <= ZEROIZE_COUNT;
    elsif counter_load = '1' then
      my_counter <= input_bus(COUNT_WIDTH-1 downto 0);
    elsif making_data = '1' then
      my_counter <= my_counter - '1';
    endif
  endif  -- CLK
end -- process

Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions? My main FPGA engineer has been working with the local
Xilinx FAE but no "Eureka!" moments yet. A very similar (if anything,
somewhat larger) design has never shown this trouble on the same boards. The
current design takes about 45% of the LUTs.

I notice the RTL shows this using an adder (it fans out making_data to
COUNT_WIDTH bits so that it becomes either the 0 or -1 to add). Isn't there
a simpler structure to define a count(down) counter? I sure can buy a simple
counter in 74xx series logic. I notice neither arith_std or numeric_std
define special operators (a la C's ++ and -- operators) to specify
increment/decrement, so maybe there is no simple way to create a counter
structure in fewer FPGA logic elements. Seems odd to me (non-EE).

Thanks in advance for your sagacity.

-Dr. Marty Ryba
Mad GNSS scientist
Marty, for the past 15 years, all Xilinx FPGAs have had a built-in
ripple-carry structure which assures that a counter takes only one
flip-lop per bit, up to 32 bits and beyond.
If I were you, I would test this counter with a clean clock and adjust
the clock frequency up until the counter fails. That should eliminate
(or illuminate) the alleged frequency dependence.
If your clock comes from PCI, then you should be very leary about
running it through a DCM, which inherently does not tolerate abrupt
frequency changes or excessive jitter. But the raw counter itself is
very forgiving.
Peter Alfke, Xilinx
 
On Mar 18, 9:03 pm, Mike Treseler <mike_trese...@comcast.net> wrote:
Marty Ryba wrote:
Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions?

Combine the two processes into one.

-- Mike Treseler
Mike is spot-on here. In my experience, the comparison(s) is (are)
the problem, not the counter operation.

You need to be a little clever and predict when the counter will be
zero (or whatever):

elsif making_data = '1' then
my_counter <= my_counter - '1';

if my_counter = "0001" then -- If it's one, and we're
counting down
will_be_zero <= '1'; -- It will be zero by the time we
latch this

endif

Hope that helps,
G.
 
On Mar 19, 11:16 am, ghel...@lycos.com wrote:
On Mar 18, 9:03 pm, Mike Treseler <mike_trese...@comcast.net> wrote:

Marty Ryba wrote:
Another process block checks that this counter is nonzero and a few other
requirements to set the value of making_data true or false.

Ideas/suggestions?

Combine the two processes into one.

             -- Mike Treseler

Mike is spot-on here.  In my experience, the comparison(s) is (are)
the problem, not the counter operation.

You need to be a little clever and predict when the counter will be
zero (or whatever):

     elsif making_data = '1' then
      my_counter <= my_counter - '1';

      if my_counter = "0001" then    -- If it's one, and we're
counting down
          will_be_zero <= '1';       -- It will be zero by the time we
latch this

    endif

Hope that helps,
G.
More general, and from a hardware point of view:
When the counter approaches zero, the most significant bits go to zero
first, and it is only the least significant bits that determine the
final count of zero. So most of the inputs to your big comparator of
32 bits are stable long before the zer0-count is reached. That means
you can divide up, and even pipeline your 32-bit comparator. You might
also be clever and detect underflow at the Most Significant bit, and
avoid the detector AND gate alltogether (but after one additional
clock pulse).
Peter Alfke, Xilinx
 
kennheinrich@sympatico.ca writes:

- Xilinx DLLs and DCM tend to exhibit peculiar behaviour in that their
LOCK output can assert even though the output clock is completely
unstable, or possibly just running at a harmonic, like half-rate.
Really!? What's the point of the LOCKED output then? Do that flaw not
make them a bit useless?

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
 

Welcome to EDABoard.com

Sponsor

Back
Top