Describing pipelined hardware

Ben Jones · Jun 9, 2006

"KJ" <kkjennings@sbcglobal.net> wrote in message
news

Wbig.42644$fb2.9829@newssvr27.news.prodigy.net...

Minor error in the equation for 'c' in the 'Y' process,

You're right, of course - whoops! I had an async interrupt half-way through
writing that post and I obviously suffered some stack corruption on the way
back.

but simple enough to try on a few different tools...

That would be really interesting - do post your results if you can.

You might also try:

Z: process (clock)
begin
if rising_edge(clock) then
c <= (d and a and b) or (c and not (a and b));
end if;
end process;

(Almost certainly just a 4-input function generator plus register).

Oh, and apologies to anyone reading this on comp.lang.verilog and wondering
why some weirdo keeps posting code snippets in a superior HDL ;-) I'm sure
the equivalent Verilog constructs would be treated similarly by the
synthesizer.

Cheers,

-Ben-

Andy · Jun 9, 2006

Wierd synthesis tricks...

"Hold my beer and watch this!"

process (clk)
variable a, b : std_logic;
begin
if rising_edge(clk) then
a := b;
b := input;
out1 <= a xor b; -- registered xor of combo a, b
end if;
out2 <= a xor b; -- combo xor of registered a, b
end process;

Both out1 and out2 simulate _exactly_ the same (including down to the
delta cycle).

If I comment out the out1 assignment, out2 is a combinatorial xor of
registered a and b values.

If I comment out the out2 assignment, out1 is a registered xor of
combinatorial values for a and b (i.e. b and input).

If I leave both in, Synplicity recognizes them as being the same, and
makes both of them share the registered xor implementation from out1.
Note that retiming was not turned on for this excercise, and there was
no mention of retiming out2.

Andy

Mike Treseler wrote:

KJ wrote:
....
Try what? Do you have two examples of 'good' and functionally
equivalent code where the style makes a difference?

No. I don't think the style makes any difference.
I have never seen any "different synthesis results"

-- Mike Treseler

Mike Treseler · Jun 10, 2006

Andy wrote:

Weird synthesis tricks...
"Hold my beer and watch this!"

Sorry, I got distracted and finished it.

process (clk)
variable a, b : std_logic;
begin
if rising_edge(clk) then
a := b;
b := input;
out1 <= a xor b; -- registered xor of combo a, b
end if;
out2 <= a xor b; -- combo xor of registered a, b
end process;

Both out1 and out2 simulate _exactly_ the same (including down to the
delta cycle).
If I comment out the out1 assignment, out2 is a combinatorial xor of
registered a and b values.

To maintain compatibility with my a_rst template,
I keep all logic inside the main IF
and only wires outside. This also eliminates
the possibility of unregistered outputs.

I would code your example as:
....
variable a_v, b_v, out1_v: std_logic;
begin
if rising_edge(clk) then
a_v := b_v;
b_v := input; -- expect input-[dq]-
out_v := a_v xor b_v;
end if;
out <= out_v;
end process;

-- Mike Treseler

Kim Enkovaara · Jun 12, 2006

KJ wrote:

OK, now if all the stages are designed say to adhere to Altera's Avalon
specification (as an example, not a sales pitch) then both stage 3 and stage
25 would be designed with a master interface for accessing the slave memory
and if your above statement is true then you would simply find that the
stage 3 read/write output signals do not happen to be set at the same time
as the stage 25 read/write output signals. That being the case, one can
- Simply add an assert to validate during simulation that this condition is
never violated.

That is too late stage to detect that kind of problem if the violation should
not be there. It should have been detected already at the documentation phase.
At the simulation stage the code is already written. If assumptions were
incorrect in documentation, the block needs to be recoded. That translates to a
slip in schedule.

Also it is hard to be sure that there is 100% coverage in the simulation.
Of course assertion+formal check handles that side, if the tools can handle
the block.

- Or, detect and report the condition in a status bit

Reporting it to staus bit does not help. Maybe the implementation is in asic and
there is no way to fix it anymore. One worrying trend with fpgas is sloppines of
design coming from sw side "Let's just code this quickly, and why should
we simulate, we can test this in lab, we can always update the image"

- Or, cover yourself and realize that stage 3 and stage 25 are competing for
a shared resource and add a simple arbiter.

And if arbitration is needed, one of the stages has to stall. And that has to
be handled with buffering. Then we run into the interesting question of
stall propabilities and the needed buffer sizes etc.

If you did the design with this approach, you'd find that while you're
working on getting the stage 3 functionality up you would not need to know
or care about stage 25 (or any other stage). Same can be said for stage 25.
When it comes time to writing the logic that ties them all together (for the
most part the 'logic' is simply connecting the outputs of one stage to the
inputs to the next) the simple arbiter that you would need would cost at
most a single logic cell.

Plus the buffering memory to store the data during stalls if memory acces is
arbited. And the system level simulations about the performance penalties.

Those dependencies are really hard to handle.

Then don't do it the hard way

Sometimes there is no way to code the functionality without resorting to
very tight control of the pipeline and it's resources.

Formal model checking can
be a good tool to proof that hazards are not possible with the used
constrained incoming data.

I've never happened to use them though. How good are they in practice and
how much work are they to use?

The tools are usable nowadays. But the design style matters quite much. If the
state is stored in big memories the tools are quite bad. But if the amount of
state information is quite small and in FFs the tools have easier time.
Model checkers are purely block level tools in terms of capacity still.

Also the constraints for the design can be problematic to write. Also initial
state might be hard to get right with model checkers. Some tools fix that area
nowadays with hybrid approaches. They have simulator engine and formal tools
integrated, and the formal tool runs from the simulated states forwards.

--Kim

KJ · Jun 12, 2006

"Kim Enkovaara" <kim.enkovaara@iki.fi> wrote in message
news:04ajg.1651$IW.870@reader1.news.jippii.net...

KJ wrote:

OK, now if all the stages are designed say to adhere to Altera's Avalon
specification (as an example, not a sales pitch) then both stage 3 and
stage 25 would be designed with a master interface for accessing the
slave memory and if your above statement is true then you would simply
find that the stage 3 read/write output signals do not happen to be set
at the same time as the stage 25 read/write output signals. That being
the case, one can
- Simply add an assert to validate during simulation that this condition
is never violated.

That is too late stage to detect that kind of problem if the violation
should
not be there. It should have been detected already at the documentation
phase.
At the simulation stage the code is already written. If assumptions were
incorrect in documentation, the block needs to be recoded. That translates
to a
slip in schedule.

Well, personally I don't think it's ever too late to detect a

problem....that's the first step in fixing it. And that would be the whole
point of adding the minimalist assert....to detect that incorrect assumption
that was not caught until simulation. Otherwise, what mechanism do you use
to 'catch' it in simulation? Looking at the waveforms? Adding the assert
to verify that something you know to be true in fact really is is the way to
go here.

As for the recoding, in this case it would not be either of the individual
blocks that needs recoding (if you had followed my approach of using a
standardized I/O model for each block's interfaces) but the interconnect
logic that interfaces stages 3 and 25 to the memory....in other words the
arbitration logic to the memory. The error is not that stage 3 and 25 need
access to a shared resource, the error was believing that they don't happen
to need simultaneous access to the memory and then basing design decisions
on that....and getting burned.

- Or, cover yourself and realize that stage 3 and stage 25 are competing
for a shared resource and add a simple arbiter.

And if arbitration is needed, one of the stages has to stall. And that has
to
be handled with buffering. Then we run into the interesting question of
stall propabilities and the needed buffer sizes etc.

Well, the early on statement that you gave as an example about 'complex'

dependencies said that this won't happen so in theory an arbiter wouldn't be
needed and my point here was that you could cover the case where you
discover late that it really does happen and have the arbiter. In other
words, recognize the shared resource architecture right up front and design
for it.

As for the stall probabilities and needed buffer size, I agree, figuring
that out is a normal part of designing any circuit that has multiple masters
competing for a shared resource as it is in the example that you posed.
Better to accept that this is what you have right up front and deal with it
properly. You seem to be implying that perhaps with a 'clever' design maybe
you can avoid having the stages compete for the memory at the same time and
perhaps in certain situations that it is true but doing so generally adds
undue risk (i.e. the uncaught condition that wasn't found until too late)
and is certainly not something I would recommend for an ASIC design where
the cost of fixing later is way larger than with FPGA. The need to possibly
stall and/or figure out buffer sizes is not a consequence of the I/O model,
it is a result of having two stages compete for a shared resource which was
a given from your example.

If you did the design with this approach, you'd find that while you're
working on getting the stage 3 functionality up you would not need to
know or care about stage 25 (or any other stage). Same can be said for
stage 25. When it comes time to writing the logic that ties them all
together (for the most part the 'logic' is simply connecting the outputs
of one stage to the inputs to the next) the simple arbiter that you would
need would cost at most a single logic cell.

Plus the buffering memory to store the data during stalls if memory acces
is
arbited. And the system level simulations about the performance penalties.

Not sure what your point here is. If the basic architecture requires stages

3 and 25 to access a shared resource (i.e. memory in this case) than you
have to arbitrate between the two. The a priori knowledge that the stages
'shouldn't need' simultaneous access would simply mean that the arbiter
itself would not need to be very fancy at all.

Those dependencies are really hard to handle.

Then don't do it the hard way

Sometimes there is no way to code the functionality without resorting to
very tight control of the pipeline and it's resources.

And standardizing on a good and scalable I/O model for getting data in and

out will not hinder that in any way.

Kevin Jennings

KJ · Jun 12, 2006

Re-labelled your three proposals as 'W', 'X' and 'Y' and added yet
another form 'Z'. Of the four forms for the input there were two
different forms that popped out after syntesis when varying the tool
and the targetted device.

Form 'W', 'X' and 'Z' as input always got synthesized to a netlist that
is of the form of 'W'.

Form 'Y' tended to get synthesized as written for 'Z' (in fact that is
why I added 'Z' to make it easier to describe the results, even though
using 'Z' as input always produced 'W' as a result...go figure).
Sometimes the tools did actually see that 'Y' really is equivalent to
'W' and implemented in that fashion.

Form 'Z' as written shows two flip flops and using Synplify targetting
Spartan 3, that is what popped out. All the other times that form 'Y'
produced a netlist of the form 'Z', the tool was able to figure out
that only one flip flop was needed.

Using Synplify 8.1 to target either Xilinx Spartan, Virtex, XC3000,
Altera Stratix, Stratix II, Lattice ISPXPGA, Actel PA, ProASCI3E all
produced the same results: outputs 'c1', 'c2' and 'c4' all implemented
in the form of 'W'; output 'c3' implemented in the form of 'Z'.

Using Synplify 8.1 to target Spartan 3E all four outputs were
implemented in the form of 'W'.

Using Quartus 5.0 to target Altera Stratix, Stratix II, Cyclone or
Cyclone II all four outputs were implemented in the form of 'W'.

I was having trouble getting ISE going so I didn't try using that.

So I guess at this point, one can conclude that using Synplify 8.1 that
the 'style' of the code can affect the synthesis results. One could
also concule that Quartus 5.0 seems to not let the 'styls' affect the
synthesis results. Both statements having the caveat that I only did
this over a limited set of devices.

Kevin Jennings

See below for the actual code that I used
------- START OF VHDL ----------
library ieee;
use ieee.std_logic_1164.all;
entity Simple is port(
clock: in std_ulogic;
a: in std_ulogic;
b: in std_ulogic;
d: in std_ulogic;
c1: out std_ulogic;
c2: out std_ulogic;
c3: out std_ulogic;
c4: out std_ulogic);
end Simple;

architecture RTL of Simple is
signal c2_int: std_ulogic;
signal c3_int: std_ulogic;
signal c4_int: std_ulogic;
signal c4_int_delayed: std_ulogic;
begin
W: process (clock)
begin
if rising_edge(clock) then
if (a and b) = '1' then
c1 <= d;
end if;
end if;
end process;

X: process (clock)
begin
if rising_edge(clock) then
if a = '1' then
c2_int <= (d and b) or (c2_int and not b);
end if;
end if;
end process;
c2 <= c2_int;

Y: process (clock)
begin
if rising_edge(clock) then
c3_int <= (d and a and b) or (c3_int and not (a and b));
end if;
end process;
c3 <= c3_int;

Z: process (clock)
begin
if rising_edge(clock) then
c4 <= c4_int;
c4_int_delayed <= c4_int;
end if;
end process;
c4_int <= d when ((a and b) = '1') else c4_int_delayed;
end RTL;
------- END OF VHDL ----------

Jonathan Bromley · Jun 12, 2006

On Tue, 06 Jun 2006 17:25:32 +0100, Jonathan Bromley
<jonathan.bromley@MYCOMPANY.com> wrote:

So, here's my question: When writing pipelined designs,
what do all you experts out there do to make the overall
data and control flow as clear and obvious as possible?

Thanks to all the contributors for some fascinating responses
and insights. At least you've given me some level of
confidence that I'm not missing something painfully
obvious...
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.

Frank A. Vorstenbosch · Jun 13, 2006

Ben Jones wrote:

For timing diagrams I have a neat web-based tool I wrote myself (based on an
idea I stole shamelessly from Frank Vorstenbosch). It's not actually on the

Some people have no shame.

Frank

john · Jun 19, 2006

Hi

A trick that will work in some cases (but not all) is to use concurrent
processes rather than a pipeline. So you would create a state machine
for the process, instance several of them and allow them to iterate
concurrently.

Obviously, this will tend to duplicate any costly resources, eg
multipliers, but you can seperate those out into seperate modules and
arrange for accesses to be sequentialised via an arbiter. So you end up
with the control logic in the processes, the critical resources being
out-boarded, and rather a complex dataflow that goes back and forth
between the processes and the resources.

This way, however, the "program logic" of the antire process is
captured in a single state machine implementation which is about as
clean as you can get for arbitrary algorithms.

The real big problem here is that there's no longer a guarantee that
processes will exhibit side effects or complete in the same order you
kicked them off.

Just some food for thought.

Cheers, John

Ben Jones · Jun 22, 2006

Hi, Frank.

"Frank A. Vorstenbosch" <frank@kingswood-consulting.co.uk> wrote in message
news:128tv4ffrpl0gf9@corp.supernews.com...

Ben Jones wrote:

For timing diagrams I have a neat web-based tool I wrote myself (based
on an
idea I stole shamelessly from Frank Vorstenbosch). It's not actually on
the

Some people have no shame.

Frank

If you have no shame, but you freely admit that you have no shame, does it
really count?

FWIW the main difference was I used a single image to hold all the glyphs
and some CSS to change the background offset within each table cell, rather
than having a sea of tiny image files. Using the images as backgrounds,
rather than <img>-type objects, has the advantage that you can add (small)
annotations to show the current value of a bus (or what-have-you).

Cheers,

-Ben-

Describing pipelined hardware

Ben Jones

Guest

Andy

Guest

Mike Treseler

Guest

Kim Enkovaara

Guest

KJ

Guest

KJ

Guest

Jonathan Bromley

Guest

Frank A. Vorstenbosch

Guest

john

Guest

Ben Jones

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Describing pipelined hardware

Ben Jones

Guest

Andy

Guest

Mike Treseler

Guest

Kim Enkovaara

Guest

KJ

Guest

KJ

Guest

Jonathan Bromley

Guest

Frank A. Vorstenbosch

Guest

john

Guest

Ben Jones

Guest

Log in

Welcome to EDABoard.com

Sponsor