For loop synthesis and repeated structures, any examples?

M

Marwan

Guest
Peace,

Context:

I am trying to discuss and demonstrate the inefficiency of the for
loop in synthesis and hardware as compared to equivalent counter
implementation.

Main body:

According to The verilog golden reference guide, the for loop
synthesizes into repeated hardware structures, provided the loop
bounds are fixed.

I am currently having difficulty in coding an example module which
demonstrates this in synthesis. Does anyone have some simple example
code that they could share?

I am using Xilinx ISE to do the simulation, synthesis testing etc...

My understanding of a for loop is that in hardware/synthesis it
'unrolls' into the repeated structures, which is obviously extremely
wasteful of resources.

It is also my understanding that a more efficient manner of
implementing a for loop, one which allows every iteration to be
controllled by a clock edge and which does not repeat structures, is
to use counters with bit select to replace for loops and nested for
loops. This would be the second example code that I will need to use
to demonstrate more efficient coding of for like functionality. This
code would be a modification of the code using the for loop.

Summary:

I am looking for synthesizable example verilog code of a module which
demonstrates the production of repeated hardware structures using a
for loop.

Of currently secondary importance is synthesizable sample code which
repreoduces the functionality of the for loop code without repeated
hardware structures but using counters.

Thanks in advance to any who attempt to answer.
 
Marwan wrote:

It is also my understanding that a more efficient manner of
implementing a for loop, one which allows every iteration to be
controllled by a clock edge and which does not repeat structures, is
to use counters with bit select to replace for loops and nested for
loops. This would be the second example code that I will need to use
to demonstrate more efficient coding of for like functionality. This
code would be a modification of the code using the for loop.
Here is an example of using a synchronous
clock enable to control a begin/end block of statements.
See begin : a_carry and
begin : b_carry here:
http://mysite.verizon.net/miketreseler/count_enable.v

-- Mike Treseler
 
On Wed, 28 May 2008 09:41:00 -0700 (PDT), Marwan wrote:

I am trying to discuss and demonstrate the inefficiency of the for
loop in synthesis and hardware as compared to equivalent counter
implementation.
Do we understand from this that you have reached your conclusion
already, and seek evidence retrospectively?

Main body:

According to The verilog golden reference guide, the for loop
synthesizes into repeated hardware structures, provided the loop
bounds are fixed.
I am currently having difficulty in coding an example module which
demonstrates this in synthesis. Does anyone have some simple example
code that they could share?
If it's our Golden Reference Guide you're talking about, then
you'll find a short simple example immediately after the
paragraph on "Synthesis" that you quote above. Here it is
again, fully dressed-up in a module so that you can synthesise
it without any further work:

module for_loop_demonstrator (
input [3:0] A, B,
output reg [3:0] F,
output reg V );
always @* begin : iterated_logic
integer I;
V = 0;
for (I=0; I<4; I=I+1) begin
F = A & B[3-I];
V = V ^ A;
end
end
endmodule

My understanding of a for loop is that in hardware/synthesis it
'unrolls' into the repeated structures, which is obviously extremely
wasteful of resources.
I don't know where the "extremely" value-judgment comes from.
The loop is in every way identical to the equivalent unrolled
code. It builds parallel, single-cycle hardware. If that's
what my design needs, then the 'for' loop is a simple, clean
way of describing it.

It is also my understanding that a more efficient manner of
implementing a for loop, one which allows every iteration to be
controllled by a clock edge and which does not repeat structures, is
to use counters with bit select to replace for loops and nested for
loops. This would be the second example code that I will need to use
to demonstrate more efficient coding of for like functionality. This
code would be a modification of the code using the for loop.
You're thinking like a programmer, not a hardware designer.
However, let's try to do the same thing sequentially instead
using a counter to provide the necessary state information.
Note that we can process one new set of data only on every
fourth clock cycle, whereas the parallel implementation can
process the entire data set on every cycle - the well-known
hardware speed/space tradeoff.

module sequential_loop_demonstrator (
input clock, reset, // needed to make the counter work
input [3:0] A, B,
output reg [3:0] F,
output reg V,
output reg result_valid );

reg [1:0] I; // loop counter
reg xor_acc; // accumulates the XOR results

// Here's the datapath logic that you are so
// keen to minimise. One XOR gate, one AND gate,
// lots of bit-select logic to steer appropriate
// values to their inputs.
wire xor_result = xor_acc ^ A;
wire and_result = A & B[3-I];

always @(posedge clock or posedge reset)
if (reset) begin
I <= 0;
F <= 0;
V <= 0;
xor_acc <= 0;
end else begin
I <= I + 1;
F <= and_result;
if (I==3) begin
// Spit out the calculation result, and start again
result_valid <= 1;
V <= xor_acc;
xor_acc <= 0;
end else begin
result_valid <= 0;
xor_acc <= xor_result;
end
end

endmodule

I fear I have failed to provide the evidence needed
by your kangaroo court :)
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
On Wed, 28 May 2008 11:14:15 -0700 (PDT), Marwan wrote:

I did not mean that a for loop is converted into an inefficient
hardware design, I meant that the for loop IS an innefficient constuct
in the sense that it would paralellise the hardware...
What's "inefficient" about that? The for-loop gives hardware
whose size and performance is identical to what would be obtained
by unrolling the loop by hand, and it is (evidently) a very
compact style of coding.

[parallel hardware]
which I should
probably have stated, is what i do not want.
Ah. But that's a matter of scale. If you have N operations
to perform, you have a choice of performing all N in the same
clock cycle using N operator blocks, or taking N clock cycles
to perform them using one operator - or, indeed, many other
possible compromises between these two extremes. Of course
you want to use the smallest hardware that will give you the
performance you desire. The use, or non-use, of a "for" loop
has almost nothing to do with that design compromise. What
matters is the relative size of the operator logic and the
control logic that's needed to serialise its use. Large
blocks such as multipliers are a good example of something
that's worth serialising if you can. The ultimate example
is the main memory of a computer; it's so big that you have
no choice but to serialise access to it, and it's that
sequential access to memory that dominates performance
concerns for most computers.

Having actualy synthesized your first code example, there is no
repeated structure produced, just one AND gate and one XOR.
But it *is* repeated! The code inside the for loop implies
a 2-input AND and a 2-input XOR gate; those are indeed
repeated 4 times. The XOR gets folded into a single LUT
on FPGAs, thanks to the synth tool seeking optimizations;
I don't know how you see only one AND gate when it has
four outputs to drive...

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

always@(posedge clk)
begin
for (i=1;i<=5;i=i+1)
begin
c = a*b;
end
end

and then produce on synthesis 5 multipliers so that the 5
multiplications happen in one cycle.

My next optimised example would then use a counter so that the 5
multiplications are performed one after the other, reusing the same
multiplier.

Sure. That's the obvious compromise I already described.
The sequential implementation must store intermediate
results somewhere, if each trip around the loop depends
on earlier trips; in your simple example it doesn't, so
it's merely a matter of steering the correct inputs to
the multiplier, and capturing its output into the right
place, on each clock cycle. You must decide for yourself
whether that extra steering logic is sufficiently small
to make the compromise favourable. You can also sometimes
do nice tricks with shift registers so that the multiplier
remains connected to the same registers at all times, but
the data slides into the appropriate place on each clock.
Whether this is practicable depends on the details of
your problem.

Since you're only processing one data set on each
N clock cycles, you must ensure that the input data is
captured at an appropriate moment and the output result is
registered at an appropriate moment. A control state machine
is the classic way to do all this kind of stuff. Basically
you need to lay out a huge table describing what's happening
on each clock cycle - what data appears on what datapath
inputs, which registers are enabled for capturing their
inputs, which strobe outputs are asserted and so on.
Once you have this table or other cycle-by-cycle list
of activity you can start to design the corresponding
state machine.

If your algorithm is really complicated, with many branching
control paths and conditional operations, it may be preferable
to use some kind of CPU to manage the sequential activity
under control of a program. Sometimes it's a good idea to
design a specialised CPU targeted to controlling your
specialised datapath - the graphics processor folk do that
sort of thing all the time. Often, though, a standard
CPU - even a very simple one such as PicoBlaze - will do
what you need.

Ultimately I must ask: if you are so determined to serialise
your algorithm, why bother with FPGA implementation at all?
Couldn't you run it on a fast general-purpose CPU, or perhaps
on a DSP machine? FPGAs offer massively parallel computation
when each computational block is relatively simple; they offer
rich, wide connectivity between blocks. If you don't want
or need to use those attributes, it's far less pain to do
things in software.

Enjoy.
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
Do we understand from this that you have reached your conclusion
already, and seek evidence retrospectively?
Correct.


If it's our Golden Reference Guide you're talking about, then
you'll find a short simple example immediately after the
paragraph on "Synthesis" that you quote above.  Here it is
again, fully dressed-up in a module so that you can synthesise
it without any further work:

  module for_loop_demonstrator (
    input [3:0] A, B,
    output reg [3:0] F,
    output reg V );
    always @* begin : iterated_logic
      integer I;
      V = 0;
      for (I=0; I<4; I=I+1) begin
        F = A & B[3-I];
        V = V ^ A;
      end
    end
  endmodule

Indeed, and a useful guide it is too :).


I don't know where the "extremely" value-judgment comes from.
The loop is in every way identical to the equivalent unrolled
code.  
I did not mean that a for loop is converted into an inefficient
hardware design, I meant that the for loop IS an innefficient constuct
in the sense that it would paralellise the hardware... which I should
probably have stated, is what i do not want.

To clarify, my thinking is actually motivated by hardware. I have a
design in behavioral verilog, which is probably as un-optimised as it
can get and likely unsynthesizable in the extreme. Nevertheless, the
code (which is for a relatively complicated algorithm) is replete with
for loops which encapsulate many calculations like multiplications,
divisions and additions with ~20 bit values.

As I would like this code to actually reuse one multiplier, one
divider and one dedicated adder; I do not want to use a for loop which
would paralell-ise the calculation and therefore require multiple
multipliers, dividers etc...


You're thinking like a programmer, not a hardware designer.
See my previous paragraph.

...

Having actualy synthesized your first code example, there is no
repeated structure produced, just one AND gate and one XOR... the
example I am looking for would, lets say do the following: -

reg [4:0] a [1:5];
reg [4:0] b [1:5];
reg [9:0] c [1:5];

always@(posedge clk)
begin
for (i=1;i<=5;i=i+1)
begin
c = a*b;
end
end

and then produce on synthesis 5 multipliers so that the 5
multiplications happen in one cycle.

My next optimised example would then use a counter so that the 5
multiplications are performed one after the other, reusing the same
multiplier.

Any comments?

--

Thank you for the time you put into the response along with the
content of the response.

Peace
 
But it *is* repeated! The code inside the for loop implies
a 2-input AND and a 2-input XOR gate; those are indeed
repeated 4 times.  The XOR gets folded into a single LUT
on FPGAs, thanks to the synth tool seeking optimizations;
I don't know how you see only one AND gate when it has
four outputs to drive...
You are right... the foolish mistake I made was to look at the rtl
schematic as opposed to the technology schematic...
Apologies for that. I have not used that function in xilinx for quite
a while...

Thank you for you advice, it is important and relavant to what i need
to think about. And thank you for the time you took to write your
response...

Peace,
Marwan
 
Marwan <marwanboustany@gmail.com> writes:

Thank you for you advice, it is important and relavant to what i need
to think about. And thank you for the time you took to write your
response...
Although you seem to have at least partially gotten your answer, let
me suggest another alternative. Since it sounds like you have a
"software design", you are trying to synthesize, you might look into
companies that are building tools to do just that. I saw a couple of
presentationtions on such tools a couple of years back. I'm not sure
I remember the names exactly correctly, but "code blue", "forte" and
possibly "coware" stick in my mind. I'm certain that at least one of
those names should be a good lead, if you decide to look into this.

What these tools do, if I understand correctly, is look at a software
design and figure out some components (e.g. multipliers) that they
have in libraries and then figure out state machine logic to drive
those elements at various levels of parallelism. You then pick from
the various implementations, the one that has the right area/speed
tradeoff for your application.

Hope this helps,
-Chris
 
Jonathan Bromley wrote:

What's "inefficient" about that? The for-loop gives hardware
whose size and performance is identical to what would be obtained
by unrolling the loop by hand, and it is (evidently) a very
compact style of coding.
What is bad about that is people used to writing C expecting
an iterative structure instead of unrolled code. (Not that C
compilers won't unroll small loops.) Note especially
what happens if the loop limits are variable instead of
constant.

-- glen
 
On May 29, 3:47 am, Chris F Clark <c...@shell01.TheWorld.com> wrote:
Marwan <marwanboust...@gmail.com> writes:
Thank you for you advice, it is important and relavant to what i need
to think about. And thank you for the time you took to write your
response...

Although you seem to have at least partially gotten your answer, let
me suggest another alternative. Since it sounds like you have a
"software design", you are trying to synthesize, you might look into
companies that are building tools to do just that. I saw a couple of
presentationtions on such tools a couple of years back. I'm not sure
I remember the names exactly correctly, but "code blue", "forte" and
possibly "coware" stick in my mind. I'm certain that at least one of
those names should be a good lead, if you decide to look into this.

What these tools do, if I understand correctly, is look at a software
design and figure out some components (e.g. multipliers) that they
have in libraries and then figure out state machine logic to drive
those elements at various levels of parallelism. You then pick from
the various implementations, the one that has the right area/speed
tradeoff for your application.

Hope this helps,
-Chris
First, think of for-loops as efficient ways to CODE large, repetitive
and/or recursive, parallel hardware designs. They also allow more
flexible coding that can automatically adapt to differing widths (or
even quantities) of inputs and/or outputs. An RTL description must by
definition specify the cycle-accurate behavior of the design. If the
RTL takes multiple clock cycles to complete a task, so will the
hardware. Using a for-loop will not change that.

Some synthesis tools allow embedding wait statements (waiting on the
clock) inside a loop, which lets the loop execution span more than one
clock cycle. Such code turns into "implied" state machines in
hardware. As the tools get better about this, one could play around
with the addition or elimination of wait statements inside the loop to
optimize the performance and resources used by the design.

Following up on Chris' post:

Mentor's Catapult C is a tool that takes a SW algorithm expressed as
an ordinary C function, and lets you play with different types of
input/output interfaces and storage, while trading latency, throughput
and resources (hardware) interactively to arrive at an optimal
hardware design to implement the algorithm. It also has very nice
hooks to allow using the same test routines written for the SW
function to exercise the hardware design in an HDL test bench.

It's pretty cool, but it ain't cheap!

Andy
 
glen herrmannsfeldt wrote:
Jonathan Bromley wrote:

What's "inefficient" about that? The for-loop gives hardware whose
size and performance is identical to what would be obtained
by unrolling the loop by hand, and it is (evidently) a very
compact style of coding.

What is bad about that is people used to writing C expecting
an iterative structure instead of unrolled code. (Not that C
compilers won't unroll small loops.) Note especially
what happens if the loop limits are variable instead of
constant.
Software guys *do* know about compile-time constants.
Embedded software guys know about interrupt routines
that update registers quickly at a fixed frequency.

-- Mike Treseler
 
Mike Treseler wrote:
glen herrmannsfeldt wrote:
Jonathan Bromley wrote:

What's "inefficient" about that? The for-loop gives hardware whose
size and performance is identical to what would be obtained
by unrolling the loop by hand, and it is (evidently) a very
compact style of coding.

What is bad about that is people used to writing C expecting
an iterative structure instead of unrolled code. (Not that C
compilers won't unroll small loops.) Note especially
what happens if the loop limits are variable instead of
constant.

Software guys *do* know about compile-time constants.
Embedded software guys know about interrupt routines
that update registers quickly at a fixed frequency.
I think you can't generalize, different people think
about things in different ways. For people who learned
logic design working with TTL gates, it isn't so hard to
think about logic as hardware. If your first experience
with logic design is verilog, it might be harder.

Even so, for small iteration counts it isn't hard
to instantiate a module the appropriate number of times.
I think I was doing that before for loops were added.

-- glen

-- glen
 
glen herrmannsfeldt wrote:

I think you can't generalize, different people think
about things in different ways.
My point exactly.
Not all C writers will expect that
an iterative hardware description will make a cpu.

They may have to learn about gates and flops,
but tend to do well with the RTL viewer
and simulation tools once they do.

For people who learned
logic design working with TTL gates, it isn't so hard to
think about logic as hardware.
For people who learned
procedural coding with C, it isn't so hard to
to visualize hardware from a synchronous block
with multiple variables.

-- Mike Treseler
 

Welcome to EDABoard.com

Sponsor

Back
Top