Counting ones

A

Aart van Beuzekom

Guest
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart
 
Aart van Beuzekom wrote:

Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart

I forgot to mention that the for-to loop cannot be used because I do not
want the system to use any clock cycles - the result just has to be there.
 
Aart van Beuzekom schrieb:
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Its not VHDL, but:
if you have bits numberd x0, x1, x2, x3 ... x30 than you can simply add
them: Y = x0 + x1 + x2 + ... x30 for counting ONE's

greetings, Bertram

--

Bertram Geiger, Graz - AUSTRIA
Private Mail: remove the letters "b a d" from my reply address
 
Uwe Bonnes wrote:

Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Hi,

: Can anybody tell me a practical way (in VHDL) to count the number of
: ones in a bit pattern? Counting should be done with combinatorial logic,
: which means that a for-to loop cannot be used here (but being quite new
: to FPGA and VHDL, I'm not even sure). The number of bits to be counted
: is about 30, so a single look-up table is not the solution, using only a
: Spartan II.

: Any suggestions? Purpose is amongst others pattern matching, where I
: count how many bits differ from the expected result.

It has been discussed before. Try www.deja.com to search this group for old
answers.

Bye
Hi Uwe,

Thanks for four response. Fortunately, I checked this newsgroup's
archive before posting. The problem is that it isn't exactly pattern
matching I want, but counting the number of errors (at the other end of
a poor communication link) in a bitstream, compared to an expected
pattern. Something that for example can be used to find a kind of
preamble signal.
Any more suggestions?

Aart
 
"Bertram Geiger" <badnews_geiger@aon.at> wrote in message
news:3F784164.2000108@aon.at...
Aart van Beuzekom schrieb:
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Its not VHDL, but:
if you have bits numberd x0, x1, x2, x3 ... x30 than you can simply add
them: Y = x0 + x1 + x2 + ... x30 for counting ONE's
I don't remember it being discussed so recently. It might be too far back
to find, or maybe another newsgroup. comp.arch might be a good choice.

Anyway, adding with carry save adders is the best combinatorial algorithm I
know of. I don't know how well it works in an FPGA, though.

-- glen
 
Hi Aart

There is an excellent example of a combinational ones counter in the course
notes on our Comprehensive VHDL course ;-) Check out my employers website
for more details...

Our example does use a for loop - I'm not sure why you feel a for-loop
cannot be used for combinational logic, unless you are putting wait
statements inside the for loop, and in that case it probably won't
synthesise (unless you have a behavoral compiler). To work out what a loop
will do, simply replicate the contents of the loop once for each iteration
of the loop, replacing the loop parameter with a constant. eg

process(b , c)
begin
for i in 0 to 3 loop
a(i) = b(i) and c(3-i);
end loop
end process;

is equivalent to

process(b , c)
begin
a(0) = b(0) and c(3);
a(1) = b(1) and c(2);
a(2) = b(2) and c(1);
a(3) = b(3) and c(0);
end process;


Try a for-loop looping the entire 'range of your error vector. Inside the
for loop, sum all the bits. So long as you get the types right and don't put
any timing in the loop, it will synthesise to combinational logic.

HTH

Ian Poole
--
Ian Poole, Consultant and Trainer

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * Perl * Tcl/Tk * Verification * Project Services

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.


"Aart van Beuzekom" <aart@westcontrol.dontspamme.com> wrote in message
news:bl8u9v$1v5$1@news.netpower.no...
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart
 
Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Hi,

: Can anybody tell me a practical way (in VHDL) to count the number of
: ones in a bit pattern? Counting should be done with combinatorial logic,
: which means that a for-to loop cannot be used here (but being quite new
: to FPGA and VHDL, I'm not even sure). The number of bits to be counted
: is about 30, so a single look-up table is not the solution, using only a
: Spartan II.

: Any suggestions? Purpose is amongst others pattern matching, where I
: count how many bits differ from the expected result.

It has been discussed before. Try www.deja.com to search this group for old
answers.

Bye
--
Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
 
"Uwe Bonnes" <bon@elektron.ikp.physik.tu-darmstadt.de> wrote in message
news:bl9cg6$7uq$1@news.tu-darmstadt.de...
Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Hi,

: Can anybody tell me a practical way (in VHDL) to count the number of
: ones in a bit pattern?
<snip>

In either the Xilinx or Altera architecture, it's probably most efficient to
pre-add in groups of 4 bits then add thse results in an adder tree. For 32
bits (for instance) you can get 8 values with counts of 0-4 with simple
LUTs. The next stages give 4 values of 0-8 by adding the previous values in
pairs, the next level is 2 values of 0-16 and the last stage is a single
adder with a result from 0 to 32. This can be extended to whatever quantity
of bit errors you want to count per-cycle.

If you just try to add all 32 bits in one line - which is entirely valid -
some synthesizers will give you poor results. The structured coding tesnds
to give structured results. If you need higher speed, you can pipeline the
adder tree to get extreme clock rates.

- John_H
 
Aart van Beuzekom wrote:
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern?
Related thread:

http://groups.google.com/groups?q=pattern_recognizer


-- Mike Treseler
 
Aart van Beuzekom wrote:
Aart van Beuzekom wrote:

Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart

I forgot to mention that the for-to loop cannot be used because I do not
want the system to use any clock cycles - the result just has to be there.
There is nothing inherent in a for loop that uses "clock cycles". As
long as you don't use a wait statement on a clock edge, you will not
have a FF (which is what you are referring to as using clock cycles).

Rather than to think of a VHDL solution from the problem statement, I
prefer to think in terms of what hardware I would design to solve the
problem and then use VHDL to describe the hardware. In your case you
need a simple bank of adders to add in succession. You can optimize
this for speed or logic which will depend on your means of implementing
the design. This will be different in an FPGA with 4 input LUTs vs. a
gate oriented array.

Once you have decided on an architecture to solve the addition problem,
you can then code it in VHDL easily. Very possibly a for loop can be
used without creating FFs.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX
 
Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Uwe Bonnes wrote:

:> Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
....
:> It has been discussed before. Try www.deja.com to search this group for old
:> answers.
:>
:> Bye

: Hi Uwe,

: Thanks for four response. Fortunately, I checked this newsgroup's
: archive before posting. The problem is that it isn't exactly pattern
: matching I want, but counting the number of errors (at the other end of
: a poor communication link) in a bitstream, compared to an expected
: pattern. Something that for example can be used to find a kind of

The second hit on deja in a search for "fpga counting ones" gives:

Re: Counting the number of ones present ...
comp.lang.verilog - 8 Jul 2002 by John_H - View Thread (10 articles)
http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3D29BF6F.C2CA3906%40mail.com&rnum=2&prev=/groups%3Fq%3Dfpga%2Bcounting%2Bones%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26start%3D0%26sa%3DN

The 6th hit is
Re: Count 1's algorithm...
.... To count 16 ones you would need 5 CLBs and have one ... you have to do it and how many
bits you are counting. ... with a fanin and fanout of 2. The FPGA LUTs generally ...
comp.arch.fpga - 4 Feb 2000 by Dragon - View Thread (11 articles)

Hope this helps

--
Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
 
"Ian Poole" <ian.poole@doulos.delete-this-bit.com> wrote in message
news:bl9ica$hqk$1$8300dec7@news.demon.co.uk...

There is an excellent example of a combinational ones counter in the
course
notes on our Comprehensive VHDL course ;-) Check out my employers website
for more details...

Our example does use a for loop - I'm not sure why you feel a for-loop
cannot be used for combinational logic, unless you are putting wait
statements inside the for loop, and in that case it probably won't
synthesise (unless you have a behavoral compiler). To work out what a loop
will do, simply replicate the contents of the loop once for each iteration
of the loop, replacing the loop parameter with a constant. eg
If you think in terms of traditional programming languages, for loop implies
iteration, though as you say that isn't required. The PL/I preprocessor,
with the %DO statement, would be an example of where it didn't.

There are people who would like to use C as a source language for synthesis,
which I still don't believe in.

Maybe asking for a combinatorial, instead of iterative design, would have
been better.

-- glen
 
Hi Aart-

In the Xilinx FPGAs, a LUT RAM block can conveniently be 32x1. You
could use three of these, with five identical inputs to provide a
look-up sum of five bits. Then, take the results of these sums (five
inputs at a time) into an adder tree. For example, if A, B, C, D, E,
F are the sets of 5-bits of a 30-bit input, then you'd sum A+B, C+D,
E+F. Two more levels and you'd have a complete sum. I'm not sure if
this is what the group decided on as the most efficient method, but
it's the first one that comes to mind. You could use BRAMs if you'd
like for a larger lookup (10-bits at a time to a 4-bit result if you
use RAMB4_S4_S4, for example).

Jake



Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote in message news:<bl8u9v$1v5$1@news.netpower.no>...
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart
 
Uwe Bonnes wrote:
Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Uwe Bonnes wrote:

:> Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
...
:> It has been discussed before. Try www.deja.com to search this group for old
:> answers.
:
:> Bye

: Hi Uwe,

: Thanks for four response. Fortunately, I checked this newsgroup's
: archive before posting. The problem is that it isn't exactly pattern
: matching I want, but counting the number of errors (at the other end of
: a poor communication link) in a bitstream, compared to an expected
: pattern. Something that for example can be used to find a kind of

The second hit on deja in a search for "fpga counting ones" gives:


Re: Counting the number of ones present ...
comp.lang.verilog - 8 Jul 2002 by John_H - View Thread (10 articles)


http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3D29BF6F.C2CA3906%40mail.com&rnum=2&prev=/groups%3Fq%3Dfpga%2Bcounting%2Bones%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26start%3D0%26sa%3DN

The 6th hit is
Re: Count 1's algorithm...
... To count 16 ones you would need 5 CLBs and have one ... you have to do it and how many
bits you are counting. ... with a fanin and fanout of 2. The FPGA LUTs generally ...
comp.arch.fpga - 4 Feb 2000 by Dragon - View Thread (11 articles)

Hope this helps

Hi Uwe,

OK, I didn't even know that newsgroup existed, I'll check it.

Aart
 
Aart van Beuzekom wrote:
Uwe Bonnes wrote:

Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
: Uwe Bonnes wrote:

:> Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote:
...
:> It has been discussed before. Try www.deja.com to search this group
for old
:> answers.
:> :> Bye

: Hi Uwe,

: Thanks for four response. Fortunately, I checked this newsgroup's :
archive before posting. The problem is that it isn't exactly pattern :
matching I want, but counting the number of errors (at the other end
of : a poor communication link) in a bitstream, compared to an
expected : pattern. Something that for example can be used to find a
kind of
The second hit on deja in a search for "fpga counting ones" gives:


Re: Counting the number of ones present ...
comp.lang.verilog - 8 Jul 2002 by John_H - View Thread (10 articles)



http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=3D29BF6F.C2CA3906%40mail.com&rnum=2&prev=/groups%3Fq%3Dfpga%2Bcounting%2Bones%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26start%3D0%26sa%3DN


The 6th hit is
Re: Count 1's algorithm...
... To count 16 ones you would need 5 CLBs and have one ... you have
to do it and how many
bits you are counting. ... with a fanin and fanout of 2. The FPGA LUTs
generally ...
comp.arch.fpga - 4 Feb 2000 by Dragon - View Thread (11 articles)

Hope this helps

Hi Uwe,

OK, I didn't even know that newsgroup existed, I'll check it.

Aart

Thanks to all for the reponse!

Aart
 
To inject a fresh idea into this long thread:

One dual-ported BlockRAM, configured as a 4K x 4 ROM can obviously take
12 inputs and generate 4 binary encoded outputs, but it can do this
twice, once per port. (Both ports look at the same code converter simultaneously).
So one BlockRAM takes 24 inputs and generates two sets of 4 binary
outputs which can be combined in one CLB to generate the 5-bit result.
Two BlockRAMs can similarily encode 48 inputs, and the final result is
available in substantially less than 10 ns.
Note that the BlockRAM requires an input clock edge to initiate the conversion.

Since Virtex-II devices contain tens and even hundreds of such
BlockRAMs, it is highly likely that a few of them are unused, which
makes this design not only simple and fast, but also almost zero cost.

Peter Alfke, Xilinx Applications.
===============================
Jake Janovetz wrote:
Hi Aart-

In the Xilinx FPGAs, a LUT RAM block can conveniently be 32x1. You
could use three of these, with five identical inputs to provide a
look-up sum of five bits. Then, take the results of these sums (five
inputs at a time) into an adder tree. For example, if A, B, C, D, E,
F are the sets of 5-bits of a 30-bit input, then you'd sum A+B, C+D,
E+F. Two more levels and you'd have a complete sum. I'm not sure if
this is what the group decided on as the most efficient method, but
it's the first one that comes to mind. You could use BRAMs if you'd
like for a larger lookup (10-bits at a time to a 4-bit result if you
use RAMB4_S4_S4, for example).

Jake

Aart van Beuzekom <aart@westcontrol.dontspamme.com> wrote in message news:<bl8u9v$1v5$1@news.netpower.no>...
Hi,

Can anybody tell me a practical way (in VHDL) to count the number of
ones in a bit pattern? Counting should be done with combinatorial logic,
which means that a for-to loop cannot be used here (but being quite new
to FPGA and VHDL, I'm not even sure). The number of bits to be counted
is about 30, so a single look-up table is not the solution, using only a
Spartan II.

Any suggestions? Purpose is amongst others pattern matching, where I
count how many bits differ from the expected result.

Thanks,

Aart
 
There is an excellent example of a combinational ones counter in the
course
notes on our Comprehensive VHDL course ;-) Check out my employers
website
for more details...

Our example does use a for loop - I'm not sure why you feel a for-loop
cannot be used for combinational logic, unless you are putting wait
statements inside the for loop, and in that case it probably won't
synthesise (unless you have a behavoral compiler). To work out what a
loop
will do, simply replicate the contents of the loop once for each
iteration
of the loop, replacing the loop parameter with a constant. eg

The for loop method generates n-1 adders for n bits.
Probably quite slow as well.
The synthesizers might of course be smarter than I think...

--
Best Regards,
Ulf Samuelsson ulf@a-t-m-e-l.com
This is a personal view which may or may not be
share by my Employer Atmel Nordic AB
 
Last time I tried it that way, it wound up with something pretty slow. The
synth doesn't know how to make small tally adders with LUTs.

Ulf Samuelsson wrote:

There is an excellent example of a combinational ones counter in the
course
notes on our Comprehensive VHDL course ;-) Check out my employers
website
for more details...

Our example does use a for loop - I'm not sure why you feel a for-loop
cannot be used for combinational logic, unless you are putting wait
statements inside the for loop, and in that case it probably won't
synthesise (unless you have a behavoral compiler). To work out what a
loop
will do, simply replicate the contents of the loop once for each
iteration
of the loop, replacing the loop parameter with a constant. eg


The for loop method generates n-1 adders for n bits.
Probably quite slow as well.
The synthesizers might of course be smarter than I think...

--
Best Regards,
Ulf Samuelsson ulf@a-t-m-e-l.com
This is a personal view which may or may not be
share by my Employer Atmel Nordic AB
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
"John_H" <johnhandwork@mail.com> wrote in message news:<pDYdb.14$XP3.1342@news-west.eli.net>...
snip
In either the Xilinx or Altera architecture, it's probably most efficient to
pre-add in groups of 4 bits then add thse results in an adder tree. For 32
bits (for instance) you can get 8 values with counts of 0-4 with simple
LUTs.
snip
- John_H
I love coming back to a subject and gaining a
little more insight!

In the current discussion (and the thread from
3 years ago) I made a mistake (and so has
everyone else contributing, a surprising
event on usenet). There is a seeming paradox,
in that it is more efficient to use the LUTs
to do a three input sum at the leaves of the adder
tree than to do a 4 input sum. I'll show below the
most efficient (area wise) FPGA implementations that
I know (for 16 bits, and OP's 30 bits), and challenge
anyone to come back and show an even better way.
After that, some simple words about the 'paradox'
of why the three input/LUT solution is better than
four input/LUT in this case.

Look first at a _LUT_only_ implementation (16 bits),
which applies to any LUT based FPGA...

A full adder (FA3) can be implemented in 2 LUTs
These are 3-LUTs!, not 4-LUTS, although in practice
4-LUTs are used because that is what is available.
It is also called a (3,2) counter.
See L.Dadda's papers, including "Pipelined Adders"
in IEEE Transactions on Computers, Mar 1996,
for a good discussion of adders, or Patterson et.al
on "Optimal Carry Save Networks", available at Citeseer,
for better/deeper discussions than I can give.

This is the full adder:
___
-0-| F |
-0-| A |-1-
-0-|_3_|-0-

(the numbers are powers of 2 at a position)
Repeating: The FA3 sums 3 bits, using 2 3-LUTs.
Next, 4 full adders are arranged to sum 7 bits,
using 8x 3-LUTs:
___
0--| F |
0--| A |-1--------+
0--|_3_|-0---+ | ___
| +-----1-| F |
___ +--|----------1-| A |---2
0--| F | | | ___ +1-|_3_|---1
0--| A |-1+ +0-| F | |
0--|_3_|-0----0-| A |-1+
0-------------0-|___|-0-----------0

Call the above a 7 Adder (7Add):
___
| 7 |
7 | A |-2-
0-/-| d |-1-
|_d_|-0-

(a more correct term might be '(7,3) counter',
but '7Add' fits in the ascii drawing)

Two 7Adds can be used to sum 14 bits into
two 3-bit numbers, using 16x 3-LUTs.

3x 4-LUTs are used to sum 4 bits to a 3-bit
result in a 4Add:
___
| 4 |
4 | A |---2
0-/-| d |---1
|_d_|---0

Produce the final result from the output of the
two 7Adds plus the remaining two bits, using
two 4Adds and an FA:
___
| 7 |
7 | A |-2-------------------------------+ ___
0-/-| d |-1-------------------+ +-2-| 4 |
|_d_|-0-----+ +-------|-------------2-| A |--4
| | +-|-------------2-| d |--3
| | | | ___ +-2-|_d_|--2
___ +---|-----+ | +-1-| F | |
| 7 | | +-|-----------|---1-| A |-2-+
7 | A |-2-+ | | ___ | +-1-|___|--------------1
0-/-| d |-1---+ +-0-| 4 | | |
|_d_|-0-------0-| A |-2-+ |
0-----------------0-| d |-1---+
0-----------------0-|_d_|----------------------------0

Total LUT count is 24 (3 Virtex CLBs), using
6x 4-LUTs and 18x 3-LUTs. This is the absolute
minimum using 3-LUT and 4-LUT based logic alone
(that I know of).

Now look at how the VirtexII carry logic may
improve things...

At the leaves of the tree (left hand side), two 3-LUTs
are still used to build full adders. The next step is
building a 7Add, which sums the output of 2 FAs plus
one more bit. Here, either carry logic can be used,
at a total cost of 8 LUTs, or a LUT only solution, also
costing 8 LUTs. A 15Add sums two 7Adds and another bit.
This costs two 7Adds (16 LUTs), plus a standard 3 bit
carry logic based adder with carry-in and carry-out, another
5 LUTs. Adding in the last bit is a 4 bit increment with
carry-out, costing 6 LUTs using carry logic. Total: 27 LUTs!
Three more than the LUT only circuit. Frustrating, isn't it?
For this size, a sum of 16 bits, trying to use carry logic
does worse than a LUT only implementation. For a 15Add, carry
logic allows a marginal improvement of one LUT, using only
21 LUTs instead of 22. For larger bit counts, the carry logic
becomes increasingly important, as will be explained below.

Finally, a few simple-minded words about the 'paradox'...

Many operations can be viewed as an exercise in compression,
reducing a number of inputs to a smaller number of outputs
through some sort of multi-level tree structure. In this case,
inputs are the bits to count, and outputs are the count. At any
tree level, a simple, naive figure of merit qualifies a circuit:
(InputBits/OutputBits) * (InputBits/LUTs)
The larger this number, the better the circuit is for that level.
The first ratio helps the tree converge faster, the second reduces
the LUT count. For 4-LUT based logic, the highest possible figure
of merit would be 16, indicating four input bits produce a single
output bit, with a single LUT. Parity trees, "and" trees, "or"
trees acheive 16.
Implementing the leaf side initial level of bit counting
with 4-LUTs gives (4/3)*(4/3)=1.778. Implementing the initial bit
counting with 3-LUTs gives (3/2)*(3/2)=2.25. At any stage where
bits of the same weight are aggregated and compressed, the 3-LUT
implementation is better if they can be grouped by threes.
At the end of the 16Add tree, some 4-LUTs are used to advantage
because there are 4 signals to combine that cannot be cleanly
split into groups of three. For every other location in the tree,
the 3-LUTs work out better.
When carry logic is used at a combining stage, uniting
two addends of the same size plus an LSB, the metric becomes:
((2*Size+1)/(Size+1))*((2*Size+1)/(Size+2))
where Size is the number of bits in each addend.
This gives:
Size Merit
1 1.5
2 2.083
3 2.45
4 2.7
... approaching 4

When 3-LUT based full adders are cascaded to add
two numbers and a carry-in, the metric becomes:
((2*Size+1)/(Size+1))*((Size+1)/(2*Size))
Giving:
Size Merit
1 2.25
2 2.083
3 2.042
4 2.025
... approaching 1

Comparing the tables, for combining three equal weight
bits, the LUT solution is better; for going from full adder
to 7Add the two circuits have equal LUT count; for
going from 7Add to 15Add carry logic should be used.
The optimum (30,5) counter (that I know of) uses a mix of
3-LUTs and carry-logic, and no 4-LUTs.

Whew! Talk about being over-anal(ytical). Hope I haven't
bored anyone, just wanted to get the 'best' circuits public
(in hopes someone else has a better one), and show something
not immediately obvious about the leaves of the adder tree.
Peter's BlockRam implementation is also slick!

I'll finish with a question:
Just what are the best/worst results from HDL synthesis
tools that folks get with this function? I.e., barring
forcing the mapping, and letting the tool optimize from
something like:
Count <= Bit0 + Bit1 + Bit2 +...

Regards,
John

p.s. Original poster asked about 30 bits...here's the
most compact way I know, but I haven't spent much
time on 30 bits. Here the carry logic helps. Note
that none of the LUTs are configured as 4-LUT....
___
0-| F |
0-| A |-1-------+
0-|_3_|-0-----+ |
___ | | ___
0-| F | | +-1-| F |
0-| A |-1-----|---1-| A |-2-----------------+
0-|_3_|-0-+ +-|---1-|_3_|-1---------------+ |
___ | | | ___ | |
0-| F | +-|-|---0-| F | | |
0-| A |-1---+ +---0-| A |-1---+ | |
0-|_3_|-0---------0-|_3_|-0---|---------+ | |
___ | | | |
0-| F | | ___ | | |
0-| A |-1-------+ + | F | | | |
0-|_3_|-0-----+ | +---1-| A |-|-|-|-2---+
___ | | ___ | +-1-|_3_|-|-|-|-1-+ |
0-| F | | +-1-| F | | | | | | | |
0-| A |-1-----|---1-| A |-2-|-|-------+ | | | | | ___
0-|_3_|-0-+ +-|---1-|_3_|-1-+ | | | | | | | '0'-4-| |
___ | | | ___ | | | | | | | '0'-3-| |
0-| F | +-|-|---0-| F | | | | | | | +------2-| C |
0-| A |-1---+ +---0-| A |-1---|-----+ | | | | +--------1-| Y |-4
0-|_3_|-0---------0-|_3_|-0---|---+ | | | | | '0'-0-| A |-3
___ | | | | | | | ___ | d |-2
0-| F | | | | | | | |'0'-3-| |-4-| d |-1
0-| A |-1-------+ | | | | | | +----2-| C |-3-| |-0
0-|_3_|-0-----+ | | | | | | +------1-| Y |-2-| |
___ | | ___ | | | | +--------0-| A |-1-| |
0-| F | | +-1-| F | | | | | ___ | d |-0-|___|
0-| A |-1-----|---1-| A |-2---|-+ | | +--2 | C |-3-| d | |
0-|_3_|-0-+ +-|---1-|_3_|-1---+ | | +----1-| Y |-2-| | 0
___ | | | ___ | +------0-| A |-1-| | |
0-| F | +-|-|---0-| F | +--------2-| d |-0-|___| |
0-| A |-1---+ +---0-| A |-1--------------1-| d | | |
0-|_3_|-0---------0-|_3_|-0--------------0-|___| 0 |
| | |
0 | |
0--------------------------------------------+ | |
0----------------------------------------------------+ |
0------------------------------------------------------------+

LUT Count:
18 + 12 + 2 + 5 + 6 + 6 = 49

49 LUTs = 8.125 CLBs
(Pipeline to taste)
HTH
(apologies for being late to the thread, suffered a disk crash
recently, could not post)
 
John, just to be contrary:
Remember my suggestion of using a 4K x 4 dual-ported BlockROM. So it can
take 24 inputs and generate two 4-bit outputs that then have to be
combined externally. On two BlockROMs you can have 48 inputs and then
combine 4 sets of 4 inputs in one or two CLBs. Fast and simple, but
needs a trigger clock, since BlockROMs are synchronous.

Peter Alfke, Xilinx Applications

John wrote:
"John_H" <johnhandwork@mail.com> wrote in message news:<pDYdb.14$XP3.1342@news-west.eli.net>...
snip
In either the Xilinx or Altera architecture, it's probably most efficient to
pre-add in groups of 4 bits then add thse results in an adder tree. For 32
bits (for instance) you can get 8 values with counts of 0-4 with simple
LUTs.
snip
- John_H

I love coming back to a subject and gaining a
little more insight!

In the current discussion (and the thread from
3 years ago) I made a mistake (and so has
everyone else contributing, a surprising
event on usenet). There is a seeming paradox,
in that it is more efficient to use the LUTs
to do a three input sum at the leaves of the adder
tree than to do a 4 input sum. I'll show below the
most efficient (area wise) FPGA implementations that
I know (for 16 bits, and OP's 30 bits), and challenge
anyone to come back and show an even better way.
After that, some simple words about the 'paradox'
of why the three input/LUT solution is better than
four input/LUT in this case.

Look first at a _LUT_only_ implementation (16 bits),
which applies to any LUT based FPGA...

A full adder (FA3) can be implemented in 2 LUTs
These are 3-LUTs!, not 4-LUTS, although in practice
4-LUTs are used because that is what is available.
It is also called a (3,2) counter.
See L.Dadda's papers, including "Pipelined Adders"
in IEEE Transactions on Computers, Mar 1996,
for a good discussion of adders, or Patterson et.al
on "Optimal Carry Save Networks", available at Citeseer,
for better/deeper discussions than I can give.

This is the full adder:
___
-0-| F |
-0-| A |-1-
-0-|_3_|-0-

(the numbers are powers of 2 at a position)
Repeating: The FA3 sums 3 bits, using 2 3-LUTs.
Next, 4 full adders are arranged to sum 7 bits,
using 8x 3-LUTs:
___
0--| F |
0--| A |-1--------+
0--|_3_|-0---+ | ___
| +-----1-| F |
___ +--|----------1-| A |---2
0--| F | | | ___ +1-|_3_|---1
0--| A |-1+ +0-| F | |
0--|_3_|-0----0-| A |-1+
0-------------0-|___|-0-----------0

Call the above a 7 Adder (7Add):
___
| 7 |
7 | A |-2-
0-/-| d |-1-
|_d_|-0-

(a more correct term might be '(7,3) counter',
but '7Add' fits in the ascii drawing)

Two 7Adds can be used to sum 14 bits into
two 3-bit numbers, using 16x 3-LUTs.

3x 4-LUTs are used to sum 4 bits to a 3-bit
result in a 4Add:
___
| 4 |
4 | A |---2
0-/-| d |---1
|_d_|---0

Produce the final result from the output of the
two 7Adds plus the remaining two bits, using
two 4Adds and an FA:
___
| 7 |
7 | A |-2-------------------------------+ ___
0-/-| d |-1-------------------+ +-2-| 4 |
|_d_|-0-----+ +-------|-------------2-| A |--4
| | +-|-------------2-| d |--3
| | | | ___ +-2-|_d_|--2
___ +---|-----+ | +-1-| F | |
| 7 | | +-|-----------|---1-| A |-2-+
7 | A |-2-+ | | ___ | +-1-|___|--------------1
0-/-| d |-1---+ +-0-| 4 | | |
|_d_|-0-------0-| A |-2-+ |
0-----------------0-| d |-1---+
0-----------------0-|_d_|----------------------------0

Total LUT count is 24 (3 Virtex CLBs), using
6x 4-LUTs and 18x 3-LUTs. This is the absolute
minimum using 3-LUT and 4-LUT based logic alone
(that I know of).

Now look at how the VirtexII carry logic may
improve things...

At the leaves of the tree (left hand side), two 3-LUTs
are still used to build full adders. The next step is
building a 7Add, which sums the output of 2 FAs plus
one more bit. Here, either carry logic can be used,
at a total cost of 8 LUTs, or a LUT only solution, also
costing 8 LUTs. A 15Add sums two 7Adds and another bit.
This costs two 7Adds (16 LUTs), plus a standard 3 bit
carry logic based adder with carry-in and carry-out, another
5 LUTs. Adding in the last bit is a 4 bit increment with
carry-out, costing 6 LUTs using carry logic. Total: 27 LUTs!
Three more than the LUT only circuit. Frustrating, isn't it?
For this size, a sum of 16 bits, trying to use carry logic
does worse than a LUT only implementation. For a 15Add, carry
logic allows a marginal improvement of one LUT, using only
21 LUTs instead of 22. For larger bit counts, the carry logic
becomes increasingly important, as will be explained below.

Finally, a few simple-minded words about the 'paradox'...

Many operations can be viewed as an exercise in compression,
reducing a number of inputs to a smaller number of outputs
through some sort of multi-level tree structure. In this case,
inputs are the bits to count, and outputs are the count. At any
tree level, a simple, naive figure of merit qualifies a circuit:
(InputBits/OutputBits) * (InputBits/LUTs)
The larger this number, the better the circuit is for that level.
The first ratio helps the tree converge faster, the second reduces
the LUT count. For 4-LUT based logic, the highest possible figure
of merit would be 16, indicating four input bits produce a single
output bit, with a single LUT. Parity trees, "and" trees, "or"
trees acheive 16.
Implementing the leaf side initial level of bit counting
with 4-LUTs gives (4/3)*(4/3)=1.778. Implementing the initial bit
counting with 3-LUTs gives (3/2)*(3/2)=2.25. At any stage where
bits of the same weight are aggregated and compressed, the 3-LUT
implementation is better if they can be grouped by threes.
At the end of the 16Add tree, some 4-LUTs are used to advantage
because there are 4 signals to combine that cannot be cleanly
split into groups of three. For every other location in the tree,
the 3-LUTs work out better.
When carry logic is used at a combining stage, uniting
two addends of the same size plus an LSB, the metric becomes:
((2*Size+1)/(Size+1))*((2*Size+1)/(Size+2))
where Size is the number of bits in each addend.
This gives:
Size Merit
1 1.5
2 2.083
3 2.45
4 2.7
... approaching 4

When 3-LUT based full adders are cascaded to add
two numbers and a carry-in, the metric becomes:
((2*Size+1)/(Size+1))*((Size+1)/(2*Size))
Giving:
Size Merit
1 2.25
2 2.083
3 2.042
4 2.025
... approaching 1

Comparing the tables, for combining three equal weight
bits, the LUT solution is better; for going from full adder
to 7Add the two circuits have equal LUT count; for
going from 7Add to 15Add carry logic should be used.
The optimum (30,5) counter (that I know of) uses a mix of
3-LUTs and carry-logic, and no 4-LUTs.

Whew! Talk about being over-anal(ytical). Hope I haven't
bored anyone, just wanted to get the 'best' circuits public
(in hopes someone else has a better one), and show something
not immediately obvious about the leaves of the adder tree.
Peter's BlockRam implementation is also slick!

I'll finish with a question:
Just what are the best/worst results from HDL synthesis
tools that folks get with this function? I.e., barring
forcing the mapping, and letting the tool optimize from
something like:
Count <= Bit0 + Bit1 + Bit2 +...

Regards,
John

p.s. Original poster asked about 30 bits...here's the
most compact way I know, but I haven't spent much
time on 30 bits. Here the carry logic helps. Note
that none of the LUTs are configured as 4-LUT....
___
0-| F |
0-| A |-1-------+
0-|_3_|-0-----+ |
___ | | ___
0-| F | | +-1-| F |
0-| A |-1-----|---1-| A |-2-----------------+
0-|_3_|-0-+ +-|---1-|_3_|-1---------------+ |
___ | | | ___ | |
0-| F | +-|-|---0-| F | | |
0-| A |-1---+ +---0-| A |-1---+ | |
0-|_3_|-0---------0-|_3_|-0---|---------+ | |
___ | | | |
0-| F | | ___ | | |
0-| A |-1-------+ + | F | | | |
0-|_3_|-0-----+ | +---1-| A |-|-|-|-2---+
___ | | ___ | +-1-|_3_|-|-|-|-1-+ |
0-| F | | +-1-| F | | | | | | | |
0-| A |-1-----|---1-| A |-2-|-|-------+ | | | | | ___
0-|_3_|-0-+ +-|---1-|_3_|-1-+ | | | | | | | '0'-4-| |
___ | | | ___ | | | | | | | '0'-3-| |
0-| F | +-|-|---0-| F | | | | | | | +------2-| C |
0-| A |-1---+ +---0-| A |-1---|-----+ | | | | +--------1-| Y |-4
0-|_3_|-0---------0-|_3_|-0---|---+ | | | | | '0'-0-| A |-3
___ | | | | | | | ___ | d |-2
0-| F | | | | | | | |'0'-3-| |-4-| d |-1
0-| A |-1-------+ | | | | | | +----2-| C |-3-| |-0
0-|_3_|-0-----+ | | | | | | +------1-| Y |-2-| |
___ | | ___ | | | | +--------0-| A |-1-| |
0-| F | | +-1-| F | | | | | ___ | d |-0-|___|
0-| A |-1-----|---1-| A |-2---|-+ | | +--2 | C |-3-| d | |
0-|_3_|-0-+ +-|---1-|_3_|-1---+ | | +----1-| Y |-2-| | 0
___ | | | ___ | +------0-| A |-1-| | |
0-| F | +-|-|---0-| F | +--------2-| d |-0-|___| |
0-| A |-1---+ +---0-| A |-1--------------1-| d | | |
0-|_3_|-0---------0-|_3_|-0--------------0-|___| 0 |
| | |
0 | |
0--------------------------------------------+ | |
0----------------------------------------------------+ |
0------------------------------------------------------------+

LUT Count:
18 + 12 + 2 + 5 + 6 + 6 = 49

49 LUTs = 8.125 CLBs
(Pipeline to taste)
HTH
(apologies for being late to the thread, suffered a disk crash
recently, could not post)
 

Welcome to EDABoard.com

Sponsor

Back
Top