Counting ones

Peter's: solution for 48 bits:
2 BlockRAMs, 16 LUTs
BlockRAM tcko + 2 levels carry chain
John's solution for 30 bits:
Resources: 46 LUTs (giving him 1 LUT per CYAdd bit)
Delay: 2 levels LUTs + 3 levels carry chain
John_H's solution for 31 bits (mine, at end):
Resources: 41 LUTs
Delay: 3 levels LUTs + 2 levels carry chain

"Peter Alfke" <peter@xilinx.com> wrote in message
news:3F870649.447B1CA3@xilinx.com...
John, just to be contrary:
Remember my suggestion of using a 4K x 4 dual-ported BlockROM. So it can
take 24 inputs and generate two 4-bit outputs that then have to be
combined externally. On two BlockROMs you can have 48 inputs and then
combine 4 sets of 4 inputs in one or two CLBs. Fast and simple, but
needs a trigger clock, since BlockROMs are synchronous.

Peter Alfke, Xilinx Applications

See original post for John's earlier comments as well as my own. In John's
form of ASCII art is the 3-input LUT form of a 31 bit adder that won't work
for 32 bits wihtout adding a bit of extra junk but is a more common need.

- John_H

___
0-| F | ___
0-| A |-1-----1-| |-2-----+
0-|_3_|-0-----0-| * |-1---+ |
___ +---1-| |-0-+ | |
0-| F | | +-0-|___| | | |
0-| A |-1-+ | | | | |
0-|_3_|-0---+ 0 | | | ___
0-----------------+ | | +-2-| C |
___ | +---1-| Y |-3-------+
0-| F | ___ +-----0-| A |-2-----+ |
0-| A |-1-----1-| |-2-------2-| d |-1---+ | |
0-|_3_|-0-----0-| * |-1-------1-| d |-0-+ | | |
___ +---1-| |-0-------0-|___| | | | |
0-| F | | +-0-|___| | | | | |
0-| A |-1-+ | | 0 | | | |
0-|_3_|-0---+ 0 | | | | |
0-----------------+ | | | | |
0---------------------------------+ | | | |
___ | | | |
0-| F | ___ | | | |
0-| A |-1-----1-| |-2-----+ | | | |
0-|_3_|-0-----0-| * |-1---+ | | | | |
___ +---1-| |-0-+ | | | | | | ___
0-| F | | +-0-|___| | | | | | | +-3-| |
0-| A |-1-+ | | | | | | | +---2-| C |-4
0-|_3_|-0---+ 0 | | | ___ | +-----1-| Y |-3
0-----------------+ | | +-2-| C | +-------0-| A |-2
___ | +---1-| Y |-3---------3-| d |-1
0-| F | ___ +-----0-| A |-2---------2-| d |-0
0-| A |-1-----1-| |-2-------2-| d |-1---------1-| |
0-|_3_|-0-----0-| * |-1-------1-| d |-0---------0-|___|
___ +---1-| |-0-------0-|___| |
0-| F | | +-0-|___| | 0
0-| A |-1-+ | | 0 |
0-|_3_|-0---+ 0 | |
0-----------------+ | |
0---------------------------------+ |
0---------------------------------------------------+

LUT count:
16 12 8 5 = 41

* The 2bit+2bit+1bit LUTS (0-7 result) are implemented as
3 levelsof logic implemented with 3-input LUTs.
 
....oops.
Mine was 45 LUTs. 16 luts in the second stage, not 12.
Delays stand as indicated
 
From a silicon-area efficiency the BlockRAM approach loses, for two
BRAMs occupy as much area as 24 CLBs = 192 LUTs.
But since this is not an a la carte restaurant where you have the option
to pay only for what you use, the BlockRAM can sometimes be considered
free, if nothing else needs it.
Who said choices were easy...
Peter Alfke
==================================
John_H wrote:
Peter's: solution for 48 bits:
2 BlockRAMs, 16 LUTs
BlockRAM tcko + 2 levels carry chain
John's solution for 30 bits:
Resources: 46 LUTs (giving him 1 LUT per CYAdd bit)
Delay: 2 levels LUTs + 3 levels carry chain
John_H's solution for 31 bits (mine, at end):
Resources: 41 LUTs
Delay: 3 levels LUTs + 2 levels carry chain

"Peter Alfke" <peter@xilinx.com> wrote in message
news:3F870649.447B1CA3@xilinx.com...
John, just to be contrary:
Remember my suggestion of using a 4K x 4 dual-ported BlockROM. So it can
take 24 inputs and generate two 4-bit outputs that then have to be
combined externally. On two BlockROMs you can have 48 inputs and then
combine 4 sets of 4 inputs in one or two CLBs. Fast and simple, but
needs a trigger clock, since BlockROMs are synchronous.

Peter Alfke, Xilinx Applications

See original post for John's earlier comments as well as my own. In John's
form of ASCII art is the 3-input LUT form of a 31 bit adder that won't work
for 32 bits wihtout adding a bit of extra junk but is a more common need.

- John_H

___
0-| F | ___
0-| A |-1-----1-| |-2-----+
0-|_3_|-0-----0-| * |-1---+ |
___ +---1-| |-0-+ | |
0-| F | | +-0-|___| | | |
0-| A |-1-+ | | | | |
0-|_3_|-0---+ 0 | | | ___
0-----------------+ | | +-2-| C |
___ | +---1-| Y |-3-------+
0-| F | ___ +-----0-| A |-2-----+ |
0-| A |-1-----1-| |-2-------2-| d |-1---+ | |
0-|_3_|-0-----0-| * |-1-------1-| d |-0-+ | | |
___ +---1-| |-0-------0-|___| | | | |
0-| F | | +-0-|___| | | | | |
0-| A |-1-+ | | 0 | | | |
0-|_3_|-0---+ 0 | | | | |
0-----------------+ | | | | |
0---------------------------------+ | | | |
___ | | | |
0-| F | ___ | | | |
0-| A |-1-----1-| |-2-----+ | | | |
0-|_3_|-0-----0-| * |-1---+ | | | | |
___ +---1-| |-0-+ | | | | | | ___
0-| F | | +-0-|___| | | | | | | +-3-| |
0-| A |-1-+ | | | | | | | +---2-| C |-4
0-|_3_|-0---+ 0 | | | ___ | +-----1-| Y |-3
0-----------------+ | | +-2-| C | +-------0-| A |-2
___ | +---1-| Y |-3---------3-| d |-1
0-| F | ___ +-----0-| A |-2---------2-| d |-0
0-| A |-1-----1-| |-2-------2-| d |-1---------1-| |
0-|_3_|-0-----0-| * |-1-------1-| d |-0---------0-|___|
___ +---1-| |-0-------0-|___| |
0-| F | | +-0-|___| | 0
0-| A |-1-+ | | 0 |
0-|_3_|-0---+ 0 | |
0-----------------+ | |
0---------------------------------+ |
0---------------------------------------------------+

LUT count:
16 12 8 5 = 41

* The 2bit+2bit+1bit LUTS (0-7 result) are implemented as
3 levelsof logic implemented with 3-input LUTs.
 
john.l.smith@titan.com (John) wrote in message news:<5b9931fd.0310101049.45df7d20@posting.google.com>...

<lots of stuff, prolly too much, snipped>

Correction/amendment to (30,5) counter (6 CLBs):
___
0-| F |
0-| A |-1-------+
0-|_3_|-0-----+ |
___ | | ___
0-| F | | +-1-| F |
0-| A |-1-----|---1-| A |-2-----------+
0-|_3_|-0-+ +-|---1-|_3_|-1---------+ |
___ | | | ___ | | ___
0-| F | +-|-|---0-| F | | +--2-| |
0-| A |-1---+ +---0-| A |-1--+ +----1-| C |
0-|_3_|-0---------0-|_3_|-0--|-----------0-| Y |
| ___ | A |
+--1-| F | | d |-3----+
+---1-| A |--2-| d |-2---+|
| +-1-|_3_|--1-| |-1--+||
| | +-0-|___|-0-+|||
| | '0'--+ | ||||
| | 0 ||||
0---------------------------|-|--------------+ ||||
___ | | ||||
0-| F | | | ||||
0-| A |-1-------+ | | ||||
0-|_3_|-0-----+ | | | ||||
___ | | ___ | | ||||
0-| F | | +-1-| F | | | ||||
0-| A |-1-----|---1-| A |-2-|-|-------+ |||| ___
0-|_3_|-0-+ +-|---1-|_3_|-1-+ | | |||| | |
___ | | | ___ | | |||+-3-| |
0-| F | +-|-|---0-| F | | | ||+--2-| C |
0-| A |-1---+ +---0-| A |-1---|-----+ | |+---1-| Y |-4
0-|_3_|-0---------0-|_3_|-0---|---+ | | +----0-| A |-3
___ | | | | | d |-2
0-| F | | | | | | d |-1
0-| A |-1-------+ | | | | +----3-| |-0
0-|_3_|-0-----+ | | | | | |+---2-| |
___ | | ___ | | | | ||+--1-| |
0-| F | | +-1-| F | | | | | ___ |||+-0-|___|
0-| A |-1-----|---1-| A |-2---|-+ | | +--2-| C |-3-+||| |
0-|_3_|-0-+ +-|---1-|_3_|-1---+ | | +----1-| Y |-2--+|| 0
___ | | | ___ | +------0-| A |-1---+| |
0-| F | +-|-|---0-| F | +--------2-| d |-0----+ |
0-| A |-1---+ +---0-| A |-1--------------1-| d | |
0-|_3_|-0---------0-|_3_|-0--------------0-|___| |
| |
0 |
0--------------------------------------------+ |
0-----------------------------------------------------------+

LUT Count:
18 + 12 + 2 + 10 + 6 = 48
48 LUTs = 6 CLBs
Pipeline to taste...
 
My final answer (until someone asks an interesting
variation on the question...)
OP asked about VHDL implementation, John_H and I gave
graphs, so the code below is illustration of how to
build the graphs in VHDL, for both the other John's
tree (simple tree), and mine (modified tree), for 30 bits.
Some observations on synthesis results using XST:

LUT utilization (from map report)...
Simple - 44 LUTs
Modified - 43 LUTs

This is a trivial difference. Because the code for the simple
tree is, well, simpler, it is preferred. Smart folk should be
able to genericize it so that the tree structure is built
automatically for arbitrary input size, by a synthesis tool
that accepts unconstrained bus inputs and outputs. (Does XST
do this yet?)

Timing (from par report smallest Spartan3)...
Pad to Pad delay...
Simple - 17+ ns
Modified - 17+ ns
Clock period (with the FFs at the end commented back in,
so as to take the pad delay out of the equation)...
Simple - 10+ ns ( 6 levels )
Modified - 10+ ns ( 6 levels )

No pin assignment or other attempts to map logic used.

Timing can be pushed around a little by playing with
map/par options, but not necessarily the way you'd expect
(who called it 'pushing the rope'? Great metaphor!)
The keep attribute, trying to force the tool, may
backfire.

So there it is. I like this one, because it is such a
neat example of how even the simplest of operations
( Counting ones!!?!?! ) may not be so simple. You need
to be aware of what the compiler actually does with a
behavioural model, and how to push it, and how not to
push it.

Regards,
John
p.s. thanks to Dan C., local X-FAE, for helping me
get this thing through XST. Like the OP, I'm not a big
VHDL guy either (just don't like the wordiness/aesthetics
of it), I am liking Handel-C more and more.

p.p.s the code (un-simulated, you get what you pay for):

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity TestAddBits_Top is
generic (UseFA3 : boolean := FALSE );
Port ( Clock: in std_logic;
InBits : in std_logic_vector(29 downto 0);
SumOut : out std_logic_vector(4 downto 0));
end TestAddBits_Top;

architecture VirtexII of TestAddBits_Top is
subtype TwoBit is std_logic_vector(1 downto 0);
subtype ThreeBit is std_logic_vector(2 downto 0);
subtype FourBit is std_logic_vector(3 downto 0);
type Leaf is array (0 to 8) of TwoBit;
type Level1 is array (0 to 3) of ThreeBit;
type Level2 is array (0 to 1) of FourBit;
type Level1A is array (0 to 5) of TwoBit;
signal LeafSum: Leaf ;
signal L1Sum: Level1;
signal L2Sum: Level2;
signal L1SumA: Level1A;
signal FA3: std_logic_vector(1 downto 0);
signal SumBits: std_logic_vector(4 downto 0);
signal Bits: std_logic_vector(29 downto 0);
--attribute keep : string;
--attribute keep of LeafSum: signal is "true";
--attribute keep of L1Sum: signal is "true";
--attribute keep of L2Sum: signal is "true";
begin
-- Simple adder tree to count 30 bits. Uses 44 LUTs
T1: if ( UseFA3 = FALSE ) generate
addleaf: for I in 0 to 7 generate -- Add 8 leafs
LeafSum(I) <= ('0' & Bits(3*I) )
+ ('0' & Bits(3*I+1))
+ ('0' & Bits(3*I+2));
end generate;
addlevel1: for J in 0 to 3 generate -- Add Level1
L1Sum(J) <= ( '0' & LeafSum(2*J) )
+ ( '0' & LeafSum(2*J+1))
+ ("00" & Bits(24+J) );
end generate;
addlevel2:for K in 0 to 1 generate -- Add Level2
L2Sum(K) <= ( '0' & L1Sum(2*K) )
+ ( '0' & L1Sum(2*K+1) )
+ ("000" & Bits(28+K) );
end generate;
SumBits <= ('0' & L2Sum(0)) -- Final Answer
+ ('0' & L2Sum(1));
end generate;
-- Modified adder tree to count 30 bits. Uses 43 LUTs,
-- 1 less LUT than simple adder tree
T2: if ( UseFA3 = TRUE ) generate -- Use FA3 = True
addleafsA: for I in 0 to 8 generate -- Add 9 leafs
LeafSum(I) <= ('0' & Bits(3*I) )
+ ('0' & Bits(3*I+1))
+ ('0' & Bits(3*I+2));
end generate;
addlevel1A: for J in 0 to 2 generate -- Add Level1A
inner: for K in 0 to 1 generate
L1SumA(2*J+K) <= ( '0' & LeafSum(3*J)(K) )
+ ( '0' & LeafSum(3*J+1)(K) )
+ ( '0' & LeafSum(3*J+2)(K) );
end generate;
end generate;
FA3 <= ( '0' & L1SumA(0)(1) ) -- Add weight 1 bits
+ ( '0' & L1SumA(2)(1) )
+ ( '0' & L1SumA(4)(1) );
L2Sum(0) <= ( "00" & Bits(27) ) -- Add Level2
+ ( '0' & L1SumA(1) & L1SumA(0)(0) )
+ ( '0' & L1SumA(3) & L1SumA(2)(0) );
L2Sum(1) <= ( '0' & L1SumA(5) & L1SumA(4)(0) )
+ ( '0' & FA3 & Bits(28) );
SumBits <= ( '0' & L2Sum(0)) -- Final Answer
+ ( '0' & L2Sum(1))
+ ( "0000" & Bits(29));
end generate;
process ( Clock ) begin
if RISING_EDGE( Clock ) then
Bits <= InBits;
SumOut <= SumBits;
end if;
end process;
end VirtexII;

"John_H" <johnhandwork@mail.com> wrote in message news:<whGhb.24$jU3.10392@news-west.eli.net>...
Peter's: solution for 48 bits:
2 BlockRAMs, 16 LUTs
BlockRAM tcko + 2 levels carry chain
John's solution for 30 bits:
Resources: 46 LUTs (giving him 1 LUT per CYAdd bit)
Delay: 2 levels LUTs + 3 levels carry chain
John_H's solution for 31 bits (mine, at end):
Resources: 41 LUTs
Delay: 3 levels LUTs + 2 levels carry chain
 

Welcome to EDABoard.com

Sponsor

Back
Top