Use of hardware adders with long words to perform multiple a

  • Thread starter Wojciech M. Zabolotny
  • Start date
W

Wojciech M. Zabolotny

Guest
I was solving a problem, when I needed to calculate every clock a sum of multiple values
encoded on a small number of bits (the latency of a few clocks is allowed).

A natural solution seemed to be a binary tree of adders, consisting of N levels,
when on each level I calculate a sum of two values.
E.g. assuming, that I have to calculate a sum of 8 values, I can calculate:

On the 1st level:
Y0 = X0 + X1, Y1=X2+X3, Y2=X4+X5, Y3=X6+X7 (4 adders)

On the 2nd level:
V0 = Y0+Y1, V1=Y2+Y3 (2 adders)

On the 3rd level:
Z = V0+V1 (1 adder)

If each level is equipped with a pipeline register, I can get a short critical path,
and the final latency is equal to 3 clocks, the new values may be entered every clock,
and the result is availeble every clock. The whole design uses 7 adders.

However modern FPGAs are equipped with adders using long words. E.g. the Xilinx family 7
chips use adders with 25 bit input operands.
If we assume, that the input values are encoded only at 5 bits, we can significantly
reduce consumption of resources.
Lets encode input words X0..X7 on bits of operands on the 1st level as follows:
A(4 downto 0)=X0; A(5)='0';
A(10 downto 6)=X2; A(11)='0';
A(16 downto 12)=X4; A(17)='0';
A(22 downto 18)=X6; A(23)='0';

B(4 downto 0)=X1; B(5)='0';
B(10 downto 6)=X3; B(11)='0';
B(16 downto 12)=X5; B(17)='0';
B(22 downto 18)=X7; B(23)='0';

Then on the first layer we can perform all calculations using only single adder:
C=A+B, and sub-sums are encoded as follows:
C(5 downto 0)= X0+X1=Y0; C(11 downto 6)=X2+X3=Y1; C(17 downto 12)=X4+X5=Y2; C(23 downto 18)=X6+X7=Y3

On the 2nd level we work with 6-bit values (7-bit after addition of leading '0'), so we can perform
up to 3 additions in a single adder (but we need only 2)

D(5 downto 0)=Y0; D(6)='0'; D(12 downto 7)=Y2; D(13)='0';
E(5 downto 0)=Y1; D(6)='0'; D(12 downto 7)=Y3; D(13)='0';

After addition:
F=D+E we get:
F(6 downto 0)=Y0+Y1=V0 ; F(13 downto 7)=Y2+Y3=V1;

The final addition may be performed in a standard way.
Please note, that now we had to use only 3 adders!

The interesting problem is, how we should organize our adders' tree for different lengths of word
in the adders, different lengths of the input values and different number of the input values.
I've prepared a simple tool, written in Python, which automatically generates the appropriate
VHDL code.
The sources have been posted on the usenet alt.sources group:
Subject: Generator of VHDL code for parallel adder (using long-word hardware adders to perform multiple additions in parallel)
Google archive: https://groups.google.com/forum/#!topic/alt.sources/8DqGgELScDM

However I'm interested if it can be done in pure VHDL?
I hope that the above idea will be useful for someone...

--
Regards,
Wojciech M. Zabolotny
wzab@ise.pw.edu.pl

My GPG/PGP keys:
standard: B191 ACF0 7909 83FA 3F9B 450C 407E 3C4B 4569 D119
confidential: 2BF3 F90F 6EA8 7D35 59FD 5080 78ED 33DE 1312 D8F8
 
Wojciech M. Zabolotny <wzab@ise.pw.edu.pl> wrote:
I was solving a problem, when I needed to calculate every clock a
sum of multiple values encoded on a small number of bits
(the latency of a few clocks is allowed).

A natural solution seemed to be a binary tree of adders, consisting of N levels,
when on each level I calculate a sum of two values.
E.g. assuming, that I have to calculate a sum of 8 values, I can calculate:

On the 1st level:
Y0 = X0 + X1, Y1=X2+X3, Y2=X4+X5, Y3=X6+X7 (4 adders)

On the 2nd level:
V0 = Y0+Y1, V1=Y2+Y3 (2 adders)

On the 3rd level:
Z = V0+V1 (1 adder)

The traditional solution to this problem is the carry save adder.

It isn't always so obvious that the traditional solutions are still
optimal in FPGAs with built-in carry logic, but I think it still works.

-- glen
 
On Sat, 30 Nov 2013 21:38:32 +0000, Wojciech M. Zabolotny wrote:

I was solving a problem, when I needed to calculate every clock a sum of
multiple values encoded on a small number of bits (the latency of a few
clocks is allowed).

A natural solution seemed to be a binary tree of adders, consisting of N
levels,
when on each level I calculate a sum of two values.
E.g. assuming, that I have to calculate a sum of 8 values, I can
calculate:

On the 1st level:
Y0 = X0 + X1, Y1=X2+X3, Y2=X4+X5, Y3=X6+X7 (4 adders)

On the 2nd level:
V0 = Y0+Y1, V1=Y2+Y3 (2 adders)

On the 3rd level:
Z = V0+V1 (1 adder)

If each level is equipped with a pipeline register, I can get a short
critical path,
and the final latency is equal to 3 clocks, the new values may be
entered every clock,
and the result is availeble every clock. The whole design uses 7 adders.

However modern FPGAs are equipped with adders using long words. E.g. the
Xilinx family 7 chips use adders with 25 bit input operands.
If we assume, that the input values are encoded only at 5 bits, we can
significantly reduce consumption of resources.
Lets encode input words X0..X7 on bits of operands on the 1st level as
follows:
A(4 downto 0)=X0; A(5)='0';
A(10 downto 6)=X2; A(11)='0';
A(16 downto 12)=X4; A(17)='0';
A(22 downto 18)=X6; A(23)='0';

B(4 downto 0)=X1; B(5)='0';
B(10 downto 6)=X3; B(11)='0';
B(16 downto 12)=X5; B(17)='0';
B(22 downto 18)=X7; B(23)='0';

Then on the first layer we can perform all calculations using only
single adder:
C=A+B, and sub-sums are encoded as follows:
C(5 downto 0)= X0+X1=Y0; C(11 downto 6)=X2+X3=Y1; C(17 downto
12)=X4+X5=Y2; C(23 downto 18)=X6+X7=Y3

On the 2nd level we work with 6-bit values (7-bit after addition of
leading '0'), so we can perform up to 3 additions in a single adder (but
we need only 2)

D(5 downto 0)=Y0; D(6)='0'; D(12 downto 7)=Y2; D(13)='0';
E(5 downto 0)=Y1; D(6)='0'; D(12 downto 7)=Y3; D(13)='0';

After addition:
F=D+E we get:
F(6 downto 0)=Y0+Y1=V0 ; F(13 downto 7)=Y2+Y3=V1;

The final addition may be performed in a standard way.
Please note, that now we had to use only 3 adders!

The interesting problem is, how we should organize our adders' tree for
different lengths of word in the adders, different lengths of the input
values and different number of the input values.
I've prepared a simple tool, written in Python, which automatically
generates the appropriate VHDL code.
The sources have been posted on the usenet alt.sources group:
Subject: Generator of VHDL code for parallel adder (using long-word
hardware adders to perform multiple additions in parallel)
Google archive:
https://groups.google.com/forum/#!topic/alt.sources/8DqGgELScDM

However I'm interested if it can be done in pure VHDL?
I hope that the above idea will be useful for someone...

Not a direct answer to your questions... have you researched Wallace
Trees? Like most solutions to computer arithmetic, it dates from about
half a century ago.
Wallace tried to find a way of making a faster hardware multiplier. Part
of that involves adding a large number of binary numbers, and Wallace's
solution allowed for a shorter critical path in the adder tree.

http://en.wikipedia.org/wiki/Wallace_tree

BTW, yes, it can be done in pure VHDL. Expect to use a lot of if-
generate and for-generate constructs, possibly with functions to work out
the ranges of the for-generates.

A few years back a wrote a "popcount" (e.g. population count) module in
VHDL. It had an input width that was configurable, and added up the
number of '1' bits in that input word. E.g. a 512 bit input would result
in a 10 bit output word. The whole thing worked in only a modest number
of levels of logic and was moderately fast (180MHz after PAR in the logic
of the day) because I used a Wallace Tree. The first stages (which were
adding six 1 bit numbers together) were implemented directly in six-input
LUTs. The second (and perhaps third) stages were implemented using
similar LUTs. I only switched to using behavioural (synthesising to
ripple carry) adders once the words became wider a few stages into the
tree.

Regards,
Allan
 
W dniu niedziela, 1 grudnia 2013 01:40:26 UTC+1 użytkownik Allan Herriman napisał:
On Sat, 30 Nov 2013 21:38:32 +0000, Wojciech M. Zabolotny wrote:



I was solving a problem, when I needed to calculate every clock a sum of

multiple values encoded on a small number of bits (the latency of a few

clocks is allowed).



A natural solution seemed to be a binary tree of adders, consisting of N

levels,

when on each level I calculate a sum of two values.

E.g. assuming, that I have to calculate a sum of 8 values, I can

calculate:



On the 1st level:

Y0 = X0 + X1, Y1=X2+X3, Y2=X4+X5, Y3=X6+X7 (4 adders)



On the 2nd level:

V0 = Y0+Y1, V1=Y2+Y3 (2 adders)



On the 3rd level:

Z = V0+V1 (1 adder)



If each level is equipped with a pipeline register, I can get a short

critical path,

and the final latency is equal to 3 clocks, the new values may be

entered every clock,

and the result is availeble every clock. The whole design uses 7 adders..



However modern FPGAs are equipped with adders using long words. E.g. the

Xilinx family 7 chips use adders with 25 bit input operands.

If we assume, that the input values are encoded only at 5 bits, we can

significantly reduce consumption of resources.

Lets encode input words X0..X7 on bits of operands on the 1st level as

follows:

A(4 downto 0)=X0; A(5)='0';

A(10 downto 6)=X2; A(11)='0';

A(16 downto 12)=X4; A(17)='0';

A(22 downto 18)=X6; A(23)='0';



B(4 downto 0)=X1; B(5)='0';

B(10 downto 6)=X3; B(11)='0';

B(16 downto 12)=X5; B(17)='0';

B(22 downto 18)=X7; B(23)='0';



Then on the first layer we can perform all calculations using only

single adder:

C=A+B, and sub-sums are encoded as follows:

C(5 downto 0)= X0+X1=Y0; C(11 downto 6)=X2+X3=Y1; C(17 downto

12)=X4+X5=Y2; C(23 downto 18)=X6+X7=Y3



On the 2nd level we work with 6-bit values (7-bit after addition of

leading '0'), so we can perform up to 3 additions in a single adder (but

we need only 2)



D(5 downto 0)=Y0; D(6)='0'; D(12 downto 7)=Y2; D(13)='0';

E(5 downto 0)=Y1; D(6)='0'; D(12 downto 7)=Y3; D(13)='0';



After addition:

F=D+E we get:

F(6 downto 0)=Y0+Y1=V0 ; F(13 downto 7)=Y2+Y3=V1;



The final addition may be performed in a standard way.

Please note, that now we had to use only 3 adders!



The interesting problem is, how we should organize our adders' tree for

different lengths of word in the adders, different lengths of the input

values and different number of the input values.

I've prepared a simple tool, written in Python, which automatically

generates the appropriate VHDL code.

The sources have been posted on the usenet alt.sources group:

Subject: Generator of VHDL code for parallel adder (using long-word

hardware adders to perform multiple additions in parallel)

Google archive:

https://groups.google.com/forum/#!topic/alt.sources/8DqGgELScDM



However I'm interested if it can be done in pure VHDL?

I hope that the above idea will be useful for someone...





Not a direct answer to your questions... have you researched Wallace

Trees? Like most solutions to computer arithmetic, it dates from about

half a century ago.

Wallace tried to find a way of making a faster hardware multiplier. Part

of that involves adding a large number of binary numbers, and Wallace's

solution allowed for a shorter critical path in the adder tree.



http://en.wikipedia.org/wiki/Wallace_tree



BTW, yes, it can be done in pure VHDL. Expect to use a lot of if-

generate and for-generate constructs, possibly with functions to work out

the ranges of the for-generates.



A few years back a wrote a "popcount" (e.g. population count) module in

VHDL. It had an input width that was configurable, and added up the

number of '1' bits in that input word. E.g. a 512 bit input would result

in a 10 bit output word. The whole thing worked in only a modest number

of levels of logic and was moderately fast (180MHz after PAR in the logic

of the day) because I used a Wallace Tree. The first stages (which were

adding six 1 bit numbers together) were implemented directly in six-input

LUTs. The second (and perhaps third) stages were implemented using

similar LUTs. I only switched to using behavioural (synthesising to

ripple carry) adders once the words became wider a few stages into the

tree.



Regards,

Allan

Thanks a lot for suggestions.
I have not described it, but in my design I have yet another constraint.
I need to squeeze as much of those summing systems in single FPGA, while maintaining latency on the lowest possible level.

That's why I tried to fully utilize the hardware features of my FPGA, and reuse adders to perform multiple operations in parallel.

In fact my problems starts at the level where "the words become wider" (quoting from your post).

Regards,
Wojtek
 

Welcome to EDABoard.com

Sponsor

Back
Top