modulo 2**32-1 arith

I

Ilya Kalistru

Guest
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?
 
Ilya Kalistru wrote:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?

Sounds like you want an "end around carry." That means that the
carry out of a 32-bit full adder wraps back to its carry in. In
that way a 32-bit full adder would do what you want in one cycle.

--
Gabor
 
On Tuesday, December 15, 2015 at 5:32:14 AM UTC-5, Ilya Kalistru wrote:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?

Maybe you should revisit the need for 'modulo 2**32-1' instead of 'modulo 2**32'. Synthesis results of your method and Rickman's method indicate that the modulo portion is consuming significantly more logic than the addition itself. Given the code posted below, here are the synthesis results using Quartus to target a Cyclone IV GX:

Method# Logic Elements Notes
======= ============== ====0 32 Sum <= A+B, no 'module 2**32-1' as a baseline
1 32 Sum <= A+B+1, as another reference point
2 76 Ilya's method
3 98 Rickman's method

Kevin Jennings

===== START OF CODE ====library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(METHOD: integer range 0 to 3);
port(
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0)
);
end Custom_Adder;

architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1;

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
-- The computation of 'mod_result' above
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Sum_P1(C'range) when (Sum_P1(32) = '1') else Sum(C'range);
end generate GEN_METHOD3;
end RTL;
===== END OF CODE =====
 
On Tuesday, December 15, 2015 at 3:38:57 PM UTC-5, KJ wrote:
OOPS, correction to the Method3 code:
Original: C <= Sum_P1(C'range) when (Sum_P1(32) = '1') else C <= Mod_Res(C'range);
Corrected: C <= Mod_Res(C'range);

Number of logic elements used is 97 rather than 98.

Kevin
 
GaborSzakacs wrote:
Ilya Kalistru wrote:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is
simple addition of two 32-bit unsigned numbers with 33-bit result and
on second cycle if the result >= 2**32-1, we add 1 and take only 32
bits of that.
Does anybody know a better way to do that?

Sounds like you want an "end around carry." That means that the
carry out of a 32-bit full adder wraps back to its carry in. In
that way a 32-bit full adder would do what you want in one cycle.

Hmmm... On second thought, that would not handle the case where
the inputs add up to exactly 2**32-1, since there would be no carry
out, but you would want to add one to end up with zero.

--
Gabor
 
On 12/15/2015 5:32 AM, Ilya Kalistru wrote:
Hello. I need to add two unsigned numbers modulo 2**32-1. Now it's
done in very inefficient way: at first clock cycle there is simple
addition of two 32-bit unsigned numbers with 33-bit result and on
second cycle if the result >= 2**32-1, we add 1 and take only 32 bits
of that. Does anybody know a better way to do that?

Not sure how efficient your implementation would be this way. You can
achieve the same result by adding one to the sum and adding the high bit
of this result to the original sum to get your answer. This should give
a simpler result because of the comparison your approach uses requires a
full adder rather than the incrementer of this approach.


signal sum, temp, mod_result : unsigned (32 downto 0);
signal answer : unsigned (31 downto 0);

sum <= RESIZE(a, 33) + RESIZE(b, 33);

temp <= sum + 1;

mod_result <= sum + temp(32);

answer <= mod_result(31 downto 0);


Doing the addition with an end around carry will not cover the case of
the sum being 2^n - 1.

No guarantees of the code above. I'm a bit rusty these days.

--

Rick
 
KJ wrote:

On Tuesday, December 15, 2015 at 5:32:14 AM UTC-5, Ilya Kalistru wrote:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?

Maybe you should revisit the need for 'modulo 2**32-1' instead of 'modulo 2**32'. Synthesis results of your method and Rickman's method indicate that the modulo portion is consuming significantly more logic than the addition itself. Given the code posted below, here are the synthesis results using Quartus to target a Cyclone IV GX:

Method# Logic Elements Notes
======= ============== =====
0 32 Sum <= A+B, no 'module 2**32-1' as a baseline
1 32 Sum <= A+B+1, as another reference point
2 76 Ilya's method
3 98 Rickman's method

Kevin Jennings

===== START OF CODE =====
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(METHOD: integer range 0 to 3);
port(
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0)
);
end Custom_Adder;

architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1;

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
-- The computation of 'mod_result' above
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Sum_P1(C'range) when (Sum_P1(32) = '1') else Sum(C'range);
end generate GEN_METHOD3;
end RTL;
===== END OF CODE =====

I'm guessing the requirement for modulo 2**32-1 is driven by the
algorithm, possibly some checksummy sort of thing. I know Fletcher uses
2**N-1 modulo pretty heavily.

If your concern is speed rather than size, you could probably do it by
running two parallel adders, Y0 = A+B and Y1 = A+B+1. If the Y1
add carries out then Y = Y1 else Y = Y0.

Alternatively, the way you do it in an optimized Fletcher is in blocks.
Add a whole bunch of samples together with sufficient extra bits on the
high end to count the overflows, then periodicially stop taking in new
data and add the overflows back into the LSBs. You could easily get
that "periodically" down to once out of every 1024 samples.

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
 
On 12/15/2015 4:31 PM, Rob Gaddi wrote:
KJ wrote:

On Tuesday, December 15, 2015 at 5:32:14 AM UTC-5, Ilya Kalistru wrote:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?

Maybe you should revisit the need for 'modulo 2**32-1' instead of 'modulo 2**32'. Synthesis results of your method and Rickman's method indicate that the modulo portion is consuming significantly more logic than the addition itself. Given the code posted below, here are the synthesis results using Quartus to target a Cyclone IV GX:

Method# Logic Elements Notes
======= ============== =====
0 32 Sum <= A+B, no 'module 2**32-1' as a baseline
1 32 Sum <= A+B+1, as another reference point
2 76 Ilya's method
3 98 Rickman's method

Kevin Jennings

I'm not clear on how "Ilya's method" uses only 76 LEs. I assume an LE is
a 4 input LUT and/or a register. I count at least 131. Producing Sum
uses 33, producing Sum_P1 uses another 33, evaluating (Sum >= MAX) uses
33 then there are 32 used in the mux to select the result. How can that
reduce to 76 LEs?


===== START OF CODE =====
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(METHOD: integer range 0 to 3);
port(
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0)
);
end Custom_Adder;

architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1;

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
-- The computation of 'mod_result' above
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Sum_P1(C'range) when (Sum_P1(32) = '1') else Sum(C'range);
end generate GEN_METHOD3;
end RTL;
===== END OF CODE =====

I'm guessing the requirement for modulo 2**32-1 is driven by the
algorithm, possibly some checksummy sort of thing. I know Fletcher uses
2**N-1 modulo pretty heavily.

If your concern is speed rather than size, you could probably do it by
running two parallel adders, Y0 = A+B and Y1 = A+B+1. If the Y1
add carries out then Y = Y1 else Y = Y0.

That would reduce to the same complexity as my method as implemented
above. My approach can be optimized by producing temp (Y1) from the two
inputs. Then in parallel adding bit 32 into the sum of the two inputs
that produces Y. Take Y modulo 2^n as the final step using 66 LEs. Like
this...

signal Y: unsigned(31 downto 0);
signal Y, Y1, A, B: unsigned(32 downto 0);
Y1 <= A + B + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);

If the tool is of any use it will add the upper bit of Y1 as a carry in
to the sum of A + B utilizing a total of 65 LEs.

--

Rick
 
On Tuesday, December 15, 2015 at 7:16:42 PM UTC-5, rickman wrote:
I'm not clear on how "Ilya's method" uses only 76 LEs. I assume an LE is
a 4 input LUT and/or a register.

The target device family was stated.

I count at least 131. Producing Sum
uses 33, producing Sum_P1 uses another 33, evaluating (Sum >= MAX) uses
33 then there are 32 used in the mux to select the result. How can that
reduce to 76 LEs?

It reduces because synthesis does not work how you how you have described it above.

That would reduce to the same complexity as my method as implemented
above. My approach can be optimized by producing temp (Y1) from the two
inputs. Then in parallel adding bit 32 into the sum of the two inputs
that produces Y. Take Y modulo 2^n as the final step using 66 LEs. Like
this...

signal Y: unsigned(31 downto 0);
signal Y, Y1, A, B: unsigned(32 downto 0);
Y1 <= A + B + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);

If the tool is of any use it will add the upper bit of Y1 as a carry in
to the sum of A + B utilizing a total of 65 LEs.
Your updated algorithm is actually the same as your first. There are also a few problems with your code that indicate that you did not bother to compile your design to produce actual results so your conclusions are simply speculating (and they are incorrect).
- You've declared 'Y' twice, once to be a 32 bit vector, the other 33 bits. Minor thing, easily fixed
- The calculation of Y1 is not correct and will result in Y1(32) always being 0 which then affects the computation of Y in the next line. This error functionally changes the output to simply be the 32 bit sum of the inputs mod 2^32, not mod 2^32-1 as requested.
- The code you posted with the fix to remove the declaration of Y as a 33 bit number results in 32 logic cell usage not 66 as you speculated. However this number is not really meaningful because it is not computing what the OP wanted due to the second error. When that second error is corrected as shown in the code posted below, it result in the same 97 logic cells as your originally posted method.

Bottom line is that what you have outlined for your method takes more logic simply because there is an additional adder required for your method that is not needed with Ilya's method.

You and Rob are also not correct in thinking that computing Sum+1 as A+B+1 will save anything. In the updated code I've posted at the end, I've added another generic control called SP1_METHOD which is used to control how 'Sum + 1' gets computed. Sum + 1 can now be computed as:
Sum_P1 <= Sum + 1; (as it was in the original code)
or
Sum_P1 <= A+B+1;

Quartus sees through the smoke and produces the exact same logic independent of SP1_METHOD. However, Quartus can be made to overshoot: if you compute Sum + 1 as 'A+1+B' rather than 'A+B+1', then the resource usage shoots up from 76 to 108 using Ilya's method. This is the result posted for Method=6, SP1_Method=1. Obviously, Quartus is able to spot the common 'A+B' expression and not recompute it. The reason why changing the order of the two adds makes a difference is because the VHDL language specifies that the expression be evaluated from left to right and 'A+B+1' is not the same as 'A+1+B' in that context. The LRM does not allow for a 'commutative property of addition'. To take advantage of that mathematical property, the code must be explicitly written in a way that takes advantage of it.

Back to the OP's problem. Barring a discovery of some other mathematical relationship that can be taken advantage of (which is what is really needed), what Ilya originally described is optimal. In fact, spreading the computation over two clock cycles as he described is the best approach. Whereas computing the output all within a single clock cycle takes 76 logic cells, spreading the computation into two clocks only takes 65 (see new Method=5). The tradeoff is the additional clock cycle of latency. Whether or not that is important in Ilya's application is for him to decide.

Results for all of the variations are:

Logic Elements
Method# SP1=0 SP1=1 Notes
0 32 32 C <= A+B, no modulo '2**32-1' as a baseline
1 32 32 C <= A+B+1, as another reference point
2 76 76 Ilya's method although implemented all in one clock cycle
3 97 97 Rickman's method #1
4 97 97 Rickman's method #2
5 65 65 Ilya's method as stated, two clock cycles
6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1

Kevin Jennings

==== START OF CODE ===library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(
SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated
METHOD: integer range 0 to 6);
port(
Clock: in std_ulogic;
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0)
);
end Custom_Adder;

-- Results:
-- Logic Elements
-- Method# SP1=0 SP1=1 Notes
-- 0 32 32 C <= A+B, no modulo '2**32-1' as a baseline
-- 1 32 32 C <= A+B+1, as another reference point
-- 2 76 76 Ilya's method although implemented all in one clock cycle
-- 3 97 97 Rickman's method #1
-- 4 97 97 Rickman's method #2
-- 5 65 65 Ilya's method as stated, two clock cycles
-- 6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1
architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1;
-- KJ note: If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells
-- increases by 27 (i.e. almost one for each bit of the adder

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method, implemented all in one clock cycle:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- Rickman's first method
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Mod_Res(C'range);
end generate GEN_METHOD3;
GEN_METHOD4: if (METHOD = 4) generate
-- Rickman's second method
-- signal Y: unsigned(31 downto 0);
-- signal Y, Y1, A, B: unsigned(32 downto 0); <-- KJ note: Redclares 'Y'
-- Y1 <= A + B + 1; <-- KJ note: Error: must have 33 elements not 32
-- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
signal Y: unsigned(31 downto 0);
signal Y1: unsigned(32 downto 0);
begin
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
C <= Y;
end generate GEN_METHOD4;

GEN_METHOD5: if (METHOD = 5) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
signal Sum_Dlyd: unsigned(Sum'range);
signal Sum_Dlyd_P1: unsigned(Sum'range);
begin
Sum_Dlyd <= Sum when rising_edge(Clock);
Sum_Dlyd_P1 <= Sum_Dlyd + 1;
C <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range);
end generate GEN_METHOD5;

GEN_METHOD6: if (METHOD = 6) generate
-- Same as method 2 (Ilya's method, implemented all in one clock cycle) except
-- the ordering of operands for the computation of 'Sum + 1' is modified.
signal Sum_P1: unsigned(Sum'range);
begin
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length);
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD6;

end RTL;
==== END OF CODE ====
 
On Tuesday, December 15, 2015 at 4:33:51 PM UTC-5, Rob Gaddi wrote:
I'm guessing the requirement for modulo 2**32-1 is driven by the
algorithm, possibly some checksummy sort of thing. I know Fletcher uses
2**N-1 modulo pretty heavily.

Could be right, at least there is some possible context now to the question.. Thanks.

If your concern is speed rather than size, you could probably do it by
running two parallel adders, Y0 = A+B and Y1 = A+B+1. If the Y1
add carries out then Y = Y1 else Y = Y0.

See my earlier post today. Quartus will implement 'Y1=A+B+1' exactly the same as 'Y1=Y+1'.

Alternatively, the way you do it in an optimized Fletcher is in blocks.
Add a whole bunch of samples together with sufficient extra bits on the
high end to count the overflows, then periodicially stop taking in new
data and add the overflows back into the LSBs. You could easily get
that "periodically" down to once out of every 1024 samples.

Interesting idea, but I don't think it saves anything over Ilya's method. Ilya's method as originally specified (i.e. spread out over two clock cycles) takes essentially 2 logic cells per bit (but it takes ~2.4 per bit for single clock cycle). When it comes time to periodically update the overflows that you mentioned, you will end up with another adder and consume another 32 logic cells (one per bit).

Kevin Jennings
 
Am Dienstag, 15. Dezember 2015 11:32:14 UTC+1 schrieb Ilya Kalistru:
Hello.
I need to add two unsigned numbers modulo 2**32-1.
Now it's done in very inefficient way: at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that.
Does anybody know a better way to do that?

Which logic family?
What is "better"? Less ressources? Faster? Do you need a result every cycle?

If you can do with a result every other cycle, you could try following:
- Make an 32b adder with carry in and carry out
- First cycle: Add the two numbers with carry.
- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.

This should be below 40 LEs. But I have not tried it out...

Regards,

Thomas

www.entner-electronics.com - Home of EEBlaster
 
KJ wrote:

On Tuesday, December 15, 2015 at 4:33:51 PM UTC-5, Rob Gaddi wrote:
I'm guessing the requirement for modulo 2**32-1 is driven by the
algorithm, possibly some checksummy sort of thing. I know Fletcher uses
2**N-1 modulo pretty heavily.


Could be right, at least there is some possible context now to the question. Thanks.

If your concern is speed rather than size, you could probably do it by
running two parallel adders, Y0 = A+B and Y1 = A+B+1. If the Y1
add carries out then Y = Y1 else Y = Y0.


See my earlier post today. Quartus will implement 'Y1=A+B+1' exactly the same as 'Y1=Y+1'.

Y0=A+B is a 32-bit adder with CIN=0. Y1=A+B+1 is a second 32-bit adder
with CIN=1. Choosing between them is a 32-bit 2:1 mux. So my rough
math puts my answer at 96 LEs and 32 flops. But the worst case
propagation path is from the LSB, through the 32-bit Y1 carry chain, to
the mux select, to the output, which should be pretty screamingly
fast.

Whereas Ilya's (and rickman's as well, if I'm reading it right) use the
carry-out of the A+B+1 add as the carry in to the "real" add. That
gives you the total prop delay of a 64-bit carry chain, plus some slop.

The OP, who seems to have vanished off into the ether, never specified
what "efficiency" he was trying to optimize on, or whether pipeline
delays were acceptable, etc.

Alternatively, the way you do it in an optimized Fletcher is in blocks.
Add a whole bunch of samples together with sufficient extra bits on the
high end to count the overflows, then periodicially stop taking in new
data and add the overflows back into the LSBs. You could easily get
that "periodically" down to once out of every 1024 samples.

Interesting idea, but I don't think it saves anything over Ilya's method. Ilya's method as originally specified (i.e. spread out over two clock cycles) takes essentially 2 logic cells per bit (but it takes ~2.4 per bit for single clock cycle). When it comes time to periodically update the overflows that you mentioned, you will end up with another adder and consume another 32 logic cells (one per bit).

My above comment again. If I'm right about the checksum, then Ilya's
method (at least on a reading) gives you an operating duty cycle of
50%, you can put new data no faster than every other clock because
you're waiting for the result of that "add one more decision" to
percolate back around. Leave out the pipeline registers and your fmax
gets ugly.

If you just carry the excess and cook it periodically, you get a "data
accept" duty cycle of well past 99%. There may be some cleverer way to
get to a 100% duty cycle pipeline, but with only one cup of coffee in me
I don't see it.

> Kevin Jennings

--
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order. See above to fix.
 
On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:
The tool was not using the carry in. Here is a version that uses the
carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs
if done in two clock cycles.
Here is the fitter summary showing 98 logic cells for this design that you posted. Presumably you're using a different tool or different target device to come up with 66 so that number only becomes relevant if you use that same tool and target device to synthesize all of the other variants.
+---------------------------------------------------------------------------------+
; Fitter Summary ;
+------------------------------------+--------------------------------------------+
; Fitter Status ; Successful - Wed Dec 16 12:31:58 2015 ;
; Quartus II 64-Bit Version ; 13.1.0 Build 162 10/23/2013 SJ Web Edition ;
; Revision Name ; Junk ;
; Top-level Entity Name ; FPGA ;
; Family ; Cyclone IV GX ;
; Device ; EP4CGX22CF19C6 ;
; Timing Models ; Final ;
; Total logic elements ; 98 / 21,280 ( < 1 % ) ;
; Total combinational functions ; 98 / 21,280 ( < 1 % ) ;
; Dedicated logic registers ; 0 / 21,280 ( 0 % ) ;
; Total registers ; 0 ;
; Total pins ; 96 / 167 ( 57 % ) ;
; Total virtual pins ; 0 ;
; Total memory bits ; 0 / 774,144 ( 0 % ) ;
; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % ) ;
; Total GXB Receiver Channel PCS ; 0 / 4 ( 0 % ) ;
; Total GXB Receiver Channel PMA ; 0 / 4 ( 0 % ) ;
; Total GXB Transmitter Channel PCS ; 0 / 4 ( 0 % ) ;
; Total GXB Transmitter Channel PMA ; 0 / 4 ( 0 % ) ;
; Total PLLs ; 0 / 4 ( 0 % ) ;
+------------------------------------+--------------------------------------------+

Kevin Jennings
 
On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com
If you can do with a result every other cycle, you could try following:
- Make an 32b adder with carry in and carry out
- First cycle: Add the two numbers with carry.
- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.

This should be below 40 LEs. But I have not tried it out...

I have yours coming in at 36 if I coded up what you said correctly, not tested. You're in the lead now with Ilya in second.

Kevin Jennings

==== START OF CODE ====
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(
SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated
METHOD: integer range 0 to 7);
port(
Clock: in std_ulogic;
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0);
Valid: out std_ulogic
);
end Custom_Adder;

-- Results:
-- Logic Elements
-- Method# SP1=0 SP1=1 Notes
-- 0 32 32 C <= A+B, no modulo '2**32-1' as a baseline
-- 1 32 32 C <= A+B+1, as another reference point
-- 2 76 76 Ilya's method although implemented all in one clock cycle
-- 3 97 97 Rickman's method #1
-- 4 97 97 Rickman's method #2
-- 5 65 65 Ilya's method as stated, two clock cycles
-- 6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1
-- 7 36 36 Thomas' method, two clock cycles

architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1;
-- KJ note: If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells
-- increases by 27 (i.e. almost one for each bit of the adder

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method, implemented all in one clock cycle:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- Rickman's first method
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Mod_Res(C'range);
end generate GEN_METHOD3;
GEN_METHOD4: if (METHOD = 4) generate
-- Rickman's second method
-- signal Y: unsigned(31 downto 0);
-- signal Y, Y1, A, B: unsigned(32 downto 0); <-- KJ note: Redclares 'Y'
-- Y1 <= A + B + 1; <-- KJ note: Error: must have 33 elements not 32
-- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
signal Y: unsigned(31 downto 0);
signal Y1: unsigned(32 downto 0);
begin
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
C <= Y;
end generate GEN_METHOD4;

GEN_METHOD5: if (METHOD = 5) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
signal Sum_Dlyd: unsigned(Sum'range);
signal Sum_Dlyd_P1: unsigned(Sum'range);
begin
Sum_Dlyd <= Sum when rising_edge(Clock);
Sum_Dlyd_P1 <= Sum_Dlyd + 1;
C <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range);
end generate GEN_METHOD5;

GEN_METHOD6: if (METHOD = 6) generate
-- Same as method 2 (Ilya's method, implemented all in one clock cycle) except
-- the ordering of operands for the computation of 'Sum + 1' is modified.
signal Sum_P1: unsigned(Sum'range);
begin
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length);
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD6;

GEN_METHOD7: if (METHOD = 7) generate
-- Thomas' method
-- Make an 32b adder with carry in and carry out
-- First cycle: Add the two numbers with carry.
-- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.
-- This should be below 40 LEs. But I have not tried it out...
signal Sum_With_Carry: unsigned(Sum'range);
signal Carry_In: unsigned(0 downto 0);
signal Carry_Out: std_ulogic;
begin
Carry_In <= not(Carry_In);
Sum_With_Carry <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In;
Carry_Out <= Sum_With_Carry(32) when rising_edge(Clock);
C <= Sum_With_Carry(C'range);
Valid <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out));
end generate GEN_METHOD7;
end RTL;
==== END OF CODE ====
 
On Wednesday, December 16, 2015 at 12:45:05 PM UTC-5, KJ wrote:

Correction to Method 7:
Posted: Carry_In <= not(Carry_In);
Supposed to be: Carry_In <= not(Carry_In) when rising_edge(Clock);

Results are not changed, still 36 Logic Elements.

Kevin
 
On Wednesday, December 16, 2015 at 12:12:15 PM UTC-5, Rob Gaddi wrote:
If I'm right about the checksum, then Ilya's
method (at least on a reading) gives you an operating duty cycle of
50%, you can put new data no faster than every other clock because
you're waiting for the result of that "add one more decision" to
percolate back around. Leave out the pipeline registers and your fmax
gets ugly.

Ilya's method (Method=6 in the code I posted) should be computing one output every clock cycle. The output will just have a one clock cycle latency compared with the pure combinatorial logic version (Method=2). Thomas' method (Method=7) though will only accept one input every other clock cycle but consumes roughly half the logic resource.

Kevin Jennings
 
The tool was not using the carry in. Here is a version that uses the
carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs
if done in two clock cycles.

BTW, did you test any of the designs you synthesized? How do you know
they actually work?

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
use std.textio.all;

ENTITY FPGA IS
GENERIC (
WIDTH : POSITIVE := 32
);
port(
-- Board Clock
A, B : in unsigned(WIDTH - 1 downto 0);
Y : out unsigned(WIDTH - 1 downto 0)
);
END FPGA;

ARCHITECTURE behavior OF FPGA IS

signal Y1 : unsigned(WIDTH downto 0);
signal temp : unsigned(WIDTH downto 0);

BEGIN

Y1 <= ('0' & A) + ('0' & B) + 1;
temp <= (A & '1') + (B & Y1(32));
Y <= temp (WIDTH downto 1);

END;

--

Rick
 
On 12/16/2015 12:36 PM, KJ wrote:
On Wednesday, December 16, 2015 at 12:12:01 PM UTC-5, rickman wrote:
The tool was not using the carry in. Here is a version that uses the
carry in. 66 LEs in one clock cycle. It can be reduced to about 40 LEs
if done in two clock cycles.

Here is the fitter summary showing 98 logic cells for this design that you posted. Presumably you're using a different tool or different target device to come up with 66 so that number only becomes relevant if you use that same tool and target device to synthesize all of the other variants.
+---------------------------------------------------------------------------------+
; Fitter Summary ;
+------------------------------------+--------------------------------------------+
; Fitter Status ; Successful - Wed Dec 16 12:31:58 2015 ;
; Quartus II 64-Bit Version ; 13.1.0 Build 162 10/23/2013 SJ Web Edition ;
; Revision Name ; Junk ;
; Top-level Entity Name ; FPGA ;
; Family ; Cyclone IV GX ;
; Device ; EP4CGX22CF19C6 ;
; Timing Models ; Final ;
; Total logic elements ; 98 / 21,280 ( < 1 % ) ;
; Total combinational functions ; 98 / 21,280 ( < 1 % ) ;
; Dedicated logic registers ; 0 / 21,280 ( 0 % ) ;
; Total registers ; 0 ;
; Total pins ; 96 / 167 ( 57 % ) ;
; Total virtual pins ; 0 ;
; Total memory bits ; 0 / 774,144 ( 0 % ) ;
; Embedded Multiplier 9-bit elements ; 0 / 80 ( 0 % ) ;
; Total GXB Receiver Channel PCS ; 0 / 4 ( 0 % ) ;
; Total GXB Receiver Channel PMA ; 0 / 4 ( 0 % ) ;
; Total GXB Transmitter Channel PCS ; 0 / 4 ( 0 % ) ;
; Total GXB Transmitter Channel PMA ; 0 / 4 ( 0 % ) ;
; Total PLLs ; 0 / 4 ( 0 % ) ;
+------------------------------------+--------------------------------------------+

I don't know what to tell you. I used Lattice Diamond 3.3. The design
is simple, two 33 bit adders, 1 LUT per bit. There is nothing special
about either the tool or the device (standard 4 input LUTs with carry
support for adders). I think your Quartus tool is broken. Try looking
at the actual logic produced. BTW, doesn't the Cyclone IV have 6 input
LUTs? What is the Altera tool doing that is so inefficient?

--

Rick
 
On 12/16/2015 12:45 PM, KJ wrote:
On Wednesday, December 16, 2015 at 9:23:49 AM UTC-5, thomas....@gmail.com
If you can do with a result every other cycle, you could try following:
- Make an 32b adder with carry in and carry out
- First cycle: Add the two numbers with carry.
- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.

This should be below 40 LEs. But I have not tried it out...

I have yours coming in at 36 if I coded up what you said correctly, not tested. You're in the lead now with Ilya in second.

Kevin Jennings

==== START OF CODE ====
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity Custom_Adder is generic(
SP1_METHOD: integer range 0 to 1; -- Controls how 'Sum + 1' is calculated
METHOD: integer range 0 to 7);
port(
Clock: in std_ulogic;
A: in unsigned(31 downto 0);
B: in unsigned(31 downto 0);
C: out unsigned(31 downto 0);
Valid: out std_ulogic
);
end Custom_Adder;

-- Results:
-- Logic Elements
-- Method# SP1=0 SP1=1 Notes
-- 0 32 32 C <= A+B, no modulo '2**32-1' as a baseline
-- 1 32 32 C <= A+B+1, as another reference point
-- 2 76 76 Ilya's method although implemented all in one clock cycle
-- 3 97 97 Rickman's method #1
-- 4 97 97 Rickman's method #2
-- 5 65 65 Ilya's method as stated, two clock cycles
-- 6 76 108 Same as method 2, but when SP1_METHOD=1 it will computes Sum + 1 as A+1+B rather than A+B+1
-- 7 36 36 Thomas' method, two clock cycles

architecture RTL of Custom_Adder is
signal Sum: unsigned(32 downto 0);
signal Sum_P1: unsigned(32 downto 0);
constant MAX: unsigned(32 downto 0) := '0' & x"FFFF_FFFF";
begin
Sum <= resize(A, Sum'length) + resize(B, Sum'length);
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + resize(B, Sum'length) + 1;
-- KJ note: If you change Sum_P1 to compute A+1+B rather than A+B+1 then the number of logic cells
-- increases by 27 (i.e. almost one for each bit of the adder

-- KJ note: Performing all calculations on one clock cycle in order to determine logic cells.

GEN_METHOD0: if (METHOD = 0) generate
C <= Sum(C'range);
end generate GEN_METHOD0;

GEN_METHOD1: if (METHOD = 1) generate
C <= Sum_P1(C'range);
end generate GEN_METHOD1;

GEN_METHOD2: if (METHOD = 2) generate
-- Ilya's method, implemented all in one clock cycle:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD2;

GEN_METHOD3: if (METHOD = 3) generate
signal Mod_Res: unsigned(32 downto 0);
begin
-- Rickman's first method
-- sum <= RESIZE(a, 33) + RESIZE(b, 33);
-- temp <= sum + 1;
-- mod_result <= sum + temp(32);
-- answer <= mod_result(31 downto 0);
Mod_Res <= Sum + unsigned'(Sum_P1(32 downto 32));
C <= Mod_Res(C'range);
end generate GEN_METHOD3;
GEN_METHOD4: if (METHOD = 4) generate
-- Rickman's second method
-- signal Y: unsigned(31 downto 0);
-- signal Y, Y1, A, B: unsigned(32 downto 0); <-- KJ note: Redclares 'Y'
-- Y1 <= A + B + 1; <-- KJ note: Error: must have 33 elements not 32
-- Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
signal Y: unsigned(31 downto 0);
signal Y1: unsigned(32 downto 0);
begin
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
C <= Y;
end generate GEN_METHOD4;

GEN_METHOD5: if (METHOD = 5) generate
-- Ilya's method:
-- at first clock cycle there is simple addition of two 32-bit unsigned numbers with 33-bit result
-- and on second cycle if the result >= 2**32-1, we add 1 and take only 32 bits of that
signal Sum_Dlyd: unsigned(Sum'range);
signal Sum_Dlyd_P1: unsigned(Sum'range);
begin
Sum_Dlyd <= Sum when rising_edge(Clock);
Sum_Dlyd_P1 <= Sum_Dlyd + 1;
C <= Sum_Dlyd_P1(C'range) when (Sum_Dlyd(32) = '1') else Sum_Dlyd(C'range);
end generate GEN_METHOD5;

GEN_METHOD6: if (METHOD = 6) generate
-- Same as method 2 (Ilya's method, implemented all in one clock cycle) except
-- the ordering of operands for the computation of 'Sum + 1' is modified.
signal Sum_P1: unsigned(Sum'range);
begin
Sum_P1 <= Sum + 1 when (SP1_METHOD = 0) else resize(A, Sum'length) + 1 + resize(B, Sum'length);
C <= Sum_P1(C'range) when (Sum >= MAX) else Sum(C'range);
end generate GEN_METHOD6;

GEN_METHOD7: if (METHOD = 7) generate
-- Thomas' method
-- Make an 32b adder with carry in and carry out
-- First cycle: Add the two numbers with carry.
-- Second cycle: Check if the output carry is set. If yes, we already have the result. Otherwise clear the input carry and use the new result.
-- This should be below 40 LEs. But I have not tried it out...
signal Sum_With_Carry: unsigned(Sum'range);
signal Carry_In: unsigned(0 downto 0);
signal Carry_Out: std_ulogic;
begin
Carry_In <= not(Carry_In);
Sum_With_Carry <= resize(A, Sum_With_Carry'length) + resize(B, Sum_With_Carry'length) + Carry_In;
Carry_Out <= Sum_With_Carry(32) when rising_edge(Clock);
C <= Sum_With_Carry(C'range);
Valid <= (Carry_In(0) and Sum_With_Carry(32)) or (not(Carry_In(0)) and not(Carry_Out));
end generate GEN_METHOD7;
end RTL;
==== END OF CODE ====

I see one issue. You are calculating Sum and Sum_P1 even for methods
that don't use them. This would then depend on the tool to optimize
away this logic. I see in some methods you recompute these values. Why
not fix this and add it to the code for the methods that use it?

--

Rick
 
I thank you all for this very interesting and useful discussion. There is a lot to think about.
But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
I manged to meet timings by using Sum(32) to decide what sum I should use.

Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;

process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;

if Sum(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;

In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.

If I use
Sum_P1 <= Sum+1;
istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.


If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;

process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;

if Sum_P(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;
we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)

in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.

if we do it like that:
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);

process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;

C <= Y;
end if;
end process;
we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.
 

Welcome to EDABoard.com

Sponsor

Back
Top