I
Ilya Kalistru
Guest
I also should apologize for my absence for long time. Device I'm working on suddenly started to lose PCIe link (without any reason) and I have to put off all other works until this problem is resolved.
Follow along with the video below to see how to install our site as a web app on your home screen.
Note: This feature may not be available in some browsers.
A couple of comments. The timing from the synthesis step is pretty much
garbage. The only timing results that matter are those from place and
route. At least that has been my experience. So I would never toss an
approach based on timings from the synthesis step unless they were very
far apart.
The actual definition of "modulo 2**32-1" would be:
A+B if A+B<2**32-1
A+B-2**32+1 if A+B>=2**32-1
In the last example you give the tool seems to treat resize(Y1(32 downto
32),33) as another operator and devotes a full adder 33 bits long to add
that to the sum of the other two operands, obviously not efficient. My
other example gets around that. But I have another way of expressing it.
This method calculates a carry out of the 32 bit addition of A+B+1 which
is added to the sum A+B to get the final answer. Rather than trying to
force the tools to use a carry in to propagate the carry, just make it
one longer adder. Built in carries tend to be fast, so it will be a
foot race between the carry propagation and the LUT and routing delays
of using the mux.
signal A,B : unsigned(WIDTH-1 downto 0);
signal AxA, BxB, Y : unsigned(2*WIDTH-1 downto 0);
BEGIN
AxA <= A & A;
BxB <= B & B;
Y <= AxA + BxB + 1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
C <= Y (2*WIDTH-1 downto WIDTH);
end if;
end process;
This produced 64 bits of adder and no additional logic. I didn't
simulate it to make sure the logic is correct. You will need to do your
own timing analysis since your target device is different from what I
work with. Whether it is faster depends on the speed of the carry
chains in your device.
--
Rick
I believe your code
can be simply modified to produce the same result, just use Sum_P1(32)
in the conditional instead of Sum(32).
Sorry, I don't understand you here. Could you paraphrase it a bit simpler or wider?I don't know how you can use
your code if you need a modulo function since it will produce a result
of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid
result.
Since the carry chains are not part of the routing, you will find little
additional delay if the placement is good.
I thank you all for this very interesting and useful discussion. There is a lot to think about.
But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
I manged to meet timings by using Sum(32) to decide what sum I should use.
Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
if Sum(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;
In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.
If I use
Sum_P1 <= Sum+1;
istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.
If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
if Sum_P(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;
we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)
in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.
if we do it like that:
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
C <= Y;
end if;
end process;
we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.
That doesn't make sense. Either it is one definition or the other. TheOn Saturday, December 19, 2015 at 10:18:36 PM UTC+3, rickman wrote:
A couple of comments. The timing from the synthesis step is pretty much
garbage. The only timing results that matter are those from place and
route. At least that has been my experience. So I would never toss an
approach based on timings from the synthesis step unless they were very
far apart.
It's just a way to have a rough estimate of the timings without insertion of this block in actual design or having deals with timings from/to input ports.
May be I'll check this designs with
set_false_path -from A_in -to A
and so on.
The actual definition of "modulo 2**32-1" would be:
A+B if A+B<2**32-1
A+B-2**32+1 if A+B>=2**32-1
It is. But as I said this particular algorithm uses nonstandard definition. (obviously they wanted to make hardware implementation easier)
In the last example you give the tool seems to treat resize(Y1(32 downto
32),33) as another operator and devotes a full adder 33 bits long to add
that to the sum of the other two operands, obviously not efficient. My
other example gets around that. But I have another way of expressing it.
This method calculates a carry out of the 32 bit addition of A+B+1 which
is added to the sum A+B to get the final answer. Rather than trying to
force the tools to use a carry in to propagate the carry, just make it
one longer adder. Built in carries tend to be fast, so it will be a
foot race between the carry propagation and the LUT and routing delays
of using the mux.
signal A,B : unsigned(WIDTH-1 downto 0);
signal AxA, BxB, Y : unsigned(2*WIDTH-1 downto 0);
BEGIN
AxA <= A & A;
BxB <= B & B;
Y <= AxA + BxB + 1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
C <= Y (2*WIDTH-1 downto WIDTH);
end if;
end process;
This produced 64 bits of adder and no additional logic. I didn't
simulate it to make sure the logic is correct. You will need to do your
own timing analysis since your target device is different from what I
work with. Whether it is faster depends on the speed of the carry
chains in your device.
--
Rick
Cool! It makes LUT2 + 16*CARRY4 with 3.112 ns estimated propagation time (before route) and uses 64 lutes.
I believe your code
can be simply modified to produce the same result, just use Sum_P1(32)
in the conditional instead of Sum(32).
It's exactly what I've done in my second post - there's test results for both definitions of "modulo" with Sum_P1(32) and Sum(32) used as an input of the multiplexer.
I don't know how you can use
your code if you need a modulo function since it will produce a result
of 2**n-1 which is not in the range of mod 2**n-1 and so is not a valid
result.
Sorry, I don't understand you here. Could you paraphrase it a bit simpler or wider?
I don't need a modulo function, I need a function which is like modulo but with this strange difference.
I've done testing of both definitions for the sake of comparison between them and to be able to compare my results of "standard modulo" on Xilinx with "standard modulo" results of KJ on Altera.
Since the carry chains are not part of the routing, you will find little
additional delay if the placement is good.
I want to check how routing affects performance in real design when I fix this problem with the bloody PCIe. I hope I won't forget to report my results here.
Using Sum high bit as the selector means the mux will produce an output
of 2**n-1. This is not a valid output for a modulo 2**n-1 function. In
other words this is not a valid function to produce the modulo 2**n-1
function. It is the wrong circuit.
I can maybe illustrate this better with 8 bits. Mod 255 means the
output will range from 0 to 254. So if you used Sum high bit to control
the mux it will produce an range of 0 to 255 which is not correct. Use
Sum_p1 high bit to control the mux and the output will range from 0 to
254 which is correct.
I'm interested. Placement will have a huge effect. I look at routed
results for my implementation and I saw one route of 4.5 ns. Looks like
they shoved the input FFs into the IOBs, so the design can't all be
close together as the IO is scattered around the chip perimeter. I'd
need to kill that or add another layer of FFs so the routes of interest
are all internal FFs which could be co-located. Still doesn't say they
*will* be close together. I find tools to be hard to manage in that
regard unless you do floor-planning. I much prefer a slow clock... lol
If that is your requirement, then the other solutions are wrong, like
mine.
In your original post you say you want a function to produce "modulo
2**32-1". The description you supply is *not* "modulo 2**32-1". It
sounds like you have an existing implementation that may or may not be
correct but you have been tasked with duplicating it. If I were tasked
with this I would go through channels to get verification that the
specification is correct. It wouldn't be the first time there was an
error in an implementation that works most of the time, but fails 1 time
in 2^32 in this case.
What happens downstream if your function spits out a number that is not
in the range of "modulo 2**32-1"?
The problem I found was the input FFs were shoved into the IOB (a common
opimization). This spreads the FFs around the IOs which are spaced far
apart. You need to tell the tools not to do that *or* place additional
FFs in the circuit so the ones in the IOBs are not part of the path
being timed. Maybe you already have IOB FFs disabled.
This is absolutely a module function, It just treats 2^32-1 as == 0
(which it is module 2^31-1).
Using Sum high bit as the selector means the mux will produce an output
of 2**n-1. This is not a valid output for a modulo 2**n-1 function. In
other words this is not a valid function to produce the modulo 2**n-1
function. It is the wrong circuit.
I can maybe illustrate this better with 8 bits. Mod 255 means the
output will range from 0 to 254. So if you used Sum high bit to control
the mux it will produce an range of 0 to 255 which is not correct. Use
Sum_p1 high bit to control the mux and the output will range from 0 to
254 which is correct.
Ok. It's wrong function to produce the modulo 2**n-1. But I don't need correct modulo function, I need function which gives us that:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
It's absolutely different from modulo function, but it's what they use in the description of algorithm. It's not my responsibility to change algorithm and this algorithm is used in software implementation already and if I changed this function to correct modulo function my hardware implementation would be incompatible with software implementation and wouldn't comply to requirements of government.
I'm interested. Placement will have a huge effect. I look at routed
results for my implementation and I saw one route of 4.5 ns. Looks like
they shoved the input FFs into the IOBs, so the design can't all be
close together as the IO is scattered around the chip perimeter. I'd
need to kill that or add another layer of FFs so the routes of interest
are all internal FFs which could be co-located. Still doesn't say they
*will* be close together. I find tools to be hard to manage in that
regard unless you do floor-planning. I much prefer a slow clock... lol
That's why I didn't use postplacement results in my first posts. Later I've worked around it. As you can see I've used A_in,B_in,C_out as a ports and A,B,C as a FF. I've also made a rule that the paths from *_in port to A,B FFs and from C FFs to C_out ports are "don't care" paths.
set_false_path -from [get_ports A_in*] -to [get_cells A_reg*]
set_false_path -from [get_ports B_in*] -to [get_cells B_reg*]
set_false_path -from [get_cells C_reg*] -to [get_ports C_out*]
It allows to place FF anywhere on the chip.
I usually don't do any floor-planning at all because it's usually enough to set constrains well for my designs at ~ 250 MHz and Vivado placement tool does its job well.
This is absolutely a module function, It just treats 2^32-1 as == 0Ok. It's wrong function to produce the modulo 2**n-1. But I don't need correct modulo function, I need function which gives us that:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
It's absolutely different from modulo function, but it's what they use in the description of algorithm. It's not my responsibility to change algorithm and this algorithm is used in software implementation already and if I changed this function to correct modulo function my hardware implementation would be incompatible with software implementation and wouldn't comply to requirements of government.
On 12/20/15 3:57 AM, Ilya Kalistru wrote:
Ok. It's wrong function to produce the modulo 2**n-1. But I don't need
correct modulo function, I need function which gives us that:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
It's absolutely different from modulo function, but it's what they use
in the description of algorithm. It's not my responsibility to change
algorithm and this algorithm is used in software implementation
already and if I changed this function to correct modulo function my
hardware implementation would be incompatible with software
implementation and wouldn't comply to requirements of government.
This is absolutely a module function, It just treats 2^32-1 as == 0
(which it is module 2^31-1).
If that is your requirement, then the other solutions are wrong, like
mine.
They are. But they give me ideas and approaches.
In your original post you say you want a function to produce "modulo
2**32-1". The description you supply is *not* "modulo 2**32-1". It
sounds like you have an existing implementation that may or may not be
correct but you have been tasked with duplicating it. If I were tasked
with this I would go through channels to get verification that the
specification is correct. It wouldn't be the first time there was an
error in an implementation that works most of the time, but fails 1 time
in 2^32 in this case.
In my original post I was wrong. I forgot about this oddity and gave you wrong description. This specification is 26 years old and I have tons of implementations . I don't think it's a good idea to try to get verification. It's an old standard set by government and it's a very long way to go. Especially if the part of work you are responsible for haven't done yet.
What happens downstream if your function spits out a number that is not
in the range of "modulo 2**32-1"?
Nothing special. It's just a way to generate pseudorandom data. It could be done this way or other way, but should be similar on all devices. The only difference - security, but I wouldn't check algorithm after professional mathematicians - not my business.
The problem I found was the input FFs were shoved into the IOB (a common
opimization). This spreads the FFs around the IOs which are spaced far
apart. You need to tell the tools not to do that *or* place additional
FFs in the circuit so the ones in the IOBs are not part of the path
being timed. Maybe you already have IOB FFs disabled.
I think that it's pointless to shove FF inside the IOB if there is a rule that we don't care about timings between ports and FFs. Why could we do that?
I haven't disable IOB FFs.
I thank you all for this very interesting and useful discussion. There is a lot to think about.
But I should apologize: later on it turns out that the definition of "modulo 2**32-1" which is used in the algorithm description other than I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
I managed to meet timings by using Sum(32) to decide what sum I should use.
Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
if Sum(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;
In my case (Artix-7) this code gives me a critical path of LUT2 then 8 CARRY4 then LUT3 and it's pretty fast (3.09 ns estimated before the routing step). It consumes 96 luts estimated.
If I use
Sum_P1 <= Sum+1;
istead, it gives me a critical path of LUT2 + 9 CARRY4 + LUT2 + 8 CARRY4 and it doesn't meet timings (4.232 ns estimated before the routing step). It consumes 64 luts estimated.
If we return to our original task with a proper modulo 2*32-1 addition and use Sun_P1(32) for the select input of the multiplexer
Sum <= resize(A, 33)+resize(B, 33);
Sum_P1 <= resize(A, 33)+resize(B, 33)+1;
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
if Sum_P(32) = '1' then
C<=resize(Sum_P1,32);
else
C<=resize(Sum,32);
end if;
end if;
end process;
we would have exactly the same results on Artix-7 as with "weird" addition. (And it makes sense)
in case of Sum_P1 <= Sum+1; we have LUT2 + 7 CARRI4 + LUT + 2 CARRY4 + LUT2 and we don't meet timings(4.444 ns estimated before the routing step). It consumes 96 luts estimated.
if we do it like that:
Y1 <= resize(A, 33) + resize(B, 33) + 1;
Y <= resize(A + B + resize(Y1(32 downto 32),33), 32);
process(clk)
begin
if rising_edge(clk) then
A <= A_in;
B <= B_in;
C <= Y;
end if;
end process;
we have in critical path LUT2 + 8*CARRY4 + LUT2 + 8*CARRY4 and we don't meet timings(4.338 ns estimated before the routing step). It consumes 64 luts estimated.
On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
I thank you all for this very interesting and useful discussion. There
is a lot to think about.
But I should apologize: later on it turns out that the definition of
"modulo 2**32-1" which is used in the algorithm description other than
I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
So in fact this is what I originally suggested, i.e. "end around carry."
This is often used in checksum algorithms. it's the equivalent of
(without the type / size changes):
sum <= A + B;
out <= sum(31 downto 0) + sum(32);
Note that if you do the whole thing in one cycle by looping the carry
out of a 32-bit full adder to its carry in, you have not only two passes
through a 32-bit carry chain (worst case) but also the non-dedicated
routing in the loopback connection from carry out to carry in.
I'm interested. Placement will have a huge effect. I look at routed
results for my implementation and I saw one route of 4.5 ns. Looks like
they shoved the input FFs into the IOBs, so the design can't all be
close together as the IO is scattered around the chip perimeter. I'd
need to kill that or add another layer of FFs so the routes of interest
are all internal FFs which could be co-located. Still doesn't say they
*will* be close together. I find tools to be hard to manage in that
regard unless you do floor-planning. I much prefer a slow clock... lol
On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
I thank you all for this very interesting and useful discussion. There
is a lot to think about.
But I should apologize: later on it turns out that the definition of
"modulo 2**32-1" which is used in the algorithm description other than
I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
So in fact this is what I originally suggested, i.e. "end around carry."
This is often used in checksum algorithms. it's the equivalent of
(without the type / size changes):
sum <= A + B;
out <= sum(31 downto 0) + sum(32);
Note that if you do the whole thing in one cycle by looping the carry
out of a 32-bit full adder to its carry in, you have not only two passes
through a 32-bit carry chain (worst case) but also the non-dedicated
routing in the loopback connection from carry out to carry in.
Another problem with this method is that it creates a combinatorial loop
which prevents proper static timing analysis. It is very efficient with
the logic however.
I'm just very curious about why this is the desired algorithm while it
is being called "mod 2^32-1", but obviously we will never know the
origin of this.
On 12/20/2015 11:23 PM, rickman wrote:
On 12/20/2015 5:11 PM, Gabor Szakacs wrote:
On 12/19/2015 1:03 PM, Ilya Kalistru wrote:
I thank you all for this very interesting and useful discussion. There
is a lot to think about.
But I should apologize: later on it turns out that the definition of
"modulo 2**32-1" which is used in the algorithm description other than
I had thought. It is weird from the mathematical point of view:
A+B if A+B<2**32
A+B-2**32+1 if A+B>=2**32
So in fact this is what I originally suggested, i.e. "end around carry."
This is often used in checksum algorithms. it's the equivalent of
(without the type / size changes):
sum <= A + B;
out <= sum(31 downto 0) + sum(32);
Note that if you do the whole thing in one cycle by looping the carry
out of a 32-bit full adder to its carry in, you have not only two passes
through a 32-bit carry chain (worst case) but also the non-dedicated
routing in the loopback connection from carry out to carry in.
Another problem with this method is that it creates a combinatorial loop
which prevents proper static timing analysis. It is very efficient with
the logic however.
I'm just very curious about why this is the desired algorithm while it
is being called "mod 2^32-1", but obviously we will never know the
origin of this.
I have always heard this method referred to as "end around carry,"
however if you are using this as an accumulator, i.e. A is the next
data input and B is the result of the previous sum, then it is in
effect the same as taking the sum of inputs modulo 2**32 - 1, with
the only difference being that the final result can be equal to
2**32 - 1 where it would normally be zero. Intermediate results
equal to 2**32-1 do not change the final outcome vs. doing a true
modulo operator.
I'm interested. Placement will have a huge effect. I look at routed
results for my implementation and I saw one route of 4.5 ns. Looks like
they shoved the input FFs into the IOBs, so the design can't all be
close together as the IO is scattered around the chip perimeter. I'd
need to kill that or add another layer of FFs so the routes of interest
are all internal FFs which could be co-located. Still doesn't say they
*will* be close together. I find tools to be hard to manage in that
regard unless you do floor-planning. I much prefer a slow clock... lol
Today I've tried the new method in the real design and it tuned out that it works quite well. Biggest negative timing slack was about -0.047 or so. If I can take an advantage of single cycle operation, I'll try to implement it there.
Total delay of critical path there is about 4 ns and it's bigger then in my simple test mostly because of worse routing to LUTs at the beginning and at the end of a dedicated CARRY network.