Signed multiplication revisited

  • Thread starter Jonathan Bromley
  • Start date
J

Jonathan Bromley

Guest
After all the recent brouhaha about signed arithmetic
and multiplication, I decided to try a little test.

I remember once getting badly burnt because I needed
an 18x18->36 UNSIGNED multiplier in a Xilinx device,
and of course the native 18x18 multipliers are SIGNED,
so I got a bunch of unexpected extra logic. So
I thought I'd try to work out for myself what it
takes to get 18x18 UNSIGNED multiplication if you
have SIGNED multipliers to work with. Apologies to
those for whom this is familiar or child's play.

With apologies to rickman, I stuck with my
preferred way of thinking about twos complement
signed numbers: the MSB represents -(2**(n-1)),
all other bits are positive. So let's consider
an 18-bit number to be in two pieces:

- an unsigned 17-bit value, call it "u"
- a 1-bit MSB, call it "s", whose value is...
- for signed, either 0 or -(2**17),
- for unsigned, either 0 or +(2**17)

So we have two unsigned numbers A, B each composed
of these two components, and we wish to multiply them
together using unsigned arithmetic to get a 36-bit
unsigned result:

uns_result = (sA+uA)*(sB+uB)
= sA*sB + sA*uB + sB*uA + uA*uB

But unfortunately we are stuck with a SIGNED multiplier
which treats these numbers as though they were:

sgn_result = (-sA+uA)*(-sB+uB)
= sA*sB - sA*uB - sB*uA + uA*uB

Woo-hoo! It's nearly the same! So we see that

uns_result = sgn_result + 2*(sA*uB + sB*uA)

But that is SO easy to do! Remember that sA and sB
are single-bit values, at the MSB end (bit 17) of the
operands. Let's do a little rearrangement:

sA = (2**17)*msbA, sB = (2**17)*msbB

so we have

uns_result = sgn_result + (2**18)*(msbA*uB + msbB*uA)

We can calculate the sum (msbA*uB + msbB*uA) in parallel
with the multiplication, and after the multiplication
we can add that value (left-shifted by 18 bits) to the
signed product, and we have the correct unsigned product.

OK, so here's the payoff. I coded this rather carefully
in VHDL to force a synthesis tool to build the
conditional adder (msbA*uB + msbB*uA) in only one LUT
per bit:

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity umult18x18 is
port
( clk : in std_logic
; a, b : in unsigned(17 downto 0)
; y : out unsigned(35 downto 0)
);
end;

architecture JB of umult18x18 is
signal ar, br: unsigned(a'range);
begin
process (clk)
variable sy, fixup: unsigned(35 downto 0);
variable sbXua, saXuB: unsigned(17 downto 0);
begin
if rising_edge(clk) then
-- Input registers
ar <= a;
br <= b;
-- Compute signed product
sy := unsigned(signed(ar) * signed(br));
-- Compute conditional operands to adder
if ar(17) = '1' then
saXub := ('0' & br(16 downto 0));
else
saXub := (others => '0');
end if;
if br(17) = '1' then
sbXua := ('0' & ar(16 downto 0));
else
sbXua := (others => '0');
end if;
-- Compute sum of conditional operands and shift it left by 18
fixup := (sbXua + saXuB) & (17 downto 0 => '0');
-- Output register is corrected product
y <= fixup + sy;
end if;
end process;
end;

and, guess what..... at least one synth tool gives significantly
better timing results on my code than on a simple A*B operation.
So the synth vendor hasn't worked out the neatest way to get
unsigned multiply from a signed multiplier block.

In fairness, the difference wasn't outrageously large. But
it was disappointing that the synth tool's built-in solution
to a standard problem was measurably sub-optimal.

Maybe hand-optimized arithmetic is not yet dead...
--
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which
are not the views of Doulos Ltd., unless specifically stated.
 
On Tue, 09 Sep 2008 20:47:29 +0100, Jonathan Bromley
<jonathan.bromley@MYCOMPANY.com> wrote:

After all the recent brouhaha about signed arithmetic
and multiplication, I decided to try a little test.

I remember once getting badly burnt because I needed
an 18x18->36 UNSIGNED multiplier in a Xilinx device,
and of course the native 18x18 multipliers are SIGNED,
so I got a bunch of unexpected extra logic. So
I thought I'd try to work out for myself what it
takes to get 18x18 UNSIGNED multiplication if you
have SIGNED multipliers to work with.
....
and, guess what..... at least one synth tool gives significantly
better timing results on my code than on a simple A*B operation.
So the synth vendor hasn't worked out the neatest way to get
unsigned multiply from a signed multiplier block.
Doesn't entirely surprise me.

Maybe hand-optimized arithmetic is not yet dead...
I believe not...

My normal approach would be to use the built-in block as a 17*17
unsigned, and combining it with two 17*1s and a 1*1, in the fabric.
There may be neat tricks with the DSP48s to help, but this would also
work with V2Pro. Apologies for not having tried it against yours yet. I
will...

But first, here is a trick question for the VHDL gurus...

When does the order of signal assignment matter within a process?



.... my answer below, but meanwhile here is a 32*32 unsigned
multiplier...

I have hinted at this little gem before, since I found it last December,
but only now found time to look at it in detail.

---------------------------------------------------------------------
LIBRARY ieee;
USE ieee.std_logic_1164.all;
USE ieee.numeric_std.all;

ENTITY MULT_SILLY IS

PORT(
A : IN unsigned (31 DOWNTO 0);
B : IN unsigned (31 DOWNTO 0);
clk : IN std_logic;
M : OUT unsigned (31 DOWNTO 0)
);

-- Simple pipelined multiplier, making 3 16*16 partial products
-- in the first cycle, and summing them in 2 further cycles.
-- For 32-bit out, let's ignore the LSB*LSB partial product and
-- rounding.

END MULT_SILLY ;

ARCHITECTURE Struct OF MULT_SILLY IS

subtype part_prod_u is unsigned(31 downto 0);

SIGNAL PartP_H : part_prod_u;
SIGNAL PartP_L1 : part_prod_u;
SIGNAL PartP_L2 : part_prod_u;
SIGNAL PartP_H2 : part_prod_u;
SIGNAL PartP_L : part_prod_u;

BEGIN

MULT: Process(clk) is

procedure mul_32_32_32
(
SIGNAL In_A, In_B : IN unsigned (31 DOWNTO 0);
SIGNAL PP_H, PP_H2, PP_L1, PP_L2, Sum_L : INOUT part_prod_u;
SIGNAL Product : OUT part_prod_u
) is

begin
-- third cycle: scale and sum to extract final product
Product <= (PP_h2(23 downto 0) & X"00")
+ Sum_L(31 downto 8);
-- second cycle: pipeline high bits during low-order summation
PP_H2 <= PP_H;
Sum_L <= PP_L1 + PP_L2;
-- first cycle: three partial products
PP_H <= In_A(31 DOWNTO 16) * In_B(31 downto 16);
PP_L1 <= In_A(31 DOWNTO 16) * In_B(15 downto 0);
PP_L2 <= In_A(15 DOWNTO 0) * In_B(31 downto 16);

end mul_32_32_32;

begin
if rising_edge(clk) then
-- for one mult, a procedure is overkill.
-- For an array of multipliers, ...
mul_32_32_32 ( A, B, PartP_H, PartP_H2, PartP_L1,
PartP_L2, PartP_L, M);
end if;
end process;

END Struct;

---------------------------------------------------------------------

Now I have said before that I like to use signals for my pipeline
registers, so that I can describe my pipelines in order. So why on earth
have I described it backwards here? Using variables, I'd have to. But
I'm using signals!

Synthesis in XST (for V2Pro, -6, ISE10.1) gives the expected three
multipliers, a bunch of LUTs and FFs, and 211 MHz predicted speed.
Not too bad. (And post-synthesis simulation confirms that it takes three
clock cycles, as expected).

So reverse the order of the assignment statements. It won't make any
difference. If you don't believe me, you can prove it by simulation.

But now, synthesis in XST gives 103 MHz. And post-synthesis sim reveals
the outputs appearing two cycles early.

So...

When does the order of signal assignment matter within a process?
Apparently, when you are assigning them within a procedure, and using
XST for synthesis.

Frankly, I am astonished.
I have seen this in XST 7.1, 9.1, 9.2, and now 10.1.
I would expect every synthesis tool to get this right.

(Incidentally, using variables instead of signals does exactly the same
thing; i.e. works exactly as expected. Writing the multiplications out
longhand in the process, instead of using a procedure, works as
expected. But if you have an array of multiplications, it gets ugly.)

- Brian
 
On Tue, 09 Sep 2008 20:47:29 +0100, Jonathan Bromley
<jonathan.bromley@MYCOMPANY.com> wrote:

After all the recent brouhaha about signed arithmetic
and multiplication, I decided to try a little test.
....
and, guess what..... at least one synth tool gives significantly
better timing results on my code than on a simple A*B operation.
So the synth vendor hasn't worked out the neatest way to get
unsigned multiply from a signed multiplier block.

Maybe hand-optimized arithmetic is not yet dead...
Another data point...

My approach was to use the built-in block as a 17*17 unsigned, and
combine it with two 17*1s and a 1*1, in the fabric.

It does use more resources...
Especially since left to itself, XST will instantiate multiplier blocks
even for the *1 multipliers. This is explicitly controlled by the
mult_style attributes in the example below.

Every time I revisit it, I mean to parameterise it for a generic m*n
multiplier, 17 < (m,n) <= 34. This would be quite simple, but I haven't
taken the trouble yet.

----------------------------------------------------------------
LIBRARY ieee;
USE ieee.std_logic_1164.all;
USE ieee.numeric_std.all;

ENTITY MULT_18_18_u IS

PORT(
A : IN unsigned (17 DOWNTO 0);
B : IN unsigned (17 DOWNTO 0);
clk : IN std_logic;
Y : OUT unsigned (35 DOWNTO 0)
);

-- Simple pipelined multiplier, making 4 partial products
-- in the first cycle, and summing them in 2 further cycles.
END MULT_18_18_u ;

ARCHITECTURE struct OF MULT_18_18_u IS

-- Internal signal declarations

SIGNAL PP1,REG1 : unsigned(33 DOWNTO 0);
SIGNAL PP2, PP3, REG2, REG3 : unsigned(34 DOWNTO 17);
SIGNAL PP4,REG4 : unsigned(35 DOWNTO 34);

SIGNAL SUM_HL : unsigned(35 DOWNTO 0);
SIGNAL SUM_MM : unsigned(35 DOWNTO 17);

attribute mult_style : string;
attribute mult_style of PP1 : signal is "block";
attribute mult_style of PP2 : signal is "lut";
attribute mult_style of PP3 : signal is "lut";
attribute mult_style of PP4 : signal is "lut";

BEGIN

-- Partial products. Kept as separate signals to attach attributes.
-- TODO: test if attributes on the registered signals work.
PP1 <= A(16 downto 0) * B(16 downto 0);
PP2 <= A(17 downto 17) * B(16 downto 0);
PP3 <= A(16 downto 0) * B(17 downto 17) ;
PP4 <= A(17 downto 17) * B(17 downto 17);

Process(clk)
begin
if clk'event and (clk = '1') then
-- first cycle
REG1 <= PP1 ;
REG2 <= PP2 ;
REG3 <= PP3 ;
REG4 <= PP4 ;
-- second cycle
SUM_HL <= REG4 & REG1;
SUM_MM <= ('0' & REG2) + ('0' & REG3);
-- third cycle
Y <= SUM_HL + SUM_MM;
end if;
end process;

END struct;
-------------------------------------------------------------------

Now I have only tried synthesis for V2Pro30, speed -6, so my results may
not hold for other technologies. More datapoints welcome...

Results for "architecture JB of umult18x18":
Device utilization summary:
---------------------------
Number of Slices: 49 out of 13696 0%
Number of Slice Flip Flops: 72 out of 27392 0%
Number of 4 input LUTs: 35 out of 27392 0%
Number of MULT18X18s: 1 out of 136 0%

Timing Summary:
---------------
Speed Grade: -6

Minimum period: 7.911ns (Maximum Frequency: 126.409MHz)
Minimum input arrival time before clock: 1.543ns
Maximum output required time after clock: 3.615ns

but the synth report contained a warning
INFO:Xst:2385 - HDL ADVISOR - You can improve the performance of
the multiplier Mmult_y_mult0000 by adding 1 register level(s).

So I took the liberty of adding another pipeline stage, as follows:
----------------------------------------------------
-- added pipeline stage ar2,br2,sy2 as below.
-- Not tested for correctness
--
architecture JB of umult18x18 is
signal ar, br, ar2, br2: unsigned(a'range);
begin
process (clk)
variable sy, sy2,fixup: unsigned(35 downto 0);
variable sbXua, saXuB: unsigned(17 downto 0);
begin
if rising_edge(clk) then
-- Input registers
ar2 <= a;
br2 <= b;
ar <= ar2;
br <= br2;
-- Compute signed product
sy := sy2;
sy2 := unsigned(signed(ar2) * signed(br2));
-- Compute conditional operands to adder
----------------------------------------------------

Not tested for correctness; but I believe the criterion is that ar, br
and sy should be aligned in time, and it does that. Definitely faster:

Device utilization summary:
---------------------------
Number of Slices: 70 out of 13696 0%
Number of Slice Flip Flops: 108 out of 27392 0%
Number of 4 input LUTs: 35 out of 27392 0%
Number of MULT18X18s: 1 out of 136 0%

Timing Summary:
---------------
Minimum period: 5.594ns (Maximum Frequency: 178.779MHz)
Minimum input arrival time before clock: 1.543ns
Maximum output required time after clock: 3.615ns

And the alternative code above:

Device utilization summary:
---------------------------
Number of Slices: 70 out of 13696 0%
Number of Slice Flip Flops: 124 out of 27392 0%
Number of 4 input LUTs: 54 out of 27392 0%
Number of MULT18X18s: 1 out of 136 0%

Timing Summary:
---------------

Minimum period: 4.069ns (Maximum Frequency: 245.791MHz)
Minimum input arrival time before clock: 3.421ns
Maximum output required time after clock: 3.615ns

Definitely uses more resources, and it takes 3 cycles.
I believe the longer input arrival time is because the inputs are
directly fed into the multiplier, not registered. If the previous stage
involved some logic, another pipeline stage may be necessary to fix
this.

- Brian
 
On Sep 10, 1:16 pm, Brian Drummond <brian_drumm...@btconnect.com>
wrote:
(Incidentally,  using variables instead of signals does exactly the same
thing; i.e. works exactly as expected. Writing the multiplications out
longhand in the process, instead of using a procedure, works as
expected. But if you have an array of multiplications, it gets ugly.)
That's what loops and arrays of vectors are for... ;^)

I just wish there was a way to synthesize procedures that span clock
cycles, rather than having to pass inout args for intermediate outputs/
registers. OTOH, most of the cases where I would want that, I would
need a fork/join type capability to avoid a separate process anyway.

Interesting... especially since XST figures it out correctly with
variables. FYI, I just checked it with Synplify, and it works
correctly regardless of the order of signal assignments in the
procedure.

Andy
 
On Sep 10, 2:16 pm, Brian Drummond <brian_drumm...@btconnect.com>
wrote:
On Tue, 09 Sep 2008 20:47:29 +0100, Jonathan Bromley

When does the order of signal assignment matter within a process?
Apparently, when you are assigning them within a procedure, and using
XST for synthesis.
You've opened a case with Xilinx about this, right?

Frankly, I am astonished.
I have seen this in XST 7.1, 9.1, 9.2, and now 10.1.
Makes me wonder why you would continue using the tool.

I would expect every synthesis tool to get this right.
Or at least respond a bit more promptly to bug reports to fix the
tool.

Kevin Jennings
 
On Thu, 11 Sep 2008 04:39:45 -0700 (PDT), KJ <kkjennings@sbcglobal.net>
wrote:

On Sep 10, 2:16 pm, Brian Drummond <brian_drumm...@btconnect.com
wrote:
On Tue, 09 Sep 2008 20:47:29 +0100, Jonathan Bromley

When does the order of signal assignment matter within a process?
Apparently, when you are assigning them within a procedure, and using
XST for synthesis.


You've opened a case with Xilinx about this, right?
Naturally. Not when I originally found it, only when I had time to
create a simple test case. (My experience is that there's no point in
opening a WebCase unless you have time to support the process properly).

Haven't heard back on this one yet, but it's only been four working days
so far.

Frankly, I am astonished.
I have seen this in XST 7.1, 9.1, 9.2, and now 10.1.

Makes me wonder why you would continue using the tool.
....very carefully...

- Brian
 
Andy wrote:

I just wish there was a way to synthesize procedures that span clock
cycles, rather than having to pass inout args for intermediate outputs/
registers.
I assume you are ruling out procedures declared
or overloaded in process scope.

These synthesize just fine for any register structure declared
in the same process.

-- Mike Treseler
 
On Mon, 29 Sep 2008 03:27:06 -0700, Mike Treseler <mtreseler@gmail.com>
wrote:

Andy wrote:

I just wish there was a way to synthesize procedures that span clock
cycles, rather than having to pass inout args for intermediate outputs/
registers.

I assume you are ruling out procedures declared
or overloaded in process scope.

These synthesize just fine for any register structure declared
in the same process.
Uhhh...

unfortunately, not in XST, if you are passing signals to procedures, to
use as pipeline registers.

It handles variables just fine though!

- Brian
 

Welcome to EDABoard.com

Sponsor

Back
Top