addsubs on FPGA

Jan 8, 2014

Hi,

I have a query on the RTL designing for addsub based implementations.

I heard that addsubs are not preferred on FPGAs as they produce worse area and timing QoR. Is it true ? Is resource sharing not preferred in general on FPGAs.

However, if I try a very simple design of addsub shown below it shows me no difference. May be in case of small examples, the difference in implementation might not be evident. That is why I wanted to ask a broader audience.

The reasoning & cases for both 'yes' and 'no' will help in understanding the cause ?

Thanks
Vipin

module addsub(a, b, oper, res);
input oper;
input [7:0] a;
input [7:0] b;
output [7:0] res;
reg [7:0] res;
always @(a or b or oper)
begin
if (oper == 1b0)
res = a + b;
else
res = a - b;
end
endmodule

glen herrmannsfeldt · Jan 9, 2014

sh.vipin@gmail.com wrote:

I have a query on the RTL designing for addsub based implementations.

I heard that addsubs are not preferred on FPGAs as they produce
worse area and timing QoR. Is it true ? Is resource sharing not
preferred in general on FPGAs.

However, if I try a very simple design of addsub shown below
it shows me no difference. May be in case of small examples,
the difference in implementation might not be evident.
That is why I wanted to ask a broader audience.

The reasoning & cases for both 'yes' and 'no' will help in
understanding the cause ?

I first got interested in FPGA addition and subtraction
in the XC4000 days. The XC4000 has a special carry logic
that may or may not do this operation. The carry logic
changed completely between the XC4000 series and later series,
though.

In the pre-IC days, it was common to build logic, called ALU,
which can implement add, subtract, and some bitwise logic operations
using an optimal number of transistors or gates. Similar logic
went into TTL.

module addsub(a, b, oper, res);
input oper;
input [7:0] a;
input [7:0] b;
output [7:0] res;
reg [7:0] res;
always @(a or b or oper)
begin
if (oper == 1???b0)
res = a + b;
else
res = a - b;
end
endmodule

Well, one possible implementation is adder and subtractor,
followed by mux to select. But modern logic optimization tools
should be able to do better. You could also write:

res = a + (oper ? b:-b);

which may or may not fit the FPGA better. (Seems to me closer
to the way that the carry logic works, though.)

If you want optimal LUT use, or minimal delay, then you need
to look more carefully at what it is doing. Otherwise, the logic
minimization will apply to the whole system, such that it may
or may not matter.

-- glen

Jan 9, 2014

On Wednesday, January 8, 2014 4:33:39 PM UTC-6, sh.v...@gmail.com wrote:

Hi, I have a query on the RTL designing for addsub based implementations. I
heard that addsubs are not preferred on FPGAs as they produce worse area and
timing QoR.

Such statements often heard about preferences in FPGAs are not always applicable to all manufacturers' FPGAs or even all of the same manufacturer's FPGA families. What might not have worked well at some time months or years ago may not be an issue today with another FPGA family. Your tests seem to show it works fine for your target FPGA and tools. Different synthesis tools (including different versions of the same tool) may also affect the reults..

On a slightly different issue, IMHO, creating a design where an adder and/or subtractor is a separate module to be instantiated makes the larger project's code less readable and understandable, unless you are specifically trying to re-use a given adder or subtractor's implementation (not just the code) to save utilization on the project.

Don't borrow trouble unless you have to. Write the RTL so that you can understand the function it has to perform (not the way you'd design the hardware) first, then see if that meets your performance/utilization requirements (not your personal desire to make the "best" implementation). You'd be amazed what a good synthesis tool can do these days. The folks that have to maintain your design (which may be yourself in 6 weeks/months/years) will thank you for it.

Andy

glen herrmannsfeldt · Jan 9, 2014

jonesandy@comcast.net wrote:

On Wednesday, January 8, 2014 4:33:39 PM UTC-6, sh.v...@gmail.com wrote:
Hi, I have a query on the RTL designing for addsub based implementations. I
heard that addsubs are not preferred on FPGAs as they produce worse area and
timing QoR.

Such statements often heard about preferences in FPGAs are not
always applicable to all manufacturers' FPGAs or even all of the
same manufacturer's FPGA families. What might not have worked well
at some time months or years ago may not be an issue today with
another FPGA family. Your tests seem to show it works fine for
your target FPGA and tools. Different synthesis tools (including
different versions of the same tool) may also affect the reults.

Yes. As I noted, there was a big change after the XC4000.

On a slightly different issue, IMHO, creating a design where
an adder and/or subtractor is a separate module to be
instantiated makes the larger project's code less readable
and understandable, unless you are specifically trying to
re-use a given adder or subtractor's implementation (not
just the code) to save utilization on the project.

Hmm. Hard to say, but in the ones I work on, it is more readable
as a separate module. But it might be that the OP was using this
to show the question, and not actually code that way.

As far as I know, the tools first flatten the netlist, so it
doesn't change the result at all.

Don't borrow trouble unless you have to. Write the RTL so
that you can understand the function it has to perform
(not the way you'd design the hardware) first, then see
if that meets your performance/utilization requirements
(not your personal desire to make the "best" implementation).

It has always seemed to me that people who knew how to design
hardware, knew about gates and such, wrote better HDL. That is,
not think of it as writing software (like C), but as wiring
up gates.

But yes, as with software, write for readability.

You'd be amazed what a good synthesis tool can do these days.
The folks that have to maintain your design (which may be
yourself in 6 weeks/months/years) will thank you for it.

There are cases where the performance goal is "as fast as possible."
In this case, compare the logic against the logic of a fixed
adder. If it is the same speed, then use it. If it is a lot
slower, then see why it is slow. Another possibility is to
pipeline the complement stage before an adder.

-- glen

Walter Banks · Jan 10, 2014

glen herrmannsfeldt wrote:

sh.vipin@gmail.com wrote:

I have a query on the RTL designing for addsub based implementations.

I heard that addsubs are not preferred on FPGAs as they produce
worse area and timing QoR. Is it true ? Is resource sharing not
preferred in general on FPGAs.

However, if I try a very simple design of addsub shown below
it shows me no difference. May be in case of small examples,
the difference in implementation might not be evident.
That is why I wanted to ask a broader audience.

The reasoning & cases for both 'yes' and 'no' will help in
understanding the cause ?

I first got interested in FPGA addition and subtraction
in the XC4000 days. The XC4000 has a special carry logic
that may or may not do this operation. The carry logic
changed completely between the XC4000 series and later series,
though.

In the pre-IC days, it was common to build logic, called ALU,
which can implement add, subtract, and some bitwise logic operations
using an optimal number of transistors or gates. Similar logic
went into TTL.

module addsub(a, b, oper, res);
input oper;
input [7:0] a;
input [7:0] b;
output [7:0] res;
reg [7:0] res;
always @(a or b or oper)
begin
if (oper == 1???b0)
res = a + b;
else
res = a - b;
end
endmodule

Well, one possible implementation is adder and subtractor,
followed by mux to select. But modern logic optimization tools
should be able to do better. You could also write:

res = a + (oper ? b:-b);

which may or may not fit the FPGA better. (Seems to me closer
to the way that the carry logic works, though.)

The above has an ambiguous carry out depending on how the -b is
implemented.

If -b is implemented as ~b+1 then for subtract
res = a + ~b + 1
which makes the carry out the result of the +1 increment and not the
addition.
A simple test case is when a and b are 0.

If the -b is a true -b then res = 0 Carry = 0
If the -b is ~b+1 then res = 0 Carry = 1

Might be better to restate the above as

res = (oper ? b:-b) + a;

which doesn't have this ambiguity.

I run into this a lot writing code generators for compilers.

w..

Jan 10, 2014

On Thursday, January 9, 2014 2:07:28 PM UTC-6, Walter Banks wrote:
> The above has an ambiguous carry out depending on how the -b is implemented.

Interesting, but since res, a and b are all the same size (in bits), in this Verilog statement, there is no observable carry out, so there is no ambiguity.

If res were bigger than a and b, then I'm not sure what it would do (but I'm sure it's defined somewhere). I use VHDL.

Andy

Jan 10, 2014

On Thursday, January 9, 2014 4:38:07 PM UTC-6, glen herrmannsfeldt wrote:

It has always seemed to me that people who knew how to design hardware, knew
about gates and such, wrote better HDL. That is, not think of it as writing
software (like C), but as wiring up gates.

I'm almost the opposite. I see RTL written by very experienced HW (not HDL) designers, and it often reads like a netlist. Might as well have coded it in edif and saved the cost of a synthesis license.

It's not their fault. We don't spend time teaching HDL designers how a synthesis tool analyzes their code, and why it infers a register, a latch(!), a RAM, or combinatorial gates. We teach all these cook-book approaches to designing FPGAs and ASICs using the same primitive functions they used with schematics.

We are sequential thinkers, not parallel thinkers. Therefore, it is best that we describe the desired behavior (on a clock cycle basis) in a sequential context (an always block or process), and let the synthesis tool infer parallelism where it is possible (they're excellent at that). Use functions and procedures to break out subsets of sequential behaviors. Instead of thinking in registers (circuit elemenst), think in clock cycles of delay (behavior). The registers are going to get shuffled around by retiming/pipelining optimizations anyway. The clock cycle delays will still be there. Just be careful around asynchronous inputs!

Of course, when the functionality is so complex that it cannot be easily expressed in a single sequential context, then it must be broken up into separately instantiated parallel contexts (entities or modules), each including their own detailed behavior in a sequential context.

My point is, we can understand (and therefore express and maintain) more complex behavior when it is conveyed in a sequential context. Imagine a casserole recipe written in concurrent statements.

> There are cases where the performance goal is "as fast as possible."

In my professional experience, such cases are pretty rare. But fun when they happen.

> Another possibility is to pipeline the complement stage before an adder.

Especially if oper and b are both available early!

andy

glen herrmannsfeldt · Jan 10, 2014

jonesandy@comcast.net wrote:

On Thursday, January 9, 2014 2:07:28 PM UTC-6, Walter Banks wrote:
The above has an ambiguous carry out depending on how
the -b is implemented.

Interesting, but since res, a and b are all the same size
(in bits), in this Verilog statement, there is no observable
carry out, so there is no ambiguity.

If res were bigger than a and b, then I'm not sure what it
would do (but I'm sure it's defined somewhere). I use VHDL.

I would have to look up the rule if I was actually doing it,
but yes, verilog knows about carry if the register is wide enough,
and it is supposed to ignore the carry if there aren't more bits.

I have found some synthesis tools that complain about the loss
of the carry. Unlike most programming languages, verilog looks
at the size of the destination (left side of assignment).

Well, I usually write continuous assignment, not behavioral
assignment. I believe the rules are the same, but I am not
sure about that.

Does VHDL have something like the verilog continuous assignment?

-- glen

Jan 10, 2014

On Friday, January 10, 2014 4:15:01 PM UTC-6, glen herrmannsfeldt wrote:
> Does VHDL have something like the verilog continuous assignment?

Yes, VHDL has concurrent assignment statements in several forms: direct, conditional and selected (like a case statement on the RHS), as well as concurrent procedure calls.

It is difficult to describe an iterative behavior, such as priority encoding or "counting ones," with concurrent statements; these are much easier with sequential statements.

Andy

glen herrmannsfeldt · Jan 10, 2014

jonesandy@comcast.net wrote:

On Thursday, January 9, 2014 4:38:07 PM UTC-6, glen herrmannsfeldt wrote:
It has always seemed to me that people who knew how to design hardware, knew
about gates and such, wrote better HDL. That is, not think of it as writing
software (like C), but as wiring up gates.

I'm almost the opposite. I see RTL written by very experienced
HW (not HDL) designers, and it often reads like a netlist.
Might as well have coded it in edif and saved the cost of
a synthesis license.

It's not their fault. We don't spend time teaching HDL
designers how a synthesis tool analyzes their code,
and why it infers a register, a latch(!), a RAM, or
combinatorial gates. We teach all these cook-book approaches
to designing FPGAs and ASICs using the same primitive
functions they used with schematics.

We are sequential thinkers, not parallel thinkers.

OK, but HDL is inherently parallel, and, more and more, software
programming, as multicore systems get more and more popular.

Therefore, it is best that we describe the desired behavior
(on a clock cycle basis) in a sequential context (an
always block or process), and let the synthesis tool infer
parallelism where it is possible (they're excellent at that).

I believe that C programmers, and other high-level language
programmers, who know how to write assembler code tend to
write better HLL code. They don't have to think about the
generated code for each statement, but still know which constructs
generate better code.

Use functions and procedures to break out subsets of
sequential behaviors. Instead of thinking in registers
(circuit elemenst), think in clock cycles of delay (behavior).
The registers are going to get shuffled around by
retiming/pipelining optimizations anyway. The clock
cycle delays will still be there. Just be careful around
asynchronous inputs!

Some time ago, I was designing systolic arrays with the goal
of at most two level of logic (two LUTs) between registers.

But registers are what make systolic arrays work, so there
really isn't any ignoring them.

Of course, when the functionality is so complex that it
cannot be easily expressed in a single sequential context,
then it must be broken up into separately instantiated
parallel contexts (entities or modules), each including
their own detailed behavior in a sequential context.

A systolic array is a long array, hundreds to thousands of stages,
of fairly simple unit cells.

Mostly, I don't have anything against behavioral HDL, but am
less sure about people who want to write HDL in C.

My point is, we can understand (and therefore express and
maintain) more complex behavior when it is conveyed in a
sequential context. Imagine a casserole recipe written
in concurrent statements.

If you are building a factory to produce thousands of them
a day, then you probably have to consider it in parallel.
For home cooking, though, serial usually works.

There are cases where the performance goal is "as fast
as possible."

In my professional experience, such cases are pretty rare.
But fun when they happen.

(snip)

-- glen

glen herrmannsfeldt · Jan 11, 2014

jonesandy@comcast.net wrote:

(snip, I wrote)

Does VHDL have something like the verilog continuous assignment?

Yes, VHDL has concurrent assignment statements in several
forms: direct, conditional and selected (like a case statement
on the RHS), as well as concurrent procedure calls.

Verilog has the conditional operator (?

like C and Java.

It is difficult to describe an iterative behavior, such
as priority encoding or "counting ones," with concurrent
statements; these are much easier with sequential
statements.

Not so hard, as I think I have done both of them.

The usual implementation of counting ones is a carry save
adder tree. It isn't so hard to write, but, yes, the usual tools
generate them pretty well.

Well, once I needed a ones counting that would generate, zero,
one, two, three, or more than three from a 40 bit input, and
with one pipeline stage. I wrote the logic for an 8 bit version,
used five of those, a register stage, and then enough logic
to combine the results.

Counting up to 8 bits is about as easy with and without a loop.

-- glen

addsubs on FPGA

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

Walter Banks

Guest

Guest

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

glen herrmannsfeldt

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

addsubs on FPGA

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

Walter Banks

Guest

Guest

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

glen herrmannsfeldt

Guest

Log in

Welcome to EDABoard.com

Sponsor