Sum of 8 numbers in FPGA

B

b2508

Guest
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?
Thanks in advance
--------------------------------------
Posted through http://www.FPGARelated.com
 
On Tuesday, October 20, 2015 at 7:40:35 AM UTC-4, b2508 wrote:
> How do I most efficiently add 8 numbers in FPGA?
With an adder. You haven't stated any requirements so any answer here would be OK. Consider:
- You didn't specify your latency or processing speed requirements
- You didn't specify your efficiency metric (i.e. power? LUTs? Something else?)

What is the best way to save LUTs?
Use an accumulator and stream the numbers in sequentially might use fewer LUTs

How is data width affecting LUT consumption?
More LUTs will be used when you increase the data width

Kevin
 
b2508 <108118@fpgarelated> wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?

The most efficient adder is the carry save adder.

But the actual implementation depends on many other details,
such as the timing of the availability of the numbers,
and also the bit width.

-- glen
 
On Tuesday, October 20, 2015 at 6:40:35 AM UTC-5, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?
Thanks in advance.
---------------------------------------
Posted through http://www.FPGARelated.com

At the risk of doing someone else's homework:

How do I most efficiently add 8 numbers in FPGA?
The latest parts from Xilinx and Altera will add three numbers at a time using a single carry chain.

> What is the best way to save LUTs?
A: Doing serial arithmetic using block RAM to hold inputs & outputs.
B: Using DSP adders in place of LUT carry chains.

Jim
 
rickman <gnuarm@gmail.com> wrote:
On 10/20/2015 7:40 AM, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?
Thanks in advance.

This sounds like a homework problem.

Yes, but even so, leaving lots of unknowns.

> In an FPGA there aren't many ways to save LUTs for adders.

If you have 8 n-bit inputs and need the sum as fast as
possible, there aren't a huge number of choices.
Though it does depends a litlte on n.

> Unless you can process your data serially, the

In this case, there are two choices. You can process the data
bit serial, or word serial. (Or, I suppose somewhere in between.)

Choosing one of those would depend on how the data was supplied,
and again, how fast you need the result. In addition, only one
set of eight, or many?

only thing I can think of is to do the additions in a tree structure
which saves you a very few LUTs from the bit growth of the result
compared to processing the additions serially, it's also faster.
(((a+b)+(c+d))+((e+f)+(g+h))) vs. ((((((a+b)+c)+d)+e)+f)+g)+h

If you just chain adders, the usual tools will optimize them.

But you might also want some registers in there, too.

Also, this could be a lab homework problem, where the student is
supposed to try things out and see what happens.

-- glen
 
On 10/20/2015 7:40 AM, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?
Thanks in advance.

This sounds like a homework problem. In an FPGA there aren't many ways
to save LUTs for adders. Unless you can process your data serially, the
only thing I can think of is to do the additions in a tree structure
which saves you a very few LUTs from the bit growth of the result
compared to processing the additions serially, it's also faster.
(((a+b)+(c+d))+((e+f)+(g+h))) vs. ((((((a+b)+c)+d)+e)+f)+g)+h

--

Rick
 
On 20/10/2015 22:52, jim.brakefield@ieee.org wrote:
On Tuesday, October 20, 2015 at 6:40:35 AM UTC-5, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?

Why not try it out. Run one of the tool chains and see what happens when
you build adder in different ways and then if its not what you expect
come and ask on here. The tool chains will show you what the LUT usage
is. I was a tad suprised to find that when I coded:-

byteout <= byte1+byte2+byte3+byte4+byte5+byte6+byte7+byte8 ;

and compared it with

temp1 <= byte1 + byte2 + byte3 + byte4 ;
temp2 <= byte5 + byte6 + byte7 + byte8 ;
byteout <= temp1 + temp2 ;

I got the same number of LUTs and Slices used....


Thanks in advance.
---------------------------------------
Posted through http://www.FPGARelated.com

At the risk of doing someone else's homework:

How do I most efficiently add 8 numbers in FPGA?

Define efficiency. Almost all efficiency is a trade off between space
and performance.

The latest parts from Xilinx and Altera will add three numbers at a time using a single carry chain.

In this modern world of optimising tool chains why not just put them all
in one expression and let the tool chain work out what is best for the chip.

What is the best way to save LUTs?
A: Doing serial arithmetic using block RAM to hold inputs & outputs.

A classical trade off of speed, as its now serial, for gates used. If
you do it serially then you may need to do 7 separate serial additions..
... which will need more LUTs for the carry latches....

> B: Using DSP adders in place of LUT carry chains.

Assuming your chip has one?


Just my two cents/pence/yuan...
And Jim, Nothing personal, your comments seemed a suitable place to hang
my hat....

Dave
 
David Wade <dave.g4ugm@gmail.com> wrote:
On 20/10/2015 22:52, jim.brakefield@ieee.org wrote:
On Tuesday, October 20, 2015 at 6:40:35 AM UTC-5, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?

Why not try it out. Run one of the tool chains and see what happens when
you build adder in different ways and then if its not what you expect
come and ask on here. The tool chains will show you what the LUT usage
is. I was a tad suprised to find that when I coded:-

byteout <= byte1+byte2+byte3+byte4+byte5+byte6+byte7+byte8 ;

and compared it with

temp1 <= byte1 + byte2 + byte3 + byte4 ;
temp2 <= byte5 + byte6 + byte7 + byte8 ;
byteout <= temp1 + temp2 ;

I got the same number of LUTs and Slices used....

Yes, the optimizers can likely figure that one out.

Some years ago, I needed a 36 bit population count.
That is, how many '1' bits there are in a 36 bit word.

The usual way to make one is with carry save adders, so
I build one up, I think first 8 bits, and then combined those.

It was a little unusual, since I needed to know 0, 1, 2, 3, more than 3.

It wasn't hard to make, but it turns out that if you just say:

p=x[0]+x[1]+x[2]+x[3]+ ... x[35];

it works just about as well. It might be that I had to pipeline
it also, but it still would have been easier to write.

(snip)

The latest parts from Xilinx and Altera will add three numbers
at a time using a single carry chain.

In this modern world of optimising tool chains why not just put them all
in one expression and let the tool chain work out what is best for the chip.

You mean ones with 6 input LUTs? I haven't looked at those much yet.

(snip)

My favorite test of the optimizer is when I make a tiny mistake, which
turns out to cause some signal to never change, and the optimizer
optimizes out all the logic! Nothing at all left!

-- glen
 
On Wednesday, October 21, 2015 at 8:08:59 PM UTC-5, glen herrmannsfeldt wrote:
David Wade <dave...@gmail.com> wrote:
On 20/10/2015 22:52, jim...@ieee.org wrote:
On Tuesday, October 20, 2015 at 6:40:35 AM UTC-5, b2508 wrote:
How do I most efficiently add 8 numbers in FPGA?
What is the best way to save LUTs?
How is data width affecting LUT consumption?

Why not try it out. Run one of the tool chains and see what happens when
you build adder in different ways and then if its not what you expect
come and ask on here. The tool chains will show you what the LUT usage
is. I was a tad suprised to find that when I coded:-

byteout <= byte1+byte2+byte3+byte4+byte5+byte6+byte7+byte8 ;

and compared it with

temp1 <= byte1 + byte2 + byte3 + byte4 ;
temp2 <= byte5 + byte6 + byte7 + byte8 ;
byteout <= temp1 + temp2 ;

I got the same number of LUTs and Slices used....

Yes, the optimizers can likely figure that one out.

Some years ago, I needed a 36 bit population count.
That is, how many '1' bits there are in a 36 bit word.

The usual way to make one is with carry save adders, so
I build one up, I think first 8 bits, and then combined those.

It was a little unusual, since I needed to know 0, 1, 2, 3, more than 3.

It wasn't hard to make, but it turns out that if you just say:

p=x[0]+x[1]+x[2]+x[3]+ ... x[35];

it works just about as well. It might be that I had to pipeline
it also, but it still would have been easier to write.

(snip)

The latest parts from Xilinx and Altera will add three numbers
at a time using a single carry chain.

In this modern world of optimising tool chains why not just put them all
in one expression and let the tool chain work out what is best for the chip.

You mean ones with 6 input LUTs? I haven't looked at those much yet.

(snip)

My favorite test of the optimizer is when I make a tiny mistake, which
turns out to cause some signal to never change, and the optimizer
optimizes out all the logic! Nothing at all left!

-- glen

> You mean ones with 6 input LUTs? I haven't looked at those much yet.
6LUTs are a favorite of mine:
One 4-to-1 mux or two 2-to-1 muxes
2-to-1 mux and an add/subtract

IMHO their reason for being is that they reduce the number of logic levels. Routing delay is now larger than logic delay, so reducing logic levels is a big speed win, more so than the greater logic capability.

The ALUT/ALM is somewhat different and more complicated. Not currently using it, but does appear to have overall characteristics similar to the 6LUT.

Jim
 
I expect it to be most efficient to use 8 adders in parallel when th
incoming data is not always fully ocupying their vector withs since th
Compiler might discoder unsued bits and shorten carry chain length
appropriately.

To meet Timing, I always add FFs behind and use Register balancing an
retiming giving the Compiler oall Options of Optimization.



--------------------------------------
Posted through http://www.FPGARelated.com
 

Welcome to EDABoard.com

Sponsor

Back
Top