All-real FFT for FPGA

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I would be very leery of an optimizer that felt free to optimize things
so that they are no longer bit-exact --

Of course, synthesizers need to be bit-exact and conform to the
HDL language spec

and for some combinations of
bits, I'm pretty sure that -(a * b) is not necessarily (a * -b).

That is not the example I gave, but in either example you still would
not need two multipliers, just one multiplier and some small amount of
logic; any reasonable synthesizer would not use two multipliers
worth of gates.

Steve
 
<eric.jacobsen@ieee.org> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000 (UTC), spope33@speedymail.org

I'm very unclear that it would not optimize past the first butterfly.
They can also optimize across pipeline stages and also modify
the number of bits of state and their logic values within a pipeline
register.

The results of synthesizer optimization on something like an FFT will
depend a LOT on the architecture used. A highly parallel
architecture, or an architecture where the FFT stages are
independently pipelined, would likely benefit from optimization when
the imaginary inputs are static zeroes than a serial architecture
would.

I agree. Sequential designs especially those that resemble a microcoded
architecture will not optimize well.

An FFT built around a single multiplier-accumulator would not optimize,
OTOH it isn't very high gate-count to begin with.

Steve
 
On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

Tim Wescott <seemywebsite@myfooter.really> wrote:

On Mon, 13 Feb 2017 22:36:41 -0500, rickman wrote:

On 2/13/2017 9:39 PM, Tim Wescott wrote:

Are you referring to the nature of the FFT on real signals (i.e.
data sets that have zeros for the imaginary data)? That would only
help with the first pass of butterflies. Once your perform those
the imaginary part is not zero anymore.

Yes, but the imaginary and real parts are still symmetrical around f
=
0,
so you should neither have to compute nor use them.

That's why you get very little optimization by plugging in the zero
data and letting the tools work it out.

Yes, unless the optimizers get _really_ smart.

We are talking two different things. The HDL synthesis optimizers are
optimizing *logic*. They don't know diddly about what you are using
the logic for.

Well, yes, but these computers are getting smarter all the time. When
you tell Synopsis to synthesize and it says "your fly is unzipped",
you'll know that the optimizers are too damned smart.

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I would be very leery of an optimizer that felt free to optimize things
so that they are no longer bit-exact -- and for some combinations of
bits, I'm pretty sure that -(a * b) is not necessarily (a * -b).

--
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work! See my website if you're interested
http://www.wescottdesign.com
 
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 11:39 AM, Steve Pope wrote:

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I would be very leery of an optimizer that felt free to optimize things
so that they are no longer bit-exact --

Of course, synthesizers need to be bit-exact and conform to the
HDL language spec

and for some combinations of
bits, I'm pretty sure that -(a * b) is not necessarily (a * -b).

That is not the example I gave, but in either example you still would
not need two multipliers, just one multiplier and some small amount of
logic; any reasonable synthesizer would not use two multipliers
worth of gates.

How is that not the example you gave? If the tool is going to use a
single multiplier for calculating (a * b) and (a * -b),

Okay so far

that implies it calculates (a * b) and then negates that to get
(a * -b) as -(a * b), no?

Not necessarily exactly this.

I don't know that -(a * b) wouldn't be exactly the same result as (a *
-b).

You just answered your own question.

Steve
 
On Mon, 13 Feb 2017 22:36:41 -0500, rickman <gnuarm@gmail.com> wrote:

On 2/13/2017 9:39 PM, Tim Wescott wrote:
On Mon, 13 Feb 2017 16:08:00 -0500, rickman wrote:

On 2/13/2017 12:56 PM, Tim Wescott wrote:
On Mon, 13 Feb 2017 01:48:49 -0500, rickman wrote:

On 2/13/2017 12:03 AM, Tim Wescott wrote:
On Sun, 12 Feb 2017 20:32:59 -0500, rickman wrote:

On 2/12/2017 1:05 PM, Tim Wescott wrote:
So, there are algorithms out there to perform an FFT on real data,
that save (I think) roughly 2x the calculations of FFTs for complex
data.

I did a quick search, but didn't find any that are made
specifically for FPGAs. Was my search too quick, or are there no
IP sources to do this?

It would seem like a slam-dunk for Xilinx and Intel/Altera to
include these algorithms in their FFT libraries.

I thought I replied to this, but my computer has been crashing a bit
so maybe I lost that one.

An FFT is inherently a complex function. Real data can be processed
by taking advantage of some of the symmetries in the FFT. I don't
recall the details as it has been over a decade since I worked with
this, but you can fold half of the real data into the imaginary
portion, run a size N/2 FFT and then unfold the results. I believe
you have to run a final N sized complex pass on these results to get
your final answer. I do recall it saved a *lot* when performing
FFTs,
nearly 50%.

My understanding is that there were some software packages that baked
that into the algorithm, for what savings I don't know. I was
wondering if it was done for FFTs as well.

I'm not sure what you mean. When you say "baking" it into the
algorithm, it would do pretty much what I described. That *is* the
algorithm. I haven't heard of any other optimizations. The savings
is in time/size. Essentially it does an N/2 size FFT with an extra
pass, so instead of N*log(N) step it takes (N+1)*log(N/2) steps.

There's a distinct symmetry to the Fourier transform that you could use
at each step of the way instead of doing the whole thing and fixing it
up at the end.

I don't know if it would save steps, but it would certainly be easier
on someone who just wants to apply an algorithm.

Are you referring to the nature of the FFT on real signals (i.e. data
sets that have zeros for the imaginary data)? That would only help with
the first pass of butterflies. Once your perform those the imaginary
part is not zero anymore.

Yes, but the imaginary and real parts are still symmetrical around f = 0,
so you should neither have to compute nor use them.

That's why you get very little optimization by plugging in the zero data
and letting the tools work it out.

Yes, unless the optimizers get _really_ smart.

We are talking two different things. The HDL synthesis optimizers are
optimizing *logic*. They don't know diddly about what you are using the
logic for.


On the other hand, the vendor's FFT logic generator should support real
data inputs. It is a common enough need.

I agree, but when I looked I didn't see it.

The Xilinx logic generators also seem to only support 2^n vector sizes.
I don't know how convoluted things get, but I know that with a general-
purpose processor, a prime-value-sized FFT is nearly as fast as a 2^n
one. Don't ask me how -- it's just a box with magic inside for me at
this point.

When I first learned FFTs, I did it by looking at the equations in terms
of the twiddle factors each input was multiplied by as it contributed to
the final value. It works out to be the same calculation as the DFT.
It is taking advantage of the cyclic nature of the twiddle factors to
avoid redundant calculations. The same basis provides the various
symmetries.

Another way to understand it is that the DFT is a linear-algebra
multiplication of the length-N input vector with the NxN transform
matrix. Before electronics were prevalent astronomers would
hand-calculate DFTs (for orbit calculations, I think), and after doing
this a lot Gauss observed that many computations were repeated in a
pattern which allowed the transform matrix to be factored to save
redundant calculations. This is why Gauss is often credited with
"inventing" the FFT, and others much later rediscovered the same
thing.


I can't say how prime sized data sets work out. I'm very surprised they
even exist. I can't picture how the butterflies would work.

There are multiple ways to do non-power-of-two-sized FFTs, some are
more efficient than others. Alternative radix or even multiple radix
transforms are sometimes used.



---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
 
On Tue, 14 Feb 2017 06:52:32 +0000 (UTC), spope33@speedymail.org
(Steve Pope) wrote:

Tim Wescott <seemywebsite@myfooter.really> wrote:

On Mon, 13 Feb 2017 22:36:41 -0500, rickman wrote:

On 2/13/2017 9:39 PM, Tim Wescott wrote:

Are you referring to the nature of the FFT on real signals (i.e. data
sets that have zeros for the imaginary data)? That would only help
with the first pass of butterflies. Once your perform those the
imaginary part is not zero anymore.

Yes, but the imaginary and real parts are still symmetrical around f =
0,
so you should neither have to compute nor use them.

That's why you get very little optimization by plugging in the zero
data and letting the tools work it out.

Yes, unless the optimizers get _really_ smart.

We are talking two different things. The HDL synthesis optimizers are
optimizing *logic*. They don't know diddly about what you are using the
logic for.

Well, yes, but these computers are getting smarter all the time. When
you tell Synopsis to synthesize and it says "your fly is unzipped",
you'll know that the optimizers are too damned smart.

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I'm very unclear that it would not optimize past the first butterfly.
They can also optimize across pipeline stages and also modify
the number of bits of state and their logic values within a pipeline
register.

The results of synthesizer optimization on something like an FFT will
depend a LOT on the architecture used. A highly parallel
architecture, or an architecture where the FFT stages are
independently pipelined, would likely benefit from optimization when
the imaginary inputs are static zeroes than a serial architecture
would.

But, to know for sure, try running it and see what happens.
(And you want to use Synopsis or the equivalent, and probably not
Xilinx/Altera native synthesizers...)

If faced with this design I would certainly give it a shot and
see if the synthesis is good enough, rather than assuming it isn't
and embarking on a major bit of perhaps unneeded design work.



Steve

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus
 
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 4:56 PM, Steve Pope wrote:

rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 11:39 AM, Steve Pope wrote:

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

Note: I did not say that in an HDL, -(a * b) is equal in a bit-exact
sense to (a * -b).

>>>>> I'm pretty sure that -(a * b) is not necessarily (a * -b).

We agree

If the tool is going to use a
single multiplier for calculating (a * b) and (a * -b),

Okay so far

that implies it calculates (a * b) and then negates that to get
(a * -b) as -(a * b), no?

Not necessarily exactly this.

If not this, then what? Why are you being so coy?

Because one would have to look at the HDL language definition,
the declared bit widths and declared data types and possibly other stuff.

But you still wouldn't need two multpliers to compute the two
values in my example; just one multiplier plus some other logic
(much less than the gate count of a second multiplier) to take care of
the extremal cases to make the output exactly match.

Steve
 
On 2/14/2017 11:39 AM, Steve Pope wrote:
Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I would be very leery of an optimizer that felt free to optimize things
so that they are no longer bit-exact --

Of course, synthesizers need to be bit-exact and conform to the
HDL language spec

and for some combinations of
bits, I'm pretty sure that -(a * b) is not necessarily (a * -b).

That is not the example I gave, but in either example you still would
not need two multipliers, just one multiplier and some small amount of
logic; any reasonable synthesizer would not use two multipliers
worth of gates.

How is that not the example you gave? If the tool is going to use a
single multiplier for calculating (a * b) and (a * -b), that implies it
calculates (a * b) and then negates that to get (a * -b) as -(a * b), no?

I don't know that -(a * b) wouldn't be exactly the same result as (a *
-b).

--

Rick C
 
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 5:48 PM, Steve Pope wrote:

But you still wouldn't need two multpliers to compute the two
values in my example; just one multiplier plus some other logic
(much less than the gate count of a second multiplier) to take care of
the extremal cases to make the output exactly match.

It was a *long* way around the woods to get here. I seriously doubt any
logic synthesis tool is going to substitute two multipliers and an adder
with a single multiplier and a bunch of other logic.

I very strongly disagree ... minimizing purely combinatorial logic
is the _sine_qua_non_ of a synthesis tool.

Lots of other things synthesizers try to do are more intricate and less
predictable, but not this.

Steve
 
Am 14.02.17 um 20:21 schrieb rickman:
I don't know that -(a * b) wouldn't be exactly the same result as (a * -b).

It is bitexact the same. In IEEE float, the sign is signalled by a
single bit, it doesn't use two's complement or similar. Therefore, -x
simply inverts the sign bit, and it doesnt matter if you do this before
or after the multiplication. The only possible corner cases re special
values like NaN and Inf; I believe that even then, the result is
bitexact the same, but I'm too lazy to check the standard.

Christian
 
On Tue, 14 Feb 2017 20:42:12 +0100, Christian Gollwitzer wrote:

Am 14.02.17 um 20:21 schrieb rickman:
I don't know that -(a * b) wouldn't be exactly the same result as (a *
-b).


It is bitexact the same. In IEEE float, the sign is signalled by a
single bit, it doesn't use two's complement or similar. Therefore, -x
simply inverts the sign bit, and it doesnt matter if you do this before
or after the multiplication. The only possible corner cases re special
values like NaN and Inf; I believe that even then, the result is
bitexact the same, but I'm too lazy to check the standard.

While the use of floats in FPGAs is getting more and more common, I
suspect that there are still loads of FFTs getting done in fixed-point.

I would need to do some very tedious searching to know for sure, but I
suspect that if there's truncation, potential rollover, or 2's compliment
values of 'b1000...0 involved, then my condition may be violated.

And I don't know how much an optimizer is going to analyze what's going
on inside of a DSP block -- if you instantiate one in your design and
feed it all zeros, is the optimizer smart enough to take it out?

--
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work! See my website if you're interested
http://www.wescottdesign.com
 
On 2/14/2017 4:56 PM, Steve Pope wrote:
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 11:39 AM, Steve Pope wrote:

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

I would be very leery of an optimizer that felt free to optimize things
so that they are no longer bit-exact --

Of course, synthesizers need to be bit-exact and conform to the
HDL language spec

and for some combinations of
bits, I'm pretty sure that -(a * b) is not necessarily (a * -b).

That is not the example I gave, but in either example you still would
not need two multipliers, just one multiplier and some small amount of
logic; any reasonable synthesizer would not use two multipliers
worth of gates.

How is that not the example you gave? If the tool is going to use a
single multiplier for calculating (a * b) and (a * -b),

Okay so far

that implies it calculates (a * b) and then negates that to get
(a * -b) as -(a * b), no?

Not necessarily exactly this.

If not this, then what? Why are you being so coy?


I don't know that -(a * b) wouldn't be exactly the same result as (a *
-b).

You just answered your own question.

No, that's a separate question. In fact, as Chris pointed out the IEEE
floating point format is sign magnitude. So it doesn't matter when you
flip the sign bit, the rest of the number will be the same. For two's
complement the only issue would be using the values of max negative int
where there is no positive number with the same magnitude. But I can't
construct a case were that would be a problem for this example. Still,
if that is a problem for one combination, then there will be cases that
fail for the other combination. Typically this is avoided by not using
the max negative value regardless of the optimizations.

--

Rick C
 
On 2/14/2017 3:22 PM, Tim Wescott wrote:
On Tue, 14 Feb 2017 20:42:12 +0100, Christian Gollwitzer wrote:

Am 14.02.17 um 20:21 schrieb rickman:
I don't know that -(a * b) wouldn't be exactly the same result as (a *
-b).


It is bitexact the same. In IEEE float, the sign is signalled by a
single bit, it doesn't use two's complement or similar. Therefore, -x
simply inverts the sign bit, and it doesnt matter if you do this before
or after the multiplication. The only possible corner cases re special
values like NaN and Inf; I believe that even then, the result is
bitexact the same, but I'm too lazy to check the standard.

While the use of floats in FPGAs is getting more and more common, I
suspect that there are still loads of FFTs getting done in fixed-point.

I would need to do some very tedious searching to know for sure, but I
suspect that if there's truncation, potential rollover, or 2's compliment
values of 'b1000...0 involved, then my condition may be violated.

And I don't know how much an optimizer is going to analyze what's going
on inside of a DSP block -- if you instantiate one in your design and
feed it all zeros, is the optimizer smart enough to take it out?

There is such a thing as "hard" IP which means a block of logic that is
already mapped, placed and routed. But they are rare, but not subject
to any optimizations. Otherwise yes, if you feed a constant into any
block of logic synthesis will not only replace that logic with constants
on the output, it will replace all the registers that may be used to
improve the throughput of the constants. There are synthesis commands
to avoid that, but otherwise the tool does as it sees optimal.

The significant optimizations in an FFT come from recognizing all the
redundant computations which the optimizer will *not* be able to see as
they are dispersed across columns and rows or time. Logic optimizations
are much more obvious and basic. Otherwise you could just feed the tool
a DFT and let it work out the details. Heck, that might get you some
significant savings. In a DFT done all at once, the tool might actually
spot that you are multiplying a given input sample by the same
coefficient multiple times. But that's still not an FFT which is much
more subtle.

--

Rick C
 
On 2/14/2017 5:48 PM, Steve Pope wrote:
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 4:56 PM, Steve Pope wrote:

rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 11:39 AM, Steve Pope wrote:

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 14 Feb 2017 06:52:32 +0000, Steve Pope wrote:

If for example you compute (a * b) and also compute (a * -b),
the synthesizer is smart enough to know there are not two full
multipliers needed.

Note: I did not say that in an HDL, -(a * b) is equal in a bit-exact
sense to (a * -b).

I'm pretty sure that -(a * b) is not necessarily (a * -b).

We agree

If the tool is going to use a
single multiplier for calculating (a * b) and (a * -b),

Okay so far

that implies it calculates (a * b) and then negates that to get
(a * -b) as -(a * b), no?

Not necessarily exactly this.

If not this, then what? Why are you being so coy?

Because one would have to look at the HDL language definition,
the declared bit widths and declared data types and possibly other stuff.

But you still wouldn't need two multpliers to compute the two
values in my example; just one multiplier plus some other logic
(much less than the gate count of a second multiplier) to take care of
the extremal cases to make the output exactly match.

It was a *long* way around the woods to get here. I seriously doubt any
logic synthesis tool is going to substitute two multipliers and an adder
with a single multiplier and a bunch of other logic. Have you seen a
tool that was that smart? If so, that is a pretty advanced tool.

BTW, what *are* the extremal[sic] cases?

--

Rick C
 
On 2/14/2017 7:35 PM, Steve Pope wrote:
rickman <gnuarm@gmail.com> wrote:

On 2/14/2017 5:48 PM, Steve Pope wrote:

But you still wouldn't need two multpliers to compute the two
values in my example; just one multiplier plus some other logic
(much less than the gate count of a second multiplier) to take care of
the extremal cases to make the output exactly match.

It was a *long* way around the woods to get here. I seriously doubt any
logic synthesis tool is going to substitute two multipliers and an adder
with a single multiplier and a bunch of other logic.

I very strongly disagree ... minimizing purely combinatorial logic
is the _sine_qua_non_ of a synthesis tool.

Lots of other things synthesizers try to do are more intricate and less
predictable, but not this.

I hear you, but I don't think the tools will be able to "see" the
simplifications as they are not strictly logic simplifications. Most of
it is trig.

Take a look at just how the FFT works. The combinations of
multiplications work out because of properties of the sine function, not
because of Boolean logic.

--

Rick C
 

Welcome to EDABoard.com

Sponsor

Back
Top