Cascaded floating-point reduction?

Saad Zafar · Aug 21, 2013

y1= 1.5f*y0 - x*y0*y0*y0

....Note that all quantities are in single precision floating point. I can't write this equation in behavioral form for synthesizer to optimize because it has to be broken down and fed into FP-multipliers. I have got both the y0 and x available in 1 clock cycle ready to be plugged into this equation ....Now I'm stuck on reducing this equation in least cycles...right now I have got cascaded series fp-multipliers feeding into final fp-subtractor...Each multiplication consuming one clock cycle.

What in your opinion should be the best way to map this equation in hardware? Is there an alternative form of this equation that would be more suitable for implementation?

Regards.

glen herrmannsfeldt · Aug 21, 2013

Saad Zafar <saad1024@gmail.com> wrote:
> y1= 1.5f*y0 - x*y0*y0*y0

The usual one would be to factor out a y0, so

y1 = (1.5f-x*y0*y0)*y0;

That sames one multiplier, but maybe the same number of
pipeline stages.

If you factor it as 1.5f*y0-(x*y0)*(y0*y0);

Then you can do it in one less pipeline stage, but no less
multipliers.

-- glen

Aug 22, 2013

I think the ieee floating point library's * operator is synthesizable, but synthesis would try to build the fp multipliers out of fixed point multipliers (e.g. DSP blocks) itself, which may take more than one clock cycle.

If the above works, then you could enable retiming & pipelining, and then use your original expression, and run the result through multiple pipeline stages. Retiming/pipelining can redistribute the operations and/or logic among the pipeline stages.

I have seen cases where synthesis tools did this automatically when assembling smaller fixed point multipliers into one larger multiplier, so long as there were pipeline register stages (clock cycles) available to spread across.

Andy

Mark Curry · Aug 22, 2013

In article <593a4792-bb97-421a-a338-3f644de0256a@googlegroups.com>,
<jonesandy@comcast.net> wrote:

I think the ieee floating point library's * operator is synthesizable,
but synthesis would try to build the fp multipliers out of fixed point
multipliers (e.g. DSP blocks) itself, which may take more than
one clock cycle.

Not in any synthesizer I know. Floating point types aren't handled
at all, much less operation like multiplication on them.
I wouldn't expect them to do so *EVER*. Too much overhead,
and too little of a customer base would need/want it.

Regards,

Mark

Aug 22, 2013

On Wednesday, August 21, 2013 6:55:15 PM UTC-5, Mark Curry wrote:

Not in any synthesizer I know. Floating point types aren't
handled at all, much less operation like multiplication on them.
I wouldn't expect them to do so *EVER*. Too much overhead,
and too little of a customer base would need/want it.

Mark,

Ok, I checked our FPGA synthesis tool's documentation.

The Synplify Pro reference guide states the following in regards to the built-in "real" data type:

"When one of the following constructs in encountered, compilation continues,
but will subsequently error out if logic must be generated for the construct.

real data types (real data expressions are supported in VHDL-2008 IEEE
float_pkg.vhd) real data types are supported as constant declarations or
as constants used in expressions as long as no floating point logic must
be generated"

Thus, you cannot use the built-in real data type or expressions thereof to generate logic.

However, the reference guide also states the following:

"The following packages are supported in VHDL 2008:
fixed_pkg.vhd, float_pkg.vhd, fixed_generic_pkg.vhd, float_generic_pkg.vhd,
fixed_float_types.vhd IEEE fixed and floating point packages
....
String and text I/O functions in the above packages are not supported. These
functions include read(), write(), to_string()."

Significantly, it states no other limitations on the support for float_pkg.

The float_generic_package (the generic package which float_pkg instantiates) defines the "*" operator for type float.

From ieee.float_generic_pkg-body.vhdl, the following indicates that the package is synthesizeable:

-- This deferred constant will tell you if the package body is synthesizable
-- or implemented as real numbers, set to "true" if synthesizable.
constant fphdlsynth_or_real : BOOLEAN := true; -- deferred constant

So, while I have not tried it to see, it appears that there are at least definite plans, if not the current ability, to synthesize floating point hardware long before *EVER* gets here.

The resulting hardware may not be particularly efficient, and may not be operable in a single clock cycle at any reasonable clock rate, but that is where retiming and pipelining come in.

Andy

Mark Curry · Aug 22, 2013

In article <3d527338-9687-41dc-b4ab-a60e7a1bba19@googlegroups.com>,
<jonesandy@comcast.net> wrote:

On Wednesday, August 21, 2013 6:55:15 PM UTC-5, Mark Curry wrote:
Not in any synthesizer I know. Floating point types aren't
handled at all, much less operation like multiplication on them.
I wouldn't expect them to do so *EVER*. Too much overhead,
and too little of a customer base would need/want it.

Mark,

Ok, I checked our FPGA synthesis tool's documentation.

However, the reference guide also states the following:

"The following packages are supported in VHDL 2008:
fixed_pkg.vhd, float_pkg.vhd, fixed_generic_pkg.vhd, float_generic_pkg.vhd,
fixed_float_types.vhd IEEE fixed and floating point packages
...
String and text I/O functions in the above packages are not supported. These
functions include read(), write(), to_string()."

Significantly, it states no other limitations on the support for float_pkg.

The float_generic_package (the generic package which float_pkg instantiates)
defines the "*" operator for type float.

From ieee.float_generic_pkg-body.vhdl, the following indicates that the
package is synthesizeable:

-- This deferred constant will tell you if the package body is synthesizable
-- or implemented as real numbers, set to "true" if synthesizable.
constant fphdlsynth_or_real : BOOLEAN := true; -- deferred constant

So, while I have not tried it to see, it appears that there are at least
definite plans, if not the current ability, to synthesize floating point
hardware long before *EVER* gets here.

The resulting hardware may not be particularly efficient, and may not be
operable in a single clock cycle at any reasonable clock rate, but that is
where retiming and pipelining come in.

Andy,

I stand corrected. Being a verilog user - I wasn't familiar with these
updates for VHDL-2008.

Looks like they've done it correctly. There's default support for IEEE 754
32-bit, and IEEE 754 64-bit. But users can (and very likely should) use the
generic float types, specifying all the settings including exponent width,
fraction width, rounding options, normalization options, etc... One wonders
however how exceptions will be handled in synthesis (i.e. NaN, etc.).

The generic 32-bit, (and worse 64-bit) IEEE 754 floating point are rarely EVER
appropriate for FPGA (and even ASIC) designs. For both you're almost always
designing something for a specific problem. There's not going to be many valid
cases where a specfic wire is going to need all that dynamic range. For
generic processors, (and DSPs) yeah, it may be appropriate.

But more controlled "floating point" like these library's provide, might be
useful. I tend to think they'll also be dangerous in the hands of
inexperienced HW designers - who will just take the defaults and go.

Thanks for the pointer.

Mark

glen herrmannsfeldt · Aug 22, 2013

jonesandy@comcast.net wrote:

On Wednesday, August 21, 2013 6:55:15 PM UTC-5, Mark Curry wrote:
Not in any synthesizer I know. Floating point types aren't
handled at all, much less operation like multiplication on them.
I wouldn't expect them to do so *EVER*. Too much overhead,
and too little of a customer base would need/want it.

(snip)

"When one of the following constructs in encountered,
compilation continues, but will subsequently error out if
logic must be generated for the construct."

Most of the time, you want internal pipelining on the floating
point operations. There is no where to specify that with the
usual arithmetic operators, but is is easy of you reference
a module to do it.

-- glen

Aug 23, 2013

On Thursday, August 22, 2013 1:15:40 PM UTC-5, glen herrmannsfeldt wrote:

Most of the time, you want internal pipelining on the floating point
operations. There is no where to specify that with the usual arithmetic
operators, but is is easy of you reference a module to do it.

Most of the time you will need the extra pipelining if you want to infer built-in multipliers.

This is where retiming and pipelining synthesis optimizations come in handy.. If you follow up (and/or precede) the expression assignment with a few extra clock cycles of latency (pipeline register stages), the synthesis tool can distribute the HW across the extra clock cycles automatically.

Whether synthesis can do it as well as you can manually, I don't know. But if it is good enough to work, does it really need to be as good as you could have done manually? I'd rather have the maintainability of the mathematical expression, if it will work.

Andy

glen herrmannsfeldt · Aug 23, 2013

jonesandy@comcast.net wrote:

(snip regarding pipelining)

Most of the time you will need the extra pipelining if you want
to infer built-in multipliers.

This is where retiming and pipelining synthesis optimizations
come in handy. If you follow up (and/or precede) the expression
assignment with a few extra clock cycles of latency
(pipeline register stages), the synthesis tool can distribute
the HW across the extra clock cycles automatically.

Which tools do that? That sounds pretty useful.

As I am not the OP, the things that I try to do are different.
One that I have wondered about is the ability to add extra register
stages to speed up the critical path. I work on very long, fixed point
pipelines, so usually there is at some point some very long routes
which limit the speed. If I could put registers in them, it could
run a lot faster.

Whether synthesis can do it as well as you can manually,
I don't know. But if it is good enough to work, does it
really need to be as good as you could have done manually?
I'd rather have the maintainability of the mathematical
expression, if it will work.

Well, for really large problems every ns counts. For 5%
difference, maybe I wouldn't worry about it, but 20% or 30%
is worth working for.

-- glen

Mark Curry · Aug 23, 2013

In article <kv86fb$h37$1@speranza.aioe.org>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:

jonesandy@comcast.net wrote:

(snip regarding pipelining)

Most of the time you will need the extra pipelining if you want
to infer built-in multipliers.

This is where retiming and pipelining synthesis optimizations
come in handy. If you follow up (and/or precede) the expression
assignment with a few extra clock cycles of latency
(pipeline register stages), the synthesis tool can distribute
the HW across the extra clock cycles automatically.

Which tools do that? That sounds pretty useful.

In Xilinx XST, the switch you're looking for is:
-register_balancing yes

I now leave it on by default - it rarely makes things worse. It
seems to help - I notice in the log file it does move Flops forward
and backward through the combinational logic in an attempt to better
balance the pipeline paths. How well it does the job - I've not dug
in that deep.

As I am not the OP, the things that I try to do are different.
One that I have wondered about is the ability to add extra register
stages to speed up the critical path. I work on very long, fixed point
pipelines, so usually there is at some point some very long routes
which limit the speed. If I could put registers in them, it could
run a lot faster.

Sounds just like what the tool is targetting. If you have access
to it, I'd suggest giving it a shot.

Regards,

Mark

Aug 26, 2013

Glen,

I know Synplify Pro has a retiing/pipelining option (for Xilinx and Altera targets), and I think Altera's and Xilinx's own tools do as well.

The last time I checked, straight retiming may only move logic into an adjacent clock cycle, but pipelining of functions such as multipliers or multiplexers can spread that logic over several clock cycles. I have seen examples where a large multiply (larger than a DSP block could handle) was automatically partitioned and pipelined to use multiplie DSP blocks.

Since straight retiming may be limited to adjacent clock cycles, it might be best to provide additional clock cycles of latency before and after the expression, so that two empty, adjacent clock cycles are available. Note that retiming does not need to have empty clock cycles to share logic across, but there does need to be positive slack in those adjacent clock cycles in order to "make room" for any retimed logic.

As far as timing or utiliszation is concerned, as long as I have positive slack in both, with any margin requirements met, I prefer to have the most understandable, maintainable description possible, even if a lesser description would cut either (or both) by half. This was very hard to do when I started VHDL based FPGA design many years ago (just meeting timing and utilization was tougher in those devices and with those tools, and the "optimizer" in me was hard to re-calibrate.) I now try to optimize for maintainability whenever possible.

Andy

Cascaded floating-point reduction?

Saad Zafar

Guest

glen herrmannsfeldt

Guest

Guest

Mark Curry

Guest

Guest

Mark Curry

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

Mark Curry

Guest

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Cascaded floating-point reduction?

Saad Zafar

Guest

glen herrmannsfeldt

Guest

Guest

Mark Curry

Guest

Guest

Mark Curry

Guest

glen herrmannsfeldt

Guest

Guest

glen herrmannsfeldt

Guest

Mark Curry

Guest

Guest

Log in

Welcome to EDABoard.com

Sponsor