Hardware floating point?

T

Tim Wescott

Guest
So, just doing a brief search, it looks like Altera is touting a floating
point slice in at least one of their lines.

Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?

And, anything else you know.

TIA.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!
 
On 1/24/2017 11:59 PM, Tim Wescott wrote:
So, just doing a brief search, it looks like Altera is touting a floating
point slice in at least one of their lines.

Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?

And, anything else you know.

I'm not sure what you are asking. What do you think floating point is
exactly? The core of floating point is just fixed point arithmetic with
an extra bit (uh, rereading this I need to make clear this is the
British "bit" meaning part :) to express the exponent of a binary
multiplier. To perform addition or subtraction on floating point
numbers the mantissa needs to be normalized meaning the bits must be
lined up so they are all equal weight. This requires adjusting one of
the exponents so the two are equal while shifting the mantissa to match.
Then the addition can be done on the mantissa and the result adjusted
so the msb of the mantissa is in the correct alignment.

Multiplication is actually easier in that normalization is not required,
but exponents are added and the result is adjusted for correct alignment
of the mantissa.

So the heart of a floating point operation is a fixed point ALU with
barrel shifters before and after.

--

Rick C
 
I'm not sure what you are asking. What do you think floating point is
exactly? The core of floating point is just fixed point arithmetic with
an extra bit (uh, rereading this I need to make clear this is the
British "bit" meaning part :) to express the exponent of a binary
multiplier. To perform addition or subtraction on floating point
numbers the mantissa needs to be normalized meaning the bits must be
lined up so they are all equal weight. This requires adjusting one of
the exponents so the two are equal while shifting the mantissa to match.
Then the addition can be done on the mantissa and the result adjusted
so the msb of the mantissa is in the correct alignment.

Multiplication is actually easier in that normalization is not required,
but exponents are added and the result is adjusted for correct alignment
of the mantissa.


So the heart of a floating point operation is a fixed point ALU with
barrel shifters before and after.

I think you oversimplify FP. It works a lot better with dedicated hardware.
 
The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing that,
but possibly.

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to find any more detailed facts. And we all know how FPGA marketing works, so bit of doubt is very understandable...

The best I could find is this:
http://www.bogdan-pasca.org/resources/publications/2015_langhammer_pasca_fp_dsp_block_architecture_for_fpgas.pdf

In short: It appears that infinite and NaNs are supported, however sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point DSPs cuts the LE-usage by about 90%, so if you need floating point, I think Arria/Stratix 10 are really the best way to go...

Regards,

Thomas

www.entner-electronics.com - Home of EEBlaster and JPEG-Codec
 
On 1/25/2017 9:15 PM, thomas.entner99@gmail.com wrote:
The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing that,
but possibly.

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to find any more detailed facts. And we all know how FPGA marketing works, so bit of doubt is very understandable...

The best I could find is this:
http://www.bogdan-pasca.org/resources/publications/2015_langhammer_pasca_fp_dsp_block_architecture_for_fpgas.pdf

In short: It appears that infinite and NaNs are supported, however sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point DSPs cuts the LE-usage by about 90%, so if you need floating point, I think Arria/Stratix 10 are really the best way to go...

That video may be for the *entire* floating point unit in the fabric.
Most FPGAs have dedicated integer multipliers which can be used for both
the multiplier and the barrel shifters in a floating point ALU. The
adders and random logic would need to be in the fabric, but will be
*much* smaller.

--

Rick C
 
On 1/25/2017 5:07 PM, Kevin Neilson wrote:
I'm not sure what you are asking. What do you think floating point is
exactly? The core of floating point is just fixed point arithmetic with
an extra bit (uh, rereading this I need to make clear this is the
British "bit" meaning part :) to express the exponent of a binary
multiplier. To perform addition or subtraction on floating point
numbers the mantissa needs to be normalized meaning the bits must be
lined up so they are all equal weight. This requires adjusting one of
the exponents so the two are equal while shifting the mantissa to match.
Then the addition can be done on the mantissa and the result adjusted
so the msb of the mantissa is in the correct alignment.

Multiplication is actually easier in that normalization is not required,
but exponents are added and the result is adjusted for correct alignment
of the mantissa.


So the heart of a floating point operation is a fixed point ALU with
barrel shifters before and after.

I think you oversimplify FP. It works a lot better with dedicated hardware.

Not sure what your point is. The principles are the same in software or
hardware. I was describing hardware I have worked on. ST-100 from Star
Technologies. I became very intimate with the inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing that,
but possibly. The basics are still the same. Adds use a barrel shifter
to denormalize the mantissa so the exponents are equal, a integer adder
and a normalization barrel shifter to produce the result. Multiplies
use a multiplier for the mantissas and an adder for the exponents (with
adjustment for exponent bias) followed by a simple shifter to normalize
the result.

Both add and multiply are about the same level of complexity as a barrel
shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think I am
missing?

--

Rick C
 
On Thursday, January 26, 2017 at 1:59:09 PM UTC-5, rickman wrote:
If you look around I thing you will find many uses for floating point in
the DSP market. It's not just a selling gimmick. I don't think the
many floating point DSP devices are sold because they look good in the
product's spec sheet.

Heck back in the day when DSP was done on mainframes the hot rods of
computing were all floating point. Cray-1, ST-100...

I am attempting to design a 40-bit single and 80-bit double hardware-
expressed form of an n-bit floating point "unum" (universal number)
engine, as per the design by John Gustafson:

http://ubiquity.acm.org/article.cfm?id=3001758

I intend an FPU, and 4x vector FPU for SIMD:

https://github.com/RickCHodgin/libsf/blob/master/arxoda/core/fpu/fpu.png
https://github.com/RickCHodgin/libsf/blob/master/arxoda/core/fpu_vector/fpu_vector.png

In my Arxoda CPU (design still in progress):

https://github.com/RickCHodgin/libsf/blob/master/arxoda/core/overall_design.png

Thank you,
Rick C. Hodgin
 
On Thu, 26 Jan 2017 01:10:14 -0500, rickman wrote:

On 1/25/2017 9:15 PM, thomas.entner99@gmail.com wrote:

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that,
but possibly.

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to
find any more detailed facts. And we all know how FPGA marketing works,
so bit of doubt is very understandable...

The best I could find is this:
http://www.bogdan-pasca.org/resources/
publications/2015_langhammer_pasca_fp_dsp_block_architecture_for_fpgas.pdf

In short: It appears that infinite and NaNs are supported, however
sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point
DSPs cuts the LE-usage by about 90%, so if you need floating point, I
think Arria/Stratix 10 are really the best way to go...

That video may be for the *entire* floating point unit in the fabric.
Most FPGAs have dedicated integer multipliers which can be used for both
the multiplier and the barrel shifters in a floating point ALU. The
adders and random logic would need to be in the fabric, but will be
*much* smaller.

Xilinx and Altera both support "DSP blocks" that do a multiply and add
(they say multiply and accumulate, but it's more versatile than that).

According to the above paper, Altera has paired up their DSP blocks and
added logic to each pair so that they become a floating-point arithmetic
block. Personally I think that for most "regular" DSP uses you're going
to know the range of the incoming data and will, therefor, only need
fixed-point -- but it looks like they're chasing the "FPGA as a
supercomputer" market (hence, the purchase by Intel), and for that you
need floating point just as a selling point.

--
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work! See my website if you're interested
http://www.wescottdesign.com
 
On 1/26/2017 11:19 AM, Tim Wescott wrote:
On Thu, 26 Jan 2017 01:10:14 -0500, rickman wrote:

On 1/25/2017 9:15 PM, thomas.entner99@gmail.com wrote:

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that,
but possibly.

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to
find any more detailed facts. And we all know how FPGA marketing works,
so bit of doubt is very understandable...

The best I could find is this:
http://www.bogdan-pasca.org/resources/
publications/2015_langhammer_pasca_fp_dsp_block_architecture_for_fpgas.pdf

In short: It appears that infinite and NaNs are supported, however
sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point
DSPs cuts the LE-usage by about 90%, so if you need floating point, I
think Arria/Stratix 10 are really the best way to go...

That video may be for the *entire* floating point unit in the fabric.
Most FPGAs have dedicated integer multipliers which can be used for both
the multiplier and the barrel shifters in a floating point ALU. The
adders and random logic would need to be in the fabric, but will be
*much* smaller.

Xilinx and Altera both support "DSP blocks" that do a multiply and add
(they say multiply and accumulate, but it's more versatile than that).

According to the above paper, Altera has paired up their DSP blocks and
added logic to each pair so that they become a floating-point arithmetic
block. Personally I think that for most "regular" DSP uses you're going
to know the range of the incoming data and will, therefor, only need
fixed-point -- but it looks like they're chasing the "FPGA as a
supercomputer" market (hence, the purchase by Intel), and for that you
need floating point just as a selling point.

If you look around I thing you will find many uses for floating point in
the DSP market. It's not just a selling gimmick. I don't think the
many floating point DSP devices are sold because they look good in the
product's spec sheet.

Heck back in the day when DSP was done on mainframes the hot rods of
computing were all floating point. Cray-1, ST-100...

--

Rick C
 
On Thursday, January 26, 2017 at 12:59:09 PM UTC-6, rickman wrote:
On 1/26/2017 11:19 AM, Tim Wescott wrote:
On Thu, 26 Jan 2017 01:10:14 -0500, rickman wrote:

On 1/25/2017 9:15 PM, thomas.entner99@gmail.com wrote:

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that,
but possibly.

Altera claims it IS IEEE-754 compliant, but it is surprisingly hard to
find any more detailed facts. And we all know how FPGA marketing works,
so bit of doubt is very understandable...

The best I could find is this:
http://www.bogdan-pasca.org/resources/
publications/2015_langhammer_pasca_fp_dsp_block_architecture_for_fpgas.pdf

In short: It appears that infinite and NaNs are supported, however
sub-normals are treated as 0 and only one rounding-mode is supported...

Somewhere there is a video which shows that using the floating-point
DSPs cuts the LE-usage by about 90%, so if you need floating point, I
think Arria/Stratix 10 are really the best way to go...

That video may be for the *entire* floating point unit in the fabric.
Most FPGAs have dedicated integer multipliers which can be used for both
the multiplier and the barrel shifters in a floating point ALU. The
adders and random logic would need to be in the fabric, but will be
*much* smaller.

Xilinx and Altera both support "DSP blocks" that do a multiply and add
(they say multiply and accumulate, but it's more versatile than that).

According to the above paper, Altera has paired up their DSP blocks and
added logic to each pair so that they become a floating-point arithmetic
block. Personally I think that for most "regular" DSP uses you're going
to know the range of the incoming data and will, therefor, only need
fixed-point -- but it looks like they're chasing the "FPGA as a
supercomputer" market (hence, the purchase by Intel), and for that you
need floating point just as a selling point.

If you look around I thing you will find many uses for floating point in
the DSP market. It's not just a selling gimmick. I don't think the
many floating point DSP devices are sold because they look good in the
product's spec sheet.

Heck back in the day when DSP was done on mainframes the hot rods of
computing were all floating point. Cray-1, ST-100...

--

Rick C

]> If you look around I thing you will find many uses for floating point in
]> the DSP market.

There was a rule of thumb in voice compression that floating point DSP took a third fewer operations than fixed point DSP. Plus probably faster code development not having to keep track of the scaling.

Jim Brakefield
 
I think you oversimplify FP. It works a lot better with dedicated hardware.

Not sure what your point is. The principles are the same in software or
hardware. I was describing hardware I have worked on. ST-100 from Star
Technologies. I became very intimate with the inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing that,
but possibly. The basics are still the same. Adds use a barrel shifter
to denormalize the mantissa so the exponents are equal, a integer adder
and a normalization barrel shifter to produce the result. Multiplies
use a multiplier for the mantissas and an adder for the exponents (with
adjustment for exponent bias) followed by a simple shifter to normalize
the result.

Both add and multiply are about the same level of complexity as a barrel
shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think I am
missing?

--

Rick C

It just all works better with dedicated hardware. Finding the leading one for normalization is somewhat slow in the FPGA and is something that benefits from dedicated hardware. Using a DSP48 (if we're talking about Xilinx) for a barrel shifter is fairly fast, but requires 3 cycles of latency, can only shift up to 18 bits, and is overkill for the task. You're using a full multiplier as a shifter; a dedicated shifter would be smaller and faster. All this stuff adds latency. When I pull up CoreGen and ask for the basic FP adder, I get something that uses only 2 DSP48s but has 12 cycles of latency. And there is a lot of fabric routing so timing is not very deterministic.
 
On 1/26/2017 9:38 PM, Kevin Neilson wrote:
I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.

--

Rick C
 
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.

A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.
 
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a Êcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.

If I understand, you can do a barrel shifter with log2(n) complexity, hence your 5 steps but you will have the combitional delays of 5 muxes, it could limit your maximum clock frequency. A brute force approach will use more resoures but will probably allow a higher clock frequency.
 
Le vendredi 27 janvier 2017 14:00:10 UTC-5, rickman a Êcrit :
On 1/27/2017 10:12 AM, Benjamin Couillard wrote:
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a ĂŠcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position..
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier..

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.


If I understand, you can do a barrel shifter with log2(n) complexity, hence your 5 steps but you will have the combitional delays of 5 muxes, it could limit your maximum clock frequency. A brute force approach will use more resoures but will probably allow a higher clock frequency.

Technically N log(N).

--

Rick C

Yep true, thanks for the clarification
 
On 27/01/17 16:12, Benjamin Couillard wrote:
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a ĂŠcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have
worked on. ST-100 from Star Technologies. I became very
intimate with the inner workings.

The only complications are from the various error and special
case handling of the IEEE-754 format. I doubt the FPGA is
implementing that, but possibly. The basics are still the
same. Adds use a barrel shifter to denormalize the mantissa
so the exponents are equal, a integer adder and a
normalization barrel shifter to produce the result.
Multiplies use a multiplier for the mantissas and an adder
for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity
as a barrel shifter is almost as much logic as the
multiplier.

Other than the special case handling of IEEE-754, what do you
think I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and
is something that benefits from dedicated hardware. Using a
DSP48 (if we're talking about Xilinx) for a barrel shifter is
fairly fast, but requires 3 cycles of latency, can only shift
up to 18 bits, and is overkill for the task. You're using a
full multiplier as a shifter; a dedicated shifter would be
smaller and faster. All this stuff adds latency. When I pull
up CoreGen and ask for the basic FP adder, I get something that
uses only 2 DSP48s but has 12 cycles of latency. And there is a
lot of fabric routing so timing is not very deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A
multiplier has N stages with a one bit adder at every bit
position. A barrel multiplexer has nearly as many bit positions
(you typically don't need all the possible outputs), but uses a
bit less logic at each position. Each bit position still needs a
full 4 input LUT. Not tons of difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a
set of 32 two-input multiplexers. Dedicated hardware for that will
be /much/ smaller and more efficient than using LUTs or a full
multiplier.

Normalisation of FP results also requires a "find first 1"
operation. Again, dedicated hardware is going to be a lot smaller
and more efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller
and faster than using integer DSP blocks with LUT's to do the same
job.

The multipliers I've seen have selectable latency down to 1
clock. Rolling a barrel shifter will generate many layers of
logic that will need to be pipelined as well to reach high
speeds, likely many more layers for the same speeds.

What do you get if you design a floating point adder in the
fabric? I can only imagine it will be *much* larger and slower.


If I understand, you can do a barrel shifter with log2(n) complexity,
hence your 5 steps but you will have the combitional delays of 5
muxes, it could limit your maximum clock frequency. A brute force
approach will use more resoures but will probably allow a higher
clock frequency.

The "brute force" method would be 1 layer of 32 32-input multiplexers.
And how do you implement a 32-input multiplexer in gates? You basically
have 5 layers of 2-input multiplexers.

If the depth of the multiplexer is high enough, you might use tri-state
gates but I suspect that in this case you'd implement it with normal logic.
 
On 1/27/2017 3:17 AM, David Brown wrote:
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.

Yes, I stand corrected. Still, it is hardly a "waste" of multipliers to
use them for multiplexers.


Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It is
the same function as used in Gray code operations.


So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

Who said it wouldn't be? I say exactly that below. My point was just
that floating point isn't too hard to wrap your head around and not so
horribly different from fixed point. You just need to stick a few
functions onto a fixed point multiplier/adder.

I was responding to:

"Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?"

The difference between fixed and floating point operations require a few
functions beyond the basic integer operations which we have been
discussing. Floating point is not magic or incredibly hard to do. It
has not been included on FPGAs up until now because the primary market
is integer based.

Some 15 years ago I discussed the need for hard IP in FPGAs and was told
by certain Xilinx employees that it isn't practical to include hard IP
because of the proliferation of combinations and wasted resources that
result. The trouble is the ratio of silicon area required for hard IP
vs. FPGA fabric gets worse with each larger generation. So as we see
now FPGAs are including all manner of functio blocks.... like other
devices.

What I don't get is why FPGAs are so special that they are the last hold
out of becoming system on chip devices.


The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.

--

Rick C
 
On 1/27/2017 10:12 AM, Benjamin Couillard wrote:
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a ĂŠcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.


If I understand, you can do a barrel shifter with log2(n) complexity, hence your 5 steps but you will have the combitional delays of 5 muxes, it could limit your maximum clock frequency. A brute force approach will use more resoures but will probably allow a higher clock frequency.

Technically N log(N).

--

Rick C
 
On 1/27/2017 11:33 AM, David Brown wrote:
On 27/01/17 16:12, Benjamin Couillard wrote:
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a ĂŠcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have
worked on. ST-100 from Star Technologies. I became very
intimate with the inner workings.

The only complications are from the various error and special
case handling of the IEEE-754 format. I doubt the FPGA is
implementing that, but possibly. The basics are still the
same. Adds use a barrel shifter to denormalize the mantissa
so the exponents are equal, a integer adder and a
normalization barrel shifter to produce the result.
Multiplies use a multiplier for the mantissas and an adder
for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity
as a barrel shifter is almost as much logic as the
multiplier.

Other than the special case handling of IEEE-754, what do you
think I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and
is something that benefits from dedicated hardware. Using a
DSP48 (if we're talking about Xilinx) for a barrel shifter is
fairly fast, but requires 3 cycles of latency, can only shift
up to 18 bits, and is overkill for the task. You're using a
full multiplier as a shifter; a dedicated shifter would be
smaller and faster. All this stuff adds latency. When I pull
up CoreGen and ask for the basic FP adder, I get something that
uses only 2 DSP48s but has 12 cycles of latency. And there is a
lot of fabric routing so timing is not very deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A
multiplier has N stages with a one bit adder at every bit
position. A barrel multiplexer has nearly as many bit positions
(you typically don't need all the possible outputs), but uses a
bit less logic at each position. Each bit position still needs a
full 4 input LUT. Not tons of difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a
set of 32 two-input multiplexers. Dedicated hardware for that will
be /much/ smaller and more efficient than using LUTs or a full
multiplier.

Normalisation of FP results also requires a "find first 1"
operation. Again, dedicated hardware is going to be a lot smaller
and more efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller
and faster than using integer DSP blocks with LUT's to do the same
job.

The multipliers I've seen have selectable latency down to 1
clock. Rolling a barrel shifter will generate many layers of
logic that will need to be pipelined as well to reach high
speeds, likely many more layers for the same speeds.

What do you get if you design a floating point adder in the
fabric? I can only imagine it will be *much* larger and slower.


If I understand, you can do a barrel shifter with log2(n) complexity,
hence your 5 steps but you will have the combitional delays of 5
muxes, it could limit your maximum clock frequency. A brute force
approach will use more resoures but will probably allow a higher
clock frequency.


The "brute force" method would be 1 layer of 32 32-input multiplexers.
And how do you implement a 32-input multiplexer in gates? You basically
have 5 layers of 2-input multiplexers.

If the depth of the multiplexer is high enough, you might use tri-state
gates but I suspect that in this case you'd implement it with normal logic.

A barrel shifter is simpler than that. I believe in a somewhat parallel
method to computing an FFT, the terms in a barrel shifter can be shared
to allow this. (pseudo vhdl)


function (indata : unsigned(31:0), sel : unsigned(4:0))
return unsigned(31:0) is
variable a, b, c, d, e : unsigned(31:0);
begin
a := indata(31:0) & '0' when sel(0) else indata;
b := (a(30:0), others => '0') when sel(1) else indata;
c := (b(27:0), others => '0') when sel(2) else indata;
d := (c(23:0), others => '0') when sel(3) else indata;
e := (d(15:0), others => '0') when sel(4) else indata;

return (e);
end;

--

Rick C
 
On 27/01/17 19:59, rickman wrote:
On 1/27/2017 3:17 AM, David Brown wrote:
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have worked on.
ST-100 from Star Technologies. I became very intimate with the
inner workings.

The only complications are from the various error and special case
handling of the IEEE-754 format. I doubt the FPGA is implementing
that, but possibly. The basics are still the same. Adds use a
barrel shifter to denormalize the mantissa so the exponents are
equal, a integer adder and a normalization barrel shifter to
produce the result. Multiplies use a multiplier for the mantissas
and an adder for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity as a
barrel shifter is almost as much logic as the multiplier.

Other than the special case handling of IEEE-754, what do you think
I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and is
something that benefits from dedicated hardware. Using a DSP48 (if
we're talking about Xilinx) for a barrel shifter is fairly fast, but
requires 3 cycles of latency, can only shift up to 18 bits, and is
overkill for the task. You're using a full multiplier as a shifter;
a dedicated shifter would be smaller and faster. All this stuff adds
latency. When I pull up CoreGen and ask for the basic FP adder, I
get something that uses only 2 DSP48s but has 12 cycles of latency.
And there is a lot of fabric routing so timing is not very
deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A multiplier has
N stages with a one bit adder at every bit position. A barrel
multiplexer has nearly as many bit positions (you typically don't need
all the possible outputs), but uses a bit less logic at each position.
Each bit position still needs a full 4 input LUT. Not tons of
difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a set
of 32 two-input multiplexers. Dedicated hardware for that will be
/much/ smaller and more efficient than using LUTs or a full multiplier.

Yes, I stand corrected. Still, it is hardly a "waste" of multipliers to
use them for multiplexers.

Well, if the multipliers are already there and you don't have
alternative dedicated hardware, then I agree you are not wasting the
multipliers in using them for a shifter.

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It is
the same function as used in Gray code operations.

It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.

So a DSP block that has dedicated FP support is going to be smaller and
faster than using integer DSP blocks with LUT's to do the same job.

Who said it wouldn't be? I say exactly that below. My point was just
that floating point isn't too hard to wrap your head around and not so
horribly different from fixed point. You just need to stick a few
functions onto a fixed point multiplier/adder.

Fair enough.

I was responding to:

"Is this really a thing, or are they wrapping some more familiar fixed-
point processing with IP to make it floating point?"

The difference between fixed and floating point operations require a few
functions beyond the basic integer operations which we have been
discussing. Floating point is not magic or incredibly hard to do. It
has not been included on FPGAs up until now because the primary market
is integer based.

Okay.

Some 15 years ago I discussed the need for hard IP in FPGAs and was told
by certain Xilinx employees that it isn't practical to include hard IP
because of the proliferation of combinations and wasted resources that
result. The trouble is the ratio of silicon area required for hard IP
vs. FPGA fabric gets worse with each larger generation. So as we see
now FPGAs are including all manner of functio blocks.... like other
devices.

What I don't get is why FPGAs are so special that they are the last hold
out of becoming system on chip devices.

I think this has come up before in this newsgroup. But I can't remember
if any conclusion was reached (probably not!).

The multipliers I've seen have selectable latency down to 1 clock.
Rolling a barrel shifter will generate many layers of logic that will
need to be pipelined as well to reach high speeds, likely many more
layers for the same speeds.

What do you get if you design a floating point adder in the fabric? I
can only imagine it will be *much* larger and slower.
 

Welcome to EDABoard.com

Sponsor

Back
Top