Hardware floating point?

Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It is
the same function as used in Gray code operations.


It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.

I've done it in a Xilinx, and it's not fast. First you have to go across the routing fabric and go through a set of LUTs to get onto the carry chain. The carry chain is pretty fast; getting on and off the carry chain is slow. After you get off the carry chain, you have to go through the general routing fabric again. This is where most of your clock cycle gets eaten up. Remember, if you had dedicated hardware, this would be a dedicated route. Now you get into a second set of LUTs, where you have to AND the data from the carry chain with the original number in order to get a one-hot bus with only the leading 1 set. Now you have to encode that into a number which you can use for your shifter. You may be able to do this with the same set of LUTs; I can't remember.
 
Le vendredi 27 janvier 2017 11:34:00 UTC-5, David Brown a Êcrit :
On 27/01/17 16:12, Benjamin Couillard wrote:
Le vendredi 27 janvier 2017 03:17:21 UTC-5, David Brown a ĂŠcrit :
On 27/01/17 05:39, rickman wrote:
On 1/26/2017 9:38 PM, Kevin Neilson wrote:

I think you oversimplify FP. It works a lot better with
dedicated hardware.

Not sure what your point is. The principles are the same in
software or hardware. I was describing hardware I have
worked on. ST-100 from Star Technologies. I became very
intimate with the inner workings.

The only complications are from the various error and special
case handling of the IEEE-754 format. I doubt the FPGA is
implementing that, but possibly. The basics are still the
same. Adds use a barrel shifter to denormalize the mantissa
so the exponents are equal, a integer adder and a
normalization barrel shifter to produce the result.
Multiplies use a multiplier for the mantissas and an adder
for the exponents (with adjustment for exponent bias)
followed by a simple shifter to normalize the result.

Both add and multiply are about the same level of complexity
as a barrel shifter is almost as much logic as the
multiplier.

Other than the special case handling of IEEE-754, what do you
think I am missing?

--

Rick C

It just all works better with dedicated hardware. Finding the
leading one for normalization is somewhat slow in the FPGA and
is something that benefits from dedicated hardware. Using a
DSP48 (if we're talking about Xilinx) for a barrel shifter is
fairly fast, but requires 3 cycles of latency, can only shift
up to 18 bits, and is overkill for the task. You're using a
full multiplier as a shifter; a dedicated shifter would be
smaller and faster. All this stuff adds latency. When I pull
up CoreGen and ask for the basic FP adder, I get something that
uses only 2 DSP48s but has 12 cycles of latency. And there is a
lot of fabric routing so timing is not very deterministic.

I'm not sure how much you know about multipliers and shifters.
Multipliers are not magical. Multiplexers *are* big. A
multiplier has N stages with a one bit adder at every bit
position. A barrel multiplexer has nearly as many bit positions
(you typically don't need all the possible outputs), but uses a
bit less logic at each position. Each bit position still needs a
full 4 input LUT. Not tons of difference in complexity.


A 32-bit barrel shifter can be made with 5 steps, each step being a
set of 32 two-input multiplexers. Dedicated hardware for that will
be /much/ smaller and more efficient than using LUTs or a full
multiplier.

Normalisation of FP results also requires a "find first 1"
operation. Again, dedicated hardware is going to be a lot smaller
and more efficient than using LUT's.

So a DSP block that has dedicated FP support is going to be smaller
and faster than using integer DSP blocks with LUT's to do the same
job.

The multipliers I've seen have selectable latency down to 1
clock. Rolling a barrel shifter will generate many layers of
logic that will need to be pipelined as well to reach high
speeds, likely many more layers for the same speeds.

What do you get if you design a floating point adder in the
fabric? I can only imagine it will be *much* larger and slower.


If I understand, you can do a barrel shifter with log2(n) complexity,
hence your 5 steps but you will have the combitional delays of 5
muxes, it could limit your maximum clock frequency. A brute force
approach will use more resoures but will probably allow a higher
clock frequency.


The "brute force" method would be 1 layer of 32 32-input multiplexers.
And how do you implement a 32-input multiplexer in gates? You basically
have 5 layers of 2-input multiplexers.

If the depth of the multiplexer is high enough, you might use tri-state
gates but I suspect that in this case you'd implement it with normal logic.

Yeah, you're right.
 
On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote:

Normalisation of FP results also requires a "find first 1"
operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It
is the same function as used in Gray code operations.


It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.


I've done it in a Xilinx, and it's not fast. First you have to go
across the routing fabric and go through a set of LUTs to get onto the
carry chain. The carry chain is pretty fast; getting on and off the
carry chain is slow. After you get off the carry chain, you have to go
through the general routing fabric again. This is where most of your
clock cycle gets eaten up. Remember, if you had dedicated hardware,
this would be a dedicated route. Now you get into a second set of LUTs,
where you have to AND the data from the carry chain with the original
number in order to get a one-hot bus with only the leading 1 set. Now
you have to encode that into a number which you can use for your
shifter. You may be able to do this with the same set of LUTs; I can't
remember.

What Xilinx part?

The Altera Stratus 10 (I think that's the one) uses paired DSP blocks
that are designed with a bit of extra logic so that you can use the pair
of them as a floating-point block, or each one as a fixed-point block.
(I'm not using their terminology).

Apparently there's enough stuff going on at the really high end that
floating point is better.

--
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work! See my website if you're interested
http://www.wescottdesign.com
 
On 1/30/2017 1:40 PM, Kevin Neilson wrote:
Normalisation of FP results also requires a "find first 1" operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It is
the same function as used in Gray code operations.


It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.


I've done it in a Xilinx, and it's not fast. First you have to go across the routing fabric and go through a set of LUTs to get onto the carry chain. The carry chain is pretty fast; getting on and off the carry chain is slow. After you get off the carry chain, you have to go through the general routing fabric again. This is where most of your clock cycle gets eaten up. Remember, if you had dedicated hardware, this would be a dedicated route. Now you get into a second set of LUTs, where you have to AND the data from the carry chain with the original number in order to get a one-hot bus with only the leading 1 set. Now you have to encode that into a number which you can use for your shifter. You may be able to do this with the same set of LUTs; I can't remember.

The comparison is using a carry chain vs. not using a carry chain.
First 1 in LUTs is either log2(N) in depth and linear in size or log2(N)
in size and linear in depth (speed). Using general routing and LUTs
this is very slow. Using a fast carry uses a LUT to enter the carry
chain and a LUT to exit the carry chain. The carry chain is a fraction
of a nanosecond per bit.

--

Rick C
 
On 1/30/2017 3:54 PM, Tim Wescott wrote:
On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote:

Normalisation of FP results also requires a "find first 1"
operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast. It
is the same function as used in Gray code operations.


It is not something I have looked into, but I'll happily take your word
for it. However, like pretty much /any/ function, it will be smaller
and faster in dedicated hardware than in logic blocks.


I've done it in a Xilinx, and it's not fast. First you have to go
across the routing fabric and go through a set of LUTs to get onto the
carry chain. The carry chain is pretty fast; getting on and off the
carry chain is slow. After you get off the carry chain, you have to go
through the general routing fabric again. This is where most of your
clock cycle gets eaten up. Remember, if you had dedicated hardware,
this would be a dedicated route. Now you get into a second set of LUTs,
where you have to AND the data from the carry chain with the original
number in order to get a one-hot bus with only the leading 1 set. Now
you have to encode that into a number which you can use for your
shifter. You may be able to do this with the same set of LUTs; I can't
remember.

What Xilinx part?

The Altera Stratus 10 (I think that's the one) uses paired DSP blocks
that are designed with a bit of extra logic so that you can use the pair
of them as a floating-point block, or each one as a fixed-point block.
(I'm not using their terminology).

Apparently there's enough stuff going on at the really high end that
floating point is better.

I'm not sure what "high end" means. Floating point has some advantages
and it has some disadvantages. Fixed point is the same. Neither is
perfect for all uses or even *any* uses actually. You always need to
analyze the problem you are solving and consider the sources of
computational errors. They are different but always potentially present
with either approach.

--

Rick C
 
On Mon, 30 Jan 2017 16:32:51 -0500, rickman wrote:

On 1/30/2017 3:54 PM, Tim Wescott wrote:
On Mon, 30 Jan 2017 10:40:39 -0800, Kevin Neilson wrote:

Normalisation of FP results also requires a "find first 1"
operation.
Again, dedicated hardware is going to be a lot smaller and more
efficient than using LUT's.

Find first 1 can be done using a carry chain which is quite fast.
It is the same function as used in Gray code operations.


It is not something I have looked into, but I'll happily take your
word for it. However, like pretty much /any/ function, it will be
smaller and faster in dedicated hardware than in logic blocks.


I've done it in a Xilinx, and it's not fast. First you have to go
across the routing fabric and go through a set of LUTs to get onto the
carry chain. The carry chain is pretty fast; getting on and off the
carry chain is slow. After you get off the carry chain, you have to
go through the general routing fabric again. This is where most of
your clock cycle gets eaten up. Remember, if you had dedicated
hardware, this would be a dedicated route. Now you get into a second
set of LUTs,
where you have to AND the data from the carry chain with the original
number in order to get a one-hot bus with only the leading 1 set. Now
you have to encode that into a number which you can use for your
shifter. You may be able to do this with the same set of LUTs; I
can't remember.

What Xilinx part?

The Altera Stratus 10 (I think that's the one) uses paired DSP blocks
that are designed with a bit of extra logic so that you can use the
pair of them as a floating-point block, or each one as a fixed-point
block. (I'm not using their terminology).

Apparently there's enough stuff going on at the really high end that
floating point is better.

I'm not sure what "high end" means. Floating point has some advantages
and it has some disadvantages. Fixed point is the same. Neither is
perfect for all uses or even *any* uses actually. You always need to
analyze the problem you are solving and consider the sources of
computational errors. They are different but always potentially present
with either approach.

Yes, you are correct.

I tend to mostly work with stuff that comes out of an ADC, goes through
some processing (usually for me it's a processor and not an FPGA, but
it's still DSP), and then goes out a DAC. In that case, fixed-point
processing for the signal itself is usually the way to go because the ADC
and DAC between them pretty much set the ranges, which means that
floating point is just a waste of silicon.

HOWEVER: that's just what I mostly run into. I'm currently working on a
project where, by its nature, the sensible numerical format is double-
precision floating point (not FPGA -- it's _slow_ data reception on a PC-
class processor, where double-precision floating point is almost as fast
as integer math unless you use the DSP extensions).

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!
 

Welcome to EDABoard.com

Sponsor

Back
Top