extracting D from 1 / D*D

B

Bert_Paris

Guest
Hi Folks,
Incredibly busy summer here, so before burning my brain cells, Googling
or -worst- digging in my very dusty math courses, I submit this
question to the DSP experts who usually float around, hoping some
aren't at the beaches ;-)

How would you extract D from a X = K / (D * D) value (16 bits) ?
- in a small FPGA indeed, which has no embedded mult (but a mult could
be synthesized)
- please don't answer to synthesize sqrt(1/x) !
- I have lots of clock cycles available to compensate the lack of FPGA
resources.
- I would like to avoid a Piece-Wise Linear estimator if possible
(though it would allow for coding sensor non-linearities).

In the past, I remember implementing an sqrt operator using a suite
(the suite didn't converge as rapidly as it might, but the
implementation was easy), so a similar technique would be nice.
But I keep an open mind : any idea is most welcome :)

Thanks,
Bert
 
glen herrmannsfeldt said :

There is an iterative sqrt algorithm that only does subtraction.
For only 16 bits, it shouldn't take a lot of LUTs, and not
so many cycles, either. I haven't thought about coding it
in logic lately, though. What about the K?

-- glen
As mentioned, I have no problem regarding sqrt, but it is 1 / Sqr(D)
that I have to transform.
The solution I had in mind before posting was to use a non-restore
division (for 1/X) followed by the iterative sqrt. But maybe some one
has a better idea.

Bert
 
On 17/08/2011 13:11, Bert_Paris wrote:
Hi Folks,
Incredibly busy summer here, so before burning my brain cells, Googling
or -worst- digging in my very dusty math courses, I submit this question
to the DSP experts who usually float around, hoping some aren't at the
beaches ;-)

How would you extract D from a X = K / (D * D) value (16 bits) ?
- in a small FPGA indeed, which has no embedded mult (but a mult could
be synthesized)
- please don't answer to synthesize sqrt(1/x) !
- I have lots of clock cycles available to compensate the lack of FPGA
resources.
- I would like to avoid a Piece-Wise Linear estimator if possible
(though it would allow for coding sensor non-linearities).
Have you figured out what accuracy you need? It will make a /big/
difference to the ease of implementation.

Implementing a linear interpolation means having a table of
coefficients. Have you got space to allow lookup in a table of some sort?

You can reduce the coefficients a bit by first left-shifting X by two
bits repeatedly until you have a 1 in either of the two MSB's. Once you
have calculated D, right-shift it by 1 bit for each 2 bits you
left-shifted X. This will save you from having to have high-precision
lookups for small X.

In the past, I remember implementing an sqrt operator using a suite (the
suite didn't converge as rapidly as it might, but the implementation was
easy), so a similar technique would be nice.
But I keep an open mind : any idea is most welcome :)
You don't just have to do a square root - you also have to do a
reciprocal. These are both relatively hard to implement accurately
(unlike multiplication, which is fairly easy - and quite small if you
have lots of clock cycles and can do it bit for bit).

If you are looking for a series of estimators, Newton-Raphson generally
gives fast convergence. The sequence for your problem would be:

D_{n+1} = (3/2).D_n - (X/2K).D_n^3

Unfortunately, that will need a number of multiplies per cycle. You
would also need to test how well it worked with limited bit precision.
 
Bert_Paris <do_not_spam@me.com> wrote:

(snip)
How would you extract D from a X = K / (D * D) value (16 bits) ?
- in a small FPGA indeed, which has no embedded mult (but a mult could
be synthesized)
- please don't answer to synthesize sqrt(1/x) !
- I have lots of clock cycles available to compensate the lack of FPGA
resources.
- I would like to avoid a Piece-Wise Linear estimator if possible
(though it would allow for coding sensor non-linearities).
There is an iterative sqrt algorithm that only does subtraction.
For only 16 bits, it shouldn't take a lot of LUTs, and not
so many cycles, either. I haven't thought about coding it
in logic lately, though. What about the K?

-- glen
 
Thanks for your feedback !

David Brown a formulé la demande :
On 17/08/2011 13:11, Bert_Paris wrote:

How would you extract D from a X = K / (D * D) value (16 bits) ?

- in a small FPGA indeed, which has no embedded mult (but a mult could
be synthesized)
- please don't answer to synthesize sqrt(1/x) !
- I have lots of clock cycles available to compensate the lack of FPGA
resources.
- I would like to avoid a Piece-Wise Linear estimator if possible
(though it would allow for coding sensor non-linearities).

Have you figured out what accuracy you need? It will make a /big/ difference
to the ease of implementation.
10 bits looks good enough.

Implementing a linear interpolation means having a table of coefficients.
Have you got space to allow lookup in a table of some sort?
Not ideal in the context (uninitialized RAMs) :-( but that's doable.

You can reduce the coefficients a bit by first left-shifting X by two bits
repeatedly until you have a 1 in either of the two MSB's. Once you have
calculated D, right-shift it by 1 bit for each 2 bits you left-shifted X.
This will save you from having to have high-precision lookups for small X.
If I calculate the sqrt first, I may not need this.

In the past, I remember implementing an sqrt operator using a suite (the
suite didn't converge as rapidly as it might, but the implementation was
easy), so a similar technique would be nice.
But I keep an open mind : any idea is most welcome :)


You don't just have to do a square root - you also have to do a reciprocal.
Absolutely !

These are both relatively hard to implement accurately (unlike
multiplication, which is fairly easy - and quite small if you have lots of
clock cycles and can do it bit for bit).

If you are looking for a series of estimators, Newton-Raphson generally gives
fast convergence. The sequence for your problem would be:

D_{n+1} = (3/2).D_n - (X/2K).D_n^3

Unfortunately, that will need a number of multiplies per cycle. You would
also need to test how well it worked with limited bit precision.
While googling, I found a "magical" estimator based on manipulating
an IEEE floating point as an integer (which they refine with
an iteration of the NR above). Not useful here I guess.

After some thinking I will probably try this :
- calculate sqrt in format 8.2 or 8.3
- use a 1024 entries LUT for the reciprocal
or (due to lack of ROM) :
- use a non-restore division 2**N / X

Thanks !
 
Bert_Paris <do_not_spam@me.com> wrote:

(snip, I wrote)
There is an iterative sqrt algorithm that only does subtraction.
For only 16 bits, it shouldn't take a lot of LUTs, and not
so many cycles, either. I haven't thought about coding it
in logic lately, though. What about the K?

As mentioned, I have no problem regarding sqrt, but it is 1 / Sqr(D)
that I have to transform.
The solution I had in mind before posting was to use a non-restore
division (for 1/X) followed by the iterative sqrt. But maybe some one
has a better idea.
In Newton-Raphson, reciprocal square root is easier than just
square root, but that might not be true for the subtract based
version. Non-restore divide is pretty small, or even restore divide,
if you have the clock cycles to spare. How many cycles do you have?

-- glen
 
Of course you'll have to deal with the special case 1/sqrt(0) =
infinity. Leaving that aside:

You can reduce the coefficients a bit by first left-shifting X by two
bits repeatedly until you have a 1 in either of the two MSB's. Once
you have calculated D, right-shift it by 1 bit for each 2 bits you
left-shifted X. This will save you from having to have high-precision
lookups for small X.
This is very good advice, it makes the problem *much* easier.

This approach works because 1/sqrt(4D) = 1/sqrt(D) >> 1

The HW required is a leading-1 detector that selects whether to left
shift the input 16 bits by 0/2/4/8/10/12/14 bits and later to right
shift the output by 0/1/2/3/4/5/6/7 bits.

This approach means that you now only have to deal with the input range
16384...65535. Why is that an advantage ? Because the function
1/sqrt(D) is highly nonlinear over the range 1...65535 but only mildly
nonlinear over the range 16384...65535. Linear interpolation will work
*much* better over the reduced range.

Some numbers to back this up:

derivative of 1/sqrt(D) is -1/2Dsqrt(D)

derivative @ 1 = -0.5

derivative @ 16384 = -2.38e-7

derivative @ 65535 = -2.98e-8

The /gradient/ of the curve changes by a factor of roughly 17,000,000
times between 1 and 65535 but only by a factor of 12.5 between 16384 and
65535. In other words, the curve is *much* more linear over the reduced
range.

Applying piecewise linear interpolation over the reduced range should
work well.

As an example, using just 1 linear segment will give you around a 20%
worst case error (at the middle of the range ie 40960).

Using 64 linear segments will give you about 12 bit accuracy.

Hope that helps.

Stephen Ecob
Silicon On Inspiration
Sydney Australia
www.sioi.com.au
 
Steve <theecobs@gmail.com> wrote:

(snip, someone wrote)
You can reduce the coefficients a bit by first left-shifting X by two
bits repeatedly until you have a 1 in either of the two MSB's. Once
you have calculated D, right-shift it by 1 bit for each 2 bits you
left-shifted X. This will save you from having to have high-precision
lookups for small X.

This is very good advice, it makes the problem *much* easier.

This approach works because 1/sqrt(4D) = 1/sqrt(D) >> 1
Usually this would be done with a barrel shifter, but they are
not very LUT efficient in many FPGAs. If you aren't so restricted
in cycles, though, you can use an ordinary shift register.

I am not sure sure how much that saves in cycles or cells.

-- glen
 
Usually this would be done with a barrel shifter, but they are
not very LUT efficient in many FPGAs. If you aren't so restricted
in cycles, though, you can use an ordinary shift register.

You can use a Multiplier as barrel shifter too (Xilinx xapp195)
Given that there are plenty of clock cycles available and not much HW,
I'd spend the first 8 clock cycles looking at the two MSBs and left
shifting two positions when those MSBs are 00. The detection of the
zero input special case can also be folded into this part very efficiently.

Stephen Ecob
Silicon On Inspiration
Sydney Australia
www.sioi.com.au
 
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
Steve <theecobs@gmail.com> wrote:

(snip, someone wrote)
You can reduce the coefficients a bit by first left-shifting X by two
bits repeatedly until you have a 1 in either of the two MSB's. Once
you have calculated D, right-shift it by 1 bit for each 2 bits you
left-shifted X. This will save you from having to have high-precision
lookups for small X.

This is very good advice, it makes the problem *much* easier.

This approach works because 1/sqrt(4D) = 1/sqrt(D) >> 1

Usually this would be done with a barrel shifter, but they are
not very LUT efficient in many FPGAs. If you aren't so restricted
in cycles, though, you can use an ordinary shift register.
You can use a Multiplier as barrel shifter too (Xilinx xapp195)

--
Uwe Bonnes bon@elektron.ikp.physik.tu-darmstadt.de

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------
 
Uwe Bonnes <bon@elektron.ikp.physik.tu-darmstadt.de> wrote:

(snip, I wrote)
Usually this would be done with a barrel shifter, but they are
not very LUT efficient in many FPGAs. If you aren't so restricted
in cycles, though, you can use an ordinary shift register.

You can use a Multiplier as barrel shifter too (Xilinx xapp195)
Usually the barrel shifter problem comes up doing floating point
in an FPGA. If you use the hardware multipliers for multiplying,
you might not have them avaialble. But it takes two or three for
floating point add, which usually makes add bigger than multiply!

For IEEE double, 53 bits is a big shifter!

-- glen
 
On Aug 18, 11:43 am, Steve <theec...@gmail.com> wrote:
Usually this would be done with a barrel shifter, but they are
not very LUT efficient in many FPGAs.  If you aren't so restricted
in cycles, though, you can use an ordinary shift register.

You can use a Multiplier as barrel shifter too (Xilinx xapp195)

Given that there are plenty of clock cycles available and not much HW,
I'd spend the first 8 clock cycles looking at the two MSBs and left
shifting two positions when those MSBs are 00.  The detection of the
zero input special case can also be folded into this part very efficiently.
How much clock cycles is plenty?

You can:

a) start with some value for D
b) compute T1=D*D
c) compute T2=Y*T1
d) check whether T2 > K
e) use nested intervals to find the exact value of D in 16 iterations.

Using 16 bit iterative multipliers you are in the range of 80 luts for
the datapath and 16*32 clock cycles.
Using bit serial implementation you will be using even less luts for
the data path and about 16*16*32 clock cycles.

An even simpler solution is not to use nested intervals but making use
of the fact that:
Y*(D+1)*(D+1) - Y*D*D = 2*Y*D + 1
and
2*Y*(D+1) - 2*Y*D = 2*Y

This means that you can scan all values of D using two additions per
try:
YDD_next = YDD + 2YD
2YD_next = 2YD + 2Y

Note that this version uses no multipliers.

resulting in
48 LUTs and 64k clock cycles (parallel)
or
6 LUTs and 2M clock cycles (bit serial)

10 bit accuracy is 64 times faster at identical hardware cost.

Kolja
cronologic.de
 
On Thu, 18 Aug 2011 17:21:39 -0700 (PDT), Kolja Sulimma wrote:

Y*(D+1)*(D+1) - Y*D*D = 2*Y*D + 1
and
2*Y*(D+1) - 2*Y*D = 2*Y

This means that you can scan all values of D using two additions per
try:
YDD_next = YDD + 2YD
2YD_next = 2YD + 2Y

Note that this version uses no multipliers.

resulting in
48 LUTs and 64k clock cycles (parallel)
or
6 LUTs and 2M clock cycles (bit serial)
Very nice use of second differences.

I believe there is a successive-approximation variant of this
which will calculate an N-bit result in N cycles without
multipliers. But it requires some ingenious juggling with
shifts, and the margin of this post, etc, etc (meaning:
I'm not smart enough to write down the algorithm nicely).
I haven't yet done a LUT count. It will be a little larger
than Kolja's numbers, but not massively bigger (I think).
Bert, we can discuss offline if you wish.

cheers
--
Jonathan
replace spam with jonathan in the published email address.
 
Hello all !

Sorry, I had to stay away from the forum for a few days.
*Thanks* a lot to all contributors for all their ideas.

I haven't made my mind yet since the sensor still
needs some validation, but there is a lot of food
for thougt here :)

For the record, "many" means 100 clock cycles would
be fast enough, more (1000 !) clock cycles would still
be possible. That's why in theory, very compact
implementations are possible.
Large accumulators are possible since they can be
pipelined / multi-cycled in this context.
The "best" implementation is a compact enough one
that is easy enough to code :)

I'll let you know if the project is going to use
this sensor, and which implementation I used.

Bert.
 

Welcome to EDABoard.com

Sponsor

Back
Top