What to do with an improved algorithm?

Mike Field · Sep 3, 2018

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Gene Filatov · Sep 3, 2018

I think the best option is to write an article -- or a patent.

Simply because it's an extra opportunity to verify that your approach is
correct.

If there's a hidden mistake, a student might be unable to see it.

Gene

On 03.09.2018 13:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Mike Butts · Sep 3, 2018

I agree with Gene, plus you might consider publishing the IP as open source code on a website of your own or opencores.org.

--Mike

On Monday, September 3, 2018 at 3:41:02 AM UTC-7, Gene Filatov wrote:

I think the best option is to write an article -- or a patent.

Simply because it's an extra opportunity to verify that your approach is
correct.

If there's a hidden mistake, a student might be unable to see it.

Gene

On 03.09.2018 13:17, Mike Field wrote:
Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Michael Kellett · Sep 4, 2018

On 03/09/2018 11:17, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

I'd publish - since you are not already in the IP licensing/patenting
groove I doubt if you would make any money from it but you might gain
kudos which may help you career and business. Xilinx might want to
publish it - which might give a lot more visibility.

If you have a web site you could put it on that.

MK

kkoorndyk · Sep 4, 2018

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

Rick C. Hodgin · Sep 4, 2018

On 9/3/2018 6:17 AM, Mike Field wrote:

Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?
Should I just throw the implementation on a website somewhere as a curiosity?
Publish it in an article?
Pass it to a local student to make a paper from it? (I'm not studying at all)
Attempt to patent and then commercialize it?

Thanks!
Mike

I am serious in this reply. You may already know this, if so then I
post for other people to know as well.

I would recommend that if you have means to get by in your life as
you are today without receiving income from this invention / discovery,
then not only give it away to people, but to do so in the name of Jesus
Christ. Tell the people around you that because of what Jesus has done
for you, you are desiring to give to the people here the fruit of what
He first gave you (your intellect, abilities, opportunities to learn,
guiding your path through life, etc.).

By giving Him the credit for your invention / discovery, and not doing
it for money or some other thing here, then you will receive a reward
in Heaven for your labor / knowledge gift to other people, and that re-
ward in Heaven is like the burning bush, or the fishes and the loaves,
meaning it is able to be used without being consumed. In short: re-
wards in Heaven do not wear out or diminish with use. They remain in
their full gift for all time, even under maximum / heavy use.

Remember the Lord in your life, and give things to people citing Him
as the reason why you are doing so. You will receive the maximum re-
turn on your efforts, and it sounds like you will also help out many
people here on this Earth by your gift.

It would be a win-win for everyone in your giving.

--
Rick C. Hodgin

Mike Butts · Sep 4, 2018

On Tuesday, September 4, 2018 at 7:48:41 AM UTC-7, kkoorndyk wrote:

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

OpenCores encourages use of the Wishbone interface for SoC components and they do offer coding guidelines, but there are no requirements for either. For example in the entire DSP core section there are 38 entries, none of which are marked as Wishbone compliant.

Mike Field · Sep 5, 2018

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:
Hi,

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.

Scaling Z
---------
1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:

The core of CORDIC for SIN() and COS() is:
x = INITIAL;
y = INITIAL;
for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);

// Either add or subtract at right angles to the current
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles : -angles);
}

The value for angle[] is all important, for example:

angle[0] = 1238021
angle[1] = 654136
angle[2] = 332050
angle[3] = 166670
angle[4] = 83415
angle[5] = 41718
angle[6] = 20860
angle[7] = 10430
angle[8] = 5215
angle[9] = 2607
angle[10] = 1303
angle[11] = 652
angle[12] = 326
angle[13] = 163
angle[14] = 81
angle[15] = 41
angle[16] = 20
angle[17] = 10
angle[18] = 5
angle[19] = 3
angle[20] = 1

If you make the following change:

for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);

// Either add or subtract at right angles
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles : -angles);

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
z <<= 1; // Double the result of 'z'
//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
}

Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):
angle[0] = 1238021
angle[1] = 1308273
angle[2] = 1328199
angle[3] = 1333354
angle[4] = 1334654
angle[5] = 1334980
angle[6] = 1335061
angle[7] = 1335082
angle[8] = 1335087
angle[9] = 1335088
angle[10] = 1335088
angle[11] = 1335088
angle[12] = 1335088
angle[13] = 1335088
angle[14] = 1335088
angle[15] = 1335088
angle[16] = 1335088
angle[17] = 1335088
angle[18] = 1335088
angle[19] = 1335088
angle[20] = 1335088
angle[21] = 1335088
angle[22] = 1335088
angle[23] = 1335088
angle[24] = 1335088
angle[25] = 1335088
angle[26] = 1335088
angle[27] = 1335088
angle[28] = 1335088
angle[29] = 1335088

....and angle rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.

Part 2: Add a lookup table
=========================If you split the input into:

[2 MSB] quadrant
[next 9 bits] an lookup table index
[the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial = scale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:

/* Use Dual Port memory for this */
if(quadrant & 1) {
x = initial[index];
y = initial[TABLE_SIZE-1-index];
} else {
x = initial[TABLE_SIZE-1-index];
y = initial[index];
}

/* Subtract half the sector angle from Z */
z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done */
...

This removes ~8 cycles of latency.

The end result
=============If you combine both of these you can get rid of the angles[] table completely - it is now a constant.

/* Use Dual Port memory for this */
if(quadrant & 1) {
x = initial[index];
y = initial[TABLE_SIZE-1-index];
} else {
x = initial[TABLE_SIZE-1-index];
y = initial[index];
}

/* Subtract half the sector angle from Z */
z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done,
so less repetitions are needed for the same accuracy */

for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// Add rounding and divide to scale the current vector
tx = x >> (INDEX_BITS+i);
ty = y >> (INDEX_BITS+i);

// Either add or subtract at right angles
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);
z <<= 1;
}

Advantages of this method
========================If you have fully unrolled it to generate a full value per cycle, you end up with:
- 1 BRAM block used (bad)
- 9 less CORDIC stages (good)
- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:

- 1 BRAM block used (bad)
- 8 or 9 cycles less latency (good)
- no need for the angles[] table (good)
- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can compute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

Gene Filatov · Sep 5, 2018

On 05.09.2018 8:40, Mike Field wrote:

On Wednesday, 5 September 2018 02:48:41 UTC+12, kkoorndyk wrote:
On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

I think I've got a really good way to improve a commonly used & well established algorithm that is often used in FPGAs, and it all checks out. The implementation completes the same tasks in 2/3rds the cycles and using 2/3rds the resources of an standard Xilinx IP block, with comparable timing).

I've verified that the output is correct over the entire range of 32-bit input values.

I can't find anything similar designs in a Google patent search, or looking through journal articles. Once you are familiar with the original algorithm, and the optimization is explained it becomes pretty self-evident in retrospect. It just seems the right way to do things.

What should I do?

Should I just throw the implementation on a website somewhere as a curiosity?

Publish it in an article?

Pass it to a local student to make a paper from it? (I'm not studying at all)

Attempt to patent and then commercialize it?

Thanks!

Mike

Licensing and selling IP comes with a bit of a learning curve and requires an investment on your part. As Michael mentions, without some of that framework already in place, a license vetted by an IP attorney, and a good marketing plan, you might not see a return on that investment.

If you want your name more prominently attached to it, I'd suggest posting up on a personal Github account rather than opencores.org which makes you conform to their requirements (such as wishbone interface, etc.).

Xilinx always welcomes guest articles on their blogs (although those have been in flux since the recent reorg), and their e-magazine Xcell Journal (again, seems to have been discontinued and the Xcell Daily Blog archived)

https://forums.xilinx.com/t5/Xilinx-Xclusive-Blog/bg-p/xilinx_xclusive
https://forums.xilinx.com/t5/Adaptable-Advantage-Blog/bg-p/tech_blog

https://www.xilinx.com/about/xcell-publications/xcell-journal.html

--Kris

I never though I would agree with Rick, but....

All sounds like too much work. So here is a quick summary with C-like pseudo-code. I'll put the HDL code up somewhere soon once I am happy with it. I am removing the last rounding errors.

I've been playing with CORDIC, and have come up with what looks to be an overlooked optimization. I've done a bit of googling, and haven't found anything - maybe it is a novel approach?

I've tested it with 32-bit inputs and outputs, and it is within +/-2, and and average error of around 0.6.I a am sure with a bit more analysis of where the errors are coming from I can get it more accurate.

This has two parts to it, both by themselves seem quite trivial, but complement each other quite nicely.

Scaling Z
---------
1. The 'z' value in CORDIC uses becomes smaller and smaller as stages increase:

The core of CORDIC for SIN() and COS() is:
x = INITIAL;
y = INITIAL;
for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);

// Either add or subtract at right angles to the current
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles : -angles);
}

The value for angle[] is all important, for example:

angle[0] = 1238021
angle[1] = 654136
angle[2] = 332050
angle[3] = 166670
angle[4] = 83415
angle[5] = 41718
angle[6] = 20860
angle[7] = 10430
angle[8] = 5215
angle[9] = 2607
angle[10] = 1303
angle[11] = 652
angle[12] = 326
angle[13] = 163
angle[14] = 81
angle[15] = 41
angle[16] = 20
angle[17] = 10
angle[18] = 5
angle[19] = 3
angle[20] = 1

If you make the following change:

for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// divide to scale the current vector
tx = x >> (i+1);
ty = y >> (i+1);

// Either add or subtract at right angles
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? angles : -angles);

//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
z <<= 1; // Double the result of 'z'
//!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
}

Then you can use all the bits in angle[], because you can scale by 2^i (this is data from a different set of parameters, hence the values and count is different):
angle[0] = 1238021
angle[1] = 1308273
angle[2] = 1328199
angle[3] = 1333354
angle[4] = 1334654
angle[5] = 1334980
angle[6] = 1335061
angle[7] = 1335082
angle[8] = 1335087
angle[9] = 1335088
angle[10] = 1335088
angle[11] = 1335088
angle[12] = 1335088
angle[13] = 1335088
angle[14] = 1335088
angle[15] = 1335088
angle[16] = 1335088
angle[17] = 1335088
angle[18] = 1335088
angle[19] = 1335088
angle[20] = 1335088
angle[21] = 1335088
angle[22] = 1335088
angle[23] = 1335088
angle[24] = 1335088
angle[25] = 1335088
angle[26] = 1335088
angle[27] = 1335088
angle[28] = 1335088
angle[29] = 1335088

...and angle rapidly becomes a constant value after the first 9 or 10 iterations. This is what you would expect, as the angle gets smaller and smaller.

Part 2: Add a lookup table
==========================
If you split the input into:

[2 MSB] quadrant
[next 9 bits] an lookup table index
[the rest] The starting CORDIC Z value, offset by 1<<(num_of_bits-1)

And have a lookup table of 512 x 36-bit values (i.e. a block RAM), which hold the SIN/COS values at the center of the range = e.g. initial = scale_factor * sin(PI/2.0/1024*(2*i+1));

Because you need both the SIN() and COS() starting point, you can get them from the same table (screaming out "dual port memory!" to me)

You can then do a standard lookup to get the starting points, 9 cycles into the CORDIC:

/* Use Dual Port memory for this */
if(quadrant & 1) {
x = initial[index];
y = initial[TABLE_SIZE-1-index];
} else {
x = initial[TABLE_SIZE-1-index];
y = initial[index];
}

/* Subtract half the sector angle from Z */
z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done */
...

This removes ~8 cycles of latency.

The end result
==============
If you combine both of these you can get rid of the angles[] table completely - it is now a constant.

/* Use Dual Port memory for this */
if(quadrant & 1) {
x = initial[index];
y = initial[TABLE_SIZE-1-index];
} else {
x = initial[TABLE_SIZE-1-index];
y = initial[index];
}

/* Subtract half the sector angle from Z */
z -= 1 << (CORDIC_BITS-1);

/* Now do standard CORDIC, with a lot of work already done,
so less repetitions are needed for the same accuracy */

for(i = 0; i < CORDIC_REPS; i++ ) {
int64_t tx,ty;
// Add rounding and divide to scale the current vector
tx = x >> (INDEX_BITS+i);
ty = y >> (INDEX_BITS+i);

// Either add or subtract at right angles
x -= (z > 0 ? ty : -ty);
y += (z > 0 ? tx : -tx);
z -= (z > 0 ? ANGLE_CONSTANT : -ANGLE_CONSTANT);
z <<= 1;
}

Advantages of this method
=========================
If you have fully unrolled it to generate a full value per cycle, you end up with:
- 1 BRAM block used (bad)
- 9 less CORDIC stages (good)
- 8 or 9 cycles less latency (good)

For 16-bit values this may only need 5 stages, rather than 14.

If you are trying to minimize area, generating an n-bit value every ~n cycles you end up with:

- 1 BRAM block used (bad)
- 8 or 9 cycles less latency (good)
- no need for the angles[] table (good)
- Less levels of logic, for faster FMAX (good)

For 16-bit values, this could double the number of calculations you can compute at a given clock rate.

You can also tune things some what - you can always throw more BRAM blocks at it to reduce the number of CORDIC stages/iterations required, if you have blocks to spare - but one block to remove 9 stages is pretty good.

What do you think?

As far as your "revised" angle converging to a constant is concerned,
there's a simple explanation using the first two terms of the taylor
series for the arctan function:

arctan(x) = x - 1/3 * x^3 + o(x^3)

So that

angle = arctan(2^-i) / pi * 2^i = (1/pi) * ( 1 - 1/3 * 2^-(2*i)) +
o(2^-(2*i))

Based on which you can easily say how many stages of the conventional
cordic algorithm do you need to skip (i.e. store the outputs in a lookup
table) for a given bit precision.

I don't know the literature well, but I think it would be cool if you
actually write an article detailing your approach!

Gene

Brian Davis · Sep 8, 2018

On Monday, September 3, 2018 at 6:17:54 AM UTC-4, Mike Field wrote:

The implementation completes the same tasks in 2/3rds
the cycles and using 2/3rds the resources of an
standard Xilinx IP block, with comparable timing).

If perchance this is related to your recent CORDIC rotator code,
I've seen a number of CORDIC optimization schemes over the years
to reduce the number of rotation stages, IIRC typically either
by a 'jump start' or merging/optimizing rotation stages.

If I ever manage to find my folder of CORDIC papers, I'll post some links...

-------
some notes on your CORDIC implementation
http://hamsterworks.co.nz/mediawiki/index.php/CORDIC

- instead of quadrant folding, given I & Q you can do octant folding
(0-45 degrees) using the top three bits of the phase

- if you pass in the bit widths and stages as generics, you can
initialize the constant arctan table on-the-fly in a function
using VHDL reals

-Brian

Brian Davis · Sep 8, 2018

earlier, I wrote:

If perchance this is related to your recent CORDIC rotator code,
I've seen a number of CORDIC optimization schemes over the years
to reduce the number of rotation stages, IIRC typically either
by a 'jump start' or merging/optimizing rotation stages.

oops, for some reason, when first reading this thread I didn't
see the later posts with the explanation... I'd swear they weren't
there, but maybe I was just scroll-impaired.

-Brian

Mike Field · Sep 10, 2018

On Wednesday, 5 September 2018 23:43:28 UTC+12, Gene Filatov wrote:

I don't know the literature well, but I think it would be cool if you
actually write an article detailing your approach!

Gene

I'll work on writing one up over the next few days, as well as posting a sample implementation.

Mike

Kevin Neilson · Oct 11, 2018

> What do you think?

That's good stuff. I wonder why you think using a BRAM is bad? It's good to use the hard cores in an FPGA instead of the fabric--it's less power and deterministic routing times.

Is the CORDIC still advantageous in a modern FPGA? The last time I needed to find sine, I used a coarse BRAM lookup that output sine on one port and cos on another. Those were used as derivatives for a 2nd-order Taylor. Two multipliers (more hard cores) are used (using Horner's Rule) for the 1st and 2nd-order interpolations. I don't remember how many digits of accuracy this yields, but the latency is low.

Benjamin Couillard · Oct 11, 2018

Le mercredi 10 octobre 2018 21:52:06 UTC-4, Kevin Neilson a ĂŠcritÂ :

What do you think?

That's good stuff. I wonder why you think using a BRAM is bad? It's good to use the hard cores in an FPGA instead of the fabric--it's less power and deterministic routing times.

Is the CORDIC still advantageous in a modern FPGA? The last time I needed to find sine, I used a coarse BRAM lookup that output sine on one port and cos on another. Those were used as derivatives for a 2nd-order Taylor. Two multipliers (more hard cores) are used (using Horner's Rule) for the 1st and 2nd-order interpolations. I don't remember how many digits of accuracy this yields, but the latency is low.

Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

Kevin Neilson · Oct 12, 2018

> Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

I think you missed the part about the Taylor interpolation.

Kevin Neilson · Oct 12, 2018

On Thursday, October 11, 2018 at 7:37:37 AM UTC-6, Benjamin Couillard wrote:

> Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

Sorry; I didn't see that you were talking about arctan. It's not quite as easy as sin/cos, but there is still the question as to whether a Farrow-type architecture using a coarse lookup and Taylor interpolations would be better than a CORDIC, and I am guessing that with the BRAMs and multipliers in present-day FPGAs, the answer would be yes.

Benjamin Couillard · Oct 12, 2018

Le vendredi 12 octobre 2018 00:42:18 UTC-4, Kevin Neilson a ĂŠcritÂ :

On Thursday, October 11, 2018 at 7:37:37 AM UTC-6, Benjamin Couillard wrote:

Cordic is useful to compute high-precision atan2. Otherwise for 2 16-bit inputs, you'd need a ram with 2^32 addresses (maybe 16 times less if you take advantage of the symmetries).

Sorry; I didn't see that you were talking about arctan. It's not quite as easy as sin/cos, but there is still the question as to whether a Farrow-type architecture using a coarse lookup and Taylor interpolations would be better than a CORDIC, and I am guessing that with the BRAMs and multipliers in present-day FPGAs, the answer would be yes.

Yes I think you're right it could work for an atan or atan 2. You could implemented a divider for atan 2 (y/x), a sign look-up (to get the whole 360 degrees), a BRAM + taylor interpolation.

For atan, you'd simply skip the divider and sign look-up part.

On the other hand, Xilinx and Altera offer plug-and-play Cordic cores for atan/atan2.

Kevin Neilson · Oct 17, 2018

Yes I think you're right it could work for an atan or atan 2. You could implemented a divider for atan 2 (y/x), a sign look-up (to get the whole 360 degrees), a BRAM + taylor interpolation.

For atan, you'd simply skip the divider and sign look-up part.

On the other hand, Xilinx and Altera offer plug-and-play Cordic cores for atan/atan2.

It's probably better to multiply y/x by x to get a normalized ratio rather than use a divider, which requires a lot of resources.

I recalled that I had to implement the atan2 function once for a QAM carrier/symbol recovery circuit. I didn't need great precision, so I split one quadrant into a grid and put the angle of the center of each grid square into a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and get the coarse angle (which was then adjusted for the quadrant) and could use that for carrier recovery.

Benjamin Couillard · Oct 17, 2018

Le mardi 16 octobre 2018 22:36:56 UTC-4, Kevin Neilson a ĂŠcritÂ :

Yes I think you're right it could work for an atan or atan 2. You could implemented a divider for atan 2 (y/x), a sign look-up (to get the whole 360 degrees), a BRAM + taylor interpolation.

For atan, you'd simply skip the divider and sign look-up part.

On the other hand, Xilinx and Altera offer plug-and-play Cordic cores for atan/atan2.

It's probably better to multiply y/x by x to get a normalized ratio rather than use a divider, which requires a lot of resources.

I recalled that I had to implement the atan2 function once for a QAM carrier/symbol recovery circuit. I didn't need great precision, so I split one quadrant into a grid and put the angle of the center of each grid square into a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and get the coarse angle (which was then adjusted for the quadrant) and could use that for carrier recovery.

One case that comes to mind : you have 2 quadrature signals

x = A(t) * cos (wt + some phase) + noise_x
y = A(t) * sin (wt + some phase) + noise_y

Atan2(y, x) = wt + some phase. The variations of A(t) (as long as they are slow-ish) will cancel each other out.

You can filter or average wt + some phase to extract the phase. Or derivate "wt + some phase" then filter to get the filtered frequency.

So multiplying "y/x by x" would not make much sense in this case.

Kevin Neilson · Oct 18, 2018

On Wednesday, October 17, 2018 at 9:26:38 AM UTC-6, Benjamin Couillard wrote:

Le mardi 16 octobre 2018 22:36:56 UTC-4, Kevin Neilson a ĂŠcritÂ :
Yes I think you're right it could work for an atan or atan 2. You could implemented a divider for atan 2 (y/x), a sign look-up (to get the whole 360 degrees), a BRAM + taylor interpolation.

For atan, you'd simply skip the divider and sign look-up part.

On the other hand, Xilinx and Altera offer plug-and-play Cordic cores for atan/atan2.

It's probably better to multiply y/x by x to get a normalized ratio rather than use a divider, which requires a lot of resources.

I recalled that I had to implement the atan2 function once for a QAM carrier/symbol recovery circuit. I didn't need great precision, so I split one quadrant into a grid and put the angle of the center of each grid square into a 2-dimensional lookup ROM. Then I could put X,Y into the ROM and get the coarse angle (which was then adjusted for the quadrant) and could use that for carrier recovery.

One case that comes to mind : you have 2 quadrature signals

x = A(t) * cos (wt + some phase) + noise_x
y = A(t) * sin (wt + some phase) + noise_y

Atan2(y, x) = wt + some phase. The variations of A(t) (as long as they are slow-ish) will cancel each other out.

You can filter or average wt + some phase to extract the phase. Or derivate "wt + some phase" then filter to get the filtered frequency.

So multiplying "y/x by x" would not make much sense in this case.

No, multiplying by x doesn't make sense. Perhaps using a ROM for 1/x and a multiplier would be better than a full divider.

What to do with an improved algorithm?

Mike Field

Guest

Gene Filatov

Guest

Mike Butts

Guest

Michael Kellett

Guest

kkoorndyk

Guest

Rick C. Hodgin

Guest

Mike Butts

Guest

Mike Field

Guest

Gene Filatov

Guest

Brian Davis

Guest

Brian Davis

Guest

Mike Field

Guest

Kevin Neilson

Guest

Benjamin Couillard

Guest

Kevin Neilson

Guest

Kevin Neilson

Guest

Benjamin Couillard

Guest

Kevin Neilson

Guest

Benjamin Couillard

Guest

Kevin Neilson

Guest

Log in

Welcome to EDABoard.com

Sponsor