Question about partial multiplication result in transposed F

F

fl

Guest
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in FPGA, not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can be
used for many multiplications (regardless of symmetry)'? I only think that to
save 50% multiplier taking advantage of FIR coef symmetric characteristic.

Could you tell me how to understand about the partial results?

Thank,
 
On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can
be
used for many multiplications (regardless of symmetry)'? I only think
that
to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz
---------------------------------------
Posted through http://www.FPGARelated.com

Here is the link:https://www.google.ca/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Thanks,
 
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clea
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell i
FPGA,
not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results ca
be
used for many multiplications (regardless of symmetry)'? I only thin
that
to
save 50% multiplier taking advantage of FIR coef symmetri
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Ka
--------------------------------------
Posted through http://www.FPGARelated.com
 
On 9/25/2015 4:10 PM, fl wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in FPGA, not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results can be
used for many multiplications (regardless of symmetry)'? I only think that to
save 50% multiplier taking advantage of FIR coef symmetric characteristic.

Could you tell me how to understand about the partial results?

They are talking about an extreme level of optimization by sharing
partial products between multiplies. Trouble is, each multiply is by a
different coefficient *and* a different data value. But in each
successive clock cycle the data moves to the next coefficient, so if any
of the bits of the coefficients match, the result of the previous
partial product can just be shifted into the appropriate location in the
adjacent product calculation. It would be a bit tortuous to code and
would nullify the utility of the hard multipliers available in many
FPGAs. It might be worth while to do if you are designing an ASIC though.

--

Rick
 
On 9/26/2015 12:06 AM, rickman wrote:
On 9/25/2015 4:10 PM, fl wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA, not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
can be
used for many multiplications (regardless of symmetry)'? I only
think that to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.

Could you tell me how to understand about the partial results?

They are talking about an extreme level of optimization by sharing
partial products between multiplies. Trouble is, each multiply is by a
different coefficient *and* a different data value. But in each
successive clock cycle the data moves to the next coefficient, so if any
of the bits of the coefficients match, the result of the previous
partial product can just be shifted into the appropriate location in the
adjacent product calculation. It would be a bit tortuous to code and
would nullify the utility of the hard multipliers available in many
FPGAs. It might be worth while to do if you are designing an ASIC though.

I posed this before I read your link. I assumed right, but I didn't see
the block diagram which shows all the multiplies happening on the same
data at the same time. I've written FIR filters before, I should have
remembered this. So the individual partial products can be shared
across all the multiplies and added appropriately. I expect this
assumes fixed coefficients which naturally make multipliers simpler.

--

Rick
 
On 9/26/15 12:06 AM, rickman wrote:
On 9/25/2015 4:10 PM, fl wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA, not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
can be
used for many multiplications (regardless of symmetry)'? I only
think that to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.

Could you tell me how to understand about the partial results?

They are talking about an extreme level of optimization by sharing
partial products between multiplies. Trouble is, each multiply is by a
different coefficient *and* a different data value. But in each
successive clock cycle the data moves to the next coefficient, so if any
of the bits of the coefficients match, the result of the previous
partial product can just be shifted into the appropriate location in the
adjacent product calculation. It would be a bit tortuous to code and
would nullify the utility of the hard multipliers available in many
FPGAs. It might be worth while to do if you are designing an ASIC though.

A simple, and useful, transformation for a FIR or IIR filter for FPGAs
is to switch from using one big summing node, with a series of delays
before/after with tap offs and multiplies to having a single node feed
forward/back to a series of nodes with simple adders. Since with the
FPGA the registers at the outputs are free, this is the most efficient
format. It also says that if the coefficients are constants, you have a
possibility of optimizing some of the partial produces if building
explicit multipliers.
 
On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
can
be
used for many multiplications (regardless of symmetry)'? I only think
that
to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz --------------------------------------- Posted through
http://www.FPGARelated.com

Here is the
link:https://www.google.ca/url?
sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Well, the guy throws out one unfounded statement and then never supports
it. In a live presentation you could raise your hand and ask about it.
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may
be able to get the gate count down, but that's not really within the
scope of the talk.

I suspect it's an editing turd -- he had something brilliant in a prior
version of the presentation which he either found out was unfounded, or
which he didn't have time to present for this particular talk, so he set
out to edit all of it out but he left this one little bit.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
 
On Sun, 27 Sep 2015 15:08:02 -0500, Tim Wescott wrote:

On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
can
be
used for many multiplications (regardless of symmetry)'? I only
think
that
to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz --------------------------------------- Posted through
http://www.FPGARelated.com

Here is the link:https://www.google.ca/url?

sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Thanks,

Well, the guy throws out one unfounded statement and then never supports
it. In a live presentation you could raise your hand and ask about it.
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may
be able to get the gate count down, but that's not really within the
scope of the talk.

I suspect it's an editing turd -- he had something brilliant in a prior
version of the presentation which he either found out was unfounded, or
which he didn't have time to present for this particular talk, so he set
out to edit all of it out but he left this one little bit.

Just a note: since nearly all FIR filters are symmetrical around some
point, it would be interesting to see how much the area would increase or
decrease if you insisted on that, then reduced the multipliers by a
factor of two at the cost of having one more adder per multiplier.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com
 
On 9/27/2015 4:08 PM, Tim Wescott wrote:
On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not
a dedicated MAC in FPGA. Anyhow, I don't know why 'partial results
can
be
used for many multiplications (regardless of symmetry)'? I only think
that
to
save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz --------------------------------------- Posted through
http://www.FPGARelated.com

Here is the
link:https://www.google.ca/url?
sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Thanks,

Well, the guy throws out one unfounded statement and then never supports
it. In a live presentation you could raise your hand and ask about it.
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may
be able to get the gate count down, but that's not really within the
scope of the talk.

I think it is clear that he is talking about a hard coded set of
coefficients. If the coefficients are variables in registers, there is
no efficient way to optimize the multipliers. With fixed coefficients
and multipliers in the fabric, this would be an important optimization.
He discusses using fabric for multipliers in later slides. It is not
unreasonable to expect the tools to do this optimization automatically.

--

Rick
 
On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote:

On 9/27/2015 4:08 PM, Tim Wescott wrote:
On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial
results can
be
used for many multiplications (regardless of symmetry)'? I only
think
that
to save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz --------------------------------------- Posted through
http://www.FPGARelated.com

Here is the link:https://www.google.ca/url?

sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Thanks,

Well, the guy throws out one unfounded statement and then never
supports it. In a live presentation you could raise your hand and ask
about it.
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may
be able to get the gate count down, but that's not really within the
scope of the talk.

I think it is clear that he is talking about a hard coded set of
coefficients. If the coefficients are variables in registers, there is
no efficient way to optimize the multipliers. With fixed coefficients
and multipliers in the fabric, this would be an important optimization.
He discusses using fabric for multipliers in later slides. It is not
unreasonable to expect the tools to do this optimization automatically.

I totally missed that -- yes, one would hope that in 2015 the tools would
be able to figure out how to optimize fixed multipliers.

This is a tangent, but it makes me wonder -- I saw a paper ages ago that
was basically saying that if addition and subtraction are equally costly,
then you can optimize a multiplication by using both -- i.e., if you're
multiplying (x) by 11110001111b, then you can either do eight adds, or
you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings.

So -- do modern optimizers do this when multiplying by fixed
coefficients, or not?

--
www.wescottdesign.com
 
On 9/27/2015 6:44 PM, Tim Wescott wrote:
On Sun, 27 Sep 2015 18:28:12 -0400, rickman wrote:

On 9/27/2015 4:08 PM, Tim Wescott wrote:
On Fri, 25 Sep 2015 15:47:29 -0700, fl wrote:

On Friday, September 25, 2015 at 4:40:18 PM UTC-4, kaz wrote:
Hi,

When I read a tutorial on FIR implementation on FPGA, I am not clear
about
"partial results can be used for many multiplications (regardless of
symmetry)" That slide may be based on multiplier with logic cell in
FPGA,
not a dedicated MAC in FPGA. Anyhow, I don't know why 'partial
results can
be
used for many multiplications (regardless of symmetry)'? I only
think
that
to save 50% multiplier taking advantage of FIR coef symmetric
characteristic.


Could you tell me how to understand about the partial results?

Thank,

can we read that tutorial?

Kaz --------------------------------------- Posted through
http://www.FPGARelated.com

Here is the link:https://www.google.ca/url?

sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CB0QFjAAahUKEwi_vKXaoZPIAhVRB5IKHZbHBTk&url=http
%3A%2F%2Fcct.cnes.fr%2Fsystem%2Ffiles%2Fcnes_cct%2F459-mce%2Fpublic%
2F06_MVD_%2520FIR_Design.pdf&usg=AFQjCNHDrIXK_J6WMErALOhKYrGsxLFg6w

Thanks,

Well, the guy throws out one unfounded statement and then never
supports it. In a live presentation you could raise your hand and ask
about it.
Can you send him an email?

I can see ways that, given a predetermined set of coefficients, you may
be able to get the gate count down, but that's not really within the
scope of the talk.

I think it is clear that he is talking about a hard coded set of
coefficients. If the coefficients are variables in registers, there is
no efficient way to optimize the multipliers. With fixed coefficients
and multipliers in the fabric, this would be an important optimization.
He discusses using fabric for multipliers in later slides. It is not
unreasonable to expect the tools to do this optimization automatically.

I totally missed that -- yes, one would hope that in 2015 the tools would
be able to figure out how to optimize fixed multipliers.

This is a tangent, but it makes me wonder -- I saw a paper ages ago that
was basically saying that if addition and subtraction are equally costly,
then you can optimize a multiplication by using both -- i.e., if you're
multiplying (x) by 11110001111b, then you can either do eight adds, or
you can do (x << 11) - (x << 6) + (x << 4) - x, for a 4x savings.

So -- do modern optimizers do this when multiplying by fixed
coefficients, or not?

I can't imagine they wouldn't. But mostly, multiplications of any type
are done using hardware multipliers available in the vast majority of
FPGAs.

--

Rick
 
You might be referring to the technique of expressing a number in CSD (Canonical Signed Digits), which reduces the number of nonzero bits in a number.
 
I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar. As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector. I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.
 
On 10/3/2015 3:37 PM, Kevin Neilson wrote:
> I just built an optimized Galois Field vector multiplier, which multiplies a vector by a scalar. As an experiment, I split it into two parts, one that was common to all elements of the vector, and then the parts that were unique to each element of the vector. I had assumed this is something the synthesizer would do anyway, but I was surprised to find that writing it the way I did cut down the number of LUTs by a big margin.

Why is it using LUTs instead of multipliers? Are these numbers too
small and too many to use the multipliers efficiently?

--

Rick
 
On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

On 10/3/2015 3:37 PM, Kevin Neilson wrote:
I just built an optimized Galois Field vector multiplier, which
multiplies a vector by a scalar. As an experiment, I split it into two
parts, one that was common to all elements of the vector, and then the
parts that were unique to each element of the vector. I had assumed
this is something the synthesizer would do anyway, but I was surprised
to find that writing it the way I did cut down the number of LUTs by a
big margin.

Why is it using LUTs instead of multipliers? Are these numbers too
small and too many to use the multipliers efficiently?

Kevin is talking about Galois Field multipliers. The integer multiplier
blocks are useless for that.

I occasionally implement cryptographic primitives in FPGAs. These often
use a combination of linear mixing functions and (other stuff which
provides nonlinearity, which I don't need to talk about here). The
linear mixing functions often contain GF multiplications (sometimes by
constants).

It's possible to express the linear mixing functions in HDL behaviourally
(e.g. including things that are recognisably GF multipliers), and it's
also possible to express them as a sea of XOR gates.

My experience using the latest tools from Xilinx is that for a particular
128 bit mixing function I was getting three times as many levels of logic
from from the behavioural source as I was from the sea of XOR gates
source, even though both described the same function.


BTW, one runs into similar problems when calculating CRCs of wide buses.
The CRCs also simplify to XOR trees, but in this case we are calculating
the remainder after a division, rather than a multiplication. (And yes,
the integer DSP blocks are useless for this too.)

Regards,
Allan
 
On 10/4/2015 2:01 AM, Allan Herriman wrote:
On Sat, 03 Oct 2015 15:51:34 -0400, rickman wrote:

On 10/3/2015 3:37 PM, Kevin Neilson wrote:
I just built an optimized Galois Field vector multiplier, which
multiplies a vector by a scalar. As an experiment, I split it into two
parts, one that was common to all elements of the vector, and then the
parts that were unique to each element of the vector. I had assumed
this is something the synthesizer would do anyway, but I was surprised
to find that writing it the way I did cut down the number of LUTs by a
big margin.

Why is it using LUTs instead of multipliers? Are these numbers too
small and too many to use the multipliers efficiently?


Kevin is talking about Galois Field multipliers. The integer multiplier
blocks are useless for that.

We are talking modulo 2 multiplies at every bit, otherwise known as AND
gates with no carry? I'm a bit fuzzy on this.

Now I'm confused by Kevin's description. If the vector is multiplied by
a scalar, what parts are common and what parts are unique? What parts
of this are fixed vs. variable? The only parts a tool can optimize are
the fixed operands. Or I am totally missing the concept.


I occasionally implement cryptographic primitives in FPGAs. These often
use a combination of linear mixing functions and (other stuff which
provides nonlinearity, which I don't need to talk about here). The
linear mixing functions often contain GF multiplications (sometimes by
constants).

It's possible to express the linear mixing functions in HDL behaviourally
(e.g. including things that are recognisably GF multipliers), and it's
also possible to express them as a sea of XOR gates.

My experience using the latest tools from Xilinx is that for a particular
128 bit mixing function I was getting three times as many levels of logic
from from the behavioural source as I was from the sea of XOR gates
source, even though both described the same function.

I don't know enough of how the tools work to say what is going on. I
just know that when I dig into the output of the tools for poorly
synthesized code, I find the problems are that my code doesn't specify
the simple structure I had imagined it did. So I fix my code. :)

Mostly this has to do with trying to use the carry out the top of
adders. Sometimes I get two adders.


BTW, one runs into similar problems when calculating CRCs of wide buses.
The CRCs also simplify to XOR trees, but in this case we are calculating
the remainder after a division, rather than a multiplication. (And yes,
the integer DSP blocks are useless for this too.)

Yep, you don't want a carry, so don't even think about using adders or
multipliers. They are not adders and multipliers in every type of algebra.

--

Rick
 
Not only can I not use the DSP blocks, but the carry chains don't work either. The only way to do a big XOR is with LUTs, and at my clock speeds, I can only do about 3 LUT levels. At least there are plenty of BRAMs in Virtex parts now, so it's easy to do logs, inverses, and exponentiation.
 
We are talking modulo 2 multiplies at every bit, otherwise known as AND
gates with no carry? I'm a bit fuzzy on this.

Now I'm confused by Kevin's description. If the vector is multiplied by
a scalar, what parts are common and what parts are unique? What parts
of this are fixed vs. variable? The only parts a tool can optimize are
the fixed operands. Or I am totally missing the concept.

Say you're multiplying a by a vector [b c d]. Let's say we're using the field GF(8) so a is 3 bits. Now a can be thought of as ( a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and alpha is the primitive element of the field. Then a*b or a*c or a*d is just a sum of some combination of those 3 values in the parentheses, depending upon the locations of the 1s in b, c, or d. So you can premultiply the three values in the parentheses (the common part) and then take sums of subsets of those three (the individual parts). It's all a bunch of XORs at the end. This is just a complicated way of saying that by writing the HDL at a more explicit level, the synthesizer is better able to find common factors and use a lot fewer gates..
 
On 10/5/2015 1:29 PM, Kevin Neilson wrote:
We are talking modulo 2 multiplies at every bit, otherwise known as
AND gates with no carry? I'm a bit fuzzy on this.

Now I'm confused by Kevin's description. If the vector is
multiplied by a scalar, what parts are common and what parts are
unique? What parts of this are fixed vs. variable? The only parts
a tool can optimize are the fixed operands. Or I am totally
missing the concept.

Say you're multiplying a by a vector [b c d]. Let's say we're using
the field GF(8) so a is 3 bits. Now a can be thought of as (
a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
alpha is the primitive element of the field. Then a*b or a*c or a*d
is just a sum of some combination of those 3 values in the
parentheses, depending upon the locations of the 1s in b, c, or d.
So you can premultiply the three values in the parentheses (the
common part) and then take sums of subsets of those three (the
individual parts). It's all a bunch of XORs at the end. This is
just a complicated way of saying that by writing the HDL at a more
explicit level, the synthesizer is better able to find common factors
and use a lot fewer gates..

Ok, I'm not at all familiar with GFs. I see now a bit of what you are
saying. But to be honest, I don't know the tools would have any trouble
with the example you give. The tools are pretty durn good at
optimizing... *but*... there are two things to optimize for, size and
performance. They are sometimes mutually exclusive, sometimes not. If
you ask the tool to give you the optimum size, I don't think you will do
better if you code it differently, while describing *exactly* the same
behavior.

If you ask the tool to optimize for speed, the tool will feel free to
duplicate logic if it allows higher performance, for example, by
combining terms in different ways. Or less logic may require a longer
chain of LUTs which will be slower. LUT sizes in FPGAs don't always
match the logical breakdown so that speed or size can vary a lot
depending on the partitioning.

--

Rick
 
On 07/10/15 05:02, rickman wrote:
On 10/5/2015 1:29 PM, Kevin Neilson wrote:

We are talking modulo 2 multiplies at every bit, otherwise known as
AND gates with no carry? I'm a bit fuzzy on this.

Now I'm confused by Kevin's description. If the vector is
multiplied by a scalar, what parts are common and what parts are
unique? What parts of this are fixed vs. variable? The only parts
a tool can optimize are the fixed operands. Or I am totally
missing the concept.

Say you're multiplying a by a vector [b c d]. Let's say we're using
the field GF(8) so a is 3 bits. Now a can be thought of as (
a0*alpha^0 + a1*alpha^1 + a2*alpha^2 ), where a0 is bit 0 of a, and
alpha is the primitive element of the field. Then a*b or a*c or a*d
is just a sum of some combination of those 3 values in the
parentheses, depending upon the locations of the 1s in b, c, or d.
So you can premultiply the three values in the parentheses (the
common part) and then take sums of subsets of those three (the
individual parts). It's all a bunch of XORs at the end. This is
just a complicated way of saying that by writing the HDL at a more
explicit level, the synthesizer is better able to find common factors
and use a lot fewer gates..

Ok, I'm not at all familiar with GFs. I see now a bit of what you are
saying. But to be honest, I don't know the tools would have any trouble
with the example you give. The tools are pretty durn good at
optimizing... *but*... there are two things to optimize for, size and
performance. They are sometimes mutually exclusive, sometimes not. If
you ask the tool to give you the optimum size, I don't think you will do
better if you code it differently, while describing *exactly* the same
behavior.

If you ask the tool to optimize for speed, the tool will feel free to
duplicate logic if it allows higher performance, for example, by
combining terms in different ways. Or less logic may require a longer
chain of LUTs which will be slower. LUT sizes in FPGAs don't always
match the logical breakdown so that speed or size can vary a lot
depending on the partitioning.

GF(8) is a Galios Field with 8 elements. That means that there are 8
different elements. These can be written in different ways. Sometimes
people like to be completely abstract and write 0, 1, a, b, c, d, e, f.

<http://www.wolframalpha.com/input/?i=GF%288%29>

Others will want to use the polynomial forms (which are used in the
definition of GF(8)) and write: 0, 1, x, x+1, x˛, x˛+1, x˛+x, x˛+x+1.

<http://www.math.uic.edu/~leon/mcs425-s08/handouts/field.pdf>

That form is easily written in binary: 000, 001, 010, 011, 100, etc.

One key point about a GF (or any field) is that you can combine elements
through addition, subtraction, multiplication, and division (except by
0) and get another element of the field. To make this work, however,
addition and multiplication (and therefore their inverses) are very
different from how you normally define them.

A GF is always of size p^n - in this case, p is 2 and n is 3. Your
elements are a set of n elements from Z/p (the integers modulo p).
Addition is just pairwise addition of elements, modulo p. For the case
p = 2 (which is commonly used for practical applications, because it
suits computing so neatly), this means you can hold your elements as a
simple binary string of n digits, and addition is just xor.

Multiplication is a little more complex. Treat your elements as
polynomials (as shown earlier), and multiply them. Any factors of x^i
can be reduced modulo p. Then the whole thing is reduced modulo the
field's defining irreducible degree n polynomial (in this case, xł + x +
1). This always gets you back to another element in the field.

The real magic here is that the multiplication table (excluding 0) forms
a Latin square - every element appears exactly once in each row and
column. This means that division "works".

It's easy to get a finite field like this for size p, where p is prime -
it's just the integers modulo p, and you can use normal addition and
multiplication modulo p. But the beauty of GF's is that they give you
fields of different sizes - in particular, size 2^n.


Back to the task in hand - implementing GF(8) on an FPGA. Addition is
just xor - the tools should not have trouble optimising that!
Multiplication in GF is usually done with lookup tables. For a field as
small as GF(8), you would have a table with all the elements. This
could be handled in a 6x3 bit "ROM", or the tools could generate logic
and then reduce it to simple gates (for example, the LSB bit of the
product of non-zero elements is the xnor of the LSB bits of the
operands). For larger fields, such as the very useful GF(2^8), you will
probably use lookup tables for the logs and antilogs, and use them for
multiplication and division.




A key area of practical application for GF's is in error detection and
correction. A good example is for RAID6 (two redundant disks in a RAID
array) - there is an excellent paper on the subject by the guy who
implemented RAID 6 in Linux, using GF(2^8). It gives a good
introduction to Galois fields.

<https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf>
 

Welcome to EDABoard.com

Sponsor

Back
Top