Unique uses for the DSP48

K

Kevin Neilson

Guest
I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)
- Barrel shifters (Not that good for wide buses, though)
- Modulo by a constant (using Barrett's Reduction)
- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)
 
On Thursday, June 27, 2019 at 8:27:03 PM UTC-4, Kevin Neilson wrote:
I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)
- Barrel shifters (Not that good for wide buses, though)
- Modulo by a constant (using Barrett's Reduction)
- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

I'm no expert, but I believe Galois filters use modulo 2 arithmetic without carries, so multipliers are not what you want. Since there is no carry, there is no need to use any special features. The typical fabric logic will do the job quite nicely.

Certainly barrel shifters would be useful. Don't know what Barrett's Reduction is.

I don't think GF(2) would be useful since you can't use a multiplier without the carry as far as I know. Am I mistaken?

--

Rick C.

- Get 1,000 miles of free Supercharging
- Tesla referral code - https://ts.la/richard11209
 
In article <c159d7b2-da4b-4852-b90f-aa619fc9a1b4@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:
I've tried to figure out how to use the Xilinx DSP48s for Galois arithmetic, but they really aren't that useful for that. The new ones can do a 96-bit unary XOR, which can be used for GF(2) matrix
multiplication, but the multipliers themselves aren't of much use for Galois math. I wondered what unusual uses (besides FIR filters or integer matrix multipliers) people have used these for. Here are some
of mine:

- Transposers (shifting rows up a DSP column in A/B, latching into P, and shifting columsn serially out of P using the pattern matcher)
- Barrel shifters (Not that good for wide buses, though)
- Modulo by a constant (using Barrett's Reduction)
- GF(2) bit-by-vector multiply-accumulate (using the ALU as an XOR)

I've used one to implement a 3-input median filter (12-bit max inputs).
Used the SIMD modes to evaluate each comparison between the three
inputs.

Regards,

Mark
 
I have to do a lot of GF(2) multiplications, which end up being a vector times a matrix, with the matrix being sometimes 128x128 bits or bigger. There are no carries, like you say. Mostly it's an AND of the vector and a column of the matrix and then an XOR of that, and that is done for each column.. It can use up a lot of LUTs and the routing can get congested, especially when the synthesizer tries to share terms. The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column. You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous. If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.
 
On 29/06/2019 00:44, Kevin Neilson wrote:
I have to do a lot of GF(2) multiplications, which end up being a vector times a matrix, with the matrix being sometimes 128x128 bits or bigger. There are no carries, like you say. Mostly it's an AND of the vector and a column of the matrix and then an XOR of that, and that is done for each column. It can use up a lot of LUTs and the routing can get congested, especially when the synthesizer tries to share terms. The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the columns are added but no carries are taken to the next column. You end up with a result that is wider than the field, so you have to do a reduction stage, but it's still advantageous. If you could disable the carries in the DSP48s, you could use them in the GF(2) multipliers, but unfortunately there's no such option.

Do you really mean GF(2) ? That is just single bits. Addition is XOR,
multiplication is AND.

If you meant something like GF(2^8), which is popular in RAID and other
forward error correction systems on byte-wide data, then I can't think
of any smart way to handle multiplication in an FPGA. Multiplying by 2
is easy enough, but arbitrary multiplication is done using log tables in
software. If there is a nice hardware method, I'd love to know.
 
On Thu, 04 Jul 2019 11:16:07 +0200, David Brown wrote:

On 29/06/2019 00:44, Kevin Neilson wrote:
I have to do a lot of GF(2) multiplications, which end up being a
vector times a matrix, with the matrix being sometimes 128x128 bits or
bigger. There are no carries, like you say. Mostly it's an AND of the
vector and a column of the matrix and then an XOR of that, and that is
done for each column. It can use up a lot of LUTs and the routing can
get congested, especially when the synthesizer tries to share terms.
The XOR is a tree with 3-4 levels of logic.

I think some processors can do a "carryless" multiply for this, so the
columns are added but no carries are taken to the next column. You end
up with a result that is wider than the field, so you have to do a
reduction stage, but it's still advantageous. If you could disable the
carries in the DSP48s, you could use them in the GF(2) multipliers, but
unfortunately there's no such option.


Do you really mean GF(2) ? That is just single bits. Addition is XOR,
multiplication is AND.

If you meant something like GF(2^8), which is popular in RAID and other
forward error correction systems on byte-wide data, then I can't think
of any smart way to handle multiplication in an FPGA. Multiplying by 2
is easy enough, but arbitrary multiplication is done using log tables in
software. If there is a nice hardware method, I'd love to know.

I think he does mean GF(2).
Here's the user guide:
<https://www.xilinx.com/support/documentation/user_guides/ug579-ultrascale-dsp.pdf>

Allan
 
I normally work in fields like GF(2^8) or GF(2^11) (or, for AES, GF(2^128)) but all the operations decompose into GF(2) operations. For example, to do a x b = c in GF(2^8), all these values are 8-bit vectors. You can expand b into an 8x8-bit matrix B in which b is the bottom row, b*alpha is the next row up, and b*alpha**7 is the top row, where alpha is the primitive element of the field. Then you multiply a x B using GF(2) to get the bits of c. So all the operations of a characteristic-2 field (based on a power of 2) can be broken down into GF(2) operations.

I always break down matrices in larger fields to GF(2) matrices to make it easier on the synthesizer. It can get bogged down otherwise.
 
On 06/07/2019 17:38, Kevin Neilson wrote:
I normally work in fields like GF(2^8) or GF(2^11) (or, for AES,
GF(2^128)) but all the operations decompose into GF(2) operations.
For example, to do a x b = c in GF(2^8), all these values are 8-bit
vectors. You can expand b into an 8x8-bit matrix B in which b is the
bottom row, b*alpha is the next row up, and b*alpha**7 is the top
row, where alpha is the primitive element of the field. Then you
multiply a x B using GF(2) to get the bits of c. So all the
operations of a characteristic-2 field (based on a power of 2) can be
broken down into GF(2) operations.

I always break down matrices in larger fields to GF(2) matrices to
make it easier on the synthesizer. It can get bogged down
otherwise.

Thanks for that information. I did not know GF(2^8) calculations could
be done by matrices over GF(2) like that, but it makes sense. I have
only used the field in software - it is key to RAID6 and higher RAID
versions - and there you do the multiplication and division using log
tables.
 
I could see how the log tables might be better in software, but in hardware a pair of log/antilog lookup tables (and a mod 2**m-1 adder) will normally be bigger and slower than a hardware multiplier, especially as m gets bigger (where the field is GF(2**m)). I do use lookup tables for reciprocals, but above some m, tables become inefficient for that as well.
 

Welcome to EDABoard.com

Sponsor

Back
Top