FIR Filter cores for Virtex-][

J

Jim George

Guest
Hello All,
I'd like to know if the generally held advice that Distributed
Arithmetic (DA) filters are the "best" way to implement FIR filters on
an FPGA still holds good when one uses a Virtex-][.
In my application, I require a 256-tap filter which takes in
18-bit samples at 50 MSPS and decimates it down to 10 MSPS (coeffs are
16-bit). I use an XC2V3000. Currently, we don't have the hardware
required to synthesize the complete design (not enough memory), so
I've synthesized just the filter with a simple testbench. It turns out
that MAC FIR filters require far less resources than a DA-FIR one with
an equivalent spec. Could this be due to Virtex-]['s multipliers or is
this some quirk I'm not taking into account?
Thanks in advance.
-Jim.
 
You should definitely use the multipliers if you are using a V2. Why burn
up the fabric with DA logic if you have unused, fast, embedded multipliers?
If your output sample rate is 10Msps, then you should only need 256 * 10/200
multipliers, where 200MHz is the estimated pipeline multiplier speed. You
will need about the same number of adders and you will accumulate over
200/10=20 cycles. You will need to store 20 coeffs for each multiplier,
which you can do with LUT RAMs or with blockRAMs if you have extra. If the
coeffs are symmetric, you can halve the number of multipliers required (if
you are short) by adding symmetry adders in the CLB fabric.
-Kevin

"Jim George" <jimgeorge@softhome.net> wrote in message
news:8775ef2b.0312180717.8369906@posting.google.com...
Hello All,
I'd like to know if the generally held advice that Distributed
Arithmetic (DA) filters are the "best" way to implement FIR filters on
an FPGA still holds good when one uses a Virtex-][.
In my application, I require a 256-tap filter which takes in
18-bit samples at 50 MSPS and decimates it down to 10 MSPS (coeffs are
16-bit). I use an XC2V3000. Currently, we don't have the hardware
required to synthesize the complete design (not enough memory), so
I've synthesized just the filter with a simple testbench. It turns out
that MAC FIR filters require far less resources than a DA-FIR one with
an equivalent spec. Could this be due to Virtex-]['s multipliers or is
this some quirk I'm not taking into account?
Thanks in advance.
-Jim.
 
It really depends on your specific design. I've used DA as well as mac based
filters in V2.
the multpliers will often be smaller realizations, provided you have enough
multipliers on the chip, however for larger filters, you may find it is more
efficient to use DA to keep stuff together. Multipliers also have a fixed
width, so if your data or coefficients are greater than 17 bits it will often
pay to use DA instead, likewise, if the coefficients are only a couple of bits.

Kevin Neilson wrote:

You should definitely use the multipliers if you are using a V2. Why burn
up the fabric with DA logic if you have unused, fast, embedded multipliers?
If your output sample rate is 10Msps, then you should only need 256 * 10/200
multipliers, where 200MHz is the estimated pipeline multiplier speed. You
will need about the same number of adders and you will accumulate over
200/10=20 cycles. You will need to store 20 coeffs for each multiplier,
which you can do with LUT RAMs or with blockRAMs if you have extra. If the
coeffs are symmetric, you can halve the number of multipliers required (if
you are short) by adding symmetry adders in the CLB fabric.
-Kevin

"Jim George" <jimgeorge@softhome.net> wrote in message
news:8775ef2b.0312180717.8369906@posting.google.com...
Hello All,
I'd like to know if the generally held advice that Distributed
Arithmetic (DA) filters are the "best" way to implement FIR filters on
an FPGA still holds good when one uses a Virtex-][.
In my application, I require a 256-tap filter which takes in
18-bit samples at 50 MSPS and decimates it down to 10 MSPS (coeffs are
16-bit). I use an XC2V3000. Currently, we don't have the hardware
required to synthesize the complete design (not enough memory), so
I've synthesized just the filter with a simple testbench. It turns out
that MAC FIR filters require far less resources than a DA-FIR one with
an equivalent spec. Could this be due to Virtex-]['s multipliers or is
this some quirk I'm not taking into account?
Thanks in advance.
-Jim.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Thanks for the quick answers. From what I've seen, the DA filter (with
the default P&R) spreads out across 3/4 of the chip. I need to make
this a parallel-DA filter since my clock rate is limited (I've got -5
grade devices, and taking the synthesizer's advice, clock rate should
not exceed 150 MHz.) At the same rate, MAC FIR filters use just a
small part of the chip. The MAC-FIR core, when used as a decimator,
has a quirk due to which it does not achieve full throughput (this is
documented in the datasheet), so I compensate by using a FIFO and a
DCM to raise the filter's clock rate. It seems to work fine in
simulation, but is there something I should watch out for when I go
forward with the design? As you can tell, I'm quite new to this area
:)
-Jim

"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:<mElEb.392471$Dw6.1244710@attbi_s02>...
You should definitely use the multipliers if you are using a V2. Why burn
up the fabric with DA logic if you have unused, fast, embedded multipliers?
If your output sample rate is 10Msps, then you should only need 256 * 10/200
multipliers, where 200MHz is the estimated pipeline multiplier speed. You
will need about the same number of adders and you will accumulate over
200/10=20 cycles. You will need to store 20 coeffs for each multiplier,
which you can do with LUT RAMs or with blockRAMs if you have extra. If the
coeffs are symmetric, you can halve the number of multipliers required (if
you are short) by adding symmetry adders in the CLB fabric.
-Kevin
 
Ah yes, without placement, the DA filters, especially with parallel bits don't fare well with the place and route
tools. I've got a design I'm putting the finishing touches on right now that has DA filters implemented in a 2V3000-4
(stepping 0). It has 30 bit coefficients, and in some places up to 40 bits arithmetic. I have no problem getting to
to run at a 160 MHz clock in the -4 part. The areas I have had timing difficulties are in routing to-from the brams
that are used as delay queues, mostly because I was too lazy to place them and the placer does a lousy job placing
brams. Anyway, in this case, using multipliers would have required a bigger part. The multipliers in the stepping 0
devices can't be clocked at 160 MHz, plus due to the data widths I'd need to use four multiplies to complete each
multiplication. In this design, the DA approach was a clear winner.

There is a data ordering quirk with the decimating MAC filter. You have a similar quirk with a DA filter if you are
sending multiple channels thorugh the filter. Nothing a bit of ingenuity won't fix.

Jim George wrote:

Thanks for the quick answers. From what I've seen, the DA filter (with
the default P&R) spreads out across 3/4 of the chip. I need to make
this a parallel-DA filter since my clock rate is limited (I've got -5
grade devices, and taking the synthesizer's advice, clock rate should
not exceed 150 MHz.) At the same rate, MAC FIR filters use just a
small part of the chip. The MAC-FIR core, when used as a decimator,
has a quirk due to which it does not achieve full throughput (this is
documented in the datasheet), so I compensate by using a FIFO and a
DCM to raise the filter's clock rate. It seems to work fine in
simulation, but is there something I should watch out for when I go
forward with the design? As you can tell, I'm quite new to this area
:)
-Jim

"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:<mElEb.392471$Dw6.1244710@attbi_s02>...
You should definitely use the multipliers if you are using a V2. Why burn
up the fabric with DA logic if you have unused, fast, embedded multipliers?
If your output sample rate is 10Msps, then you should only need 256 * 10/200
multipliers, where 200MHz is the estimated pipeline multiplier speed. You
will need about the same number of adders and you will accumulate over
200/10=20 cycles. You will need to store 20 coeffs for each multiplier,
which you can do with LUT RAMs or with blockRAMs if you have extra. If the
coeffs are symmetric, you can halve the number of multipliers required (if
you are short) by adding symmetry adders in the CLB fabric.
-Kevin
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
This may be sheer laziness, but how do I find out the stepping of a
V2?
I had worked out a design where I ran a MAC-based filter at 130 MHz, I
hope my V2's multipliers (-5 grade) can take it. I'm not looking at
very large precision, so I wont need multiple passes through the
multipliers.
-Jim.

Ray Andraka <ray@andraka.com> wrote in message news:<3FE34730.C68AB68@andraka.com>...
Ah yes, without placement, the DA filters, especially with parallel bits don't fare well with the place and route
tools. I've got a design I'm putting the finishing touches on right now that has DA filters implemented in a 2V3000-4
(stepping 0). It has 30 bit coefficients, and in some places up to 40 bits arithmetic. I have no problem getting to
to run at a 160 MHz clock in the -4 part. The areas I have had timing difficulties are in routing to-from the brams
that are used as delay queues, mostly because I was too lazy to place them and the placer does a lousy job placing
brams. Anyway, in this case, using multipliers would have required a bigger part. The multipliers in the stepping 0
devices can't be clocked at 160 MHz, plus due to the data widths I'd need to use four multiplies to complete each
multiplication. In this design, the DA approach was a clear winner.

There is a data ordering quirk with the decimating MAC filter. You have a similar quirk with a DA filter if you are
sending multiple channels thorugh the filter. Nothing a bit of ingenuity won't fix.
 
Jim George wrote:

I had worked out a design where I ran a MAC-based filter at 130 MHz, I
hope my V2's multipliers (-5 grade) can take it.
You can do well over 180MHz with the multipliers (registered version) in
a -5. You have to layout the in/outs precisely and force very specific
routing via timing constraints. Check out XAPP636
http://direct.xilinx.com/bvdocs/appnotes/xapp636.pdf


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Martin Euredjian

To send private email:
0_0_0_0_@pacbell.net
where
"0_0_0_0_" = "martineu"
 
The stepping number is a function of the part number, as I recall there is a decode sheet somewhere on the Xilinx website.
I think only the -4's have the stepping 0, which equates to something about 130 MHz on a fully pipelined multiplier...if you
also add pipeline registers immediately before and after the multiplier and place them properly. I don't think it will be
an issue with the -5 parts.

Jim George wrote:

This may be sheer laziness, but how do I find out the stepping of a
V2?
I had worked out a design where I ran a MAC-based filter at 130 MHz, I
hope my V2's multipliers (-5 grade) can take it. I'm not looking at
very large precision, so I wont need multiple passes through the
multipliers.
-Jim.

Ray Andraka <ray@andraka.com> wrote in message news:<3FE34730.C68AB68@andraka.com>...
Ah yes, without placement, the DA filters, especially with parallel bits don't fare well with the place and route
tools. I've got a design I'm putting the finishing touches on right now that has DA filters implemented in a 2V3000-4
(stepping 0). It has 30 bit coefficients, and in some places up to 40 bits arithmetic. I have no problem getting to
to run at a 160 MHz clock in the -4 part. The areas I have had timing difficulties are in routing to-from the brams
that are used as delay queues, mostly because I was too lazy to place them and the placer does a lousy job placing
brams. Anyway, in this case, using multipliers would have required a bigger part. The multipliers in the stepping 0
devices can't be clocked at 160 MHz, plus due to the data widths I'd need to use four multiplies to complete each
multiplication. In this design, the DA approach was a clear winner.

There is a data ordering quirk with the decimating MAC filter. You have a similar quirk with a DA filter if you are
sending multiple channels thorugh the filter. Nothing a bit of ingenuity won't fix.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 

Welcome to EDABoard.com

Sponsor

Back
Top