Compact FIR filters with multiplier blocks?

  • Thread starter Michael Spencer
  • Start date
M

Michael Spencer

Guest
Hello,

Has anyone compared FPGA implementations of full-rate digital FIR filters based on the use of Multiplier Blocks vs. traditional FIRs with constant coefficient multipliers? By full rate, I mean: one output result per clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network of shifters and adders that performs multiplications by several coefficients efficiently by exploiting common sub-expressions. The multiplier block can be exploited in FIR filters by transposing the standard filter so that the products of all the coefficients with the current input-sample are required simultaneously.

Also, by representing the coefficients in the Canonical-Signed-Digit number system (a small number of +1 and -1's) along common sub-expression sharing the multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter (fp=0.10 and fs=0.12) can be realized with only 61 adds (zero explicit multiplications). See filter example #4 in "FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders," http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf

If the adder depth is constrained to a maximum of four, then the authors' algorithm can do the multiplier block in 69 additions.

It would seem that this approach would be very efficient in a target such as the Xilinx Spartan-IIE (with no dedicated multipliers).

Another question: If we only need one result per K clock periods (K ~= 1000 for audio applications), could a multiplier block approach realized with, say, bit-serial addition be more efficient than some other approach such as distributed arithmetic?

Comments welcome. Thanks.

-Michael

______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com
 
Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic. The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken


--
To reply by email, please remove the _MENOWANTSPAM from my email address.


"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759
 
I agree the multiplier block style filters are more efficient area-wise. It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity. As I thought
I alluded to, the biggest problem with multiplier block filters is that the
layout/size is not a constant if you change the coefficients. This means that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any filter
changes. Depending on the implementation, it may also mean a change in the
filter's pipeline latency. These factors can make them difficult to use on
some projects. The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable coefficients
for that reason.



Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic. The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Ray,

Thanks for your quick response.

You raise some excellent points clearly derived from experience.

Hopefully my program will still be useful when filters can be fixed early on
in the project or when a synthesis/place and route run is an acceptable cost
for the area efficiency provided.

Also, for devices being used in consumer applications, perhaps the area
saved using a multiplier block filter would allow a smaller and cheaper
device within the family to be used - reducing production costs depending on
the number of units being produced.

Cheers,

Ken



"Ray Andraka" <ray@andraka.com> wrote in message
news:3F55E8A3.70936FBD@andraka.com...
I agree the multiplier block style filters are more efficient area-wise.
It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity. As I
thought
I alluded to, the biggest problem with multiplier block filters is that
the
layout/size is not a constant if you change the coefficients. This means
that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any
filter
changes. Depending on the implementation, it may also mean a change in
the
filter's pipeline latency. These factors can make them difficult to use
on
some projects. The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable
coefficients
for that reason.



Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and
Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any
number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA,
the
filters my program generates require far less FPGA area (slices/logic
cells)
than those generated using Distributed Arithmetic. The critical path in
my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a
Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for
multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently
processed
in one filter.

As Michael points out in his post, this technique would be very suitable
for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where
these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have
any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email
address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759



--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa


Ray Andraka <ray@andraka.com> wrote:
I agree the multiplier block style filters are more efficient area-wise. It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity. As I thought
I alluded to, the biggest problem with multiplier block filters is that the
layout/size is not a constant if you change the coefficients. This means that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any filter
changes. Depending on the implementation, it may also mean a change in the
filter's pipeline latency. These factors can make them difficult to use on
some projects. The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable coefficients
for that reason.


Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic. The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759



--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Ken,
While the RSG solution may yield smaller designs for specific cases,
the Altera FIR Compiler gives you more flexibility in terms of
optimizing area vs.speed.

For instance, the numbers presented in the RSG datasheet is based on a
pipeline=2 setting for the Altera FIR Compiler. Using the FIR
Compiler, the design yields an fmax of 322MHz (single rate, single
channel). This is much higher than the 154MHz cited for the filter
using the RSG approach. This is the classic speed/area trade-off
scenario.

If indeed area is the critical factor, it is possible to reduce the
pipeline to 1 in the FIR Compiler. In the single rate cases, the
logic cell count comparison would show that the RSG approach would be
beneficial for the single and 2 channel FIR designs (58% and 80%
respectively). In the 4 and 8 channel FIR designs, the distributed
arithmetic approach employed by the Altera FIR Compiler yields better
area compared to the RSG generated filter (106% and 133%
respectively). Reducing the number of pipeline stage to 1 yields fmax
of 237MHz (single rate, single channel), still well beyond the
performance requirement in most cases. This, I believe, is a more
accurate
comparison for the RSG datasheet.

Regards,
HS

Tero Rissa <tpr@doc.ic.ac.uk> wrote in message news:<bjhr0m$7f7$1@harrier.doc.ic.ac.uk>...
I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa


Ray Andraka <ray@andraka.com> wrote:
I agree the multiplier block style filters are more efficient area-wise. It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity. As I thought
I alluded to, the biggest problem with multiplier block filters is that the
layout/size is not a constant if you change the coefficients. This means that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any filter
changes. Depending on the implementation, it may also mean a change in the
filter's pipeline latency. These factors can make them difficult to use on
some projects. The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable coefficients
for that reason.



Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic. The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759



--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Hong,

Firstly, my apologies for the delay in replying.

I have inserted my responses to your points below:


"Hong Shan Neoh" <hsneoh@netscape.net> wrote in message
news:2ff2f33d.0309111656.12ef44cd@posting.google.com...
Ken,
While the RSG solution may yield smaller designs for specific cases,
the Altera FIR Compiler gives you more flexibility in terms of
optimizing area vs.speed.
Agreed - RSG only produces full-parallel filters. If multiple clocks per
output can be used (depending on data rates and avaible clock/power
consumption requirements of course) then I would be the first to say use
Distributed Arithmetic (DA) in multi-cycle mode.

For instance, the numbers presented in the RSG datasheet is based on a
pipeline=2 setting for the Altera FIR Compiler. Using the FIR
Compiler, the design yields an fmax of 322MHz (single rate, single
channel). This is much higher than the 154MHz cited for the filter
using the RSG approach. This is the classic speed/area trade-off
scenario.
The figure of 154MHz is not in the datasheet you are referring to (available
from http://www.dspec.org/rsg/downloads/datasheets/RSG_Overview.pdf) - I do
not know where you got this number from (I seem to remember it being in an
old version [which did not include Altera results] but I removed it because
the 154MHz was Xilinx specific for a particular filter on a particular
device and has no bearing at all to Altera devices/filter implementations).

The test filters I did came in at various speeds between [244-283MHz] for
RSG and [217-293MHZ] for Altera FIR Compiler (on exactly the same device and
with the same constraints). This is why I used pipeline level 2 for fir
compiler because pipeline level 1 made the fir compiler filters too slow and
pipeline level 3 made the fir compiler filters ridiculously large and not
that much faster.

Hence, pipeline level 2 does give a fair comparison with RSG.

For pipeline level 1, the FIR compiler clock rates were in most cases slower
than for RSG and the FIR compiler implementations were also larger. The
8-channel singlerate FIR was the only one that was the same size but the RSG
filter ran at 251MHz and FIR Comp at 206MHz.

Cheers,

Ken


Regards,
HS

Tero Rissa <tpr@doc.ic.ac.uk> wrote in message
news:<bjhr0m$7f7$1@harrier.doc.ic.ac.uk>...
I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa


Ray Andraka <ray@andraka.com> wrote:
I agree the multiplier block style filters are more efficient
area-wise. It
sounds like you have addressed the irregularity issues by using a
program
to do the generation, which I think is pretty much a necessity. As I
thought
I alluded to, the biggest problem with multiplier block filters is
that the
layout/size is not a constant if you change the coefficients. This
means that
the fiter coefficients have to be constant and known earlier in the
design
cycle, and necessitates a rerun of synthesis, place and route for any
filter
changes. Depending on the implementation, it may also mean a change
in the
filter's pipeline latency. These factors can make them difficult to
use on
some projects. The filters typically used in my projects often need
to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable
coefficients
for that reason.



Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing
full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of
FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type
of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and
Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any
number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific
FPGA, the
filters my program generates require far less FPGA area (slices/logic
cells)
than those generated using Distributed Arithmetic. The critical path
in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a
Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for
multiple
channels in interpolation/decimation filters (not supported by
Xilinx)
allows lower than full-parallel sampling rates to be efficiently
processed
in one filter.

As Michael points out in his post, this technique would be very
suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where
these
filters would be useful even on devices with dedicated multipliers
(when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are
also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in
reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you
have any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email
address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759



--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
I went around the irregularity issue by having sub-multiplier
block architecture that has have fixed interface to the routing
and have fixed (yet reasonable) area. Therefore, when the
coefficients are changed, no place and route is required and
the latency remains the same (unless you change the number taps).
The generation of coefficients can done at reconfiguration time
thanks to symmetry in the FPGA used (Atmel 40K40). Naturally,
there is the problem of hassling with run-time reconfiguration
and everything that comes with that...

As part of this work we looked also into common subexpression
sharing in that particular FPGA family and found it very unlikely
that benefits could be obtained with similar multiplier-block
architecture. This is mainly due the fact that it is different
story to be able to generate the most useful common subexpressions
that it is to really use them before the routing becomes congested.

http://www.doc.ic.ac.uk/~tpr/papers/rissa_FPT02.pdf

T.Rissa

tpr at doc ic ac uk

Ray Andraka <ray@andraka.com> wrote:
I agree the multiplier block style filters are more efficient area-wise. It
sounds like you have addressed the irregularity issues by using a program
to do the generation, which I think is pretty much a necessity. As I thought
I alluded to, the biggest problem with multiplier block filters is that the
layout/size is not a constant if you change the coefficients. This means that
the fiter coefficients have to be constant and known earlier in the design
cycle, and necessitates a rerun of synthesis, place and route for any filter
changes. Depending on the implementation, it may also mean a change in the
filter's pipeline latency. These factors can make them difficult to use on
some projects. The filters typically used in my projects often need to be
adjusted by the customer or late in the project to accommodate minor
requirements changes. I prefer to use a filter with reloadable coefficients
for that reason.


Ken wrote:

Ray,

I sent this to Michael via email and he suggested the group would be
interested also...

My PhD (now drawing to the end) has been on implementing full-parallel
Transpose FIR filters using multiplier blocks that you mention (I use
techniques/algorithms that exceed the efficiency of CSD in terms of FPGA
area).

The upshot of my work is that I have written a C++ program that will
generate RTL VHDL given the quantised filter coefficients, the type of
filter required (singlerate, interpolation, decimation etc.) and the
appropriate parameters (input width, signed/unsigned input, number of
channels, rate-change factor etc.)

The VHDL my program generates exceeds the functionality (at a lower
cost) of that provided by Xilinx's Distributed Arithmetic core and Altera's
FIR Compiler (also DA). In fact, my program allows interpolation and
decimation factors up to the number of filter coefficients and any number of
data channels (for interpolation/decimation filters also).

The main point is that, once synthesised and mapped to a specific FPGA, the
filters my program generates require far less FPGA area (slices/logic cells)
than those generated using Distributed Arithmetic. The critical path in my
filters is just the longest adder carry chain so very high speeds are
possible. E.g. 154MHz for a singlerate filter (25 bit output) in a Xilinx
xc2v3000-fg676-5 - obviously the speed will depend on the device
family/speed grade and the longest carry chain. The facility for multiple
channels in interpolation/decimation filters (not supported by Xilinx)
allows lower than full-parallel sampling rates to be efficiently processed
in one filter.

As Michael points out in his post, this technique would be very suitable for
a
Xilinx Spartan-IIE and indeed any FPGA - there are many cases where these
filters would be useful even on devices with dedicated multipliers (when
they are all in use for example! ;-) ).

You can find out more at http://www.dspec.org/rsg.asp - there are also
datasheets here that provide comparisons with Xilinx and Altera and
demonstrate the output of another application (written in java) that
generates schematic representations of the filters for use in reports,
meetings and thesises! :)

I hope this information is of use to you - please contact me if you have any
questions,

Thanks for your time,

Ken

--
To reply by email, please remove the _MENOWANTSPAM from my email address.

"Ray Andraka" <ray@andraka.com> wrote in message
news:3F54F936.5E694FD1@andraka.com...
The problem with the multiplier block approach is that the
construction is predicated on the specific coefficients. As
a result it is considerably harder to use for an arbitrary
set of coefficients. It may reduce area over a straight FIR
filter running at the same clocks per sample, but at a
considerable cost in design time and flexibility. You also
give up regularity in the structure, which may reduce the
overall performance. Essentially what the block multiplier
and distributed arithmetic approaches are is a rearrangement
of the bitwise product terms. The mutliplier block takes
advantage of duplicate terms by adding the inputs before
they are multiplied by the term.

Michael Spencer wrote:

Hello,

Has anyone compared FPGA implementations of full-rate
digital FIR filters based on the use of Multiplier Blocks
vs. traditional FIRs with constant coefficient
multipliers? By full rate, I mean: one output result per
clock cycle and no interpolation or decimation.

For anyone not familiar, a multiplier block is a network
of shifters and adders that performs multiplications by
several coefficients efficiently by exploiting common
sub-expressions. The multiplier block can be exploited in
FIR filters by transposing the standard filter so that the
products of all the coefficients with the current
input-sample are required simultaneously.

Also, by representing the coefficients in the
Canonical-Signed-Digit number system (a small number of
+1 and -1's) along common sub-expression sharing the
multiplier block can get even smaller.

For example, the multiplier block for a 100 tap FIR filter
(fp=0.10 and fs=0.12) can be realized with only 61 adds
(zero explicit multiplications). See filter example #4 in
"FIR Filter Synthesis Algorithms for Minimizing the Delay
and the Number of Adders,"
http://ics.kaist.ac.kr/~dk/papers/TCAD2001.pdf
If the adder depth is constrained to a maximum of four,
then the authors' algorithm can do the multiplier block in
69 additions.

It would seem that this approach would be very efficient
in a target such as the Xilinx Spartan-IIE (with no
dedicated multipliers).

Another question: If we only need one result per K clock
periods (K ~= 1000 for audio applications), could a
multiplier block approach realized with, say, bit-serial
addition be more efficient than some other approach such
as distributed arithmetic?

Comments welcome. Thanks.

-Michael
______________________
Michael E. Spencer, Ph.D.
President
Signal Processing Solutions, Inc.
Web: http://www.spsolutions.com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759



--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 

Welcome to EDABoard.com

Sponsor

Back
Top