FPGA acceleration v.s. GPU acceleration

vcar · Sep 14, 2011

I was an FPGA engineer before and I think high performance computing
based FPGA will lead to a bright future. However through my recently
projects I found GPU will be more appropriate when there is a
acceleration need.

In embedded system, FPGA co-processing plan:
Intel E6x5C

and GPU co-processing plan:
AMD APU (with opencl support)

and in desktop system, FPGA co-processing plan:
Full custom design, mostly will be based on PCIe fabric

and GPU co-processing plan:
nVidia CUDA (with opencv basically support)

If I choose FPGA co-processing, the algorithm will be specifically
optimized and R&D time will be very noticeable. If I choose GPU plan,
algorithm migration will cost little time(even the original one is
Matlab code), and the acceleration performance will also be quite
well.

As a conclusion, the FPGA acceleration only suits some certain and
fixed application. However in the real world , many projects and many
algorithms are very uncertain and arbitrary. With same power
consumption, GPU plan may lead better results. For a concrete
project, I will consider GPU or DSP, and FPGA at last.

Do everybody agree?

glen herrmannsfeldt · Sep 14, 2011

vcar <hitsx@163.com> wrote:

(snip)

As a conclusion, the FPGA acceleration only suits some certain and
fixed application. However in the real world , many projects and many
algorithms are very uncertain and arbitrary. With same power
consumption, GPU plan may lead better results. For a concrete
project, I will consider GPU or DSP, and FPGA at last.

There have been companier over the years selling FPGA based
accelerator hardware, but none have done very well.

GPU acceleration takes advantage of the economy of scale for
graphical uses.

FPGAs have not been very good for floating point, especially for
floating point addition.

Some fixed point algorithms, such as dynamic programming and
convolution of very large data sets can possibly take advantage
of FPGA technology. One problem that I know of requires 5e19
fixed point adds per day, or about 6e14 per second.

Do everybody agree?

I agree.

-- glen

RCIngham · Sep 14, 2011

If what you need is a computation off-load engine for a standard CPU, wit
that CPU handling all the I/O tasks, than using a GPU would probably be th
most appropriate implementation methodology.

However, the phrase "horses for courses" always applies.

---------------------------------------
Posted through http://www.FPGARelated.com

fpga_me · Sep 15, 2011

If I choose FPGA co-processing, the algorithm will be specifically
optimized and R&D time will be very noticeable. If I choose GPU plan,
algorithm migration will cost little time(even the original one is
Matlab code), and the acceleration performance will also be quite
well.

I like the way you are looking at this issue, that is, from a cos
efficiency point-of-view.

As a conclusion, the FPGA acceleration only suits some certain and
fixed application. However in the real world , many projects and many
algorithms are very uncertain and arbitrary. With same power
consumption, GPU plan may lead better results. For a concrete
project, I will consider GPU or DSP, and FPGA at last.

Again, here it boils down to cost efficiency and how you choose to measur
it. Can we devote the time to optimize the algorithm for the FPGA? Can w
afford to make it massively parallel to reduce latency? Does it even matte
if the latency is worse than the GPU? Is the cost performance ratio bette
than that of the GPU? And so on...

The way you use "suit" and "certain" is extremely subjective. If as th
system architect, I see that though my system is not optimized for latency
for example, it may still be acceptable to use depending on syste
requirements and cost efficiency. Rarely does one get a system spec tha
states "Run as fast as possible."

Do everybody agree?

Cost efficiency rules here. How can you measure it? Power, latency, area
throughput, NRE, etc. Simply saying that FPGAs can't implement a functio
as good as a GPU because of fabric differences is not the best way to sa
"it is worse than a GPU." In short, the system specs and ultimately th
cost efficiency says use the GPU or FPGA or CPU...and it will tell yo
which is better for *your* application.

---------------------------------------
Posted through http://www.FPGARelated.com

Paul Colin Gloster · Oct 4, 2011

Someone sent on September 13th, 2011:
|---------------------------------------------------------------------|
|"[..] |
| |
|As a conclusion, the FPGA acceleration only suits some certain and |
|fixed application. However in the real world , many projects and many|
|algorithms are very uncertain and arbitrary. With same power |
|consumption, GPU plan may lead better results. For a concrete |
|project, I will consider GPU or DSP, and FPGA at last. |
| |
|Do everybody agree?" |
|---------------------------------------------------------------------|

GPUs can outperform CPUs, but CPUs can outperform GPUs. It depends.

Tim Wescott · Oct 4, 2011

On Tue, 04 Oct 2011 18:13:50 +0100, Paul Colin Gloster wrote:

Someone sent on September 13th, 2011:
|---------------------------------------------------------------------|
|"[..] |
| |
|As a conclusion, the FPGA acceleration only suits some certain and |
|fixed application. However in the real world , many projects and many|
|algorithms are very uncertain and arbitrary. With same power |
|consumption, GPU plan may lead better results. For a concrete |
|project, I will consider GPU or DSP, and FPGA at last. |
| |
|Do everybody agree?" |
|---------------------------------------------------------------------|

GPUs can outperform CPUs, but CPUs can outperform GPUs. It depends.

It depends on the application -- a lot, and the company a little.

I've done a lot of work around (and sometimes even on) a system that does
a lot of per-pixel video processing. The actual algorithm is quite
simple, but it needs to happen at video pixel rates, and the power
dissipation needs to be low. For that app, an FPGA doing the pixel-level
work made lots of sense.

For the version that I worked on, having a processor working hand-in-hand
with the FPGA handling management tasks at the video line rate also made
oodles of technical sense -- but ran afoul of some company political
decisions (mostly a decision to maintain the illusion that a software guy
who could handle "big box" GUI and communications interface stuff was the
right guy to work on software that implemented a PLL at the video line
rate).

For decisions that are even close to even-steven, being able to hire and
manage a crew that can do the work becomes an important part of the mix
-- which means that if you're trying to do this sort of thing in an all-
software company, a GPU solution may make oodles more sense than an FPGA
solution, even if the FPGA solution is technically better. Similarly, if
the hard part of the algorithm needs to have a lot of interaction with
the hardware, and if management is composed of circuit designers, then an
FPGA solution may be a better choice even if the better technical
solution would have been to use a GPU.

--
www.wescottdesign.com

glen herrmannsfeldt · Oct 4, 2011

Tim Wescott <tim@seemywebsite.com> wrote:

On Tue, 04 Oct 2011 18:13:50 +0100, Paul Colin Gloster wrote:
Someone sent on September 13th, 2011:
|
|As a conclusion, the FPGA acceleration only suits some certain and |
|fixed application. However in the real world , many projects and many|
|algorithms are very uncertain and arbitrary.

GPUs can outperform CPUs, but CPUs can outperform GPUs. It depends.

It depends on the application -- a lot, and the company a little.

I've done a lot of work around (and sometimes even on) a system that does
a lot of per-pixel video processing. The actual algorithm is quite
simple, but it needs to happen at video pixel rates, and the power
dissipation needs to be low. For that app, an FPGA doing the pixel-level
work made lots of sense.

FPGAs work best when you need to do a huge number of small fixed
point operations, especially add/subtract/compare and some, but
not a huge number, of multiplies and divides. The shifter needed
for floating point addition and subtraction is big, and limits the
use of FPGA for floating point work.

GPUs traditionally are designed to do a lot of single precision
floating point. The use of the GPU for numerical processing takes
advantage of the economy of scale or building them for display use.

I have heard that there is discussion toward building GPUs to do
double precision, just for this purpose, though.

For the version that I worked on, having a processor working hand-in-hand
with the FPGA handling management tasks at the video line rate also made
oodles of technical sense -- but ran afoul of some company political
decisions (mostly a decision to maintain the illusion that a software guy
who could handle "big box" GUI and communications interface stuff was the
right guy to work on software that implemented a PLL at the video line
rate).

For decisions that are even close to even-steven, being able to hire and
manage a crew that can do the work becomes an important part of the mix
-- which means that if you're trying to do this sort of thing in an all-
software company, a GPU solution may make oodles more sense than an FPGA
solution, even if the FPGA solution is technically better. Similarly, if
the hard part of the algorithm needs to have a lot of interaction with
the hardware, and if management is composed of circuit designers, then an
FPGA solution may be a better choice even if the better technical
solution would have been to use a GPU.

-- glen

Dr. Beau Webber · Apr 28, 2012

On Wednesday, 14 September 2011 04:50:39 UTC+1, vcar wrote:

If I choose FPGA co-processing, the algorithm will be specifically
optimized and R&D time will be very noticeable. If I choose GPU plan,
algorithm migration will cost little time(even the original one is
Matlab code), and the acceleration performance will also be quite
well.

As a conclusion, the FPGA acceleration only suits some certain and
fixed application. However in the real world , many projects and many
algorithms are very uncertain and arbitrary. With same power
consumption, GPU plan may lead better results. For a concrete
project, I will consider GPU or DSP, and FPGA at last.

Do everybody agree?

As an exercise I have recently written a USB interfaced FPGA based accelerator for a simple scientific algorithm (Binning) - the original code was written in Apl, and for large data sets (5 million) the FPGA did better than Apl or compiled C. It also needs testing with more of the complete algorithm moved into the FPGA, so there is less USB overhead. I have not yet done a comparison with a GPU.
There is a video of this working and being tested on YouTube : http://www.youtube.com/user/LabToolsInstruments
and a fuller description of the FPGA techniques used on the Farnell Element14 site :
FPGA Modular Firmware Skeleton for multiple instruments - Morph-IC-II, YouTube videos.
http://www.element14.com/community/groups/fpga-group/blog/2012/04/28/fpga-modular-firmware-skeleton-for-multiple-instruments--morph-ic-ii-youtube-videos

Frank Buss · Apr 29, 2012

Dr. Beau Webber wrote:

As an exercise I have recently written a USB interfaced FPGA based accelerator for a simple scientific algorithm (Binning) - the original code was written in Apl, and for large data sets (5 million) the FPGA did better than Apl or compiled C. It also needs testing with more of the complete algorithm moved into the FPGA, so there is less USB overhead. I have not yet done a comparison with a GPU.
There is a video of this working and being tested on YouTube : http://www.youtube.com/user/LabToolsInstruments
and a fuller description of the FPGA techniques used on the Farnell Element14 site :
FPGA Modular Firmware Skeleton for multiple instruments - Morph-IC-II, YouTube videos.
http://www.element14.com/community/groups/fpga-group/blog/2012/04/28/fpga-modular-firmware-skeleton-for-multiple-instruments--morph-ic-ii-youtube-videos

I'm sure such a binning algorithm can be many times faster with a GPU,
because it is easy to parallelize it, e.g. partitioning the input data
in as many blocks as you have parallel processing units, each with its
own result sum array, and finally summing all result arrays.

I've used CUDA for magnet field calculation and it was at least 20 times
faster (depending on the graphics card)

http://www.frank-buss.de/magnetfeld/

--
Frank Buss, http://www.frank-buss.de
electronics and more: http://www.youtube.com/user/frankbuss

FPGA acceleration v.s. GPU acceleration

vcar

Guest

glen herrmannsfeldt

Guest

RCIngham

Guest

fpga_me

Guest

Paul Colin Gloster

Guest

Tim Wescott

Guest

glen herrmannsfeldt

Guest

Dr. Beau Webber

Guest

Frank Buss

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

FPGA acceleration v.s. GPU acceleration

vcar

Guest

glen herrmannsfeldt

Guest

RCIngham

Guest

fpga_me

Guest

Paul Colin Gloster

Guest

Tim Wescott

Guest

glen herrmannsfeldt

Guest

Dr. Beau Webber

Guest

Frank Buss

Guest

Log in

Welcome to EDABoard.com

Sponsor