iterative algorithms + tightly coupled CPU with cloud of log

W

wallge

Guest
I was wondering if anyone had experience with using combinations of
FPGA based CPUs and surrounding logic to perform iterative algorithms.
For instance, if we want to implement different types of more complex
computer vision algorithms in an embedded system, we may wish to use
the parallelism of an fpga to do multiple parts of a 2d convolution or
matrix operation in parallel.
While the FPGA may be able to handle the number crunching requirements
of a given algorithm, it seems to me to be ill suited to handle the
iterative (often non-systolic) nature of many advanced image processing
algorithms. Often more complex computer vision algorithms seem to be
too complex to be handled solely by FPGA based logic.

I was thinking of the case were we have an FPGA connected directly to a
video source, and data is flowing into the system at some fixed rate.
We may wish to process this data at several scales, and iteratively
search the low scales up to the higher ones until we have found
features of interest in the video stream. Perhaps we wish to mark those
features by altering pixels in their local neighborhood.

We may need to iteratively process multiple scales of image data and
buffer the original video frame in off-FPGA DRAM, since there will not
be enough on-FPGA BRAM to store full images. Once we find the region of
interest, we may then wish to retrieve the original to be marked and
then sent off as output video. A good example of this process might be,
say, face detection.

It seems to me that the iterative nature of these kinds of algorithms
needs to be handled by a combination of CPU and FPGA logic. The FPGA
handling the number crunching and parallel data
paths, and the CPU handling the notion of when to iterate, or when to
stop, or in general, what decision to take next based on the results of
the FPGA's number crunching. The CPU could be built from programmable
logic, or placed off-FPGA.

Does anyone have experience with this kind of thing, or know of
somewhere I might be able to find more information about optimal ways
of coupling heterogenous processors?

I am aware of Altera's C2H compiler, but have not used it, and don't
know how optimally it combines FPGA/CPU resources.
I might be in the market to hire a consultant, if one were
knowledgeable in this area.
 
wallge wrote:
I was wondering if anyone had experience with using combinations of
FPGA based CPUs and surrounding logic to perform iterative algorithms.
For instance, if we want to implement different types of more complex
computer vision algorithms in an embedded system, we may wish to use
the parallelism of an fpga to do multiple parts of a 2d convolution or
matrix operation in parallel.
While the FPGA may be able to handle the number crunching requirements
of a given algorithm, it seems to me to be ill suited to handle the
iterative (often non-systolic) nature of many advanced image processing
algorithms. Often more complex computer vision algorithms seem to be
too complex to be handled solely by FPGA based logic.

I was thinking of the case were we have an FPGA connected directly to a
video source, and data is flowing into the system at some fixed rate.
We may wish to process this data at several scales, and iteratively
search the low scales up to the higher ones until we have found
features of interest in the video stream. Perhaps we wish to mark those
features by altering pixels in their local neighborhood.

We may need to iteratively process multiple scales of image data and
buffer the original video frame in off-FPGA DRAM, since there will not
be enough on-FPGA BRAM to store full images. Once we find the region of
interest, we may then wish to retrieve the original to be marked and
then sent off as output video. A good example of this process might be,
say, face detection.

It seems to me that the iterative nature of these kinds of algorithms
needs to be handled by a combination of CPU and FPGA logic. The FPGA
handling the number crunching and parallel data
paths, and the CPU handling the notion of when to iterate, or when to
stop, or in general, what decision to take next based on the results of
the FPGA's number crunching. The CPU could be built from programmable
logic, or placed off-FPGA.

Does anyone have experience with this kind of thing, or know of
somewhere I might be able to find more information about optimal ways
of coupling heterogenous processors?

I am aware of Altera's C2H compiler, but have not used it, and don't
know how optimally it combines FPGA/CPU resources.
I might be in the market to hire a consultant, if one were
knowledgeable in this area.
Some ideas
If you have the dosh, you might consider using the Opteron server
boards with the 2nd socket used for an FPGA plugin module, there is one
product for virtex and another for stratix, you will need to google for
those. They were discussed in this group a year ago when they first
came out but I forget the vendor names. One issue here is that the
Opterons are comunicating with the FPGAs through the HT bus and the
Opterons are running at compute speeds in the 2GHz & up while the FPGA
may be grunting at 300MHz or less but massively parallel. The Opteron
had better be smart about partioning the problem and not get to into
the FPGA at too fine a grain otherwise the HT bus will be the
bottleneck and either the cpu or FPGA may be idle.

The other idea is to consider the soft core processor as a unit you can
either customize at the instruction level by adding your own bit
twiddly opcodes or add a coprocessor for more complex processing.
Adding opcodes usually slows down the cpu since it has already been
architected without your new opcodes in mind. The copro route should
work better since this support is usually included in the architecture
definition.

If soft cores can perform most of their workload from a Bram with
little need to go to external DRAM for code or data, then quite a few
of these cores might be placed in the bigger FPGAs and you might then
be able to mix and match with a mix of hardware engines under software
control of local soft cpu and much closer in clock speed. You could
think of a FFT butterfly box as a specialized cpu engine that has it
instructions set in wired logic, generalize this into a DSP engine and
there are many options.

Also consider using real TI/ADI DSP chips with FPGA as possible
accelerator and also look at nVidia GPUs as a PC accelerator, haven't
been there but some folks claim some impressive speed ups and you
probably already got the hardware.


John Jakson
transputer guy
 
It sounds like you have a software solution that you want to implement
in hardware. I don't think who should be too hung up on the iterative
nature of your algorithm but instead you need to rewrite your algorithm
targeting hardware or taking advantage of the VHDL. You should be
looking for things that can be done in parallel. Pipelining can reduce
the need for memory.

An example from your description is as areas of interest are
identified, instead of marking them, pass them onto the next stage in
the solution.

I evaluated a couple of different C-to-HDL compilers. Most times they
require you to rewrite code to work within their environment. To some
extent it is like learning a new or another language. Now don't get
me wrong, there are still advantages to using these tools but I found
the VHDL that they produce wasn't optimal but 'safe'. I instead
decided to spend my time becoming a better VHDL programmer.

My advice is start by implementing your solution as a state machine. In
your algorithm, break up compound statements into simple steps. Each
step becomes a state. I developed a technique for implementing for-next
loops that I could easily manage. What I found was, while my solutions
required a lot of cycles, I could achieve higher clock frequencies.
After I was able to get a working solution that matched my software
implementation I went back and identified things that I could implement
as parallel units and pipeline.

The other thing I wanted to do was reduce the number of multipliers I
was using. So, I recoded them as shared hardware instead. The amount of
combinational logic I was using went up but I was able to reduce the
number of multipliers to 4 from 36.
 

Welcome to EDABoard.com

Sponsor

Back
Top