Image rotation

T

Tomas D.

Guest
Hello,
I've come up with an issue, where I need to rotate the incoming video stream
image by +/-5 degrees with 0.5 degree step. The problem now is to identify
the most resource saving approach, which would also use the memory as
efficiently as possible, because I need to design a new PCB and do a
component selection.
The FPGA will be Altera Cyclone V with one hard memory controller (5CEFA2
device). I am trying to check if it will be sufficient to use one DDR3
memory chip or it's better to use two devices with 32bit memory bus, thus
increasing the bandwidth.

The incoming video stream is from the camera, which has a separate clock,
thus the frame buffer is a requirement.

I've come accross two options so far:
1) Image rotation by shearing:
https://www.ocf.berkeley.edu/~fricke/projects/israel/paeth/rotation_by_shearing.html

It seems like this is kinda easy approach, but it will require at least
three memory accesses. In a combination with regular 3 frames frame buffer,
I could end up doing 5 memory read/write cycles.

2) Image rotation by having lookup table of each pixel. If the lookup table
will be placed into the memory, then this will require one access to read
the location and another access to read the pixels and write them to the
moved location.

I am not sure which method is used the most common in the FPGA video
processing? Maybe you, experts, have good resources to read about this?

Thank you.

Regards
Tomas D.
 
Tomas D. wrote:
Hello,
I've come up with an issue, where I need to rotate the incoming video stream
image by +/-5 degrees with 0.5 degree step. The problem now is to identify
the most resource saving approach, which would also use the memory as
efficiently as possible, because I need to design a new PCB and do a
component selection.
The FPGA will be Altera Cyclone V with one hard memory controller (5CEFA2
device). I am trying to check if it will be sufficient to use one DDR3
memory chip or it's better to use two devices with 32bit memory bus, thus
increasing the bandwidth.

The incoming video stream is from the camera, which has a separate clock,
thus the frame buffer is a requirement.

I've come accross two options so far:
1) Image rotation by shearing:
https://www.ocf.berkeley.edu/~fricke/projects/israel/paeth/rotation_by_shearing.html

It seems like this is kinda easy approach, but it will require at least
three memory accesses. In a combination with regular 3 frames frame buffer,
I could end up doing 5 memory read/write cycles.

2) Image rotation by having lookup table of each pixel. If the lookup table
will be placed into the memory, then this will require one access to read
the location and another access to read the pixels and write them to the
moved location.

I am not sure which method is used the most common in the FPGA video
processing? Maybe you, experts, have good resources to read about this?

Thank you.

Regards
Tomas D.

Some time ago I did image rotation for a check scanner that used
a line-scan camera. My issue was mostly the general lack of BRAM
in the small (XCV50) FPGAs I used and I had to come up with an
algorithm that only read small groups of pixels at a time. My
suggestion is to try to find a part that has as much internal
RAM as you can reasonably afford. Then remember that when reading
you want to keep as much of the data you actually read (full bursts)
so you don't have to re-read during the same rotation pass. I would
not think that going to a two-pass shearing algorithm will really
save much in terms of logic. I didn't need to go that way and I
used relatively small parts with no internal hardware multipliers.
The algorithm I used simply started with the location of the first
destination pixel of the first output line, which may be located
at some point outside the actual input image. Remember that free
rotation usually requires a larger "canvas" than the input image.
In my case I didn't really need the whole input image since the
output image was calculated by the detected corners of the check.
Then the algorithm simply walked a pixel at a time by adding a
delta to the starting location. You need to read pixels surrounding
each computed X,Y location and interpolate. My interpolation was
simply linear and used only the 4 nearest neighbors, but a more
robust algorithm would either use more neighboring pixels or do
some filtering on the input image before rotation. When you get
to the end of the first output line, you go back to the original
pixel location plus on orthogonal delta to get the first pixel
of the second output line and so on.
My algorithm used reads of 4 adjacent pixels in each of three
adjacent rows to fill internal memory in 3 x 4 blocks. The starting
point of these 3 x 4 blocks depended on the direction of rotation,
but it allowed me to do +/- 14 degrees max. I did not need to
do sine / cosine in my design because there was a processor that
looked at the incoming raw image to find the check corners and
directly programmed the starting pixel location and X,Y deltas.

--
Gabor
 
On Thursday, 8 January 2015 08:51:55 UTC+13, Tomas D. wrote:
The FPGA will be Altera Cyclone V with one hard memory controller (5CEFA2
device). I am trying to check if it will be sufficient to use one DDR3
memory chip or it's better to use two devices with 32bit memory bus, thus
increasing the bandwidth.

The most efficient method which uses frame buffer that is external to the FPGA will result in one write and one read per pixel, if you use a DDR module you will need a bit more than 2x the bandwidth of the video stream, so you will need a bit over 900MB/s for 24-bit 1080p @ 60 Hz. You will need to carefully plan how memory will be accessed to maximise available memory bandwidth.

For 1080p video, if you can hold 180 rows of pixel data inside your FPGA you don't actually need external memory to buffer the frames at all, and you can achieve lower latency too (approx the time for 181 lines). The idea being to use a rolling buffer of 180 rows that you sample/extract your output pixels from. The cost of the larger FPGA might be offset by the savings in not requiring the external memory, smaller PCB and so on.

There is a sweet spot for 720p video, where you can get away with holding just 128 rows for +/- 5 degrees of rotation, requiring only half a MB of block RAM. This assumes that you are not interpolating between pixels.

If you are performing interpolation, then you might need to be really cunning and use the extra cycles found in the blanking interval to give additional cycles required for the extra memory accesses needed when you walk through the pixels. Your access pattern might be something like

1234......
....5678...
.......9ABC
...........

In this case it takes 12 cycles to access the data needed for interpolating 10 pixels (because of the additional cycle required for access 5 & 9 when it jumps lines). You will then need something like a FIFO to remove the gaps in the output pixel stream. For 1080p, you have about 280 cycles in the horizontal blanking interval, a little more than what you will need for a +/- 5 degree rotation, where you will have at most 167 changes between lines.


Mike
 
On 1/7/15 11:23 PM, Mike Field wrote:
On Thursday, 8 January 2015 08:51:55 UTC+13, Tomas D. wrote:
The FPGA will be Altera Cyclone V with one hard memory controller
(5CEFA2 device). I am trying to check if it will be sufficient to
use one DDR3 memory chip or it's better to use two devices with
32bit memory bus, thus increasing the bandwidth.


The most efficient method which uses frame buffer that is external to
the FPGA will result in one write and one read per pixel, if you use
a DDR module you will need a bit more than 2x the bandwidth of the
video stream, so you will need a bit over 900MB/s for 24-bit 1080p @
60 Hz. You will need to carefully plan how memory will be accessed to
maximise available memory bandwidth.

For 1080p video, if you can hold 180 rows of pixel data inside your
FPGA you don't actually need external memory to buffer the frames at
all, and you can achieve lower latency too (approx the time for 181
lines). The idea being to use a rolling buffer of 180 rows that you
sample/extract your output pixels from. The cost of the larger FPGA
might be offset by the savings in not requiring the external memory,
smaller PCB and so on.

There is a sweet spot for 720p video, where you can get away with
holding just 128 rows for +/- 5 degrees of rotation, requiring only
half a MB of block RAM. This assumes that you are not interpolating
between pixels.

If you are performing interpolation, then you might need to be really
cunning and use the extra cycles found in the blanking interval to
give additional cycles required for the extra memory accesses needed
when you walk through the pixels. Your access pattern might be
something like

1234......
...5678...
......9ABC
..........

In this case it takes 12 cycles to access the data needed for
interpolating 10 pixels (because of the additional cycle required for
access 5 & 9 when it jumps lines). You will then need something like
a FIFO to remove the gaps in the output pixel stream. For 1080p, you
have about 280 cycles in the horizontal blanking interval, a little
more than what you will need for a +/- 5 degree rotation, where you
will have at most 167 changes between lines.


Mike

On the need for 12 cycles here. My experience is that FPGA's tend NOT to
have giant blocks of memory, but a lot of "smaller" blocks (perhaps of
differing size. This 1/2 MB memory is likely made of smaller blocks and
could be defined as 2 separate memories, one for even lines, and one for
odd, which says that you 4,5 and 8,9 could be accessed simultaneously.
(Actually, if you are interpolating the pixels, you are going to almost
always want two lines of data, the line above and below your fractional
position, and the point before and after, and so may want to 4 way
interleave your memory).
 

Welcome to EDABoard.com

Sponsor

Back
Top