video buffering scheme, nonsequential access (no spatial loc

W

wallge

Guest
I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?
 
I have somewhat the same problem and I'm using ram that provides fast
random access, i.e. ZBT ram. You can get ZBT ram that runs at 200 MHz,
so that you can effectively process 100 Mpixels/s. ZBT ram is very
small compared to SDRAM, but if you only need to store a few frames,
that shouldn't be a problem.

Adding ZBT might not be an option on your system however... Maybe
someone can suggest a clever algorithm for your particular problem.


Patrick Dubois

On Jan 24, 2:36 pm, "wallge" <wal...@gmail.com> wrote:
I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?
 
"wallge" <wallge@gmail.com> writes:

I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.
If you are doing truly arbitrary warping, then is it not right that
you can never get an optimal organisation for all warps?

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?
Could you do some kind of caching scheme where you read an entire DRAM
row in at a time, and "hope it comes in handy" later?

Failing that, can you use SSRAM for your frame buffer?

Or, can you parallelise your task so that it operates on (eg) 4 wildly
different areas of input data at a time, which means you can use the
banking mechanism of the DRAMs to hide the latency?

Those are my initial thoughts (whilst waiting for a very loooooong
simulation to run :)

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
 
wallge wrote:

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access
Sounds like a RAM.
If it didn't fit in fpga block ram
I would use an external device.

-- Mike Treseler
 
I should have been more specific in my question.

I have to use a small (64 Mbit) mobile sdram. I can't choose
to use a different storage element in the system (other than *some*
FPGA buffering, though not full frame).

I have heard some discussion of the way in which graphic accelerator
boards do memory transactions, storing pixels in blocks of neighbor
pixels
(instead of being organized row major). In other words the spatial
locality
in the SDRAM buffer might look like:

Image pixels:
N2 N3 N4
N1 P N5
N8 N7 N6

Memory organization:
ADDR DATA
0x0000 P
0x0001 N1
0x0002 N2
0x0003 N3
0x0004 N4
0x0005 N5
0x0006 N6
0x0007 N7
0x0008 N8


Where P is the central pixel of interest, and the N's are its
neighbors.
We organize the pixels in the SDRAM buffer not by rows, but by regions
of interest.
This way if we are doing some kind of Image warp and we want to get
more bang for the buck
in terms of read latency, we are more likely to reuse pixels in the
neighborhood of the currently accessed pixel
than if we were arranged in a row or column major ordering (consider
the case were we wanted to rotate an image by 47.2 degrees from input
to output).

Has anyone seen something like this or know of any resources online
with regard to memory buffer organization schemes for graphics or image
processing?



On Jan 24, 2:36 pm, "wallge" <wal...@gmail.com> wrote:
I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?
 
"wallge" <wallge@gmail.com> wrote in message
news:1169747314.537493.237140@l53g2000cwa.googlegroups.com...

Image pixels:
N2 N3 N4
N1 P N5
N8 N7 N6
Have you thought about what order of filtering you'll
need to use?
 
I am not doing any image filtering.
This is not a filtering operation.
It is an interpolation operation
typically bilinear or bicubic
to do image transformations.

On Jan 25, 1:00 pm, "Pete Fraser" <pfra...@covad.net> wrote:
"wallge" <wal...@gmail.com> wrote in messagenews:1169747314.537493.237140@l53g2000cwa.googlegroups.com...



Image pixels:
N2 N3 N4
N1 P N5
N8 N7 N6Have you thought about what order of filtering you'll
need to use?
 
"wallge" <wallge@gmail.com> wrote in message
news:1169753051.309748.114380@q2g2000cwa.googlegroups.com...
I am not doing any image filtering.
Yes you are.

This is not a filtering operation.
Yes it is.

It is an interpolation operation
typically bilinear or bicubic
to do image transformations.
And that's a filtering operation.
So the maximum kernel size is 4 x 4, though
you might use 2 x 2. The kernel size could have a substantail
bearing on the traffic to/from on-chip RAM.

I'm still not sure of your limitations on off-chip RAM.
You have a buffer on the input or output (or both?)
Do you have enough bandwidth to have an
intermediate buffer for a two-pass operation?
 
Can you write out the FIR filter coeffs for
a bilinear interpolation "filter kernel"?
How about a bicubic interpolator filter kernel
what are its filter coeffs?

arguing semantics was not the purpose of my post.

I will probably wind up doing bilinear interpolation or
"filtering". Which means I need 4 pixels of the input frame to
determine
1 pixel of output warped frame.

By the way what is the Freq response of the bilinear interpolation
"filter"?



On Jan 25, 5:16 pm, "Pete Fraser" <pfra...@covad.net> wrote:
"wallge" <wal...@gmail.com> wrote in messagenews:1169753051.309748.114380@q2g2000cwa.googlegroups.com...

I am not doing any image filtering.Yes you are.

This is not a filtering operation.Yes it is.

It is an interpolation operation
typically bilinear or bicubic
to do image transformations.And that's a filtering operation.
So the maximum kernel size is 4 x 4, though
you might use 2 x 2. The kernel size could have a substantail
bearing on the traffic to/from on-chip RAM.

I'm still not sure of your limitations on off-chip RAM.
You have a buffer on the input or output (or both?)
Do you have enough bandwidth to have an
intermediate buffer for a two-pass operation?
 
"wallge" <wallge@gmail.com> wrote in message
news:1169766685.898182.155950@v45g2000cwv.googlegroups.com...
Can you write out the FIR filter coeffs for
a bilinear interpolation "filter kernel"?
How about a bicubic interpolator filter kernel
what are its filter coeffs?
I'm happy to, but we're getting away from FPGA stuff,
so let's do that off line. Let me know how many phases you
need, and the coefficient format you'd like. I usually
use a minor 4x4 variation on cubic, but it's all set up in
Mathematica, so I could do cubic also.

arguing semantics was not the purpose of my post.

I will probably wind up doing bilinear interpolation or
"filtering". Which means I need 4 pixels of the input frame to
determine
1 pixel of output warped frame.
So you don't really need coefficient tables for this.
You can just use the fractional phase directly.

By the way what is the Freq response of the bilinear interpolation
"filter"?
It depends on the position of output relative to input pixel, but
for a central output pixel the frequency response would be
Cosusoidal.

Getting back to FPGA stuff though, what are your off-chip
RAM bandwidth limitations, and could you consider a two-pass approach?

I'm still not sure of your limitations on off-chip RAM.
You have a buffer on the input or output (or both?)
Do you have enough bandwidth to have an
intermediate buffer for a two-pass operation?
 
I am not sure what you mean by two pass approach.
The max (theoretical) bandwidth I have available to/from the SDRAM
is about
16 bits * 100 Mhz = 1.6 Gbit/sec

This is not an achievable estimate of course, even if I only did full
page
reads and writes, since there is overhead associated with each. I also
have to refresh periodically.

My pixel bit width could be brought down to 8 bits. That way I could
store 2
pixels per address if need be.



On Jan 25, 7:23 pm, "Pete Fraser" <pfra...@covad.net> wrote:
"wallge" <wal...@gmail.com> wrote in messagenews:1169766685.898182.155950@v45g2000cwv.googlegroups.com...

Can you write out the FIR filter coeffs for
a bilinear interpolation "filter kernel"?
How about a bicubic interpolator filter kernel
what are its filter coeffs?I'm happy to, but we're getting away from FPGA stuff,
so let's do that off line. Let me know how many phases you
need, and the coefficient format you'd like. I usually
use a minor 4x4 variation on cubic, but it's all set up in
Mathematica, so I could do cubic also.



arguing semantics was not the purpose of my post.

I will probably wind up doing bilinear interpolation or
"filtering". Which means I need 4 pixels of the input frame to
determine
1 pixel of output warped frame.So you don't really need coefficient tables for this.
You can just use the fractional phase directly.



By the way what is the Freq response of the bilinear interpolation
"filter"?It depends on the position of output relative to input pixel, but
for a central output pixel the frequency response would be
Cosusoidal.

Getting back to FPGA stuff though, what are your off-chip
RAM bandwidth limitations, and could you consider a two-pass approach?

I'm still not sure of your limitations on off-chip RAM.
You have a buffer on the input or output (or both?)
Do you have enough bandwidth to have an
intermediate buffer for a two-pass operation?
 
"wallge" <wallge@gmail.com> writes:

I am not sure what you mean by two pass approach.
The max (theoretical) bandwidth I have available to/from the SDRAM
is about
16 bits * 100 Mhz = 1.6 Gbit/sec

This is not an achievable estimate of course, even if I only did full
page
reads and writes, since there is overhead associated with each. I also
have to refresh periodically.
Don't forget that for video apps, you often don't need to refresh, as
you are reading and writing the SDRAM rows in a regular fashion which
means you can guarantee that each gets touched often enough.

Indeed for some video applications, like output framebuffers, all you
need to do is ensure that you read the row out for display soon enough
after the write, which is often easy to achieve.

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.net/electronics.html
 
Gabor,

Are you saying that I don't need to activate/precharge the bank
when switching to another?
I am kind of unclear on this. When do activate and precharge commands
need to be issued? I thought when switching to a new row or bank you
had
to precharge (close) the previously active one, then activate the new
row/bank before
actually reading from or writing to it. Where am I going wrong here?

Also to the notion that I don't need to refresh since I am doing video
buffering: I am actually buffering multiple frames of video and then
reading
out several frames later. In other words, there may be a significant
fraction
of a second (say 1/8~1/4 sec) of delay between writing data into a
particular page of memory and actually reading it back out.
Is this too much time to expect my pixel data to still be valid
without refreshing?



On Jan 26, 6:03 pm, "Gabor" <g...@alacron.com> wrote:
On Jan 26, 3:15 pm, "wallge" <wal...@gmail.com> wrote:

I am not sure what you mean by two pass approach.
The max (theoretical) bandwidth I have available to/from the SDRAM
is about
16 bits * 100 Mhz = 1.6 Gbit/sec

This is not an achievable estimate of course, even if I only did full
page
reads and writes, since there is overhead associated with each. I also
have to refresh periodically.

My pixel bit width could be brought down to 8 bits. That way I could
store 2
pixels per address if need be.You may be missing an important feature of SDRAM. You don't need to
use full-page reads or writes to keep data streaming at 100% of the
available bandwidth (if you don't change direction) or very nearly 100%
(if you switch from read to write infrequently). This is due to the
ability
to set up another block operation on one bank while another bank is
transferring data. When I use SDRAM for relatively random operations
like this I like to think of the minimum data unit as one minimal burst
(two words in a single-data-rate SDRAM) to each of the four banks.
Any number of these data units can be strung one after another
with no break in the data flow. Then if you wanted to internally
buffer
a square section of the image in internal blockRAM the width
of the minimum block (allowing 100% data rate) would only be
16 8-bit pixels or 8 16-bit pixels in your case. If the area can
cover the required computational core (4 x 4?) for several pixels
at a time, you can reduce overall bandwidth. This was the point
of suggesting an internal cache memory.

HTH,
Gabor
 
"wallge" <wallge@gmail.com> writes:

Are you saying that I don't need to activate/precharge the bank when
switching to another?
Not necessarily.

I am kind of unclear on this. When do activate and precharge
commands need to be issued? I thought when switching to a new row or
bank you had to precharge (close) the previously active one, then
activate the new row/bank before actually reading from or writing to
it. Where am I going wrong here?
You have to precharge a bank only when you switch to another row
within that bank.

Also to the notion that I don't need to refresh since I am doing
video buffering: I am actually buffering multiple frames of video
and then reading out several frames later. In other words, there may
be a significant fraction of a second (say 1/8~1/4 sec) of delay
between writing data into a particular page of memory and actually
reading it back out. Is this too much time to expect my pixel data
to still be valid without refreshing?
That very much depends on the access patterns. The fact that you are
going to implement a frame buffer alone doesn't automatically mean
that you won't need a refresh. Double-, or triple-check your specs. If
in doubt I'd definitely recommend putting in a refresh as low priority
task.

Regards,
Marcus

--
note that "property" can also be used as syntaxtic sugar to reference
a property, breaking the clean design of verilog; [...]

-- Michael McNamara
(http://www.veripool.com/verilog-mode_news.html)
 
I just wanted to say thanks to everyone for responding
with a lot of helpful answers and feedback in this post.
Really great forum.

On Jan 30, 10:32 am, "Gabor" <g...@alacron.com> wrote:
On Jan 29, 10:50 am, "wallge" <wal...@gmail.com> wrote:

Gabor,

Are you saying that I don't need to activate/precharge the bank
when switching to another?First of all, you don't "switch" banks. There are four banks that can
all potentially be active at a given time. Only the external
interface
works on one bank at a time. That being said, realise that the
control interface (address, ras, cas, we) is somewhat independent
of the data interface (dq).

I am kind of unclear on this. When do activate and precharge commands
need to be issued? I thought when switching to a new row or bank you
had
to precharge (close) the previously active one, then activate the new
row/bank before
actually reading from or writing to it. Where am I going wrong here?You need to precharge a bank before opening a new row in _THAT_
bank. Other banks may remain open while this happens. When
doing single burst accesses, I generally precharge using the
read or write command with auto-precharge (A10 high during CAS).

Also to the notion that I don't need to refresh since I am doing video
buffering: I am actually buffering multiple frames of video and then
reading
out several frames later. In other words, there may be a significant
fraction
of a second (say 1/8~1/4 sec) of delay between writing data into a
particular page of memory and actually reading it back out.What's a page? These RAMs have rows. Each row must be accessed
using row activate or else refreshed within the refresh period. If
you
store data in successive rows / banks first, and then successive
columns (i.e. row/bank form LSB's of your address), you will usually
refresh the entire part without accessing a large portion of the
entire
memory.

Here's a typical sequence I use for writing streaming data into
an SDRAM:

Cycle Command Bank Addr Data
startup sequence has unused cycles (NOPs)
1 ACT 0 row0 x
2 NOP x x x
3 ACT 1 row0 x
4 NOP x x x
5 ACT 2 row0 x
full streaming starts here (burst size = 2)
6 WRITEA 0 col0 data0
7 ACT 3 row0 data0
8 WRITEA 1 col0 data1
9 ACT 0 row1 data1
10 WRITEA 2 col0 data2
11 ACT 1 row1 data2
12 WRITEA 3 col0 data3
13 ACT 2 row1 data3
14 WRITEA 0 col0 data4
15 ACT 3 row1 data4
16 WRITEA 1 col0 data5
above sequence (streaming can be repeated ad nauseum)
end sequence has unused cycles (NOPs)
17 NOP x x data5
18 WRITEA 2 col0 data6
19 NOP x x data6
20 WRITEA 3 col0 data7
21 NOP x x data7

WRITEA is write command with autoprecharge (A10 = 1)

Reading is similar except there are pipeline delays on the data bus
due to CAS read access time.

Regards,
Gabor
 

Welcome to EDABoard.com

Sponsor

Back
Top