EDK : FSL macros defined by Xilinx are wrong

JustJohn · Jan 26, 2007

On Jan 24, 11:36 am, "wallge" <wal...@gmail.com> wrote:

I am doing some embedded video processing, where I store an incoming
frame of video, then based on some calculations in another part of the
system, I warp that buffered frame of video. Now when the frame goes
into the buffer
(an off-FPGA SDRAM chip), it is simply written in one pixel at a time
in row major ordering.

The problem with this is that I will not be accessing it in this way. I
may want to do some arbitrary image rotation. This means
the first pixel I want to access is not the first one I put in the
buffer, It might actually be the last one in the buffer. If I am doing
full page reads, or even burst reads, I will get a bunch of pixels that
I will not need to determine the output pixel value. If i just do
single reads, this waists a bunch of clock cycles setting up the SDRAM,
telling it which row to activate and which column to read from. After
the read is done, you then have to issue the precharge command to close
the row. There is a high degree of inefficiency to this. It takes 5,
maybe 10 clock cycles just to retrieve one
pixel value.

Does anyone know a good way to organize a frame buffer to be more
friendly (and more optimal) to nonsequential access (like the kind we
might need if we wanted to warp the input image via some
linear/nonlinear transformation)?

A fairly simple technique, reasonably well-known among video system
designers, is to use what is sometimes called tiling the image into the
((DDR)S)DRAM columns.
E.g., assume a 1Kx1K image with vertical and horizontal address bits
(V9..V0) and (H9..H0), and also DRAM with row and column address bits
(R9..R0) and (C9..C0). Do _not_ use the straight mapping of:
(V9..V0) <=> (R9..R0) and (H9..H0) <=> (C9..C0)
Instead, map the H/V LSBs into the DRAM column address, and the H/V
MSBs to the DRAM row address:
(V4..V0,H4..H0) <=> (C9..C0) and (V9..V5,H9..H5) <=> (R9..R0)
When warping, the image sample addresses are pipelined out to DRAM,
with time designed into the pipeline to examine the addresses for DRAM
row boundary crossings, and only stall when it's necessary to re-RAS.
Stalls only occur when the sampling area overlaps the edge of a tile,
instead of with every 2x2 or 3x3 fetch.
(You posed your question so well, I'll bet this already occurred to
you)

Caching can also be used to bypass the external RAM access pipeline
when the required pixels are already in the FPGA. There are lots of
different caching techniques, I haven't looked at that in a while.

Block processing is a kind of variant of caching, reading a tile from
external DRAM into BRAM, warping from that BRAM into another BRAM, then
sending the results back out, but border calculations get messy for
completely arbitrary warps.

HTH
Just John

JustJohn · Jan 26, 2007

On Jan 25, 10:37 pm, "Shenli" <zhushe...@gmail.com> wrote:

Hi all,

I am reading "Coding Guidelines for Datapath Synthesis" from Synopsys.

It says "The most important technique to improve the performance of a
datapath is to avoid expensive carry-propagations and to make use of
redundant representations instead (like carry-save or partial-product)
wherever possible."

1. Is there any article talk about what's "carry-propagation" and how
to avoid use it?
2. What's "redundant representations" mean?

Please recommend some readings about it, thanks in advance!

Best regards,
Davy

http://www.ecs.umass.edu/ece/koren/arith/
Happy reading,

Gabor · Jan 26, 2007

On Jan 26, 3:15 pm, "wallge" <wal...@gmail.com> wrote:

I am not sure what you mean by two pass approach.
The max (theoretical) bandwidth I have available to/from the SDRAM
is about
16 bits * 100 Mhz = 1.6 Gbit/sec

This is not an achievable estimate of course, even if I only did full
page
reads and writes, since there is overhead associated with each. I also
have to refresh periodically.

My pixel bit width could be brought down to 8 bits. That way I could
store 2
pixels per address if need be.

You may be missing an important feature of SDRAM. You don't need to
use full-page reads or writes to keep data streaming at 100% of the
available bandwidth (if you don't change direction) or very nearly 100%
(if you switch from read to write infrequently). This is due to the
ability
to set up another block operation on one bank while another bank is
transferring data. When I use SDRAM for relatively random operations
like this I like to think of the minimum data unit as one minimal burst
(two words in a single-data-rate SDRAM) to each of the four banks.
Any number of these data units can be strung one after another
with no break in the data flow. Then if you wanted to internally
buffer
a square section of the image in internal blockRAM the width
of the minimum block (allowing 100% data rate) would only be
16 8-bit pixels or 8 16-bit pixels in your case. If the area can
cover the required computational core (4 x 4?) for several pixels
at a time, you can reduce overall bandwidth. This was the point
of suggesting an internal cache memory.

HTH,
Gabor

Marlboro · Jan 28, 2007

You may be missing an important feature of SDRAM. You don't need to
use full-page reads or writes to keep data streaming at 100% of the
available bandwidth (if you don't change direction) or very nearly 100%
(if you switch from read to write infrequently). This is due to the
ability
to set up another block operation on one bank while another bank is
transferring data.

Hi Gabor,

I've missed this too, what happens at the end of the burst? Do you
have another "cas" otherwise the burst will stop (unless it's full
page) Anyway, if it works like that then we can save bunch of cycles
ras to cas and data transfer can be seamless

I don't know it's possible to activate another bank while one bank is
read/written, please clarify me on this, must be my big miss

Thanks,

Tommy Thorn · Jan 30, 2007

On Jan 29, 7:50 am, "wallge" <wal...@gmail.com> wrote:

Are you saying that I don't need to activate/precharge the bank
when switching to another?
I am kind of unclear on this. When do activate and precharge commands
need to be issued? I thought when switching to a new row or bank you
had
to precharge (close) the previously active one, then activate the new
row/bank before
actually reading from or writing to it. Where am I going wrong here?

Each bank is closed or open on a given row. As long as your accesses
go to an already open row, you needn't precharge. Beware though you
must close the bank after tRAS MAX ~= 70 us.

AFAICT, ppl here have been suggesting data layout schemes what will
increase your likelyhood of hitting an open row.

Also to the notion that I don't need to refresh since I am doing video
buffering: I am actually buffering multiple frames of video and then
reading
out several frames later. In other words, there may be a significant
fraction
of a second (say 1/8~1/4 sec) of delay between writing data into a
particular page of memory and actually reading it back out.
Is this too much time to expect my pixel data to still be valid
without refreshing?

Yes that too much. tREFI (Average periodic refresh interval) is ~ 7
us.

Tommy

Tommy Thorn · Jan 30, 2007

Also to the notion that I don't need to refresh since I am doing video
buffering: I am actually buffering multiple frames of video and then
reading
out several frames later. In other words, there may be a significant
fraction
of a second (say 1/8~1/4 sec) of delay between writing data into a
particular page of memory and actually reading it back out.
Is this too much time to expect my pixel data to still be valid
without refreshing?Yes that too much. tREFI (Average periodic refresh interval) is ~ 7
us.

Let me try that again. Quoting from Micron's SDRAM (SDR, not DDR) data
sheet: "The addressing is generated by the internal refresh
controller. This makes the address bits "Don't Care" during an AUTO
REFRESH command. The 64Mb SDRAM requires 4,096 AUTO REFRESH cycles
every 64ms (EF), regardless of width option."

Which suggests that for this SDR SDRAM, a frame rate of 15 Hz or
higher is enough to keep all displayed pixels refreshed. DDR SDRAM is
probably very similar. Your 1/8 ~ 1/4 s is much too slow.

Tommy

Tim · Jan 30, 2007

Geronimo Stempovski wrote:

I recently heard about some 60 GHz in the mobile communication sector and 10
Gbit Ethernet but as far as I know there are those multi-level modulation
methods (like QAM for example) that are able to provide 10 Gbit bandwidth
with a bitrate of some Mbps (is that correct?).

Last I heard, 10GbaseT runs at 800Mbaud with the gloriously named
128-DoubleSquare line code. And lots of other clever stuff, such as
Tomlinson-Harashima precoding, which others on c.a.f. will jump in to
explain ;-)

So the baud rate (i.e. symbol rate or changes on the line) is 800
million per second.

Tim

Gabor · Jan 30, 2007

On Jan 29, 10:50 am, "wallge" <wal...@gmail.com> wrote:

Gabor,

Are you saying that I don't need to activate/precharge the bank
when switching to another?

First of all, you don't "switch" banks. There are four banks that can
all potentially be active at a given time. Only the external
interface
works on one bank at a time. That being said, realise that the
control interface (address, ras, cas, we) is somewhat independent
of the data interface (dq).

I am kind of unclear on this. When do activate and precharge commands
need to be issued? I thought when switching to a new row or bank you
had
to precharge (close) the previously active one, then activate the new
row/bank before
actually reading from or writing to it. Where am I going wrong here?

You need to precharge a bank before opening a new row in _THAT_
bank. Other banks may remain open while this happens. When
doing single burst accesses, I generally precharge using the
read or write command with auto-precharge (A10 high during CAS).

Also to the notion that I don't need to refresh since I am doing video
buffering: I am actually buffering multiple frames of video and then
reading
out several frames later. In other words, there may be a significant
fraction
of a second (say 1/8~1/4 sec) of delay between writing data into a
particular page of memory and actually reading it back out.

What's a page? These RAMs have rows. Each row must be accessed
using row activate or else refreshed within the refresh period. If
you
store data in successive rows / banks first, and then successive
columns (i.e. row/bank form LSB's of your address), you will usually
refresh the entire part without accessing a large portion of the
entire
memory.

Here's a typical sequence I use for writing streaming data into
an SDRAM:

Cycle Command Bank Addr Data
startup sequence has unused cycles (NOPs)
1 ACT 0 row0 x
2 NOP x x x
3 ACT 1 row0 x
4 NOP x x x
5 ACT 2 row0 x
full streaming starts here (burst size = 2)
6 WRITEA 0 col0 data0
7 ACT 3 row0 data0
8 WRITEA 1 col0 data1
9 ACT 0 row1 data1
10 WRITEA 2 col0 data2
11 ACT 1 row1 data2
12 WRITEA 3 col0 data3
13 ACT 2 row1 data3
14 WRITEA 0 col0 data4
15 ACT 3 row1 data4
16 WRITEA 1 col0 data5
above sequence (streaming can be repeated ad nauseum)
end sequence has unused cycles (NOPs)
17 NOP x x data5
18 WRITEA 2 col0 data6
19 NOP x x data6
20 WRITEA 3 col0 data7
21 NOP x x data7

WRITEA is write command with autoprecharge (A10 = 1)

Reading is similar except there are pipeline delays on the data bus
due to CAS read access time.

Regards,
Gabor

Jim Granville · Jan 30, 2007

PeteS wrote:

In terms of bitrates, then I designed a board with serial links at 5Gb/s
*per pair* a couple of years ago. Before that I designed some switches
and gateways with 2.5Gb/s pairs (lots and lots of them). PCI Express has
just released the 5Gb/s signalling revision (within the last month or
so, I believe).

Latest boasting on this is here
http://www.altera.com/corporate/news_room/releases/products/nr-pcie2.html?f=hp&k=wn2

and here

http://www.xilinx.com/prs_rls/2007/silicon_vir/0706_V5PCIe-compliance.htm

I presume there _is_ some important BER threshold to set the distance,
and that this fluff is marketing's effort :

".. reliably transmit data over an unprecedented 30 m "

- how 'reliably' exactly ? - giving a distance without any error rate is
getting a bit slack...

I also see they spec thus "5-GT/s PCIe 2.0 specification"

( Transistions per second )which seems more sensible than Gbps.

-jg

PeteS · Jan 30, 2007

Jim Granville wrote:

PeteS wrote:

In terms of bitrates, then I designed a board with serial links at
5Gb/s *per pair* a couple of years ago. Before that I designed some
switches and gateways with 2.5Gb/s pairs (lots and lots of them). PCI
Express has just released the 5Gb/s signalling revision (within the
last month or so, I believe).

Latest boasting on this is here
http://www.altera.com/corporate/news_room/releases/products/nr-pcie2.html?f=hp&k=wn2

and here

http://www.xilinx.com/prs_rls/2007/silicon_vir/0706_V5PCIe-compliance.htm

I presume there _is_ some important BER threshold to set the distance,
and that this fluff is marketing's effort :

".. reliably transmit data over an unprecedented 30 m "

- how 'reliably' exactly ? - giving a distance without any error rate is
getting a bit slack...

I also see they spec thus "5-GT/s PCIe 2.0 specification"

( Transistions per second )which seems more sensible than Gbps.

Actually, I think it's Transfers per second (which for NRZ data is the
same as the bit rate). That's commonly used in HT (formerly LDT) for the
data rate.

-jg

Well, I have seen 2.5Gb/s go 17 metre with 10dB loss and that was 5
years ago; the new distance looks to be a single lane which might sound
good to the marketdroids, but isn't much use or very practical.

Note that the later versions of the specifications don't put a loss
limit, but rather refer to eye closure (at least that's true for IB).

To claim compliance, the link has to exhibit <1E-12 BER, at least for
earlier versions. As I don't have a copy of the PCI-e 2.0 spec I'll
assume (as a SWAG) that's still true.

Cheers

PeteS

Peter Alfke · Feb 5, 2007

They are in business. They are associated with the university in
Spokane, Washington State, and I think they are as solid as the "rock
of Gibraltar".
Send them an e-mail if you prefer written English.
Oder brauchst Du Hilfe auf Deutsch?
Peter Alfke

On Feb 5, 2:00 pm, Kosta Xonis <ChaosKo...@web.de> wrote:

John_H schrieb/wrote:

They have a phone number, too. Look at the bottom of the page:

I already saw this, but its overseas (me from DE), and my spoken english
is worse than my written.

Thanks
[...]

--
XAKiChaos

Squirrel · Feb 6, 2007

Yes.

I ordered and received a board from them two weeks ago. When I called, they
answered the phone (business hours are posted on the Contacts page), and
answered all the questions I had.
Regards,
SH

"John_H" <newsgroup@johnhandwork.com> wrote in message
news:12sfcq68j64a513@corp.supernews.com...

I haven't talked to them myself in the last 60 days but they appear to be
going strong. The Spartan-3A starter board just entering the market is a
digilent product, though not necessarily (initially) sold directly by them.
It's a small outfit that was started (I believe) by a college professor or
two to help supply students with decent hardware. A slow/no response may
be that the one person you're trying to contact is stuck under a deluge of
other email or new-product issues. It's also possible that strange
spam-filtering had problems with the email (false positives are a problem
sometimes).

If you reply to me directly with quick questions (and where/when you sent
them before) I would happily devote 15 minutes of my morning trying to get
the right people connected. I'd reply directly back to your email 20
hours from now if you'd like to take me up on the offer.

"Kosta Xonis" <ChaosKosta@web.de> wrote in message
news:eq84us$3ij$00$1@news.t-online.com...
Hi !!

I sent an inquiry about 3 weeks before to digilent, but neither an
answer, nor a reply yet.

Same for the 2nd & 3rd try...

Are they still in business ??

THX !

--
XAKiChaos

Sandro · Feb 7, 2007

On Feb 5, 9:44 pm, Kosta Xonis <ChaosKo...@web.de> wrote:

...
Same for the 2nd & 3rd try...

Are they still in business ??
...

I sent an e-mail to the digilent support 20/01/2007 (Yes Saturday)
and received an answer 22/01/2007...
I'll suggest you to retry

Sandro

Symon · Mar 7, 2007

"Weng Tianxiang" <wtxwtx@gmail.com> wrote in message
news:1173140726.435538.84620@30g2000cwc.googlegroups.com...

Hi,
I am very confused with latch generation in VHDL.

1. I have been using VHDL for 7 years and I have never met a situation
I need a latch.

Just thought I'd add to this thread that, all other things being equal,

latches generally have less setup time than FFs. This can be important for
some circuits.
HTH, Syms.

Peter Alfke · Mar 7, 2007

On Mar 7, 5:57 am, "Symon" <symon_bre...@hotmail.com> wrote:>

Just thought I'd add to this thread that, all other things being equal,
latches generally have less setup time than FFs. This can be important for
some circuits.
HTH, Syms.

Symon, I disagree.
The set-up time of a flip-flop is really the set-up time of the master
latch (the last chance for input data to become locked up in the latch
when the clock is going High.) The flip-flop's slave latch has nothing
to do with the set-up time. Therefore I see no reason for the flip-
flop to have a longer set-up time than a simple latch. The evidence
seems to be anecdotal...
Peter Alfke

Symon · Mar 7, 2007

"Peter Alfke" <peter@xilinx.com> wrote in message
news:1173287136.833622.29220@v33g2000cwv.googlegroups.com...

On Mar 7, 5:57 am, "Symon" <symon_bre...@hotmail.com> wrote:
Just thought I'd add to this thread that, all other things being equal,
latches generally have less setup time than FFs. This can be important
for
some circuits.
HTH, Syms.

Symon, I disagree.
The set-up time of a flip-flop is really the set-up time of the master
latch (the last chance for input data to become locked up in the latch
when the clock is going High.) The flip-flop's slave latch has nothing
to do with the set-up time. Therefore I see no reason for the flip-
flop to have a longer set-up time than a simple latch. The evidence
seems to be anecdotal...
Peter Alfke

Hi Peter,

Yes, thinking about it again, you're absolutely right. I dredged back though
my memory and the situation I so badly recalled was a double data rate
circuit near the toggle limit of the part. I needed to synchronise phase
alignment between two signals, one in the rising and one in the falling edge
clock domains. To do this I had the output of a falling edge FF feeding the
input of a rising edge FF. By changing the falling edge FF to a latch
(transparent when clock is high), I gained extra set-up time for the rising
edge FF, as the latch passed data through a little earlier.
Many thanks for correcting me, and putting the record straight.
Best regards, Symon.

John_H · Mar 7, 2007

"Symon" <symon_brewer@hotmail.com> wrote in message
news:esmu00$mnc$1@aioe.org...

Hi Peter,
Yes, thinking about it again, you're absolutely right. I dredged back
though my memory and the situation I so badly recalled was a double data
rate circuit near the toggle limit of the part. I needed to synchronise
phase alignment between two signals, one in the rising and one in the
falling edge clock domains. To do this I had the output of a falling edge
FF feeding the input of a rising edge FF. By changing the falling edge FF
to a latch (transparent when clock is high), I gained extra set-up time
for the rising edge FF, as the latch passed data through a little earlier.
Many thanks for correcting me, and putting the record straight.
Best regards, Symon.

I've come up with this approach myself, subsequently seen it mentioned in an
app note, and now find you've concluded the same circuit is useful in this
situation.

More engineers should be exposed to this one application where a latch is
indespensible. Without it, DDR domains can get messy.

The one sad thing about it, in my opinion, is that the Xilinx timing tools
don't treat this case well. My recollection is that the setup is referenced
to the wrong edge giving me no chance to get clean numbers by only including
latch_d_q path tracing.

- John_H

Weng Tianxiang · Mar 7, 2007

On Mar 7, 10:41 am, "John_H" <newsgr...@johnhandwork.com> wrote:

"Symon" <symon_bre...@hotmail.com> wrote in message

news:esmu00$mnc$1@aioe.org...

Hi Peter,
Yes, thinking about it again, you're absolutely right. I dredged back
though my memory and the situation I so badly recalled was a double data
rate circuit near the toggle limit of the part. I needed to synchronise
phase alignment between two signals, one in the rising and one in the
falling edge clock domains. To do this I had the output of a falling edge
FF feeding the input of a rising edge FF. By changing the falling edge FF
to a latch (transparent when clock is high), I gained extra set-up time
for the rising edge FF, as the latch passed data through a little earlier.
Many thanks for correcting me, and putting the record straight.
Best regards, Symon.

I've come up with this approach myself, subsequently seen it mentioned in an
app note, and now find you've concluded the same circuit is useful in this
situation.

More engineers should be exposed to this one application where a latch is
indespensible. Without it, DDR domains can get messy.

The one sad thing about it, in my opinion, is that the Xilinx timing tools
don't treat this case well. My recollection is that the setup is referenced
to the wrong edge giving me no chance to get clean numbers by only including
latch_d_q path tracing.

- John_H

Hi John,
Could you please tell which application note you are talking about.

Thank you.

Weng

John_H · Mar 7, 2007

"Weng Tianxiang" <wtxwtx@gmail.com> wrote in message
news:1173293293.597472.321940@p10g2000cwp.googlegroups.com...

On Mar 7, 10:41 am, "John_H" <newsgr...@johnhandwork.com> wrote:
"Symon" <symon_bre...@hotmail.com> wrote in message

news:esmu00$mnc$1@aioe.org...

Hi Peter,
Yes, thinking about it again, you're absolutely right. I dredged back
though my memory and the situation I so badly recalled was a double
data
rate circuit near the toggle limit of the part. I needed to synchronise
phase alignment between two signals, one in the rising and one in the
falling edge clock domains. To do this I had the output of a falling
edge
FF feeding the input of a rising edge FF. By changing the falling edge
FF
to a latch (transparent when clock is high), I gained extra set-up time
for the rising edge FF, as the latch passed data through a little
earlier.
Many thanks for correcting me, and putting the record straight.
Best regards, Symon.

I've come up with this approach myself, subsequently seen it mentioned in
an
app note, and now find you've concluded the same circuit is useful in
this
situation.

More engineers should be exposed to this one application where a latch is
indespensible. Without it, DDR domains can get messy.

The one sad thing about it, in my opinion, is that the Xilinx timing
tools
don't treat this case well. My recollection is that the setup is
referenced
to the wrong edge giving me no chance to get clean numbers by only
including
latch_d_q path tracing.

- John_H

Hi John,
Could you please tell which application note you are talking about.

Thank you.

Weng

I thought it was XAPP250

http://www.xilinx.com/bvdocs/appnotes/xapp250.pdf

but a quick scan of the document and search for "latch" came up empty. Some
of the techniques I've been using lately are mentioned in that article as
are the ones in XAPP671. Ah, there it is. Mentioned on page 4 and
elaborated on page 11:

http://www.xilinx.com/bvdocs/appnotes/xapp671.pdf

Fun with silicon!

- John_H

Weng Tianxiang · Mar 8, 2007

On Mar 7, 12:16 pm, "John_H" <newsgr...@johnhandwork.com> wrote:

"Weng Tianxiang" <wtx...@gmail.com> wrote in message

news:1173293293.597472.321940@p10g2000cwp.googlegroups.com...

On Mar 7, 10:41 am, "John_H" <newsgr...@johnhandwork.com> wrote:
"Symon" <symon_bre...@hotmail.com> wrote in message

news:esmu00$mnc$1@aioe.org...

Hi Peter,
Yes, thinking about it again, you're absolutely right. I dredged back
though my memory and the situation I so badly recalled was a double
data
rate circuit near the toggle limit of the part. I needed to synchronise
phase alignment between two signals, one in the rising and one in the
falling edge clock domains. To do this I had the output of a falling
edge
FF feeding the input of a rising edge FF. By changing the falling edge
FF
to a latch (transparent when clock is high), I gained extra set-up time
for the rising edge FF, as the latch passed data through a little
earlier.
Many thanks for correcting me, and putting the record straight.
Best regards, Symon.

I've come up with this approach myself, subsequently seen it mentioned in
an
app note, and now find you've concluded the same circuit is useful in
this
situation.

More engineers should be exposed to this one application where a latch is
indespensible. Without it, DDR domains can get messy.

The one sad thing about it, in my opinion, is that the Xilinx timing
tools
don't treat this case well. My recollection is that the setup is
referenced
to the wrong edge giving me no chance to get clean numbers by only
including
latch_d_q path tracing.

- John_H

Hi John,
Could you please tell which application note you are talking about.

Thank you.

Weng

I thought it was XAPP250

http://www.xilinx.com/bvdocs/appnotes/xapp250.pdf

but a quick scan of the document and search for "latch" came up empty. Some
of the techniques I've been using lately are mentioned in that article as
are the ones in XAPP671. Ah, there it is. Mentioned on page 4 and
elaborated on page 11:

http://www.xilinx.com/bvdocs/appnotes/xapp671.pdf

Fun with silicon!

- John_H- Hide quoted text -

- Show quoted text -

Hi John,
Thank you for your suggestion.

Silicon is living and money to me.

Weng

EDK : FSL macros defined by Xilinx are wrong

JustJohn

Guest

JustJohn

Guest

Gabor

Guest

Marlboro

Guest

Tommy Thorn

Guest

Tommy Thorn

Guest

Tim

Guest

Gabor

Guest

Jim Granville

Guest

PeteS

Guest

Peter Alfke

Guest

Squirrel

Guest

Sandro

Guest

Symon

Guest

Peter Alfke

Guest

Symon

Guest

John_H

Guest

Weng Tianxiang

Guest

John_H

Guest

Weng Tianxiang

Guest

Log in

Welcome to EDABoard.com

Sponsor