EDK : FSL macros defined by Xilinx are wrong

Aurelian Lazarut · Apr 21, 2006

Martin Schoeberl wrote:

I would like to code the on-chip memory in vendor neutral VHDL.
I got it running for a dual-port memory with single clock and
same port sizes for the read and write port.

However, I need a memory with a 32-bit write port and an 8-bit
read port. So far I was not able to code it in VHDL in a way
that the Synthesizer inferres the correct block ram without
an extra read MUX.

I'll give up one this vendor independent block RAM project. For
the 32-bit write data, 8-bit read data with registered address,
in data and unregistered out data RAM coded in VHDL I got:

On the Altera Cyclone: generates a 32-bit dual port RAM with an
external 4:1 MUX. This MUX hurts fmax (from 94MHz down to 84MHz)!

On the Xlinix Spartan-3: The RAM gets implemented as distributed
RAM! Uses a lot of LCs and the fmax goes from 65MHz down to
50MHz

So I will bite the bullet and use two vendor specific VHDL files.
However, there is one open issue: I want the memory size be
configurable via a generic. This is possible with Alteras
altsyncram.

For Xilinx I only know those RAMB16_S9_S36 components where
the memory size is part of the component name. Is there a
a Xilinx block RAM component where I can specify the size?

NO, but you can use GENERATE (assuming VHDL) to switch between different

bram geometries
Aurelian

Thanks,
Martin

--
__
/ /\/\ Aurelian Lazarut
\ \ / System Verification Engineer
/ / \ Xilinx Ireland
\_\/\/

phone: 353 01 4032639
fax: 353 01 4640324

Jochen · Apr 21, 2006

Brad Smallridge wrote:

Hi,

this task, one using the FX output to go from 40 to 140 MHz (7/2), and
another DCM using 2X output to go from 140 to 280. The FX output locks OK
and the divided output looks OK on the scope. The 2X lock output seems to
lock intermitently, trying to get to high, and the 2X divided output is
"fuzzy".

ds099 for Spartan3 / jitter-calculator for Virtex2:
The FX-output of first DCM's output will have a worst-case-jitter of
740ps
(http://www.xilinx.com/applications/web_ds_v2/jitter_calc.htm)
whereas only 300ps are allowed for CLKIN of 2nd DCM's input !

Cheers
Jochen

Marco · Apr 21, 2006

Thanks Tim, it works now, I was not drunk, but I did not see it!
Marco

Martin Schoeberl · Apr 21, 2006

The RAMB16 elements are the raw RAM macros. The Sx part of the name indicates the port width. You can build up bigger memories in
Coregen which is a bit like the Altera Megawizard tool or build them up yourself using generic statements using the raw macros.

Can you describe this a little bit more specific, please? Are the
other components available to describe Xilinx block RAMs?

BTW: With the web edition I don't have Coregen and I also don't
use the Megawizzard in Quartus. Ideal setup is a single generic
parameter with the memory size (in lenght of the address).

If you looking at switching between vendors one trick is to hide a RAM inside a wrapper file. If you use the wrapper level as the
RAM component for instantiation then you will only have to change the technology based memory element in one place i.e. the the
wrapper file.

That's the way I do it. I switch between technologies with different
files in the project. I also use this different VHDL files in projects
for other customization - primitiv, but efficient.

Some synthesisers are capable of inferring RAM usually using an indexed array of something like VHDL's "std_logic_vector". I can't
tell you much about the results as it isn't my own preferred method but a non-vendor synthesiser may do better than one offered by
the silicon vendors.

The Xilinx tool interffered distributed RAM from the VHDL description.
A thing I definitely don't want. Quartus had problems with the
different port sizes, but single port sizes work very well.

Martin

I would like to code the on-chip memory in vendor neutral VHDL.
I got it running for a dual-port memory with single clock and
same port sizes for the read and write port.

However, I need a memory with a 32-bit write port and an 8-bit
read port. So far I was not able to code it in VHDL in a way
that the Synthesizer inferres the correct block ram without
an extra read MUX.

I'll give up one this vendor independent block RAM project. For
the 32-bit write data, 8-bit read data with registered address,
in data and unregistered out data RAM coded in VHDL I got:

On the Altera Cyclone: generates a 32-bit dual port RAM with an
external 4:1 MUX. This MUX hurts fmax (from 94MHz down to 84MHz)!

On the Xlinix Spartan-3: The RAM gets implemented as distributed
RAM! Uses a lot of LCs and the fmax goes from 65MHz down to
50MHz

So I will bite the bullet and use two vendor specific VHDL files.
However, there is one open issue: I want the memory size be
configurable via a generic. This is possible with Alteras
altsyncram.

For Xilinx I only know those RAMB16_S9_S36 components where
the memory size is part of the component name. Is there a
a Xilinx block RAM component where I can specify the size?

Thanks,
Martin

Martin Schoeberl · Apr 21, 2006

For Xilinx I only know those RAMB16_S9_S36 components where
the memory size is part of the component name. Is there a
a Xilinx block RAM component where I can specify the size?

NO, but you can use GENERATE (assuming VHDL) to switch between different bram geometries
Aurelian

Really, that's it? Not very comfortable - a plus for Quartus.

Perhaps one in this group has already done this coding effort
and can provide the VHDL file?

Martin

Martin Schoeberl · Apr 21, 2006

Hi Olaf,

I'll give up one this vendor independent block RAM project. For
the 32-bit write data, 8-bit read data with registered address,

Don't give up ;-)

Maybe the code attached from my project will help you. Using configurations you can choose the architecture.

Not so bad! The infering architecture generates a block ram
for the Spartan-3. However, with Quartus it generates registers.
A little step forward ;-)

The next step is to generate a memory with different port sizes.
The attached code instances a block ram with a MUX in Quartus
and distributed memory with the Xilinx tool.

--
-- gen_mem.vhd
--
-- VHDL memory experiments
--
-- address, data in are registered
-- data out is unregistered
--
--

library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;

entity jbc is
generic (jpc_width : integer := 10);
port (
clk : in std_logic;
data : in std_logic_vector(31 downto 0);
rd_addr : in std_logic_vector(jpc_width-1 downto 0);
wr_addr : in std_logic_vector(jpc_width-3 downto 0);
wr_en : in std_logic;
q : out std_logic_vector(7 downto 0)
);
end jbc;

--
-- registered wraddress, wren
-- registered din
-- registered rdaddress
-- unregistered dout
--
architecture rtl of jbc is

constant nwords : integer := 2**(jpc_width-2);
type mem is array(0 to nwords-1) of std_logic_vector(31 downto 0);
signal ram_block: mem;

signal d: std_logic_vector(31 downto 0);

signal rda_reg : std_logic_vector(jpc_width-1 downto 0);

begin

d <= ram_block(to_integer(unsigned(rda_reg(jpc_width-1 downto 2))));

process(clk)
begin

if rising_edge(clk) then
if wr_en='1' then
ram_block(to_integer(unsigned(wr_addr))) <= data;
end if;

rda_reg <= rd_addr;

end if;
end process;

process(rda_reg, d)
begin
case rda_reg(1 downto 0) is
when "11" =>
q <= d(31 downto 24);
when "10" =>
q <= d(23 downto 16);
when "01" =>
q <= d(15 downto 8);
when "00" =>
q <= d(7 downto 0);
when others =>
null;
end case;
end process;

end rtl;

Adrian Knoth · Apr 21, 2006

JustJohn <john.l.smith@titan.com> wrote:

Hi Adrian,

Hi!

There is a lot wrong here, and I can't teach you all about H/W design
vs. S/W coding, but will try to touch on some key points. I had a look

Thank you so much, these were exactly the points I needed. They've
guided me on the (more or less complete) rewrite.

All the problems here stem from a common source, thinking that a H/W
process is like a S/W procedure or function, but it is not. Concentrate
on the basic way process works in synthesizable code: Whenever an event
occurs on a member of the sensitivity list, the process is entered,
statements run 'in an instant', and the process exits.

This was necessary to understand and it works now. I'll ask
my professor to slightly change the basic vhdl course[0];
we do not do a lot, just combine some flip flops, but never
realize that there are statements which compile but won't
work (and yes, I thought the VHDL compiler is as strict
as an Ada compiler)

Thank you (and the others) for your patience.

[0] You cannot call this a course, it is only a laboratory with
let's say five hours for VHDL in total

--
mail: adi@thur.de http://adi.thur.de PGP: v2-key via keyserver

Alt-F4 ist die Grundlage jeden vernünftigen Arbeitens mit Windows.

Brad Smallridge · Apr 21, 2006

Wow. That's it exactly. I inverted the first DCM lock and fed it to the
second DCM reset and all the clocks scope good now. I'll play around with
adding a delay, as you suggested, later.

Thank you ever so much.

Brad Smallridge
aivison.com

Therefore, I think you need to hold the second DCM in reset until the
first
DCM locks (this is mentioned in several app-notes).

John Adair · Apr 21, 2006

Abhishek

I don't know what country you want but it might be worth looking at the
Institute for System Level Integration in Scotland which is associated with
Alba Centre in Livingston. There are courses and post-grad oppertunities
that are certainly near what you want. Alba centre website is here
http://www.scottish-enterprise.com/albacentre.htm .

John Adair
Enterpoint Ltd. - Home of MINI-CAN. The FPGA CAN Bus Development Board.
http://www.enterpoint.co.uk

"ABS" <abhishekbedi@gmail.com> wrote in message
news:1133082999.335991.288870@g44g2000cwa.googlegroups.com...

Hi All

I want to Research in the "Processors Core Development" , can any one
suggest me of any current topics to conduct this Master's Level
Research for whole year .
I would be great full !!
or else i would like to Research in developing "Applications for FPGA's
/ CPLD's " .

Cheers

Abhishek Bedi

Peter Alfke · Apr 21, 2006

Al, Philip gave you good advice:
For each input pin, you can specify a delay from the pad to the O. The
granularity (given a 200 MHz calibration frequency) is 78.125 ps,but
each tap has its own non-cumulative error of about 15 ps.
I would improve your accuracy by using 16 inputs, each having a
different IDELAY value, so that you divide the 5 ns into 16 steps of
312 ps each (give or take a 15 ps non-accumulative error). The tap
delays are unaffected by any jitter of the 200 MHz clock.
You interconnect all 16 inputs. When an edge comes in, it will be
delayed differently in each IDELAY, and you use your 200 MHz clock to
register a 16-bit input word which has ones on one end, and zeros on
the other.
It's then your job to find the transition point (look-up-tables are
good for that),and that 4-bit binary value identifies the time as a
fraction of your 5 ns timing (200 MHz)
This means you have an absolute time for the rising as well as for the
falling edge, and the difference is your pulse width. Worst-case error
is thus +/- one tap.
Peter Alfke, from home

Jim Granville · Apr 21, 2006

Peter Alfke wrote:

Al, Philip gave you good advice:
For each input pin, you can specify a delay from the pad to the O. The
granularity (given a 200 MHz calibration frequency) is 78.125 ps,but
each tap has its own non-cumulative error of about 15 ps.
I would improve your accuracy by using 16 inputs, each having a
different IDELAY value, so that you divide the 5 ns into 16 steps of
312 ps each (give or take a 15 ps non-accumulative error).

Are the pin-captures within this 15ps window, or is that just the
error of the delay elements themselves ?

The tap

delays are unaffected by any jitter of the 200 MHz clock.
You interconnect all 16 inputs. When an edge comes in, it will be
delayed differently in each IDELAY, and you use your 200 MHz clock to
register a 16-bit input word which has ones on one end, and zeros on
the other.
It's then your job to find the transition point (look-up-tables are
good for that),and that 4-bit binary value identifies the time as a
fraction of your 5 ns timing (200 MHz)
This means you have an absolute time for the rising as well as for the
falling edge, and the difference is your pulse width. Worst-case error
is thus ą one tap.
Peter Alfke, from home

This sounds like a good app-note...., Peter ?

Such an app note could also cover :
a) If you use just one FPGA pin (eg existing PCB design), what are the
alternatives ?

b) Trickiest portion of this, I can see, will be crossing the 'phase
boundary' between the delay line capture, and the counter-capture.
Edge detect flag could be as simple as Sample.0 <> Sample.15.

For the Calibrate Philip mentions, and this ease of edge detect, the
delay block should be toleranced to be always greater than the clock -
ie maybe 6ns for 200MHz.

The Clock can be scaled, to match the FPGAs ability to count/capture
the edges - which will be related a little to the max time between edges
- longer counters are slower

c) Pattern detect might need to be single sample error tolerant.
ie a pattern of 111110100000000 might occur ?

-jg

Martin Schoeberl · Apr 21, 2006

I have done some, although I haven't covered all the various options -
unfortunately within work time, so I can't post them :-(

That's fine, not everything can be open-source ;-)
And this work should be done by Xilinx (as you wrote
in the other post).

I based it on the ideas in David Kessner's Free-IP RAM library if that
helps... I can't seem to find it out there on the web anymore.

The wayback machine has it though...
http://web.archive.org/web/20040519060445/http://www.free-ip.com/
http://web.archive.org/web/20040605072636/www.free-ip.com/ramlib/index.html

But the .zip files are not archieved :-(

Martin

Peter Alfke · Apr 21, 2006

The 15 ps are what I remember as the difference between the ideal delay
from pin to O vs the measured delay, because the taps are not perfectly
equal. As a difference between further non-adjacent taps, this
statistical error actually gets smaller. The total delay over the 64
bits is exactly 5 ns = one period of the 200 MHz clock. It is
servo-controlled. The 200 MHz are allowed to vary by +/-10%, (causing
of course an inversely proportional change in tap delay) although that
is not described in the data sheet..
I could imagine calibrating this with a variable frequency input of
<<200 MHz, effectively measuring the half-period of the incoming
signal. Any discontinuities could be attributed to wrong tap-settings
and/or different pc-board-to-chip (package) delays. This can of course
be remedied by changing individual tap settings (The design in question
uses only 25% of the available tap settings). Sampling errors as Jim
showed should be impossible, once the design is properly adjusted.

I think IDELAY is one of the most exciting innovations in Virtex-4
(together with the FIFO controller).
Peter Alfke, from home.

Philip Freidin · Apr 21, 2006

On Mon, 28 Nov 2005 03:15:36 -0600, alastairlynch@blueyonder.co-dot-uk.no-spam.invalid (al99999) wrote:

Thanks for all your help. One quick last question, is it possible to
internally connect the pins, or do I need to physically wire them up
external to the fpga? Thanks again,

Alastair

You could bring the signal in on 1 pin, and then setup 8 other
I/Os as bi directional, and send the signal out on all 8, and
then bring it back in on those 8, with the IDELAY stuff. Doing
this will make the external pins wiggle, so they would all have
to be "no connection" externally.

Overall, I would not recommend this structure, as you will not
have good control of the delay to each of the output circuits,
and this would therefore add to the error in timing.

I think it is best to distribute the signal on your PCB.

Philip

Mike Treseler · Apr 21, 2006

Monica wrote:

We are confused how FPGA drives logic on the pins that we donot use in
design.Are we missing something?
By default, some versions of quartus route signals

through unused pins to save internal resources.

Can anybody please give us a hint how to solve this problem?
One way is to declare and describe exactly what

you expect these pins to do:
a <= '0';b <= '0'; c <= '0';...
Or
c <= a and b and ...;

The other way is to change the synthesis
default setting for unused pins.

-- Mike Treseler

henn_xxx@trispel.org · Apr 21, 2006

Hi Monica,

in Quartus II software, select Assignments in the menue selection,
there you select "Device". Now a settings window pops-up, press the
"Device & Pin Options..." button. Another window pops-up. One of its
tabs is called "Unused Pins". That is what you are probably looking
for.

Alternativ another smart and good way is of course Mike's hint of
assigning unused pins directly to a specific signal, e.g. 0.

HTH
Henning

Michael Dreschmann · Apr 21, 2006

Hi,

thanks for your code. I think I've found a solution:
The read and write pointers will be implemented in gray code. Then
I'll decode them to binary code and multiply by 18. The multiplication
should be a simple addition, so there is no big resource using. (*18 =
*16 + *2)
But I've a last question:
If I compare the two gray coded pointers, no glitch can appear on the
full or empty signals? Or do I have to consider something else?

Thanks,
Michael

Hal Murray · Apr 21, 2006

There can be a problem, when DONE is released by the FPGA, if its risetime
is too slow. From what I understand, this is only a problem if there are
more than one FPGA's DONE pins tied together.

Who/what gets confused if DONE rises slowly? Slow relative to what?

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Bob · Apr 21, 2006

"Hal Murray" <hmurray@suespammers.org> wrote in message
news:k5udnR_VhaxAUxbenZ2dnUVZ_t2dnZ2d@megapath.net...

There can be a problem, when DONE is released by the FPGA, if its risetime
is too slow. From what I understand, this is only a problem if there are
more than one FPGA's DONE pins tied together.

Who/what gets confused if DONE rises slowly? Slow relative to what?

--

Hal,

DONE, after it's been released, becomes an input. An option in the bitstream
(via bitgen) can be setup so it looks to see if DONE's being held low, after
configuration, and if so, holds off the activation of its guts. I believe
that the reason for this is if there are other devices that are being
configured, and their DONE pins are tied together, that activation waits
until the last device has been configured before they start running.

The slowness of the risetime somehow affects a device's ability to sense the
release of DONE. This is what I was told. The effect is (at least what we
saw when we experienced this problem) that the device never starts up. Both
solutions worked, for us -- the stronger pullup worked and and the push/pull
option worked. We use the push/pull option, now, since we never tie DONE
pins together.

I'm not a Xilinx employee, and I don't play one on tv, so if you're really
interested I'm sure that there is some documentation on their website. Where
are Peter/Austin when you need them? They've probably made so much damn
money that they only, now, answer the easy questions. Ingrates!

Bob

fjh-mailbox-38@galois.com · Apr 21, 2006

Bob Perlman wrote:

Fergus Henderson wrote:
Mike Treseler writes:
Problem 1.
There are ten times as many software designers
as digital hardware designers.
Solution 1:
Develop high-level languages for hardware design. Make these similar
enough to existing software development practices that developers only
need a general understanding of hardware optimization techniques (e.g.
pipelining, resource sharing, etc.), available hardware resources (e.g.
LUTs
and BlockRAMs), and how high-level language constructs map onto those
hardware resources. Then one hardware engineer can easily train up 10
software engineers to the level of hardware knowledge that they need in
order
to be able to productively develop efficient hardware using a
high-level language.

Would it be possible to do just the opposite, and create a high-level
language that lets a digital designer write efficient,
high-performance software the same way he'd design hardware? Because
I'd like to become an expert programmer without expending much effort.

I'm not suggesting that becoming an expert hardware designer isn't
going to take effort. But currently popular hardware design tools are
in the stone age in comparison to software design tools. The amount of
effort required to implement even very simple functionality in
synthesizable VHDL/Verilog is huge -- much higher than the effort
required to implement the same functionality in software.

Becoming an expert warrior certainly takes effort, regardless of
whether your weapon of choice is a sharpened stone axe or an AK47. But
that's not a good reason to stick with stone axes.

--
Fergus J. Henderson "I have always known that the pursuit
Galois Connections, Inc. of excellence is a lethal habit"
Phone: +1 503 626 6616 -- the last words of T. S. Garp.

EDK : FSL macros defined by Xilinx are wrong

Aurelian Lazarut

Guest

Jochen

Guest

Marco

Guest

Martin Schoeberl

Guest

Martin Schoeberl

Guest

Martin Schoeberl

Guest

Adrian Knoth

Guest

Brad Smallridge

Guest

John Adair

Guest

Peter Alfke

Guest

Jim Granville

Guest

Martin Schoeberl

Guest

Peter Alfke

Guest

Philip Freidin

Guest

Mike Treseler

Guest

henn_xxx@trispel.org

Guest

Michael Dreschmann

Guest

Hal Murray

Guest

Bob

Guest

fjh-mailbox-38@galois.com

Guest

Log in

Welcome to EDABoard.com

Sponsor