V2p block ram clock -> Q delay help

  • Thread starter Matthew E Rosenthal
  • Start date
M

Matthew E Rosenthal

Guest
Hi all,
I have a long combinational path in my fpga design and I am looking for
ways to reduce the path. one of the biggest contributors is the clock to
Q delay from memory on some of the inputs to the path. The
memory(blockram) is currently very wide and not deep.
Is there a way to optimize the size or any other paramaters to decrease
the clock to Q time?

Thanks

Matt
 
Pipelining is the most obvious and most popular way to reduce long delays.
When it can be used, it is great...
Peter Alfke

From: Matthew E Rosenthal <mer2@andrew.cmu.edu
Organization: Carnegie Mellon, Pittsburgh, PA
Newsgroups: comp.arch.fpga
Date: Wed, 5 May 2004 18:50:30 -0400 (EDT)
Subject: V2p block ram clock -> Q delay help

Hi all,
I have a long combinational path in my fpga design and I am looking for
ways to reduce the path. one of the biggest contributors is the clock to
Q delay from memory on some of the inputs to the path. The
memory(blockram) is currently very wide and not deep.
Is there a way to optimize the size or any other paramaters to decrease
the clock to Q time?

Thanks

Matt
 
Unfortunately that can not be implemented. I was hoping for something
specific to bram clock-> Q delay.


Matt
On Wed, 5 May 2004, Peter Alfke wrote:

Pipelining is the most obvious and most popular way to reduce long delays.
When it can be used, it is great...
Peter Alfke

From: Matthew E Rosenthal <mer2@andrew.cmu.edu
Organization: Carnegie Mellon, Pittsburgh, PA
Newsgroups: comp.arch.fpga
Date: Wed, 5 May 2004 18:50:30 -0400 (EDT)
Subject: V2p block ram clock -> Q delay help

Hi all,
I have a long combinational path in my fpga design and I am looking for
ways to reduce the path. one of the biggest contributors is the clock to
Q delay from memory on some of the inputs to the path. The
memory(blockram) is currently very wide and not deep.
Is there a way to optimize the size or any other paramaters to decrease
the clock to Q time?

Thanks

Matt
 
Matthew E Rosenthal wrote:
Unfortunately that can not be implemented. I was hoping for something
specific to bram clock-> Q delay.

The BlockRAM CLK -> Q can only be improved by going to a faster speed
grade. If that is out of the question and adding latency is not an
option then you have two choices, either look elsewhere to reduce the
path delays or not use the BlockRAM. You mention that you do not need
the RAM very deep but do need it very wide. Have you considered using
LUT-based RAM (RAM16X1S)? You can configure LUT-based RAM fairly easily
in 16, 32 and 64 bit depths and will see a better CLK -> Q than in the
BlockRAMs and on top of that, likely see better placement for wider
buses since they are not all tied together like they are in a BlockRAM.
Also, LUT-RAMs have asynchronous reads so if you want to keep that
clock cycle of latency for your reads, you can either add a register to
the output of the RAM in the same slice and get that latency back and
still get a good CLK --> Q or else you can push that register deeper
into you critical path and perhaps get a better balance of registers in
that path and thus get much better timing. You can configure the
LUT-RAMs to depths deeper than 64-bits but you start to consume a lot of
LUT resources and the trade-off is not as great. My suggestion is if
you can get by with 64-bits or less ore bit, might as well go to
LUT-RAM. If you need deeper RAMs, stay in the BlockRAM and look at
reducing routing delays (you can try adding placement constraints,
replicating registers/logic, higher effort levels in Map/Par, etc.) or
logic levels for those critical paths (try harder synthesis
constraints/options, re-coding that section of the design, etc.).

Good Luck,

-- Brian

Matt
On Wed, 5 May 2004, Peter Alfke wrote:


Pipelining is the most obvious and most popular way to reduce long delays.
When it can be used, it is great...
Peter Alfke


From: Matthew E Rosenthal <mer2@andrew.cmu.edu
Organization: Carnegie Mellon, Pittsburgh, PA
Newsgroups: comp.arch.fpga
Date: Wed, 5 May 2004 18:50:30 -0400 (EDT)
Subject: V2p block ram clock -> Q delay help

Hi all,
I have a long combinational path in my fpga design and I am looking for
ways to reduce the path. one of the biggest contributors is the clock to
Q delay from memory on some of the inputs to the path. The
memory(blockram) is currently very wide and not deep.
Is there a way to optimize the size or any other paramaters to decrease
the clock to Q time?

Thanks

Matt
 
that's sound you in trouble, a design with no room for pipelines? <p>one thing i want to point out:(Ray mentioned it before?) you need to do manual placement so that the flipflop can sit next to the blockram, auto placement some time failed to do that.
 
yup. Autoplace virtually never puts the flip-flop adjacent
to the block RAM. Unless it is in a critical feedback loop,
you should be able to pipeline.

thangkho wrote:

that's sound you in trouble, a design with no room for
pipelines?

one thing i want to point out:(Ray mentioned it before?)
you need to do manual placement so that the flipflop can
sit next to the blockram, auto placement some time failed
to do that.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin
Franklin, 1759
 
The clock to Q of the BRAMs is what it is, and it is longer than the clock to
Q of the flip-flops in the fabric. The best solution is to pipeline the BRAM
outputs by adding a register. For maximum performance, that register should
be placed immediately adjacent to the block RAMs to minimize the routing
delays out of the block RAM. The automatic placement does an exceptionally
poor job at placing pipeline registers on BRAM outputs, so in order to have
them be of much use you have to do a little floorplanning.

Now you mentioned you can't afford to pipeline the design, which I'll trust
you on for the moment. If that is the case, then you'll have to live with the
long clock to Q from the BRAM, although it doesn't mean you also have to live
with the routing delays to the logic you have connected to them, nor
necessarily the propagation delay through that logic. First, look at the
logic connected the BRAM outputs. Is it designed for minimum propagation
delay to the next flip-flop? Is there anything you can do to reduce the
number of LUTs it passes through? Are you using the carry chain (the carry
chain can be expensive in terms of propagation delay)? Next, look at your
timing report. It enumerates how much of the delay between the BRAM and the
flip flop is attributed to logic and how much to routing, and gives you the
delay for each net in the path. You need to reduce those delays by placing
the logic as close to the BRAMs as you can get it. If your design is like
many novice FPGA designs, your signal goes through several LUTs before
reaching a flip-flop. Each LUT has a flip-flop with it, so pipelining comes
for free if you can afford the latency, but I assume you know that. Anyway,
the automatic placer does alright with placing one level of logic (levels of
logic are the number of LUTs the signal passes through between flip-flops),
but when there are two or more levels of logic, the placer does quite poorly,
often placing the LUTs far away from the direct path between the flip-flops.
What you need to do is constrain the placement of the flip flops as well as
all the logic between the flip-flops and the BRAM so that it is kept as close
to the BRAM as practical. An area constraint on that logic will help,
although the ultimate performance will come by hand placing that critical
logic.

Another consideration is that the automatic router in recent versions of ISE
has gotten lazy compared to the router in versions 2 years ago. The current
router no longer gets the shortest route between well placed logic, rather it
stops optimizing each route as soon as the route is under the timing
constraint. The result is you wind up with every route being a critical
route, and in dense high perfomrance designs you get congestion so that the
router can't find a solution that meets timing. Running the router multiple
times in the reentrant mode will sometimes improve the results, but usually
will not achieve the level of performance you can get with a hand route, or in
the case of VirtexI devices what you could achieve with the version 3 sp8
tools. If placement constraints alone don't get your timing to where it needs
to be, you can try doing some hand routing of that circuit using FPGA editor.
At the very least, that will tell you how much performance is possible, and if
the level of performance you seek is possible with your circuit, it may be the
only way to reach it given the current state of the tools without further
changes to your design.


Matthew E Rosenthal wrote:

Wut the heck are you talking about?

On Thu, 13 May 2004, thangkho wrote:

that's sound you in trouble, a design with no room for pipelines?

one thing i want to point out:(Ray mentioned it before?) you need to do
manual placement so that the flipflop can sit next to the blockram, auto
placement some time failed to do that.
--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759
 
Wut the heck are you talking about?

On Thu, 13 May 2004, thangkho wrote:

that's sound you in trouble, a design with no room for pipelines?

one thing i want to point out:(Ray mentioned it before?) you need to do
manual placement so that the flipflop can sit next to the blockram, auto
placement some time failed to do that.
 

Welcome to EDABoard.com

Sponsor

Back
Top