Area Optimization

C

Christopher Head

Guest
Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read? I've already spent a little time looking around the design in
ISE's schematic viewer, but with tens of thousands of LUTs it's not
exactly a fast process going from that angle, and if possible I'd
rather avoid getting into a lot of explicit instantiation of primitives.

I've already read the ISE documentation on how to write expressions
that the synthesizer can recognize as particular patterns, but
unfortunately most of my design is just brute-force combinational logic
(a lot of basic boolean operations and additions on fairly wide values)
arranged into a pipeline, so the special patterns don't really apply (I
don't have counters, or RAMs, or shift registers, or what have you).

This is with ISE 13.1, in case it matters, the most recent AFAIK.

I do have the option of moving to a larger chip if necessary, but would
strongly prefer not to as the one I'm using is the largest supported by
WebPack. I've looked at chips in other families, and WebPack seems to
top out at similar LUT counts in all the families.

Thanks!
Chris
 
Christopher Head <chead@is.invalid> wrote:

I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read? I've already spent a little time looking around the design in
ISE's schematic viewer, but with tens of thousands of LUTs it's not
exactly a fast process going from that angle, and if possible I'd
rather avoid getting into a lot of explicit instantiation of primitives.

I've already read the ISE documentation on how to write expressions
that the synthesizer can recognize as particular patterns, but
unfortunately most of my design is just brute-force combinational logic
(a lot of basic boolean operations and additions on fairly wide values)
arranged into a pipeline, so the special patterns don't really apply (I
don't have counters, or RAMs, or shift registers, or what have you).
One of the tricks, which I don't believe the the tools will do
automatically, is use the BRAMs in place of logic. That is, use
a BRAM as a big look-up table. Since BRAMs are synchronous, you
have to fit it in with your pipeline logic, but that shouldn't
be so hard to do.

-- glen
 
Chris

Are a pile of techniques that can reduce size and a lot depends on the
original HDL design and coding style. We do this as one of our
services and I have seen designs reduced to 40% of the original design
in some extreme cases. Obtaining a 20% reduction to 80% of the
original is more typical.

As with any engineering prpblem the first thing to do is to identify
where your problem might be. I would typically use Floorplanner to
identify which modules in your design are the largest. The largest is
probably got the most chance of giving you most.

On the simple level try speed and area driven synthesis. Area mode
does not always give the smallest result. You can also use choice of
sythesisers to get different results if you have those available to
you. Typically you might get 5-10% out of these techniques but I have
seen some extreme sythesiser results giving a X3 variation on some
logic.

One other thing on synthesis that can make a reasonable difference is
the setting for you state machine encoding. Try playing with different
settings. If the XST switch isn't broken again try anything but One
Hot encoding. XST programmers have a fixation for One Hot encoding and
it one gives the best results in less than 25% of designs.

Moving to the next level and much more extreme is to look at your HDL.
Here you can look for shift registers that can go to SRL16/32
technology in Xilinx parts. That can save a lot. Old techniques like
using illegial states in a state machine to reduce logic decoded can
also be beneficial. Other techniques like using RAM for multiple
related registers may also get you a reduction.

John Adair
Enterpoint Ltd. - Home of Drigmorn4. The Spartan-6 FPGA Embedded
processor Board.

On Jun 11, 6:54 am, Christopher Head <ch...@is.invalid> wrote:
Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read? I've already spent a little time looking around the design in
ISE's schematic viewer, but with tens of thousands of LUTs it's not
exactly a fast process going from that angle, and if possible I'd
rather avoid getting into a lot of explicit instantiation of primitives.

I've already read the ISE documentation on how to write expressions
that the synthesizer can recognize as particular patterns, but
unfortunately most of my design is just brute-force combinational logic
(a lot of basic boolean operations and additions on fairly wide values)
arranged into a pipeline, so the special patterns don't really apply (I
don't have counters, or RAMs, or shift registers, or what have you).

This is with ISE 13.1, in case it matters, the most recent AFAIK.

I do have the option of moving to a larger chip if necessary, but would
strongly prefer not to as the one I'm using is the largest supported by
WebPack. I've looked at chips in other families, and WebPack seems to
top out at similar LUT counts in all the families.

Thanks!
Chris
 
On Jun 11, 1:54 am, Christopher Head <ch...@is.invalid> wrote:
Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read? I've already spent a little time looking around the design in
ISE's schematic viewer, but with tens of thousands of LUTs it's not
exactly a fast process going from that angle, and if possible I'd
rather avoid getting into a lot of explicit instantiation of primitives.

I've already read the ISE documentation on how to write expressions
that the synthesizer can recognize as particular patterns, but
unfortunately most of my design is just brute-force combinational logic
(a lot of basic boolean operations and additions on fairly wide values)
arranged into a pipeline, so the special patterns don't really apply (I
don't have counters, or RAMs, or shift registers, or what have you).

This is with ISE 13.1, in case it matters, the most recent AFAIK.

I do have the option of moving to a larger chip if necessary, but would
strongly prefer not to as the one I'm using is the largest supported by
WebPack. I've looked at chips in other families, and WebPack seems to
top out at similar LUT counts in all the families.

Thanks!
Chris
Have you turned on the area optimization control? Most synthesizers
have a trade off between speed and area. Most of the time they seem
to default to optimizing for speed. That can easily get you 10% in
most designs.

As to techniques, first you need to find out where your LUTs are being
used. Rather than using tools for that, compile your code one module
at a time or in smaller groups of modules. I usually code from the
bottom up and test every module in the simulator. So it is not hard
to also do a compile and see how large each one is. Then you might be
able to see which ones are larger than you expect and can look at how
to improve them.

Rick
 
Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read?

Xilinx white paper WP231 is a good read. It is mainly for speed but show
why doing things like using an asynchronous reset is a really bad plan fo
both speed and area.


If you really don't care about speed then have you considered convertin
your parallel data paths into serial? Serial adders are really reall
small.


John Eaton






---------------------------------------
Posted through http://www.FPGARelated.com
 
Christopher Head <chead@is.invalid> writes:

Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read? I've already spent a little time looking around the design in
ISE's schematic viewer, but with tens of thousands of LUTs it's not
exactly a fast process going from that angle, and if possible I'd
rather avoid getting into a lot of explicit instantiation of primitives.
Have you first established which parts of you design are responsible for the
most LUT usage?

If not, I wrote FPGAOptim when I was in a similar situation to help with just
that:

http://www.conekt.co.uk/capabilities/49-fpga-optim

Drop me an email via that webpage and I'll get a download link to you.

Alternatively, these days Planahead can provide a view on LUT usage, and the
logfiles also have some information.

Once you know which blocks to optimise, you've had good answers from others
already. In my most recent case (a video processing application) there's
sections of code which only have to update once per video line - they are prime
targets for resource sharing.

As John Adair said, reducing by 20% is usually easily doable. With deep
knowledge of what's going on and the tradeoffs that are acceptable, I've
achieved 40-50% in the past.

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.co.uk/capabilities/39-electronic-hardware
 
On Jun 12, 7:22 pm, "jt_eaton"
<z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
Hi all,
I have a design (written in VHDL) targetting the Spartan 6 series, and
it's oversubscribed for LUTs. Can anyone recommend good resources to
read?

Xilinx white paper WP231 is a good read. It is mainly for speed but shows
why doing things like using an asynchronous reset is a really bad plan for
both  speed and area.

If you really don't care about speed then have you considered converting
your parallel data paths into serial? Serial adders are really really
small.

John Eaton
Hi John,

Thanks for that pointer. I have always been a believer in using the
async reset and now I see that this may not always be the best way to
reset a design. But the devil is in the details. I wonder if this
still applies to non-Xilinx designs?

Rick
 
On Jun 12, 7:22=A0pm, "jt_eaton"
z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:


Thanks for that pointer. I have always been a believer in using the
async reset and now I see that this may not always be the best way to
reset a design. But the devil is in the details. I wonder if this
still applies to non-Xilinx designs?

Rick
It applies it all designs. Designers who started their careers wit
asynchronous logic carried it with them when Design for Synthesis an
synchronous design became a requirement but it has never been the bes
choice. Many designers make the mistake of thinking that because they nee
an asynchronous reset system that they must design it using asynchronou
logic. That is simply not true. We design synchronous systems that ar
black box equivalent to asynchronous systems all the time. The main thin
that you need to realize about reset system design is that the purpose o
the reset system is not to reset the system when a trigger event occurs
It's purpose is to NOT reset the system when a trigger event is NO
occuring.

The same is true for airbag controllers.The job of an airbag controller i
not to deploy the bag when the car is in a accident, it's job is to no
deploy the bag when the car is not having an accident. Any system where th
expected number of uses is small and the effects of the usage is large wil
follow this rule.

Remember the 1st StarWars movie? They built DeathStar with an emergenc
exhaust port that provided a direct path from the reactor core to th
surface. It was ray shielded but could not be particle shielded. Bad plan.


An asynchronous reset has a direct path from a pad into every flip-flop i
the entire chip. It is analog shielded but not digitally shielded. Ba
plan.


Resets in a real product (not a simulation) are really rare events. If
reset is delayed by 20 microseconds then nobody will notice. If a produc
that you are using suddenly resets itself then you will likely notice
Spend a few hundred cycles on a digital filter before you do somethin
drastic.



John Eaton





---------------------------------------
Posted through http://www.FPGARelated.com
 
John
the best is to design to never reset !
You can create a design that will work with no resets at all. The proble
is that the verification suite will take a few eons to finish.

John


---------------------------------------
Posted through http://www.FPGARelated.com
 
jt_eaton <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
(snip)

You can create a design that will work with no resets at all.
The problem is that the verification suite will take a few
eons to finish.
Most FPGA do an asynchronous reset on all FF at the end of
configuration. I don't believe that is optional.

-- glen
 
On Jun 13, 8:21 pm, "jt_eaton"
<z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:
On Jun 12, 7:22=A0pm, "jt_eaton"
z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:

Thanks for that pointer.  I have always been a believer in using the
async reset and now I see that this may not always be the best way to
reset a design.  But the devil is in the details.  I wonder if this
still applies to non-Xilinx designs?

Rick

It applies it all designs. Designers who started their careers with
asynchronous logic carried it with them when Design for Synthesis and
synchronous design became a requirement but it has never been the best
choice. Many designers make the mistake of thinking that because they need
an asynchronous reset system that they must design it using asynchronous
logic. That is simply not true. We design synchronous systems that are
black box equivalent to asynchronous systems all the time. The main thing
that you need to realize about reset system design is that the purpose of
the reset system is not to reset the system when a trigger event occurs.
It's purpose is to NOT reset the system when a trigger event is NOT
occuring.

The same is true for airbag controllers.The job of an airbag controller is
not to deploy the bag when the car is in a accident, it's job is to not
deploy the bag when the car is not having an accident. Any system where the
expected number of uses is small and the effects of the usage is large will
follow this rule.

Remember the 1st StarWars movie? They built DeathStar with an emergency
exhaust port that provided a direct path from the reactor core to the
surface. It was ray shielded but could not be particle shielded. Bad plan..

An asynchronous reset has a direct path from a pad into every flip-flop in
the entire chip. It is analog shielded but not digitally shielded. Bad
plan.

Resets in a real product (not a simulation) are really rare events. If a
reset is delayed by 20 microseconds then nobody will notice. If a product
that you are using suddenly resets itself then you will likely notice.
Spend a few hundred cycles on a digital filter before you do something
drastic.

John Eaton

---------------------------------------        
Posted throughhttp://www.FPGARelated.com
Interesting philosophy.

Rick
 
On Jun 14, 1:40 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
jt_eaton <z3qmtr45@n_o_s_p_a_m.n_o_s_p_a_m.gmail.com> wrote:

(snip)

You can create a design that will work with no resets at all.
The problem is that the verification suite will take a few
eons to finish.

Most FPGA do an asynchronous reset on all FF at the end of
configuration.   I don't believe that is optional.  

-- glen
I believe that is optional for any given FF. The GSR has to be
enabled on each FF and that is the point of the white paper. In
Xilinx devices using the GSR uses one of the set/reset input on a FF
as an async input which also configures the other input as async
IIRC. The tools are capable of using the Set and Reset inputs a
synchronous inputs to reduce the LUT usage and improving the speed of
a design... in some cases.

As to the philosophical avoidance of async resets, I can't say I share
that belief. As you point out, there is one async reset on the chip
that you can't eliminate, the PROGRAM pin. Even if it doesn't reset
the FFs, it will stop the design from working and reload all the LUTs
and memory.

It has been a long time since I used a Xilinx part, so I may not
remember them 100% correctly.

Rick
 
Lots of interesting advice here! In particular I read the Xilinx
whitepaper with interest. Unfortunately, a lot of the advice seemed to
be inapplicable to my problem. I can't look for the individual
submodule that's taking up most of the area, because my application is
a single long pipeline with a large number of very similar stages: the
area isn't taken up by any one stage, but more by the number of stages.
And because the design is a pipeline with general logic (mostly
bitwise, plus a small bit of basic arithmetic) between registers, I
don't really see any opportunities for special primitives like SRLs,
DSPs, or the like that would reduce area. I can probably solve my
problem by building a smaller pipeline and reusing it; I preferred not
to do that as it will decrease system performance but it looks like I
don't have much choice now.

Thanks anyway!
Chris
 
Something to remember about Xilinx FPGAs, at least when designing in VHD
and synthesizing with XST, is that you can specify the initial value o
registered signals (when declaring the signal in the declarative part o
the architecture). This is sometimes considered bad practice (bad codin
style) in other contexts, and may not be supported by other tool flows.


---------------------------------------
Posted through http://www.FPGARelated.com
 
On Jun 15, 1:35 am, Christopher Head <ch...@is.invalid> wrote:
Lots of interesting advice here! In particular I read the Xilinx
whitepaper with interest. Unfortunately, a lot of the advice seemed to
be inapplicable to my problem. I can't look for the individual
submodule that's taking up most of the area, because my application is
a single long pipeline with a large number of very similar stages: the
area isn't taken up by any one stage, but more by the number of stages.
And because the design is a pipeline with general logic (mostly
bitwise, plus a small bit of basic arithmetic) between registers, I
don't really see any opportunities for special primitives like SRLs,
DSPs, or the like that would reduce area. I can probably solve my
problem by building a smaller pipeline and reusing it; I preferred not
to do that as it will decrease system performance but it looks like I
don't have much choice now.

Thanks anyway!
Chris
I would start by saying that the biggest opportunities for savings are
almost always by starting at the algorithm level. You'll only get so
far by playing with implementation.

one suggestion might be to look for places where you could do 'double
clocking' - ie generate a 2x clock with the DCM and run a particular
piece of logic twice per cycle, muxing the inputs and distributing the
outputs. We have some designs that were multiplier limited, so we
used this trick as our main pipeline was slow enough to use one
multiplier to do double duty per pipeline stage.

some other tricks - use multipliers as shifters if you have them
spare. See if you can rejigger your pipeline stages. Some of the
older parts (vitrex-2 or so) have dedicated BUFT primitives that you
can use to reduce the number of logic elements in multiplexers.

Look at and understand the logic usage reports from the synthesizer.
If a module gets generated with more f/fs than you think it should,
it's good to dig in and figure out what got generated. For XST There
is a tool or option that will show a schematic of synthesized logic,
this can be handy.
 
Something to remember about Xilinx FPGAs, at least when designing in VHDL
and synthesizing with XST, is that you can specify the initial value of
registered signals (when declaring the signal in the declarative part of
the architecture). This is sometimes considered bad practice (bad coding
style) in other contexts, and may not be supported by other tool flows.


---------------------------------------
Posted through http://www.FPGARelated.com
I really like the fact that you can initialize rams as well. You no longe
need to think in terms of rams or roms, you have a universal read/writabl
rom for everything.

Need a screen buffer for your display? Create a startup screen image fil
and have that loaded as well. Need some boot/test code. Load it in a
startup and then reuse that memory later.

This stuff is great!!

John


---------------------------------------
Posted through http://www.FPGARelated.com
 
AMDyer@gmail.com <amdyer@gmail.com> wrote:

(snip)

I would start by saying that the biggest opportunities for savings are
almost always by starting at the algorithm level. You'll only get so
far by playing with implementation.

one suggestion might be to look for places where you could do 'double
clocking' - ie generate a 2x clock with the DCM and run a particular
piece of logic twice per cycle, muxing the inputs and distributing the
outputs. We have some designs that were multiplier limited, so we
used this trick as our main pipeline was slow enough to use one
multiplier to do double duty per pipeline stage.
For systolic arrays, which I will guess that the OP is working on,
that often doesn't help. You could speed up the whole thing by
a factor of two, though.

-- glen
 
As to the philosophical avoidance of async resets, I can't say I share
that belief. As you point out, there is one async reset on the chip
that you can't eliminate, the PROGRAM pin. Even if it doesn't reset
the FFs, it will stop the design from working and reload all the LUTs
and memory.

Rick

You can't avoid 100% of all async reset flops but you can easily do th
99.999% where sync will give you a smaller, faster design and your desig
is still a black box equivalent to using the async reset.

With xilinx parts every flop with an async reset wastes 1 lut over a syn
reset. In asic design every async reset flop doubles the number o
endpoints needing timing closure from 1 to 2. If you do a really lousy jo
in designing your reset distribution then these async paths could becom
critical paths and start taking routing resources away from your other mor
important paths.

Async resets on flops are nothing but trouble.

John


---------------------------------------
Posted through http://www.FPGARelated.com
 

Welcome to EDABoard.com

Sponsor

Back
Top