LUT6 FPGAs and Carry Logic

J

Jan Bruns

Guest
Hallo.

Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is
a little outdated, and the newer LUT6-FPGAs don't seem to show
up correctly in fpga_editor).

* Is there really no carry-bypass option in LUT6-paths like the
CYMUX(any,cin,1) living in LUT4 paths, apart from
constraining the LUTs?

* SLICEMs and SLICELs always have a carry chain, while SLICEXs
neither have a RAM nor a carry option?

* Within a CLB, SLICEMs are paired with SCLICEX (if there are
SLICEXs in the device)? Sounds strange to me: If a LUT is
configured to be dynamic, it is probably very likely that
additional Carry logic isn't used, compared to static LUTs
(with LUT4s, one rare reason to this is using the carry chain
to implement a post-invert option for the RAM...). Have you ever
seen a dynamic LUT6 really gain something in also using carry?

* What about production? Does it look like Xilinx might stop selling
and developing new LUT4-FPGAs in the near future? I personally
don't have enough overview about these two FPGA classes, so I
can't see the detailed pros and cons.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
 
Jan Bruns <jansaccount@arcor.de> writes:

Hallo.

Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is
a little outdated, and the newer LUT6-FPGAs don't seem to show
up correctly in fpga_editor).
The datasheets and usermanuals show everything you would need I think
though... see UG384 for example. Pages 9-11 show the various slices in
some detail.

* Is there really no carry-bypass option in LUT6-paths like the
CYMUX(any,cin,1) living in LUT4 paths, apart from
constraining the LUTs?
I'm not sure what you mean by constraining the LUTs. There are various
muxes shown in Fig 3,4,5 - can you achieve what you want with them?

* SLICEMs and SLICELs always have a carry chain, while SLICEXs
neither have a RAM nor a carry option?
Correct

* Within a CLB, SLICEMs are paired with SCLICEX (if there are
SLICEXs in the device)? Sounds strange to me: If a LUT is
configured to be dynamic, it is probably very likely that
additional Carry logic isn't used, compared to static LUTs
(with LUT4s, one rare reason to this is using the carry chain
to implement a post-invert option for the RAM...). Have you ever
seen a dynamic LUT6 really gain something in also using carry?
It seems to me that there are "big slices", "medium slices" and "smal
slices" - the silicon area taken up by the carry chain may well be
"free" compared to the rest of the big/medium slices.

Additionally, SLICEMs can be used for dynamic filter-coefficient
storage, the arithmetic logic is also useful then.

Xilinx will have pushed an awful lot of existing and potential designs
through this architecture and decided its a win overall. Whether it's a
win for your particular designs and style is immaterial to them (unless
you are an *enormous* customer!)

* What about production? Does it look like Xilinx might stop selling
and developing new LUT4-FPGAs in the near future?
Selling... I doubt they'll stop selling Spartan 3 (for example) for a
very long time yet - Xilinx have a long history of keeping old families
going for many many years after it was sensible to design them into new
systems.

Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so
yes, I think it's stopped!

My understanding of the the Series 7 goal is to make as much of the
user-visible logic as possible identical across the three ranges (Artix,
Kintex and Virtex). There are differences in power/speed tradeoffs and
the mix of memory, DSP, gigabit IO, logic etc. But the fundamental
blocks are the same throughout. Unlike in the V5/S3 era when the LUTS,
DSPs, BRAMs, IOs were all different between the two families!

I personally don't have enough overview about these two FPGA classes,
so I can't see the detailed pros and cons.
I'm not sure there's much to care about pros and cons. LUT6 is here,
unless you want to design with relatively old chips.

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.co.uk/capabilities/39-electronic-hardware
 
Martin Thompson:
Jan Bruns:

Some questions about Xilinx LUT6 FPGAs (my WebPack Toolchain is a
little outdated, and the newer LUT6-FPGAs don't seem to show up
correctly in fpga_editor).

The datasheets and usermanuals show everything you would need I think
though... see UG384 for example. Pages 9-11 show the various slices in
some detail.
Thanks. I've tested the wrong datasheets then.

* Is there really no carry-bypass option in LUT6-paths like the
CYMUX(any,cin,1) living in LUT4 paths, apart from constraining the
LUTs?

I'm not sure what you mean by constraining the LUTs. There are various
muxes shown in Fig 3,4,5 - can you achieve what you want with them?
The select-Input of the main Carry-Select Muxes is directly connected
to the LUT-output, without an option to put another signal on it.
If you make use of the Carry Logic, the function you put on the LUT
will always become part of the Carry calculation.

Xilinx LUT4 FPGAs had an option to make the main CarrySelect Mux always
fprward the cin to cout, no matter of the LUT said. This was pretty
useful, because it was possible to make relatively huge logic feed
the Carry Chain, without ever crossing CLB boundaries.

For example, within a SLICE, it was possible to have one LUT act as
16-bit RAM, and have it added (or whatever) to some external value on
the other LUT. The RAM-LUTs output was not expected to directly
connect to the Carry Logic, but had relatively fast routes to
the arithmetic then.

However, I'd expect many reasons to use "partial populated"
carry chains to be gone with LUT6.

* Within a CLB, SLICEMs are paired with SCLICEX (if there are
SLICEXs in the device)? Sounds strange to me: If a LUT is configured
to be dynamic, it is probably very likely that additional Carry logic
isn't used, compared to static LUTs (with LUT4s, one rare reason to
this is using the carry chain to implement a post-invert option for
the RAM...). Have you ever seen a dynamic LUT6 really gain something
in also using carry?

It seems to me that there are "big slices", "medium slices" and "smal
slices" - the silicon area taken up by the carry chain may well be
"free" compared to the rest of the big/medium slices.
Hmn, sounds like that's only one theory of yours.

Additionally, SLICEMs can be used for dynamic filter-coefficient
storage, the arithmetic logic is also useful then.
Hm, what about details, then?
A dynloadable LUT1 calculating "external signal xor stored bit"?

Xilinx will have pushed an awful lot of existing and potential designs
through this architecture and decided its a win overall.
Whether it's a win for your particular designs and style is immaterial
to them (unless you are an *enormous* customer!)
Compared to what? LUT4 vs. LUT6, given the same silicon process?
What would you expect the term "win" to represent, then?

I don't believe there's no market for LUT4 FPGAs using current
silicon process.

* What about production? Does it look like Xilinx might stop selling
and developing new LUT4-FPGAs in the near future?

Selling... I doubt they'll stop selling Spartan 3 (for example) for a
very long time yet - Xilinx have a long history of keeping old families
going for many many years after it was sensible to design them into new
systems.

Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so
yes, I think it's stopped!

I personally don't have enough overview about these two FPGA classes,
so I can't see the detailed pros and cons.

I'm not sure there's much to care about pros and cons. LUT6 is here,
unless you want to design with relatively old chips.
Argh. So all these valuable customers have to rework all parts of
their highly optimized, huge module database, just because Xilinx
engineers thought it might be less work for them to ever put LUT6 in
silicon?

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
 
On Feb 15, 10:21 pm, Jan Bruns <jansacco...@arcor.de> wrote:

I don't believe there's no market for LUT4 FPGAs using current
silicon process.
Market: Maybe.
Resonable facts to support such an architecture: No.

The problem here is that users tend to evaluate the capabilites of an
FPGA mainly
as logic, while really you pay mostly for routing. Logic is a very
small portion of the
silicon area. Of course the vendors don't publish the numbers, but
university research
suggests the area of LUT and LUT configuration is only a few percent
of total area.

Therefore when going from 4-LUT to 6-LUT you don't get a 4x area
increase (16 entries
to 64 entries) but more like a 60% increase (going from 4 inputs that
must be routed to
6 inputs that must be routed in a somewhat worse than linear routing
area).
This is offset by the fact that routing now gets a lot simpler.

Routing increases faster than linear with the number of wires.
Therefore with bigger
FPGAs the percentage of logic goes down. The optimum LUT size
therefore tends to go
up with technology improvements.

Research shows that the efficiency curve for FPGA technologies is
relatively flat around
the optimum. E.g. for a given technology there are multiple LUT sizes
that get you almost
the same area efficiency. Because performce tends to be better for the
larger LUTs and
because the software runtimes go down for larger LUTs (mapping is
polynomial time, routing
exponential) a typical design decision would be to chose the largest
LUT size within the
flat region of the curve, expecting that future implementations of the
architecture would
move the optimum spot in that direction.

This is exactly what FPGA vendors did:
In the early 90ies the sweet spot was consistenly show to be between 3-
LUTs and 4-LUTs so
most vendors chose 4-LUTs.

Newer research shows the flat region to be go from 4-LUTs to 6-LUTs.
While 4-LUTs probably
would be still a good choice, it is clear that there must be switch to
6-LUTs at some time, and
one might just as well do the switch now getting much better EDA
software run times.

Kolja Sulimma
cronologic.de
 
Jan Bruns <jansaccount@arcor.de> writes:

Martin Thompson:
Jan Bruns:

* Is there really no carry-bypass option in LUT6-paths like the
CYMUX(any,cin,1) living in LUT4 paths, apart from constraining the
LUTs?

I'm not sure what you mean by constraining the LUTs. There are various
muxes shown in Fig 3,4,5 - can you achieve what you want with them?

The select-Input of the main Carry-Select Muxes is directly connected
to the LUT-output, without an option to put another signal on it.
If you make use of the Carry Logic, the function you put on the LUT
will always become part of the Carry calculation.

Xilinx LUT4 FPGAs had an option to make the main CarrySelect Mux always
fprward the cin to cout, no matter of the LUT said. This was pretty
useful, because it was possible to make relatively huge logic feed
the Carry Chain, without ever crossing CLB boundaries.

For example, within a SLICE, it was possible to have one LUT act as
16-bit RAM, and have it added (or whatever) to some external value on
the other LUT. The RAM-LUTs output was not expected to directly
connect to the Carry Logic, but had relatively fast routes to
the arithmetic then.

However, I'd expect many reasons to use "partial populated"
carry chains to be gone with LUT6.
Yes, I agree. No doubt there will be *some* designs which don't work out
so well in the newer architectures.

* Within a CLB, SLICEMs are paired with SCLICEX (if there are
SLICEXs in the device)? Sounds strange to me: If a LUT is configured
to be dynamic, it is probably very likely that additional Carry logic
isn't used, compared to static LUTs (with LUT4s, one rare reason to
this is using the carry chain to implement a post-invert option for
the RAM...). Have you ever seen a dynamic LUT6 really gain something
in also using carry?

It seems to me that there are "big slices", "medium slices" and "smal
slices" - the silicon area taken up by the carry chain may well be
"free" compared to the rest of the big/medium slices.

Hmn, sounds like that's only one theory of yours.
Well, yes, it is - you'll have to wait for someone from Xilinx for
anything better than that :)

Additionally, SLICEMs can be used for dynamic filter-coefficient
storage, the arithmetic logic is also useful then.

Hm, what about details, then?
Well, I only offer it as a possibility (haven't done an actual
comparison), but distributed arithmetic FIR filters were what I was
thinking of.

Xilinx will have pushed an awful lot of existing and potential designs
through this architecture and decided its a win overall.
Whether it's a win for your particular designs and style is immaterial
to them (unless you are an *enormous* customer!)

Compared to what? LUT4 vs. LUT6, given the same silicon process?
What would you expect the term "win" to represent, then?
Don't ask me - I'm not making the decisions. Ultimately, Xilinx
presumably decided it was a "win" in business terms: "We'll make the
most money doing it this way."

I don't believe there's no market for LUT4 FPGAs using current
silicon process.
No-one is saying there is not a market. Just that it's not big enough
for Xilinx to be targetting it.

* What about production? Does it look like Xilinx might stop selling
and developing new LUT4-FPGAs in the near future?

Selling... I doubt they'll stop selling Spartan 3 (for example) for a
very long time yet - Xilinx have a long history of keeping old families
going for many many years after it was sensible to design them into new
systems.

Developing... Spartan 3(and E,A,ADSP) was the last LUT4 generation, so
yes, I think it's stopped!

I personally don't have enough overview about these two FPGA classes,
so I can't see the detailed pros and cons.

I'm not sure there's much to care about pros and cons. LUT6 is here,
unless you want to design with relatively old chips.

Argh. So all these valuable customers have to rework all parts of
their highly optimized, huge module database,
That's progress :)

This is how bare-metal-assembly-language programmers felt as processors
developed and their highly-tuned routines needed to be rewritten. Of
course, the processors were faster and compilers were better, so the
smart ones just wrote straightforward, portable C-code which turned out
to be good-enough most of the time. And that code was much more
re-usable.

just because Xilinx engineers thought it might be less work for them
to ever put LUT6 in silicon?
I'm sure it wasn't done on a whim! There are sound business reasons for
how it's been done. Sounds like they just don't fit what you'd like :(

Cheers,
Martin

--
martin.j.thompson@trw.com
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.co.uk/capabilities/39-electronic-hardware
 
Martin Thompson <martin.j.thompson@trw.com> wrote:

(snip)
Don't ask me - I'm not making the decisions. Ultimately, Xilinx
presumably decided it was a "win" in business terms: "We'll make the
most money doing it this way."
Well, they do have some competition. If they don't design
and build what works for their customers, they will lose out.

I don't believe there's no market for LUT4 FPGAs using current
silicon process.

No-one is saying there is not a market. Just that it's not
big enough for Xilinx to be targetting it.
As I understand it, 6LUT is better for larger chips.

For smaller ones, it likely doesn't make so much difference.
There is some advantage as far as synthesis software of
keeping a minimum number of different architectures.

Still, 4LUT chips should be around for a while.

-- glen
 
Kolja Sulimma:
The problem here is that users tend to evaluate the
capabilites of an FPGA mainly as logic, while really
you pay mostly for routing. Logic is a very small
portion of the silicon area. Of course the vendors
don't publish the numbers, but university research
suggests the area of LUT and LUT configuration is
only a few percent of total area.
That's what I expected. This becomes pretty obvious
if you imagine a LUT2 FPGA, where everyone should
intuitively understand that the entire silicon would
be filled up with routing resources. And LUT4 can't
be far off.

Therefore when going from 4-LUT to 6-LUT you don't
get a 4x area increase (16 entries to 64 entries)
but more like a 60% increase (going from 4 inputs
that must be routed to 6 inputs that must be routed
in a somewhat worse than linear routing area).
So let's compare Spartans:

Spartan6 LUT6: about 7 ins, about 3 outs = 10 ports
Spartan3 Slice: about 10 ins, about 6 outs = 16 ports

Where the port count for the Sparta3 Slice doesn't
include the FXMUX path, but the full XB/YB (I doubt
this path has/needs full routing caps, anyway).

So from what you said about area with taking routing
resources into account, the Spartan3 Slice might very
well consume a little more area, although it has only
about half the SRAM bits.

What do we get for that?

For SLICEL, I think of:

2*any 4 inp-func: LUT4:yes, LUT6:no
2*any 4 inp-func, paired invert: LUT4:yes, LUT6:no
any 5-inp func: both
any 6-inp func: LUT4:no LUT6:yes
MUX4: both
half/partial populated Carry: LUT4:yes, LUT6:no
2 Bit full Adder: both
2 Bits of long Adder: LUT4:yes, LUT6:eek:ne, but 2?
2 Bits of long MulAdder: LUT4:yes, LUT6:eek:ne, but 2?
1 Bit ALU (fast Carry): maybe both
--with dual Ext-feedin: LUT4:yes(paired with DPram), LUT6:no
Large Chain Logic: LUT4: 8Bit/Slice, LUT6:6Bit/LUT
DblLUTed Chain Logic: LUT4: no, BX, only, LUT6: yes


For SLICEM, I also think of:
64x1 RAM: LUT4:no, LUT6 yes
32x2 RAM: LUT4:no, LUT6 yes
32x1 RAM: LUT4:yes, LUT6 yes
16x2 RAM: LUT4:yes, LUT6 yes
16x1 RAM+Adder: LUT4:yes, LUT6 no

Well, for the SLICEM-Part, the LUT6 might be a better
choice, but for SLICEL, I'd still prefer the LUT4,
given 50% area overhead, although I'm missing a
little partial bit more of static MUXes and FF-paths
(independent clock-inverters, or something).

Gruss

Jan Bruns


--
Ein paar Fotos: http://abnuto.de/gal/
 
On Feb 16, 7:07 am, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
Martin Thompson <martin.j.thomp...@trw.com> wrote:

(snip)

Don't ask me - I'm not making the decisions. Ultimately, Xilinx
presumably decided it was a "win" in business terms: "We'll make the
most money doing it this way."

Well, they do have some competition. If they don't design
and build what works for their customers, they will lose out.

I don't believe there's no market for LUT4 FPGAs using current
silicon process.
No-one is saying there is not a market. Just that it's not
big enough for Xilinx to be targetting it.

As I understand it, 6LUT is better for larger chips.

For smaller ones, it likely doesn't make so much difference.
There is some advantage as far as synthesis software of
keeping a minimum number of different architectures.

Still, 4LUT chips should be around for a while.

-- glen
I believe that is what it comes down to. Given the fact that routing
is a huge percentage of the chip area (and so cost) this becomes a
more important factor as the chips get larger. After all, routing
does go up at a faster rate than linear. So minimizing routing is
more important in larger chips. The tradeoff provides for lower costs
with LUT6 in larger devices.

The other side of the coin is more "wasted" logic when larger LUTs are
underutilized. So it would seem that we have reached the point where
the LUT6 is optimal for many if not the vast majority of designs.

I don't know that there is a performance penalty in using LUT6. I
would expect that is minimal since the muxes in the LUTs are done with
transmission gates with very little delay, but I don't really know.
If so, the only issue then becomes cost. So if you design is one of
the minority designs that can indeed be done more efficiently in a
LUT4 architecture, then you will pay a bit more for a LUT6 based
part... but given the advantages of smaller feature size you will
likely get lower costs with the newer parts than sticking with an old
generation.

As to design reworks required to optimize a design for a newer part, I
expect that would be done for speed and/or cost. My experience is
that Xilinx is more than willing to help you with that, especially if
it means a design win over a competitor. But would anyone really
expect much lost ground from a LUT4 design to a current LUT6 design?
Software changes can greatly impact results, but I can't see needing
to touch a design from a Spartan 3 to get it to run well in a newer
device given the large improvements in the hardware from using a much
smaller process. I suppose if you have used hard constraints you may
have to remove them. But you knew the risk when you used those
features, no?

Rick
 
rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)
For smaller ones, it likely doesn't make so much difference.
There is some advantage as far as synthesis software of
keeping a minimum number of different architectures.
(snip)
I believe that is what it comes down to. Given the fact that routing
is a huge percentage of the chip area (and so cost) this becomes a
more important factor as the chips get larger. After all, routing
does go up at a faster rate than linear. So minimizing routing is
more important in larger chips. The tradeoff provides for lower
costs with LUT6 in larger devices.

The other side of the coin is more "wasted" logic when larger LUTs are
underutilized. So it would seem that we have reached the point where
the LUT6 is optimal for many if not the vast majority of designs.
One that I am interested in, though, is that 6LUT should be
much better for building the MUX needed for barrel shifters.
A 4LUT makes a two input MUX, but 6LUT can make a 4 input
(and two select line) MUX. Other than that, I haven't though
much about how useful differnet sizes are. The less logic
between FF's, the less advantage to larger ones.

I don't know that there is a performance penalty in using LUT6. I
would expect that is minimal since the muxes in the LUTs are done with
transmission gates with very little delay, but I don't really know.
If so, the only issue then becomes cost. So if you design is one of
the minority designs that can indeed be done more efficiently in a
LUT4 architecture, then you will pay a bit more for a LUT6 based
part... but given the advantages of smaller feature size you will
likely get lower costs with the newer parts than sticking with an old
generation.
Well, they have to be designed not to glitch when switching
between entries with the same output value. That doesn't
naturally happen with an SRAM. Also, with transmission gates
you can't go through too many without a buffer, but presumably
that is part of optimizing the cell.

-- glen
 
On Feb 16, 4:50 pm, Jan Bruns <jansacco...@arcor.de> wrote:

Spartan6  LUT6: about  7 ins, about 3 outs = 10 ports
Spartan3 Slice: about 10 ins, about 6 outs = 16 ports

So from what you said about area with taking routing
resources into account, the Spartan3 Slice might very
well consume a little more area, although it has only
about half the SRAM bits.
Not. It will consume a lot more area if you include routing.
Routing grows faster than linear (look up "rent exponent").
Of course it can cover more flexible circuit areas because
you can chose much more combinations of input signals
with two 4-luts compared to one 6-lut (except if you have
high fanin random logic. But the area is much larger.

The point is: It does not matter if a LUT-6 on average has
lower utilization, as LUT area is virtually free. What matters
is routing utilization.

There is research that clearly shows that from an efficiency
standpoint FPGAs are best that can't achieve 100% LUT utilization
because they have sparse routing.

The reasons why vendors choose to provide lots of routing anyway is:
a) customers don't understand this and tend to start whining when they
don't get 100% LUT utilization instead of beeing happy that they get
better wire utilization. (Remember: Wires are the expensive part)

b) It get's hard to predict what can be implemented and what can't.

c) software gets harder to do and slower with worse routing
ressources.


So you pay a premium to be able to reliably plan your design and to
simplify marketing.

Back to LUT size: Have a look at figure 3.3 in this:
http://www.eecg.utoronto.ca/~jayar/pubs/theses/Ahmed/EliasAhmed.pdf

area is virtually constant in that analysis for LUT sizes from 4 to 6.
But with LUT size 6 you get much better software runtimes.

Kolja
 
Kolja Sulimma:

Spartan6  LUT6: about  7 ins, about 3 outs = 10 ports Spartan3 Slice:
about 10 ins, about 6 outs = 16 ports

So from what you said about area with taking routing resources into
account, the Spartan3 Slice might very well consume a little more area,
although it has only about half the SRAM bits.

Not. It will consume a lot more area if you include routing. Routing
grows faster than linear (look up "rent exponent"). Of course it can
cover more flexible circuit areas because you can chose much more
combinations of input signals with two 4-luts compared to one 6-lut
(except if you have high fanin random logic. But the area is much
larger.
Take some area A of silicon and put n_1 blocks of type T_1 into it.
Take another area A of silicon and put n_2 blocks of a similar type
T_2 into it.

If n_1*portcount(T_1) = n_2*portcount(T_2) then
portcount(A) won't depend on what blocktype was implemented,
and I don't see any reason why one or the other should consume
more routing overhead.

The point is: It does not matter if a LUT-6 on average has lower
utilization, as LUT area is virtually free. What matters is routing
utilization.
If the utilization of a given LUT goes low, the routing will on average
become lesser "localized", so that wires become longer,

There is research that clearly shows that from an efficiency standpoint
FPGAs are best that can't achieve 100% LUT utilization because they have
sparse routing.

The reasons why vendors choose to provide lots of routing anyway is: a)
customers don't understand this and tend to start whining when they
don't get 100% LUT utilization instead of beeing happy that they get
better wire utilization. (Remember: Wires are the expensive part)

b) It get's hard to predict what can be implemented and what can't.

c) software gets harder to do and slower with worse routing ressources.


So you pay a premium to be able to reliably plan your design and to
simplify marketing.

Back to LUT size: Have a look at figure 3.3 in this:
http://www.eecg.utoronto.ca/~jayar/pubs/theses/Ahmed/EliasAhmed.pdf
Thanks for sharing that link.

However, my understanding from that presentation is, that LUT4,,6 give
the same overall area utilization, LUT>6 would give shortest delays,
and LUT4..6 all give the same best area*delay product.

area is virtually constant in that analysis for LUT sizes from 4 to 6.
But with LUT size 6 you get much better software runtimes.
Overall area (including routing) doesn't significantly change from LUT4
to LUT6, and even the delay was similar from LUT4 to LUT6.

But these results don't represent the fact, that the Xilinx Lut4-design
has an enormous fit to many practically relevant problems (for example,
adders ans busmuxes are very frequently used). Even the software
generated technology mapping makes heavy use of these additional Lut4
features, that are almost for free, compared to the theoretical, simple
LUT4 design.


The technology mapping might become easier for synthesis software, if
the CLB design comes nearer to the bare LUT (with LUT6, the Carry seems
to become the only additional specialized circuit), but the Xilinx
software is already able to make good use of their LUT4 specials, it's
only that it doesn't always notice the ideal, obvious solution.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
 
glen herrmannsfeldt:

(snip)
There is research that clearly shows that from an efficiency
standpoint FPGAs are best that can't achieve 100% LUT utilization
because they have sparse routing.

I have done place and route on pipelined arrays with different numbers
of cells per chip, and found that speed goes fairly close to inversely
proportional to the number of cells, over a fairly wide range.
Some pipeline control signals crossing the data-path and getting slower
with wider fanouts?

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
 
Kolja Sulimma <ksulimma@googlemail.com> wrote:

(snip)
There is research that clearly shows that from an efficiency
standpoint FPGAs are best that can't achieve 100% LUT utilization
because they have sparse routing.
I have done place and route on pipelined arrays with different
numbers of cells per chip, and found that speed goes fairly
close to inversely proportional to the number of cells, over
a fairly wide range.

-- glen
 
glen herrmannsfeldt:
Jan Bruns <jansaccount@arcor.de> wrote:

(snip, I wrote)
I have done place and route on pipelined arrays with different numbers
of cells per chip, and found that speed goes fairly close to inversely
proportional to the number of cells, over a fairly wide range.

Some pipeline control signals crossing the data-path and getting slower
with wider fanouts?

It is a linear array of fairly simple cells. I believe it is that the
routes get longer and slower as things get more tightly packed together.
Just some days ago, I had a similar problem.
There was a horizontal data flow, with the parallel data lines
vertically aligned.

The bottleneck was one CLB column using a couple of "control
signals" sourced elsewhere.
The timing heavily scaled down with bus size, and timinganlysis
showed up a couple of ns of routing delay, just for the control.

Luckily, the critical CLB row had some unused regs, so
I used them to replicate the most critical controls.

At first, this didn't work out as expected. It even got worse
than without the replication. This was caused by the way
I've arranged the replicates, with more vertical direct lines
than available. So the router came up with solutions like routing
a critical, local CLB signal once around that CLB (a lot of hops
through a handful of neighbor switch matrices).

A simple rearrangment of the replicate usage however fully
solved that further problem (by halving the direct neighbor route
consumption).

Althugh there are now some more signals on the switches (remember
the original signals still need to go to the replicate regs), now
all the replicates have direct neighbor connects (or better) to
the LUTs. So timing doesn't scale anymore with bus width.

Gruss

Jan Bruns

--
Ein paar Fotos: http://abnuto.de/gal/
 
Jan Bruns <jansaccount@arcor.de> wrote:

(snip, I wrote)
I have done place and route on pipelined arrays with different numbers
of cells per chip, and found that speed goes fairly close to inversely
proportional to the number of cells, over a fairly wide range.

Some pipeline control signals crossing the data-path and getting slower
with wider fanouts?
It is a linear array of fairly simple cells. I believe it is
that the routes get longer and slower as things get more
tightly packed together.

-- glen
 
On Feb 16, 4:06 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
rickman <gnu...@gmail.com> wrote:

(snip, I wrote)

For smaller ones, it likely doesn't make so much difference.
There is some advantage as far as synthesis software of
keeping a minimum number of different architectures.

(snip)

I believe that is what it comes down to. Given the fact that routing
is a huge percentage of the chip area (and so cost) this becomes a
more important factor as the chips get larger. After all, routing
does go up at a faster rate than linear. So minimizing routing is
more important in larger chips. The tradeoff provides for lower
costs with LUT6 in larger devices.
The other side of the coin is more "wasted" logic when larger LUTs are
underutilized. So it would seem that we have reached the point where
the LUT6 is optimal for many if not the vast majority of designs.

One that I am interested in, though, is that 6LUT should be
much better for building the MUX needed for barrel shifters.
A 4LUT makes a two input MUX, but 6LUT can make a 4 input
(and two select line) MUX. Other than that, I haven't though
much about how useful differnet sizes are. The less logic
between FF's, the less advantage to larger ones.
Yes, the 4LUT can be finagled by using the fourth input as an enable
which is in essence the AND gate of the next mux stage, then you can
use all four inputs of a LUT as the OR gate to combine 8 inputs in two
levels. So the 4LUT is more like 1.5 2 input muxes.


I don't know that there is a performance penalty in using LUT6. I
would expect that is minimal since the muxes in the LUTs are done with
transmission gates with very little delay, but I don't really know.
If so, the only issue then becomes cost. So if you design is one of
the minority designs that can indeed be done more efficiently in a
LUT4 architecture, then you will pay a bit more for a LUT6 based
part... but given the advantages of smaller feature size you will
likely get lower costs with the newer parts than sticking with an old
generation.

Well, they have to be designed not to glitch when switching
between entries with the same output value. That doesn't
naturally happen with an SRAM. Also, with transmission gates
you can't go through too many without a buffer, but presumably
that is part of optimizing the cell.

-- glen
The glitching is from logic race conditions. Using transmission gates
pretty much eliminates that as long as you use break before make
connections. Then the capacitance of the line retains the last value
until the new value comes up.

Rick
 

Welcome to EDABoard.com

Sponsor

Back
Top