Stratix 2 ALUT architecture patented ?

Kenneth Land · Feb 27, 2004

Hi Ray,

As I said, (tried to say?) I'm sure many of the smart guys here like you
could rig the results at will - big deal. But what if you didn't go to the
trouble to make one or the other come out miles ahead? Would it not be
interesting to see the results?

I think its clear people don't really care because they've made up their
minds which they prefer. Choosing A or X probably pales in comparison to
pushing more of your design onto *any* state of the art FPGA. That's where
I am - basking in the amazement that my hardware is configurable, my pins
are assignable and I can add as much parallel logic as required to meet
performance.

Frankly, it would take a lot to steer me away from Altera and Nios because
of the ease of use and the effort the company went to to get us going.

Ken

"Ray Andraka" <ray@andraka.com> wrote in message
news:403E8F68.9BDD0118@andraka.com...

I could make either family come out miles ahead of the other for a
particular
design. The fact is, if you design to the part you'll get the best
performance
from that part, but it will likely not port well to a dissimilar part. In
my
cursory look at the new Altera architecture, I see structure there to
support
wider functions, which is great for the naive user. It makes the
synthesis job
a little harder since there are now even more ways to skin the cat. For
heavily
pipelined designs, it generally isn't going to matter a whole lot if
you've
already done the designs with 4-LUTs in mind. I'm not sure yet what it
does to
the arithmetic functions. Altera had previously been a little weaker than
xilinx for arithmetic just because of the structure of the LEs and LAB
when used
in arithmetic mode, and in the families prior to stratix there were row
routing
congestion problems if you put a lot of arithmetic in one row. For
typical
designs, either will do fine. Select on the feel-good parameters instead:
things like best comfort with tools, best price, best relationship with
vendor,
best give-aways at trade shows. If your application is pushing the
envelope,however, you'll want to carefully evaluate each architecture to
determine which is going to best fit your design, and then design to that
architecture.

Kenneth Land wrote:

Hi Peter,

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

Hal Murray · Feb 27, 2004

I know you could trust my designs, because I don't know enough to tweak them
for X or A. I'll bet there are many others who don't tweak designs to that
level either. (probably for better reasons

That cuts both ways. I'm willing to tweak my designs to take advantage
of the hardware. That can make a big difference.

Perhaps what we want is a nice clean reference design, and then see what
happens if people try to tweak it to look better on a particular device.
(That might make a good open cores project.)

But that puts us back to discussing what makes a good reference design. See
Peter's comments about PREP. But maybe the FPGA world has matured enough
by now that we could build complete (interesting) designs rather than
replicating simple counters.

Would LUTs/MIPS for an x bit CPU be an interesting measure? Both vendors
have mature CPU technology.

I like Ray's offer. Maybe one of the trade rags will take him up on it.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Jesse Kempa · Feb 27, 2004

nweaver@ribbit.CS.Berkeley.EDU (Nicholas C. Weaver) wrote in message news:<c1lb97$5uc$1@agate.berkeley.edu>...

In article <103s244a4ugb899@news.supernews.com>,
Kenneth Land <kland1@neuralog1.com1> wrote:
Austin,

I agree that performance claims will have to wait, but Jesse Kempa posted
the LE results for a Nios softcore on Stratix I vs. II. The numbers showed
just north of 30% fewer LE's used.

Considering the impressive design mapping required in the NIOS 2 (the
FPGA talk on the subject was VERY-cool), how much redesign was
done/needed for Stratix II? Or is this Nios 1.1?

Just to prevent any confusion (on the Nios stuff): I am assuming by
"NIOS 2" you mean "Nios II", which is a new product that has been
announced as part of Altera's 2004 new product line-up, but has not
yet been released... all I can say about this is to please be patient.

Nios "I" versions up to this point (1.0, 1.1, 2.0... up to the
just-released v3.2) have been more-or-less an evolution on the
original processor & instruction set introduced with Nios >3 years
ago. Probably the most significant amount of architecture-specific
optimization went into Nios targeting the Apex FPGA family.

While there are architecture tweaks here and there, the CPU remains
largely unchanged between Stratix & Stratix II... but don't take my
word for it, generate the respective CPUs and diff the HDL! Really the
magic behind all of this is in Quartus with synthesis & fitting
optimizations for each device architecture.

Jesse Kempa
Altera Corp.
jkempa at altera dot com

john jakson · Feb 27, 2004

Ray Andraka <ray@andraka.com> wrote in message news:<403E90F1.E251EC30@andraka.com>...

If someone will pay for my time doing it (and be willing to fit in my
schedule), I'd be happy to do an app or two targeted to each of the
families, pulling out all the stops for each one to see which does better
for that app after all the tricks in my bag are applied. I'll guarantee the
internals of the two functionally equivalent designs would be rather
different. Of course that would only be a valid comparison for the pieces
of that application.

I suspect if one does a dozen apps, not all DSP, but a general
variety, and for clock freqs all over the range from 50 to 300Mhz, you
would get wildly varying results for which is better even if X & A
agreed on 2 roughly equal technologies. Just doing same in X devices
produces variations due to different strengths of different families
even if closely derived.

I am willing to put up my 300MHz cpu when it is complete but I suspect
I already know the answer, just look up in specs for who has fastest
DP BlockRam cycle and N (say 12,16) bit adder or 3 4bLUT equiv levels.
But I'd be curious to know A results too.

johnjakson_usa_com

Nicholas C. Weaver · Feb 27, 2004

In article <95776079.0402261921.4d7bab99@posting.google.com>,
Jesse Kempa <kempaj@yahoo.com> wrote:

Just to prevent any confusion (on the Nios stuff): I am assuming by
"NIOS 2" you mean "Nios II", which is a new product that has been
announced as part of Altera's 2004 new product line-up, but has not
yet been released... all I can say about this is to please be patient.

Nios "I" versions up to this point (1.0, 1.1, 2.0... up to the
just-released v3.2) have been more-or-less an evolution on the
original processor & instruction set introduced with Nios >3 years
ago. Probably the most significant amount of architecture-specific
optimization went into Nios targeting the Apex FPGA family.

While there are architecture tweaks here and there, the CPU remains
largely unchanged between Stratix & Stratix II... but don't take my
word for it, generate the respective CPUs and diff the HDL! Really the
magic behind all of this is in Quartus with synthesis & fitting
optimizations for each device architecture.

I'm specifically referring to the very HEAVY NIOS optimizations
between 1.1 and 2.0 described in the talk at FPGA this year, on
" A High Performance 32-bit ALU for Programmable Logic" by Peter
Metzgen of Altera. (Sorry, don't have it online).

The talk/paper was all ABOUT architectural tweaks/modifications to get
the ALU and other structures to be not only much smaller, but only 2
LUT stages per pipeline stage, and the reason for the large size
reduction between 1.1 and 2.0. IT really was programmed down at the
LE level, understanding tricks with the carry chain etc.

How much reoptimiation was needed for Stratix II?
--
Nicholas C. Weaver nweaver@cs.berkeley.edu

Nicholas C. Weaver · Feb 27, 2004

In article <403E90F1.E251EC30@andraka.com>,
Ray Andraka <ray@andraka.com> wrote:

If someone will pay for my time doing it (and be willing to fit in my
schedule), I'd be happy to do an app or two targeted to each of the
families, pulling out all the stops for each one to see which does better
for that app after all the tricks in my bag are applied. I'll guarantee the
internals of the two functionally equivalent designs would be rather
different. Of course that would only be a valid comparison for the pieces
of that application.

I'll argue that 4 apps acn be fairly good (heck, I did in the past)
for this comparison, if you choose them well.

AES: Memory/SBOxes, delay chains, LUTs

Smith/Waterman: 16 bit ALU systolic operations

Sythentic datapath: What does a generic microprocessor-ish datapath
look like

Sythesized processor core: How do pushbutton hairballs behave.

But doing the first three right is still probbaly 1-3 weeks of
billable time for Ray Andraka (and 2-6 months for lazy gradstudents
like I was).

--
Nicholas C. Weaver nweaver@cs.berkeley.edu

Austin Lesea · Feb 27, 2004

This is clearly not offering anyone anything of any import.

I hope my explanations have been clear and understood by the others out
there.

Have a nice day.

Austin

Irwin Kennedy · Feb 27, 2004

mrather@altera.com (Michael) wrote in message news:<250d58c1.0402251837.32d4855c@posting.google.com>...

<snip>

Logic Structure Comparison Between Stratix & Virtex-Based
Architectures
http://www.altera.com/literature/wp/wpstxiixlnx.pdf

<snip>

The idea of an architecture comparison using "real" designs is of
great interest, however the choice of comparison metric used in the
white paper above is woolly. A much better metric for comparison would
be the silicon area required (normalised to the same process
technology). "Normalized Relative Logic Capacity" in terms of the
"ALUT" has little meaning.

rickman · Feb 27, 2004

Irwin Kennedy wrote:

mrather@altera.com (Michael) wrote in message news:<250d58c1.0402251837.32d4855c@posting.google.com>...

snip

Logic Structure Comparison Between Stratix & Virtex-Based
Architectures
http://www.altera.com/literature/wp/wpstxiixlnx.pdf

snip

The idea of an architecture comparison using "real" designs is of
great interest, however the choice of comparison metric used in the
white paper above is woolly. A much better metric for comparison would
be the silicon area required (normalised to the same process
technology). "Normalized Relative Logic Capacity" in terms of the
"ALUT" has little meaning.

Likewise, logic capcity per silicon area has no real meaning. I have
never bought a chip because it had a given area. I care about the
cost. But that brings in another coefficient/variable that would have
to be measured. In the real world manufacturers don't charge according
to their costs. They charge according to the market making as much
profit as they can squeeze out.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

Nicholas C. Weaver · Feb 27, 2004

In article <a0f83dd1.0402271208.375ef764@posting.google.com>,
Irwin Kennedy <iokennedy@hotmail.com> wrote:

The idea of an architecture comparison using "real" designs is of
great interest, however the choice of comparison metric used in the
white paper above is woolly. A much better metric for comparison would
be the silicon area required (normalised to the same process
technology). "Normalized Relative Logic Capacity" in terms of the
"ALUT" has little meaning.

It depends. If your metric is throughput
(parallel/pipeline/multitask), then it has to be normalized to area or
cost. LEs is a funny metric, it has to be silicon area.

But if the metric is latency, then actually area is secondary
altogether, and its just the clock cycle.

--
Nicholas C. Weaver nweaver@cs.berkeley.edu

Peter Alfke · Feb 27, 2004

I have to agree with rickman ( in spite of his harsh wording)
The issue is not square millimeters, the issues are:
Capacity, performance, and price (and power, familiarity and software
support)
The connection between price and silicon area is very tenuous:
Defect density, process maturity, manufacturing volume, package cost, and
market conditions are equally important factors.
Thank God we are not (yet) selling FPGAs by the square millimeter.

And BMW, Lexus and Cadillac are still not selling their cars by the pound,
or even the cubic inch. And those products have more than a hundred-year
evolution behind them...
Peter Alfke

Likewise, logic capcity per silicon area has no real meaning. I have
never bought a chip because it had a given area. I care about the
cost. But that brings in another coefficient/variable that would have
to be measured. In the real world manufacturers don't charge according
to their costs. They charge according to the market making as much
profit as they can squeeze out.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

rickman · Feb 27, 2004

Peter Alfke wrote:

I have to agree with rickman ( in spite of his harsh wording)

Geeze Peter, I don't know what I said that you thought was harsh... or
are you responding to my statement about companies charging as much as
the market will allow? That was not meant as an insult, just a simple
statement of fact. If companies did not make a profit, they would not
exist. That is the nature of our system, in order for companies to form
there has to be a profit motive. No insult intended, I just wanted to
clarify that there is only an indirect relationship between a product's
cost and its price. Likewise there is only an indirect relationship
between the silicon area and the price.

Heck, I am in business to make money. I don't price my products by
their cost, I price them by how useful they are to my customers.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

google_guy · Feb 28, 2004

Austin Lesea <austin@xilinx.com> wrote in message news:<c1nvd9$lnm1@cliff.xsj.xilinx.com>...

This is clearly not offering anyone anything of any import.

I hope my explanations have been clear and understood by the others out
there.

No, you never answered the question. How many logic cells does the
XC3S1000 have? The data sheet says 17,280. Is that correct?

My understanding is that this number is meaningless and that I have to
figure it out for myself. If I know that there are 8 LCs per CLB,
then I can multiply 1,920 * 8 = 15,360. Funny, that is not the same
as 17,280, is it?

Can you explain? Perhaps I don't understand what a logic cell is...

Philip Freidin · Feb 28, 2004

On 27 Feb 2004 19:27:47 -0800, sense_1909S_VDB@yahoo.com (google_guy) wrote:

Austin Lesea <austin@xilinx.com> wrote in message news:<c1nvd9$lnm1@cliff.xsj.xilinx.com>...
This is clearly not offering anyone anything of any import.

I hope my explanations have been clear and understood by the others out
there.

No, you never answered the question. How many logic cells does the
XC3S1000 have? The data sheet says 17,280. Is that correct?

My understanding is that this number is meaningless and that I have to
figure it out for myself. If I know that there are 8 LCs per CLB,
then I can multiply 1,920 * 8 = 15,360. Funny, that is not the same
as 17,280, is it?

Can you explain? Perhaps I don't understand what a logic cell is...

Well google_guy, maybe you should try Google

Google is your friend (if you talk to it nicely).
Google search "counting logic cells" and you get
a link (without much effort) to

http://www.fpga-faq.com/archives/65200.html#65218

and also:

http://www.nalanda.nitc.ac.in/industry/appnotes/xilinx/documents/xbrf/xbrf011.pdf

But one of my favorites is this:

http://www.fpgacpu.org/log/jan01.html#marketing_gates

The short story is that the marketing weenies got to gate
counts and butchered it to the point of meaninglessness, so
engineers turned to LUT/LC counting, and now it is being
abused too. Although I have not looked recently, in the past
I found that Actel's gate-count claims were fairly honest,
as were Xilinx's back around the XC4000 family (See the table
in the "marketing_gates" link).

I particularly hate the new meaningless term "System Gates".

When I do estimates, I use E-Gates, which are gates that an
engineer may actually get to use.

Philip

===================
Philip Freidin
philip@fliptronics.com
Host for WWW.FPGA-FAQ.COM

Paul Leventis (at home) · Mar 1, 2004

Hi Ray,

I'm not sure yet what it does to the arithmetic functions.

There are two specific ways in which Stratix II improves on the arithmetic
capabilities of our previous archiectures. First, a single ALM can
implement the sum of two 4-LUTs provided they share two inputs f(a, b, c,
d) + g(a, b, e, f). Second, it can implement a 3-input adder allowing you
to reduce the number of ALMs and logic levels required for adder trees.

Please see
http://www.altera.com/products/devices/stratix2/features/architecture/st2-adder.html
and Figures 2-11, 2-12, and 2-13 of the Stratix II databook for further
details.

It makes the synthesis job a little harder since there are now even more
ways to skin the cat

In architecting Stratix II, we took into account the increased challenges
for synthesis and place & route. We worked with and continue to work with
our 3rd party synthesis providers to improve the quality of synthesis for
this architecture. Using today's synthesis tools, Stratix II achieves a 25%
logic density advantage over Stratix. And it's not hard to imagine that
this will only get better with time.

Regards,

Paul Leventis
Altera Corp.

Paul Leventis (at home) · Mar 1, 2004

There are two specific ways in which Stratix II improves on the arithmetic
capabilities of our previous archiectures. First, a single ALM can
implement the sum of two 4-LUTs provided they share two inputs f(a, b,
c,
d) + g(a, b, e, f). Second, it can implement a 3-input adder allowing you
to reduce the number of ALMs and logic levels required for adder trees.

A quick correction -- I'm suffering from Vacation-Fried Brain Syndrome:

Each ALM implements _two_ bits of arithemetic:

sum0 = f0(a, b, c, e0) + g0(a, b, c, f0)
sum1 = f1(a, b, d, e1) + g1(a, b, d, f1)

Where f0, f1, g0, and g1 are each four-input functions. The simplest use of
this is to set e[0..1] = A[0..1] and f[0..1] = B[0..1], make a, b, c, d all
don't cares, and set f0, f1, g0, g1 to be "wire LUTs". This just gives you
Sum[0..1] = A[0..1] + B[0..1]. There are other more powerful uses, such as
user-controllable adder/subtractor, conditional operators, etc.

An ALM can also implement _two bits_ of an 3-input adder .

Regards,

Paul Leventis
Altera Corp.

Paul Leventis (at home) · Mar 1, 2004

Hi Austin,

[To answer your technical/architectural questions]

I was unclear on just how a ALM is any different from drawing the box
differently around the components. I am still puzzled, but the block
diagrams appears to have 3, 4, 5 and 6 LUTS with muxes, and maybe if it
was actually designed this way then that is simply what it is. A true 6
LUT has 64 memory cells and the associated logic, and two of these seems
a bit excessive and would not require any other logic or muxes at all.
Combining existing 4 LUTs to deliver some of the possible terms of a 6
LUT is a completely different matter.

I would highly recommend looking at Figure 2-6 of the Stratix II databook to
gain a better understanding of exactly what hardware there is in the ALM.

The Stratix II ALM can implement all functions of 6 inputs, since it has 64
bits of LUT memory. It can also implement two independent 4-LUTs, a 5-LUT
and 3-LUT, two 6-LUTs that share a LUT mask and 4 inputs, two 5-LUTs that
share part of their LUT mask plus 2 inputs, a subset of 7-input LUTs, etc.
Plus there are a variety of ways to combine this functionality with
registers before, after, or independent of the logic, and some gunk for
powerful arithmetic.

First the simple question: How does a 6-LUT differ from 4 4-LUTs + 3 2:1
muxes (ala 2 slices, 2 f5 muxes and an f6 mux)? It is not just where you
draw the boxes. The silicon area per logic function (or logic efficiency)
is much better with a 6-LUT, and this is largely due to area for
user-programmable routing.

In a 4-LUT architecture, the LUTs are designed to be independent, thus there
are 4 independently routable signals to each LUT. The fx muxes also require
a control input, which for now we will assume is independently routed. Thus
to implement a 6-input function using a 4-LUT architecture and fx muxes
requires a total 19 independently routed signals. This implies 19 routing
multiplexers which burn area and power. With a 6-LUT, obviously only 6
routing inputs would be required. So the potential area savings of a 6-LUT
come not from a reduction in LUT mask RAM bits (both require 64) but from a
reduction in user-configurable routing multiplexers.

Of course, you can't take this argument to extremes. Working against larger
LUTs is your ability to map designs into these larger functions. If most of
your design maps into 4-input functions and you have a 6-LUT architecture,
you'll be wasting a lot of silicon and a 4-LUT based product will be more
efficient. For these reasons, there is a bottom to the curve -- a 25-LUT
architecture would not be more area efficient than a 4-LUT architecture!
Where that bottom is... well, there's lots of academic studies and we've got
our own data.

But the Stratix II ALM is more than a 6-LUT architecture. It targets the
routing area efficiency gains of larger LUTs, while attempting to minimize
the wastage that occurs when you need to implement small logic functions.
It provides a few extra inputs (8 instead of 6) and one extra output (2
instead of 1), and is thus slightly less efficient than a true 6-LUT
architecture for implementing 6-input functions. However, these inputs and
outputs plus a few internal 2:1 muxes allow us to make use of the full ALM
under a wide range of function sizes by allowing us to fracture the ALM into
independent/semi-dependent functions. This allows us to greatly reduce the
number of LUT mask bits that go unused, and allows us to highly utilize the
available inputs and outputs of the ALM, resulting in little wasted silicon
area for input/output routing.

Why 8 inputs, 2 outputs, and all the little 2:1 muxes? Because our
experiments in the end showed that this resulted in the best combination of
area and performance, and I can assure you we believed there to be a
substantial benefit over the Stratix ALE in order to commit the resources
required to support a completely new logic fabric.

On a performance front, larger input LUTs confer a benefit in terms of
critical path delay by reducing the number of levels of logic and thus
routing hops required to a implement a given cone of logic. But is an 6-LUT
based ALM faster than 4-LUT based slices + fx muxes? A paper analysis will
not answer this, since both implement 6-input functions (albeit at different
area efficiencies). I could start arguing that smaller area turns
into/gives area to be spent on better speed, or start counting
transistors/gates in the path, but then we'd be getting into a very fuzzy
realm full of a gazillion assumptions!

Regardless, it is enjoyable to hear about any radical or innovative new
architecture, as there are so many that now dot the landscape as dead
skeletons of past FPGAs.

And I must say it was enjoyable to have worked on a radical, innovative
architecture such as Stratix II. And given its enhancements over the
successful Stratix architecture, I expect it to be flesh-covered and alive
for a long while.

Regards,

Paul Leventis
Altera Corp.

Kolja Sulimma · Mar 1, 2004

"Paul Leventis $at home$" <paul.leventis@utoronto.ca> wrote in message news:<j2B0c.53313$ah.15341@twister01.bloor.is.net.cable.rogers.com>...

Of course, you can't take this argument to extremes. Working against larger
LUTs is your ability to map designs into these larger functions. If most of
your design maps into 4-input functions and you have a 6-LUT architecture,
you'll be wasting a lot of silicon and a 4-LUT based product will be more
efficient. For these reasons, there is a bottom to the curve -- a 25-LUT
architecture would not be more area efficient than a 4-LUT architecture!
Where that bottom is... well, there's lots of academic studies and we've got
our own data.

The older academic data that I have seen suggests that the optimum is
somewhere between 4-LUTs and 5-LUTs for non-arithmetic circuits.
It makes sense that these numbers shift to larger LUTs when the number
of routing levels increases and also when the transistors shrink
faster than the routing density. (Which seems to be the case for
modern technologies).

But silicon area is not everything.
Because of the large area that is dedicated to routing in FPGAs, in
generally makes sense (area wise) to have an architecture that is a
little short on routing. (Better waste unused LUTs than waste unused
routing).
But it turned out that both the CAD tool developers and the customers
did not like this because the flow (in the tools and in the heads) is
to place first and route second.
This flow becomes a lot easier when the routing is guaranteed to
succeed.

Therefore most commercial FPGA we see today target a 100% LUT
utilization.
This is expensive. But it really helps time to market wise.

Kolja Sulimma

Paul Leventis (at home) · Mar 1, 2004

Hi Kolja,

Because of the large area that is dedicated to routing in FPGAs, in
generally makes sense (area wise) to have an architecture that is a
little short on routing. (Better waste unused LUTs than waste unused
routing).
[Snip]
This flow becomes a lot easier when the routing is guaranteed to
succeed.
Therefore most commercial FPGA we see today target a 100% LUT
utilization.
This is expensive. But it really helps time to market wise.

Yes, the academic literature suggesting this is a interesting (Andre DeHon
had a paper at FPGA a few years back, I think). In our own experimentation,
we see that different designs (obviously) have varying amounts of routing
demand per logic element. If you try to build a chip that allows all
designs to route, you'll have way too much routing for most designs.

But as you point out, customers aren't too happy if they have a 90% full
device that fails to route. And it's mostly a problem when they've done
most of their design (and it fits), only to find that late in the game when
they add a few more LEs, suddenly they can't route anymore. And when
customers run into problems, it costs us in support time, customer loyalty,
lost business, etc. So there are cost pressures that push us in the
direction of being slightly over-routed.

That said, our devices do make use of the less-than-100% observation, but in
more local ways. For example, a LAB in Stratix has 30 general routing
inputs (lab lines), and has 10 4-input LEs. Obviously, you could construct
a LAB that would be (deterministically) unroutable. It would just not be
efficient to build a LAB architecture with 40+ inputs, since Quartus can
almost always find LEs that share input signals or feedback to one another
in order to reduce input demands, and thus most LABs would have a lot of
wasted input muxes. When it can't, it automatically will leave some LEs in
a LAB unused in order to cap the number of inputs in use. This is like a
localized version of the "don't hit 100%" approach. There is a large body
of good academic research on the optimal # of cluster inputs for a given #
of BLEs (Rose, Betz, E. Ahmed, Singh, Kouloheris, D. Hill, etc.). There is
also research that shows that you should aim to never use the full # of LAB
inputs, as this is more efficient than trying to make a fully utilized set
of inputs routable (Guy Lemieux).

Regards,

Paul Leventis
Altera Corp.

Austin Lesea · Mar 1, 2004

Paul,

You answered my questions,

It will be an interesting springtime,

Thanks,

Austin

Paul Leventis (at home) wrote:

Hi Austin,

[To answer your technical/architectural questions]

I was unclear on just how a ALM is any different from drawing the box
differently around the components. I am still puzzled, but the block
diagrams appears to have 3, 4, 5 and 6 LUTS with muxes, and maybe if it
was actually designed this way then that is simply what it is. A true 6
LUT has 64 memory cells and the associated logic, and two of these seems
a bit excessive and would not require any other logic or muxes at all.
Combining existing 4 LUTs to deliver some of the possible terms of a 6
LUT is a completely different matter.

I would highly recommend looking at Figure 2-6 of the Stratix II databook to
gain a better understanding of exactly what hardware there is in the ALM.

The Stratix II ALM can implement all functions of 6 inputs, since it has 64
bits of LUT memory. It can also implement two independent 4-LUTs, a 5-LUT
and 3-LUT, two 6-LUTs that share a LUT mask and 4 inputs, two 5-LUTs that
share part of their LUT mask plus 2 inputs, a subset of 7-input LUTs, etc.
Plus there are a variety of ways to combine this functionality with
registers before, after, or independent of the logic, and some gunk for
powerful arithmetic.

First the simple question: How does a 6-LUT differ from 4 4-LUTs + 3 2:1
muxes (ala 2 slices, 2 f5 muxes and an f6 mux)? It is not just where you
draw the boxes. The silicon area per logic function (or logic efficiency)
is much better with a 6-LUT, and this is largely due to area for
user-programmable routing.

In a 4-LUT architecture, the LUTs are designed to be independent, thus there
are 4 independently routable signals to each LUT. The fx muxes also require
a control input, which for now we will assume is independently routed. Thus
to implement a 6-input function using a 4-LUT architecture and fx muxes
requires a total 19 independently routed signals. This implies 19 routing
multiplexers which burn area and power. With a 6-LUT, obviously only 6
routing inputs would be required. So the potential area savings of a 6-LUT
come not from a reduction in LUT mask RAM bits (both require 64) but from a
reduction in user-configurable routing multiplexers.

Of course, you can't take this argument to extremes. Working against larger
LUTs is your ability to map designs into these larger functions. If most of
your design maps into 4-input functions and you have a 6-LUT architecture,
you'll be wasting a lot of silicon and a 4-LUT based product will be more
efficient. For these reasons, there is a bottom to the curve -- a 25-LUT
architecture would not be more area efficient than a 4-LUT architecture!
Where that bottom is... well, there's lots of academic studies and we've got
our own data.

But the Stratix II ALM is more than a 6-LUT architecture. It targets the
routing area efficiency gains of larger LUTs, while attempting to minimize
the wastage that occurs when you need to implement small logic functions.
It provides a few extra inputs (8 instead of 6) and one extra output (2
instead of 1), and is thus slightly less efficient than a true 6-LUT
architecture for implementing 6-input functions. However, these inputs and
outputs plus a few internal 2:1 muxes allow us to make use of the full ALM
under a wide range of function sizes by allowing us to fracture the ALM into
independent/semi-dependent functions. This allows us to greatly reduce the
number of LUT mask bits that go unused, and allows us to highly utilize the
available inputs and outputs of the ALM, resulting in little wasted silicon
area for input/output routing.

Why 8 inputs, 2 outputs, and all the little 2:1 muxes? Because our
experiments in the end showed that this resulted in the best combination of
area and performance, and I can assure you we believed there to be a
substantial benefit over the Stratix ALE in order to commit the resources
required to support a completely new logic fabric.

On a performance front, larger input LUTs confer a benefit in terms of
critical path delay by reducing the number of levels of logic and thus
routing hops required to a implement a given cone of logic. But is an 6-LUT
based ALM faster than 4-LUT based slices + fx muxes? A paper analysis will
not answer this, since both implement 6-input functions (albeit at different
area efficiencies). I could start arguing that smaller area turns
into/gives area to be spent on better speed, or start counting
transistors/gates in the path, but then we'd be getting into a very fuzzy
realm full of a gazillion assumptions!

Regardless, it is enjoyable to hear about any radical or innovative new
architecture, as there are so many that now dot the landscape as dead
skeletons of past FPGAs.

And I must say it was enjoyable to have worked on a radical, innovative
architecture such as Stratix II. And given its enhancements over the
successful Stratix architecture, I expect it to be flesh-covered and alive
for a long while.

Regards,

Paul Leventis
Altera Corp.

Stratix 2 ALUT architecture patented ?

Kenneth Land

Guest

Hal Murray

Guest

Jesse Kempa

Guest

john jakson

Guest

Nicholas C. Weaver

Guest

Nicholas C. Weaver

Guest

Austin Lesea

Guest

Irwin Kennedy

Guest

rickman

Guest

Nicholas C. Weaver

Guest

Peter Alfke

Guest

rickman

Guest

google_guy

Guest

Philip Freidin

Guest

Paul Leventis (at home)

Guest

Paul Leventis (at home)

Guest

Paul Leventis (at home)

Guest

Kolja Sulimma

Guest

Paul Leventis (at home)

Guest

Austin Lesea

Guest

Log in

Welcome to EDABoard.com

Sponsor