Phrasing!

Kevin Neilson · Nov 19, 2016

Here's an interesting synthesis result. I synthesized this with Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a 23-long carry chain (6 CARRY4 blocks). This is twice as big as it should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15 total.

Neither is optimal. What I really want is a combination, 12 6-input LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

Tim Wescott · Nov 21, 2016

On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with Vivado
for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a 23-long
carry chain (6 CARRY4 blocks). This is twice as big as it should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular synthesis
tool. It's a pity.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

rickman · Nov 21, 2016

On 11/20/2016 5:43 PM, Tim Wescott wrote:

On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with Vivado
for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a 23-long
carry chain (6 CARRY4 blocks). This is twice as big as it should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular synthesis
tool. It's a pity.

'tis true, âtis pity, And pity âtis âtis true

--

Rick C

Tom Gardner · Nov 21, 2016

On 20/11/16 22:43, Tim Wescott wrote:

On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with Vivado
for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a 23-long
carry chain (6 CARRY4 blocks). This is twice as big as it should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular synthesis
tool. It's a pity.

Of course sometimes you don't want optimisation.
Consider, for example, bridging terms in an asynchronous
circuit.

Kevin Neilson · Nov 21, 2016

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

I know! I often feel like I'm a software guy, but stuck in the 80s, poring over every line generated by the assembler to make sure it's optimized.

Mark Curry · Nov 21, 2016

In article <9ae86fdc-dc6a-4d3f-b201-594fe2f6a3cd@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

I know! I often feel like I'm a software guy, but stuck in the 80s, poring over every line generated by the assembler to make sure it's optimized.

But, but "HLS", and "IP Integrator"...

I actually came back a bit let down from a recent Xilinx user's meeting at just how
much focus Xilinx is putting on their 'high level' tools. I'm of the opinion that
Xilinx is sinking a ton of resources into something that a small minority will
ever use. (And will probably not last long either). To Xilinx, RTL design is
dead...

--Mark

Kevin Neilson · Nov 21, 2016

I actually came back a bit let down from a recent Xilinx user's meeting at just how
much focus Xilinx is putting on their 'high level' tools. I'm of the opinion that
Xilinx is sinking a ton of resources into something that a small minority will
ever use. (And will probably not last long either). To Xilinx, RTL design is
dead...

--Mark

I wish they would just focus all their effort on the synthesizer and placer.. The chips get better and better, but the software seems stuck. I think the high-level tools are not for serious users. You can only use them if you don't care about clock speed, and if you don't care about clock speed, you should be using a processor or something.

Mark Curry · Nov 22, 2016

In article <c5206719-b91e-43e5-94ef-dfc84a49d62a@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:

I actually came back a bit let down from a recent Xilinx user's meeting at just how
much focus Xilinx is putting on their 'high level' tools. I'm of the opinion that
Xilinx is sinking a ton of resources into something that a small minority will
ever use. (And will probably not last long either). To Xilinx, RTL design is
dead...

--Mark

I wish they would just focus all their effort on the synthesizer and placer. The chips
get better and better, but the software seems stuck. I think the high-level tools are
not for serious users. You can only use them if you don't care about clock speed, and
if you don't care about clock speed, you should be using a processor or something.

Agreement. Add value where you add value - in your core competencies. Xilinx
adds value here - they design some kick ass technologies, in some very tough
geometries. They add value here. They have some excellant experts in a wide
breadth of technologies, than can help you design and debug some of the most
advanced designs. They add value in their software back end tools
which must map to this technology. They have great reference designs, and documentation.

They don't add value in the front end. They're trying to solve a difficult
problem that's been around for 20 years, that's vexxed an entire EDA
software industry. Learn from the ASIC guys here. ASIC companies
punted on their "special sauce" in-house SW 20 years ago, before they got wise and
let the EDA industry do its job. FPGA needs to do the same now.

I'm actually of the opinion that they should punt on synthesis too. Focus on the back
end. I doubt it'll happen - folks are too used to the idea of "free" EDA tools from
the FPGA vendors.

Regards,

Mark

Tim Wescott · Nov 22, 2016

On Mon, 21 Nov 2016 10:07:41 +0000, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with
Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a
23-long carry chain (6 CARRY4 blocks). This is twice as big as it
should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just express
your intent and the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular
synthesis tool. It's a pity.

Of course sometimes you don't want optimisation. Consider, for example,
bridging terms in an asynchronous circuit.

OK. I give up -- what do you mean by "bridging terms"?

In general, I would say that if this is an issue, then (as with the
'volatile' and 'mutable' keywords in C++), there should be a way in the
language to express your intent to the synthesizer -- either a way to say
"don't optimize this section", or a way to say "keep this signal no
matter what", or a syntax that lets you lay down literal hardware, etc.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

GaborSzakacs · Nov 22, 2016

Tim Wescott wrote:

On Mon, 21 Nov 2016 10:07:41 +0000, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with
Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a
23-long carry chain (6 CARRY4 blocks). This is twice as big as it
should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...
I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just express
your intent and the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular
synthesis tool. It's a pity.
Of course sometimes you don't want optimisation. Consider, for example,
bridging terms in an asynchronous circuit.

OK. I give up -- what do you mean by "bridging terms"?

In general, I would say that if this is an issue, then (as with the
'volatile' and 'mutable' keywords in C++), there should be a way in the
language to express your intent to the synthesizer -- either a way to say
"don't optimize this section", or a way to say "keep this signal no
matter what", or a syntax that lets you lay down literal hardware, etc.

Bridging terms refers to terms that cover transitions in an asynchronous
sequential circuit. Xilinx tools specifically do not honor this sort of
logic and it really has no business in their FPGA's. However, if you
insist on generating asynchronous sequential logic in a Xilinx FPGA, you
will need to instantiate LUTs to get the coverage you're looking for.

--
Gabor

Tim Wescott · Nov 22, 2016

On Mon, 21 Nov 2016 21:19:50 +0000, Mark Curry wrote:

In article <9ae86fdc-dc6a-4d3f-b201-594fe2f6a3cd@googlegroups.com>,
Kevin Neilson <kevin.neilson@xilinx.com> wrote:
I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just
express your intent and the compiler makes it happen most efficiently.

I know! I often feel like I'm a software guy, but stuck in the 80s,
poring over every line generated by the assembler to make sure it's
optimized.

But, but "HLS", and "IP Integrator"...

I actually came back a bit let down from a recent Xilinx user's meeting
at just how much focus Xilinx is putting on their 'high level' tools.
I'm of the opinion that Xilinx is sinking a ton of resources into
something that a small minority will ever use. (And will probably not
last long either). To Xilinx, RTL design is dead...

--Mark

If that small minority is the one with the most dollars behind it, then
they win. Dunno if that's the case or not, but it seems like there's a
lot of design of high-volume, cost-sensitive stuff that's done mostly by
applications engineers these days.

Or, Xilinx is wrong, and they'll spend a lot of money on uselessness.
That's never happened before in the history of semiconductors, now has
it?

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Tim Wescott · Nov 22, 2016

On Mon, 21 Nov 2016 14:51:13 -0800, Kevin Neilson wrote:

I actually came back a bit let down from a recent Xilinx user's meeting
at just how much focus Xilinx is putting on their 'high level' tools.
I'm of the opinion that Xilinx is sinking a ton of resources into
something that a small minority will ever use. (And will probably not
last long either). To Xilinx, RTL design is dead...

--Mark

I wish they would just focus all their effort on the synthesizer and
placer. The chips get better and better, but the software seems stuck.
I think the high-level tools are not for serious users. You can only
use them if you don't care about clock speed, and if you don't care
about clock speed, you should be using a processor or something.

Maybe if the synthesizer got better the demand for hugely fast chips
would go down, and thus they'd shoot themselves in the foot -- at least
from their perspective.

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Tom Gardner · Nov 22, 2016

On 21/11/16 20:19, Tim Wescott wrote:

On Mon, 21 Nov 2016 10:07:41 +0000, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with
Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a
23-long carry chain (6 CARRY4 blocks). This is twice as big as it
should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just express
your intent and the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular
synthesis tool. It's a pity.

Of course sometimes you don't want optimisation. Consider, for example,
bridging terms in an asynchronous circuit.

OK. I give up -- what do you mean by "bridging terms"?

https://en.wikipedia.org/wiki/Karnaugh_map#Race_hazards

It is called a bridging term since it is a logically
redundant term that straddles two required minterms.
Its purpose is to remove static hazards (glitches) that
can occur when inputs change, typically when there
are unequal propagation delays inside the implementation.

In general, I would say that if this is an issue, then (as with the
'volatile' and 'mutable' keywords in C++), there should be a way in the
language to express your intent to the synthesizer -- either a way to say
"don't optimize this section", or a way to say "keep this signal no
matter what", or a syntax that lets you lay down literal hardware, etc.

It only occurs in asynchronous circuits; the <ahem>
workaround is to only have synchronous designs and
implementations.

Tom Gardner · Nov 22, 2016

On 21/11/16 20:47, GaborSzakacs wrote:

Tim Wescott wrote:
On Mon, 21 Nov 2016 10:07:41 +0000, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with
Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a
23-long carry chain (6 CARRY4 blocks). This is twice as big as it
should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...
I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just express
your intent and the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular
synthesis tool. It's a pity.
Of course sometimes you don't want optimisation. Consider, for example,
bridging terms in an asynchronous circuit.

OK. I give up -- what do you mean by "bridging terms"?

In general, I would say that if this is an issue, then (as with the 'volatile'
and 'mutable' keywords in C++), there should be a way in the language to
express your intent to the synthesizer -- either a way to say "don't optimize
this section", or a way to say "keep this signal no matter what", or a syntax
that lets you lay down literal hardware, etc.

Bridging terms refers to terms that cover transitions in an asynchronous
sequential circuit. Xilinx tools specifically do not honor this sort of
logic and it really has no business in their FPGA's. However, if you
insist on generating asynchronous sequential logic in a Xilinx FPGA, you
will need to instantiate LUTs to get the coverage you're looking for.

Agreed. You will probably also have to nail
down the LUTs and the signal routing.

I suspect that, since Xilinx has a very good range
of I/O primitives, there really isn't any benefit
to full async design in their FPGAs.

Tom Gardner · Nov 22, 2016

On 22/11/16 00:33, Tim Wescott wrote:

On Mon, 21 Nov 2016 14:51:13 -0800, Kevin Neilson wrote:

I actually came back a bit let down from a recent Xilinx user's meeting
at just how much focus Xilinx is putting on their 'high level' tools.
I'm of the opinion that Xilinx is sinking a ton of resources into
something that a small minority will ever use. (And will probably not
last long either). To Xilinx, RTL design is dead...

--Mark

I wish they would just focus all their effort on the synthesizer and
placer. The chips get better and better, but the software seems stuck.
I think the high-level tools are not for serious users. You can only
use them if you don't care about clock speed, and if you don't care
about clock speed, you should be using a processor or something.

Maybe if the synthesizer got better the demand for hugely fast chips
would go down, and thus they'd shoot themselves in the foot -- at least
from their perspective.

Synthesis is easy. Place and route is hard.
A big question is how to either decouple or
integrate the them.

Particularly when you see the size of the
big Xilinx chips and consider the relative
time taken to get across the chip and through
a single LUT (and then through the integrated
ARM cores

)

But I suspect I'm close to teaching you how
to suck eggs

Tim Wescott · Nov 22, 2016

On Tue, 22 Nov 2016 01:33:12 +0000, Tom Gardner wrote:

On 22/11/16 00:33, Tim Wescott wrote:
On Mon, 21 Nov 2016 14:51:13 -0800, Kevin Neilson wrote:

I actually came back a bit let down from a recent Xilinx user's
meeting at just how much focus Xilinx is putting on their 'high
level' tools. I'm of the opinion that Xilinx is sinking a ton of
resources into something that a small minority will ever use. (And
will probably not last long either). To Xilinx, RTL design is
dead...

--Mark

I wish they would just focus all their effort on the synthesizer and
placer. The chips get better and better, but the software seems
stuck.
I think the high-level tools are not for serious users. You can only
use them if you don't care about clock speed, and if you don't care
about clock speed, you should be using a processor or something.

Maybe if the synthesizer got better the demand for hugely fast chips
would go down, and thus they'd shoot themselves in the foot -- at least
from their perspective.

Synthesis is easy. Place and route is hard. A big question is how to
either decouple or integrate the them.

Particularly when you see the size of the big Xilinx chips and consider
the relative time taken to get across the chip and through a single LUT
(and then through the integrated ARM cores )

But I suspect I'm close to teaching you how to suck eggs

Nah -- about the teaching me to suck eggs part, at least. I understand
the principles involved, but it's not something I've ever done.

Assuming that people know what the hell they're doing it can't be an easy
problem, because it hasn't been fully solved. At least -- to my
knowledge the process is still an iterative one that's at least partially
based on some sort of a pseudo-random process (presumably simulated
annealing).

--

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

rickman · Nov 22, 2016

On 11/21/2016 3:47 PM, GaborSzakacs wrote:

Tim Wescott wrote:
On Mon, 21 Nov 2016 10:07:41 +0000, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with
Vivado for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a
23-long carry chain (6 CARRY4 blocks). This is twice as big as it
should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...
I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the
assembly that the thing was spitting out. Now, if you've got a good
optimizer (and the gnu C optimizer is better than I am on all but a
very few of the processors I've worked with recently), you just express
your intent and the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular
synthesis tool. It's a pity.
Of course sometimes you don't want optimisation. Consider, for example,
bridging terms in an asynchronous circuit.

OK. I give up -- what do you mean by "bridging terms"?

In general, I would say that if this is an issue, then (as with the
'volatile' and 'mutable' keywords in C++), there should be a way in
the language to express your intent to the synthesizer -- either a way
to say "don't optimize this section", or a way to say "keep this
signal no matter what", or a syntax that lets you lay down literal
hardware, etc.

Bridging terms refers to terms that cover transitions in an asynchronous
sequential circuit. Xilinx tools specifically do not honor this sort of
logic and it really has no business in their FPGA's. However, if you
insist on generating asynchronous sequential logic in a Xilinx FPGA, you
will need to instantiate LUTs to get the coverage you're looking for.

Xilinx parts do not require bridging terms. If two canonical terms,
adjacent in the Karnaugh map, are set to the same value in the LUT there
is no glitch if a single input transitions from one term to another.
This is because they use transmission gates for the multiplexer and
there is enough capacitance to hold a signal on the output if neither
signals are driving the output as the switches transition.

If you think about it just a bit, you will realize most FPGA LUTs only
have canonical product terms and so can't have "cover terms" or
"bridging terms".

--

Rick C

Tom Gardner · Nov 22, 2016

On 22/11/16 01:50, Tim Wescott wrote:

On Tue, 22 Nov 2016 01:33:12 +0000, Tom Gardner wrote:

On 22/11/16 00:33, Tim Wescott wrote:
On Mon, 21 Nov 2016 14:51:13 -0800, Kevin Neilson wrote:

I actually came back a bit let down from a recent Xilinx user's
meeting at just how much focus Xilinx is putting on their 'high
level' tools. I'm of the opinion that Xilinx is sinking a ton of
resources into something that a small minority will ever use. (And
will probably not last long either). To Xilinx, RTL design is
dead...

--Mark

I wish they would just focus all their effort on the synthesizer and
placer. The chips get better and better, but the software seems
stuck.
I think the high-level tools are not for serious users. You can only
use them if you don't care about clock speed, and if you don't care
about clock speed, you should be using a processor or something.

Maybe if the synthesizer got better the demand for hugely fast chips
would go down, and thus they'd shoot themselves in the foot -- at least
from their perspective.

Synthesis is easy. Place and route is hard. A big question is how to
either decouple or integrate the them.

Particularly when you see the size of the big Xilinx chips and consider
the relative time taken to get across the chip and through a single LUT
(and then through the integrated ARM cores )

But I suspect I'm close to teaching you how to suck eggs

Nah -- about the teaching me to suck eggs part, at least. I understand
the principles involved, but it's not something I've ever done.

Assuming that people know what the hell they're doing it can't be an easy
problem, because it hasn't been fully solved. At least -- to my
knowledge the process is still an iterative one that's at least partially
based on some sort of a pseudo-random process (presumably simulated
annealing).

I'm sure heuristics are involved, of course, but even
they will only get you so far.

From memory, a CLB "gate" delay is of the order of 100ps
and it can take ~1ns for a logic signal to cross the chip
(clocks can be a bit faster due to dedicated drivers
and tracks). Even a "global reset" becomes a heretical
concept.

Now, what delay should you guess a particular gate+track
will have, and where should you place it? Ditto the
100,000 others - to maximise the clock rate of the
ensemble.

As you might guess, the workflow is
1 design
2 synthesise (from RTL/behavioural/system design)
3 simulate, to get an idea of speed
4 place and route
5 simulate, with "actual" delays
6 utter expletive deleteds
7 goto 1

Yes, there are many means to constrain the designs and
help the place and route, from specifying which timings
matter to nailing down functions in individual LUT/CLBs.
But they only go so far.

Kevin Neilson · Nov 22, 2016

I'm sure heuristics are involved, of course, but even
they will only get you so far.

From memory, a CLB "gate" delay is of the order of 100ps
and it can take ~1ns for a logic signal to cross the chip
(clocks can be a bit faster due to dedicated drivers
and tracks). Even a "global reset" becomes a heretical
concept.

In the part I'm using, LUT delays are 43 ps and net delays between them can easily be 1 ns. I'm looking at a net segment now that is 950ps and it looks like it only goes about 3% the width of the die. It's short. (It does go across an IOB column, which is probably part of the problem.) The heuristics in the synthesizer seem to dislike using MUXF7s and MUXCYs, even though they have dedicated routing, because the LUT delay is only 43ps and that makes it look good. But when the route to it is >500ps, the advantage is lost.

These are nice chips, but the synthesizer is still weak. And it seems odd that a slight rephrasing resulting in an equivalent Boolean expression would yield an entirely different synthesis result.

Richard Damon · Nov 23, 2016

On 11/21/16 5:07 AM, Tom Gardner wrote:

On 20/11/16 22:43, Tim Wescott wrote:
On Sat, 19 Nov 2016 14:15:18 -0800, Kevin Neilson wrote:

Here's an interesting synthesis result. I synthesized this with Vivado
for Virtex-7:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= x!=0; // version 1

Then I rephrased the logic:

reg [68:0] x;
reg x_neq_0;
always@(posedge clk) x_neq_0 <= |x; // version 2

These should be the same, right?

Version 1 uses 23 3-input LUTs on the first level followed by a 23-long
carry chain (6 CARRY4 blocks). This is twice as big as it should be.

Version 2 is 3 levels of LUTs, 12 6-input LUTs on the first level, 15
total.

Neither is optimal. What I really want is a combination, 12 6-input
LUTs followed by 3 CARRY4s.

This is supposed to be the era of high-level synthesis...

I'm not enough of an FPGA guy to make really deep comments, but this
looks like the state of C compilers about 20 or so years ago. When I
started coding in C one had to write the code with an eye to the assembly
that the thing was spitting out. Now, if you've got a good optimizer
(and the gnu C optimizer is better than I am on all but a very few of the
processors I've worked with recently), you just express your intent and
the compiler makes it happen most efficiently.

Clearly, that's not yet the case, at least for that particular synthesis
tool. It's a pity.

Of course sometimes you don't want optimisation.
Consider, for example, bridging terms in an asynchronous
circuit.

If you are thinking in terms of an AND-OR tree for the typical LUT based
FPGA, you aren't going to get it right. Most FPGA's now use the LUT,
which, at least for a single LUT, are normally guaranteed to be glitch
free for single line transitions (so no need for the bridging terms). If
you need more inputs than a single LUT provides, and you need need the
glitch free performance, than trying to force a massive AND-OR tree is
normally going to be very inefficient, and I find it worth building the
exact structure I need with the Low Level, vendor provided fundamental
LUT/Carry primatives.

Phrasing!

Kevin Neilson

Guest

Tim Wescott

Guest

rickman

Guest

Tom Gardner

Guest

Kevin Neilson

Guest

Mark Curry

Guest

Kevin Neilson

Guest

Mark Curry

Guest

Tim Wescott

Guest

GaborSzakacs

Guest

Tim Wescott

Guest

Tim Wescott

Guest

Tom Gardner

Guest

Tom Gardner

Guest

Tom Gardner

Guest

Tim Wescott

Guest

rickman

Guest

Tom Gardner

Guest

Kevin Neilson

Guest

Richard Damon

Guest

Log in

Welcome to EDABoard.com

Sponsor