EDK : FSL macros defined by Xilinx are wrong

Ray Andraka wrote:
Which reconfigurable FPGAs would those be with the non-volatile
bitstreams? I'm not aware of any.
What are XC18V04's? Magic ROMs?
What are the platform flash parts? Magic ROMs?
They are CERTAINLY non-volatile every time I've checked.

In fact, nonvolatile includes disks, optical, and just about any other
medium that doesn't go poof when you turn the power off.

and now this assertion that all the parts
have non-volatile storage sure makes it sound like you don't have the
hands on experience with FPGAs you'd like us to believe you have.
Ok Wizard God of FPGA's ... just how do you configure your FPGA's
without having some form of non-volatile storage handy? What ever the
configuration bit stream sources is, if it is reprogramable ... IE
ignore 17xx proms ... you can store the defect list?

UNDERSTAND?

Now, the insults are NOT -- I REPEAT NOT - being civil.


What are you doing different in the RC design then?
With RC there is an operating system, complete with disk based
filesystem. The intent is to do fast (VERY FAST) place and route on the
fly.

From my
perspective, the only ways to be able to be able to tolerate changes in
the PAR solution and still make timing are to either be leaving a
considerable amount of excess performance margin (ie, not running the
parts at the high performance/high density corner), or spending an
inordinate amount of time looking for a suitable PAR solution for each
defect map, regardless of how coarse the map might be.
You are finally getting warm. Several times in this forum I discussed
what I call "clock binning" where the FPGA accel board has several
fixed clocks arranged as integer powers. The dynamic runtime linker
(very fast place and route) places, routes, and assigns the next
slowest clock that matches the code block just linked. The concept is
use the fastest clock that is available for the code block that meets
timing. NOT change the clocks to fix the code.

From your previous posts regarding open tools and use of HLLs, I
suspect it is more on the leaving lots of performance on the table side
of things.
Certainly ... it may not hardware optimized to the picosecond. Some
will be, but that is a different problem. Shall we discuss every
project you have done in 12 years as though it was the SAME problem
with identical requirements? I think not. So why do you for me?

In my own experience, the advantage offered by FPGAs is
rapidly eroded when you don't take advantage of the available
performance.
The performance gains are measured against single threaded CPU's with
serial memory systems. The performance gains are high degrees of
parallelism with the FPGA. Giving up a little of the best case
performance is NOT a problem. AND if it was, for a large dedicated
application, then by all means, use traditional PAR and fit the best
case clock the the code body.

If you are leaving enough margin in the design so that it is
tolerant to fortuitous routing changes to work around unique defects,
then I sincerely doubt you are going to run into the runaway thermal
problems you were concerned with.
This is a completely different problem set than that particular
question was addressing. That problem case was about hand packed
serial-parallel MACs doing a Red-Black ordered simulations with kernel
sizes between 80-200 LUT's, tiled in tight, running at best case clock
rate. 97% active logic. VERY high transistion rates. About the only
thing worse, would be purposefully toggling everything.

A COMPLETELY DIFFERENT PROBLEM is compiling arbitrary C code and
executing it with a compile, link, and go strategy. Example is a
student iterratively testing a piece of code in an edit, compile and
run sequence. In that case, getting the netlist bound to a reasonable
set of LUTs quickly and running the test is much more important than
extracting the last bit of performance from it.

Like it or not .... that is what we mean by using the FPGA to EXECUTE
netlists. We are not designing highly optimized hardware. The FPGA is
simply a CPU -- a very parallel CPU.

Show me that my intuition is wrong.
First you have taken and merged several different concepts, as though
they were some how the same problem .... from various posting topics
over the last several months.

Surely we can distort anything you might want to present by taking your
posts out of context and arguing them in the worst possible combination
against you.

Let's try - ONE topic, one discussion.

Seems that you have made up your mind. As you have been openly
insulting and mocking ... have a good day. When are really interested,
maybe we can have a respectful discussion. You are pretty clueless
today.
 
John, last time I checked, FPGAs did not get delivered from Xilinx with
the config prom. Sure, you can store a defect map on the config prom,
or on your disk drive, or battery backed sram or whatever, but the point
is that defect map has to get into your system somehow. Earlier in this
thread you were asking/begging Xilinx to provide the defect map, even if
just to one of 16 quadrants for each non-zero-defect part delivered.
That leads to the administration nightmare I was talking about.

In the absence of a defect map provided by Xilinx (which you were
lobbying hard for a few days ago), the only other option is for the end
user to run a large set of test configurations on each device while in
system to map the defects. Writing that set of test configurations
requires a knowledge of the device at a detail that is not available
publicly, or getting ahold of the Xilinx test configurations, and
expanding on them to obtain fault isolation. I'm not sure you realize
the number of routing permutations that need to be run just to get fault
coverage of all the routing, switchboxes, LUTs, etc in the device, and
much less achieve fault isolation. Your posts regarding that seem to
support this observation.

With RC there is an operating system, complete with disk based
filesystem. The intent is to do fast (VERY FAST) place and route on the
fly.
Now see, that is the fly in the ointment. The piece that is missing is
the "very fast place and route". There is and has been a lot of
research into improving place and route, but the fact of the matter is
that in order to get performance that will make the FPGA compete
favorably against a microprocessor is going to require a fast time to
completion that is orders of magnitude faster than what we have now
without giving up much in the way of performance. Sure, I can slow a
clock down (by bin steps or using a programmable clock) to match the
clock to the timing analysis for the current design, but that doesn't
help you much for many real-world problems where you have a set time to
complete the task. (yes, I know that may RC apps are not explicitly time
constrained, but they do have to finish enough ahead of other approaches
to make them economically justifiable). Remember also, that the RC FPGA
starts out with a sizable handicap against a microprocessor with the
time to load a configuration, plus if the configuration is generated on
the fly the time to perform place and route. Once that hurdle is
crossed, you still need enough of a performance boost over the
microprocessor to amortize that set-up cost over the processing interval
to come out ahead. Obviously, you gain from the parallelism in the
FPGA, but if you don't also mind the performance angle, it is quite easy
to wind up with designs that can only be clocked at a few tens of MHz,
and often that use up so much area that you don't have room for enough
parallelism to make up for the much lower clock rate. So that puts the
dynamically configured RC in a box, where problems that aren't
repetitive and complex enough to overcome the PAR and configuration
times are better done on a microprocessor, and problems that take long
enough to make the PAR time insignificant may be better served by a more
optimized design than what has been discussed, and we're talking not
only about PAR results, but also architecturally optimizing the design
to get the highest clock rates and density. In my experience, FPGAs can
do roughly 100x the performance of similar generation microprocessors,
give or take an order of magnitude depending on the exact application
and provided the FPGA design is done well. It is very easy to lose the
advantage by sub-optimal design. If I had a dollar for every time I've
gotten remarks that 100x performance is not possible, or that so and so
did an FPGA design expecting only 10x and it turned out slower than a
microprocessor because it wouldn't meet timing etc, I'd be retired.

I guess I owe you an apology for merging your separate projects. I was
under the impression (and glancing back over your posts still can
interpret it this way) that these different topics were all addressing
facets of the same RC project. I assumed (apparently erroneously) that
this was all towards the same RC system. I also apologize for the
insults, as I didn't mean to insult you or mock you, rather I was trying
to point out that, taking all your posts together that I thought you
were trying to hit all the corners of the design space at once, and at
the same time do it on the cheap with defect ridden parts. I am still
not convinced you aren't trying to hit everything at once....you know
that old good, fast, cheap, pick any two thing. Rereading my post, I
see that I let my tone get out of hand, and for that I ask your forgiveness.

In any event, truely dynamic RC remains a tough nut to crack because of
the PAR and configuration time issues. By adding the desire to use
defect ridden parts, you are only making an already tough job much
harder. I respectfully suggest you try first to get the system together
using perfect FPGAs, as I believe you will find you already have an
enormous task in front of you between the HLL to gates, the need for
fast PAR, partitioning the problem over multiple FPGAs and between FPGAs
and software, making a usable user interface and libraries etc, without
exponentially compounding the problem by throwing defect tolerance into
the mix. Baby steps are necessary to get through something as complex
as this.

fpga_toys@yahoo.com wrote:
Ray Andraka wrote:

Which reconfigurable FPGAs would those be with the non-volatile
bitstreams? I'm not aware of any.


What are XC18V04's? Magic ROMs?
What are the platform flash parts? Magic ROMs?
They are CERTAINLY non-volatile every time I've checked.

In fact, nonvolatile includes disks, optical, and just about any other
medium that doesn't go poof when you turn the power off.

and now this assertion that all the parts

have non-volatile storage sure makes it sound like you don't have the
hands on experience with FPGAs you'd like us to believe you have.


Ok Wizard God of FPGA's ... just how do you configure your FPGA's
without having some form of non-volatile storage handy? What ever the
configuration bit stream sources is, if it is reprogramable ... IE
ignore 17xx proms ... you can store the defect list?

UNDERSTAND?

Now, the insults are NOT -- I REPEAT NOT - being civil.



What are you doing different in the RC design then?


With RC there is an operating system, complete with disk based
filesystem. The intent is to do fast (VERY FAST) place and route on the
fly.


From my
perspective, the only ways to be able to be able to tolerate changes in
the PAR solution and still make timing are to either be leaving a
considerable amount of excess performance margin (ie, not running the
parts at the high performance/high density corner), or spending an
inordinate amount of time looking for a suitable PAR solution for each
defect map, regardless of how coarse the map might be.


You are finally getting warm. Several times in this forum I discussed
what I call "clock binning" where the FPGA accel board has several
fixed clocks arranged as integer powers. The dynamic runtime linker
(very fast place and route) places, routes, and assigns the next
slowest clock that matches the code block just linked. The concept is
use the fastest clock that is available for the code block that meets
timing. NOT change the clocks to fix the code.


From your previous posts regarding open tools and use of HLLs, I
suspect it is more on the leaving lots of performance on the table side
of things.


Certainly ... it may not hardware optimized to the picosecond. Some
will be, but that is a different problem. Shall we discuss every
project you have done in 12 years as though it was the SAME problem
with identical requirements? I think not. So why do you for me?

In my own experience, the advantage offered by FPGAs is

rapidly eroded when you don't take advantage of the available
performance.


The performance gains are measured against single threaded CPU's with
serial memory systems. The performance gains are high degrees of
parallelism with the FPGA. Giving up a little of the best case
performance is NOT a problem. AND if it was, for a large dedicated
application, then by all means, use traditional PAR and fit the best
case clock the the code body.


If you are leaving enough margin in the design so that it is
tolerant to fortuitous routing changes to work around unique defects,
then I sincerely doubt you are going to run into the runaway thermal
problems you were concerned with.


This is a completely different problem set than that particular
question was addressing. That problem case was about hand packed
serial-parallel MACs doing a Red-Black ordered simulations with kernel
sizes between 80-200 LUT's, tiled in tight, running at best case clock
rate. 97% active logic. VERY high transistion rates. About the only
thing worse, would be purposefully toggling everything.

A COMPLETELY DIFFERENT PROBLEM is compiling arbitrary C code and
executing it with a compile, link, and go strategy. Example is a
student iterratively testing a piece of code in an edit, compile and
run sequence. In that case, getting the netlist bound to a reasonable
set of LUTs quickly and running the test is much more important than
extracting the last bit of performance from it.

Like it or not .... that is what we mean by using the FPGA to EXECUTE
netlists. We are not designing highly optimized hardware. The FPGA is
simply a CPU -- a very parallel CPU.


Show me that my intuition is wrong.


First you have taken and merged several different concepts, as though
they were some how the same problem .... from various posting topics
over the last several months.

Surely we can distort anything you might want to present by taking your
posts out of context and arguing them in the worst possible combination
against you.

Let's try - ONE topic, one discussion.

Seems that you have made up your mind. As you have been openly
insulting and mocking ... have a good day. When are really interested,
maybe we can have a respectful discussion. You are pretty clueless
today.
 
Ray Andraka wrote:
John, last time I checked, FPGAs did not get delivered from Xilinx with
the config prom. Sure, you can store a defect map on the config prom,
or on your disk drive, or battery backed sram or whatever, but the point
is that defect map has to get into your system somehow. Earlier in this
thread you were asking/begging Xilinx to provide the defect map, even if
just to one of 16 quadrants for each non-zero-defect part delivered.
That leads to the administration nightmare I was talking about.
Since NOTHING exists today, I've offered several IDEAS, including the
board mfg taking responsiblity for the testing and including it to the
end user .... as well as being able to do the testing at the end user
using a variety of options including triple redundancy and scrubbing.
Multiple ideas have been presented so provide options and room for
discussion. Maybe you missed that.

Not discussed was a proposal that the FPGA vendor could provide maybe
subquadrant level defect bin sorting .... which could be transmitted
via markings on the package, or by order selection, or even by using 4
balls on the package to specify the subquadrant.

For someone interested in finding solutions, there is generally the
intellectual capacity to connect the dots and finish a proposal with
alternate ideas.

For someone being obstructionist, there are no end to the objections
that can be raised.


I'm not sure you realize
the number of routing permutations that need to be run just to get fault
coverage of all the routing, switchboxes, LUTs, etc in the device, and
much less achieve fault isolation. Your posts regarding that seem to
support this observation.
I'm not sure that you understand, where there is a will, it certainly
can and will be done. After all, when it comes to routers for FPGA's
there are many independent implementations .... it's not a Christ
delivered on the mount technology for software guys to do these things.

With RC there is an operating system, complete with disk based
filesystem. The intent is to do fast (VERY FAST) place and route on the
fly.


but the fact of the matter is
that in order to get performance that will make the FPGA compete
favorably against a microprocessor is going to require a fast time to
completion that is orders of magnitude faster than what we have now
without giving up much in the way of performance.
Ray, the problem is that you clearly have lost sight that sometimes the
expensive and critical resource to optimize for is people. Sometimes
it's the machine.

I know that may RC apps are not explicitly time
constrained, but they do have to finish enough ahead of other approaches
to make them economically justifiable).
Ray .... stop lecturing ... I understand, and you are worried about
YOUR problems here, and clearly lack the mind reading ability to
understand everything from where I am coming or going.

There are a set of problems, very similar to DSP filters, which are
VERY parallel and scale very nicely in FPGA's. For those problems,
FPGA's are a couple orders of magnitude faster. Other's, that are
truely sequential with limited parallelism, are much better done on a
traditional ISA. It's useful to mate an FPGA system, with a
complementary traditional CPU. This is true in each of the prototypes I
built in the first couple years of my research. More reciently I've
also looked at FPGA centric designs for a different class of problems.







Remember also, that the RC FPGA
starts out with a sizable handicap against a microprocessor with the
time to load a configuration, plus if the configuration is generated on
the fly the time to perform place and route. Once that hurdle is
crossed, you still need enough of a performance boost over the
microprocessor to amortize that set-up cost over the processing interval
to come out ahead. Obviously, you gain from the parallelism in the
FPGA, but if you don't also mind the performance angle, it is quite easy
to wind up with designs that can only be clocked at a few tens of MHz,
and often that use up so much area that you don't have room for enough
parallelism to make up for the much lower clock rate.
So? what's the point .... most of these applications run for hours,
even days. I would like a future generation FPGA that has parallel
memory like access to the configuration space with high bandwidth ...
that is not today, and I've said so.

You are lecturing again, totally clueless about the issues I've
considered over the last 5 years, the architectures I've exported, the
applications I find interesting, or even what I have long term intent
for. There are a lot of things I will not discuss without a purchase
order and under NDA.

So that puts the
dynamically configured RC in a box, where problems that aren't
repetitive and complex enough to overcome the PAR and configuration
times are better done on a microprocessor, and problems that take long
enough to make the PAR time insignificant may be better served by a more
optimized design than what has been discussed, and we're talking not
only about PAR results, but also architecturally optimizing the design
to get the highest clock rates and density.
So, what's your point? Don't think I've gone down that path? .... there
is a big reason I want ADB and the related interfaces that were done
for JHDLBits and several other university projects. Your obsession
with "highest clock rates" leaves you totally blind to other tradeoffs.

In my experience, FPGAs can
do roughly 100x the performance of similar generation microprocessors,
give or take an order of magnitude depending on the exact application
and provided the FPGA design is done well. It is very easy to lose the
advantage by sub-optimal design. If I had a dollar for every time I've
gotten remarks that 100x performance is not possible, or that so and so
did an FPGA design expecting only 10x and it turned out slower than a
microprocessor because it wouldn't meet timing etc, I'd be retired.
With hand layout, I've done certain very small test kernels which
replicated to fill a dozen 2V6000's pull three orders of magnitude over
the reference SMP cluster for some important applications I wish to
target ... you don't get to a design that can reach petaflops by being
conservative, which is my goal. I've used live tests on the DIni
boards confirm the basic processing rate and data transfers between
packages, for a number of benchmarks and test kernels, and they seem to
scale at this point. I've also done similar numbers with a 2V6000
array. Later this year my goal is to get a few hundred LX200's, and
see if the scaling predictions are where I expect.

So, I agree, or I wouldn't be doing this.

Rereading my post, I
see that I let my tone get out of hand, and for that I ask your forgiveness.
Accepted. And I do have nearly six different competitive market
requirements to either fill concurrently, or with overlapping
sollutions. It is, six projects at a time at this point, and will later
settle into several clearly defined roles/solutions.

In any event, truely dynamic RC remains a tough nut to crack because of
the PAR and configuration time issues.
It's there in education project form .... getting the IP released, or
redoing it is a necessary part of optimizing the human element for
programming and testing. Production, is another set of problems and
solutions.

By adding the desire to use
defect ridden parts, you are only making an already tough job much
harder.
Actually, I do not believe so. I'm 75% systems software engineer and
25% hardware designer, and very good at problem definition and
architecture issues. I've spent 35 years knocking off man year plus
software projects by myself in 3-4 months, and 5-8 man year projects
with a small team of 5-7 in similar time frames with a VERY strong KISS
discipline.

I see defect parts as a gold mine that brings volums up, and prices
down to make RC systems very competitive for general work, as well as
highly optimized work where they will shine big time.

I'm used to designing for defect management ... in disks, in memories,
and do not see this as ANY concern.

I respectfully suggest you try first to get the system together
using perfect FPGAs,
I've built several, and have several more ready to fab.

as I believe you will find you already have an
enormous task in front of you between the HLL to gates,
FpgaC I've been using just over 2-1/2 years, even with it's current
faults which impact density by between 2-20%. Enough to know where it
needs to go, and have that road map in place. There is a slowly growing
user base and developer group for that project. The project will mature
during 2006, and in some ways I've yet to talk about.

the need for fast PAR,
This is a deal breaker, and why I've put my head up after a couple
years and started pushing when JHDLBits with ADB was not released.
There is similar code in several other sources that will take more
work. I've a good handle on that.

partitioning the problem over multiple FPGAs and between FPGAs
and software, making a usable user interface and libraries etc, without
exponentially compounding the problem by throwing defect tolerance into
the mix. Baby steps are necessary to get through something as complex
as this.
I've done systems level design for 35 years ... operating systems,
drivers, diagnostics, hardware design, and large applications. I do
everything with baby steps and KISS, but by tacking the tough problems
as early in a design as possible for risk management.

Again ... defect mangement may be scarry to you, because of how it
impacts YOUR projects, in this project it is NOT a problem. Reserving
defect resources is very similar to having the same resource already
allocated. OK?
 
Jim Granville wrote:
FPGAs are great at distributed fabric, but not that good at memory
bandwidth, especially at bandwidth/$.
A traditional multiprocessor shares one more moderately wide dram
systems, which are inheriently sequential from a performance
perspective, even when shared/interleaved. Caches for some applications
create N memory systems, but also can become even a worse bottleneck.

The basic building block with FPGAs is lots of 16x1 memories with a FF
.... with FULL PARALLELISM. The trick, is to avoid serialization with
FSM's, and bulk memories (such as BRAM and external memories) which are
serial.

DSP and numerical applications are very similar data flow problems.
Ditto for certain classes of streaming problems, including wire speed
network servers.
 
Phil Hays wrote:
Why test at die level at all? Economics. Packaging costs money.
Why test at package level at all? Full testing at wafer sort isn't
realistic, and die damage during packaging happens.
And for some, such a damn if you do, and damn if you don't is a
purfectly good excuse to do nothing. Life isn't purfect. Finding
solutions I find more valuable than finding restrctions and excuses.

Something quite like this was tried. Some very good reasons not to do
it were found, the hard way. "Human beings, who are almost unique in
having the ability to learn from the experience of others, are also
remarkable for their apparent disinclination to do so." (Douglas
Adams)
One of the most remarkable forms of success, is the difficult
challenges offered from failures. The cost of chipping away at this
problem could be relatively small, one or two engineers for a few years
adding a very small complexity addition to production die. When success
materializes, the savings are substantial.

Power supply measurement requires an ammeter per power supply per die,
or some way to switch an ammeter between measurement points, like
relays. I'd love to hear your plan.
It always comes down to V=IR and there are plenty of designs/products
that do current sensing well, even if an external reference standard is
required. Maybe one of the ATE functions is to calibrate on die
standards, and pass that to the rack manager.

Some things can't be implemented on wafers. Disk drives, relays,
precision resisters, ...
None of which are needed on die for self testing.
 
John Bass fpga_toys@yahoo.com wrote:

Phil Hays wrote:
Why test at die level at all? Economics. Packaging costs money.
Why test at package level at all? Full testing at wafer sort isn't
realistic, and die damage during packaging happens.

And for some, such a damn if you do, and damn if you don't is a
purfectly good excuse to do nothing. Life isn't purfect. Finding
solutions I find more valuable than finding restrctions and excuses.
If you don't understand the problem, you are not very likely to come
up with a solution.


Something quite like this was tried. Some very good reasons not to do
it were found, the hard way. "Human beings, who are almost unique in
having the ability to learn from the experience of others, are also
remarkable for their apparent disinclination to do so." (Douglas
Adams)

One of the most remarkable forms of success, is the difficult
challenges offered from failures.
I'm sure Douglas Adams would agree. But you wouldn't like it.


Power supply measurement requires an ammeter per power supply per die,
or some way to switch an ammeter between measurement points, like
relays. I'd love to hear your plan.

It always comes down to V=IR
Ever figure out what current a wafer full of die would draw? Now for
the fun part. How to get all that current to all the die without too
much voltage drop? Oh, and what if one die is in latchup?


and there are plenty of designs/products
that do current sensing well, even if an external reference standard is
required. Maybe one of the ATE functions is to calibrate on die
standards, and pass that to the rack manager.
I thought you were not going to use an ATE.


Some things can't be implemented on wafers. Disk drives, relays,
precision resisters, ...

None of which are needed on die for self testing.
As long as test coverage is way less than 50%, sure.


--
Phil Hays
 
Take two ... bad google day ...

Phil Hays wrote:
If you don't understand the problem, you are not very likely to come
up with a solution.
Quite true.

Something quite like this was tried. Some very good reasons not to do
it were found, the hard way. "Human beings, who are almost unique in
having the ability to learn from the experience of others, are also
remarkable for their apparent disinclination to do so." (Douglas
Adams)

One of the most remarkable forms of success, is the difficult
challenges offered from failures.

I'm sure Douglas Adams would agree. But you wouldn't like it.
You probably will not like the contradiction that it poses either:

a) Experienced team A works diligently, ending in a heroic failure

b) Team B offers regular help, which is turned down

c) after the failure is complete, Team B completes the project
quickly.

Should Team B have accepted the failure as hard fact that the problem
had no viable solution, and also failed by failing to try? (IE learning
from Team A's failure)

In 30 years of being self employed I've made about 20% of my income
from taking over failed projects with a low bid no risk flat fee
proposals to management ... no delivery, no payment. All I have at risk
is my time and my reputation to always succeed on those projects.
Several of those projects were taken from experienced teams that I
offered friendly help on a regular basis, and was turned down. Others I
took after one or more other companies failed to deliver what the
client needed, often with sharp adivice that I would be doomed to
repeating the cycle.

Ever figure out what current a wafer full of die would draw? Now for
the fun part. How to get all that current to all the die without too
much voltage drop? Oh, and what if one die is in latchup?
yep ... and did you notice the part of the proposal about using on
wafer power control for each die?

I thought you were not going to use an ATE.
Did you notice the part of the proposal about using ATE for screening
dangrous defects, like shorted power nets?

Some things can't be implemented on wafers. Disk drives, relays,
precision resisters, ...

None of which are needed on die for self testing.

As long as test coverage is way less than 50%, sure.
You have already given up if you think that. The explict idea behind
defect managment is functional issolation by designing for 100% test
coverage at some level of detail. Either a route, FF, LUT, buffer, or
other resource fails testing, or is presumed operational, and to be
screened if necessary by using redundant logic in the system level
design initially.

I suspect that this will be an itterative process of incremental
refinement over a long period of time, maybe at first only saving
40-60% of the reject yield, and possibly progressing to nearly all. I
suspect one of the most important parts of the process will be design
refinements to prevent/issolate the failure impacts on future designs,
increasing both the primary and secondary yields in the long term.

One interesting form of "success" includes not reaching the entire
objective, but leaving a carefully documented road map of the
challenges, assumptions, and proposed solutions along the way so that
those that follow have a better defined path to chip away at.

Now, I don't know how much of Xilinx's yield is scrap today, or would
be scrap at the end of 6 months, a year or two years. I do suspect the
number will steadily decrease using design for defect management
strategies.

I do know the "cost" to Xilinx to sell scrap die and packaged product
is pretty low, if it comes with a long term partnership to provide
engineering input to increase yields for both zero defect, and managed
defect segments. The long term promise of such a program is for each
to act in good faith to increase revenues for both partners as the
process matures.

I believe that I can create products which are defect aware using the
largest Xilinx parts, that presumably also have the largest percentage
of rejects. I'm willing to invest the engineering into developing a
recovery process, if Xilinx is willing to provide scrap material, and
include in that partnership an agreement to share data and design
suggestions to improve yeilds. As the recovery process becomes
profitable, there is certainly incentives on both parties part to share
that windfall. That's a pretty low risk deal for Xilinx if they are
crushing scrap in die and packaged form today.
 
fpga_toys@yahoo.com wrote:
fpga_toys@yahoo.com wrote:

I'm willing to invest the engineering into developing a
recovery process, if Xilinx is willing to provide scrap material, and
include in that partnership an agreement to share data and design
suggestions to improve yeilds. As the recovery process becomes
profitable, there is certainly incentives on both parties part to share
that windfall. That's a pretty low risk deal for Xilinx if they are
crushing scrap in die and packaged form today.


I'm willing to consider the same for other FPGA vendors as well.
Sounds simple - become an EasyPath customer!

You can make your own bit streams ( IIRC, two are allowed ? ),
and thus get die that are in a 'possibly faulty, but partially proven'
bin, and expand from there.....

-jg
 
Austin Lesea wrote:
??

I wonder if there is any reason why it would be useful to compile the
verilog for a FPGA?

Austin

Pablo Bleyer Kocik wrote:

For those who are interested, SUN released Open SPARC today:

http://opensparc-t1.sunsource.net/download_hw.html

Verilog RTL, verification and simulation tools included.

Cheers.

--
PabloBleyerKocik /"Person who say it cannot be done
pablo / should not interrupt person doing it."
@bleyer.org / -- Chinese proverb
I can imagine no practical use. But it sure is fun to do :).

-Isaac
 
Errr... To start developing and testing a SoC based on OpenSPARC
interfaced to a custom digital block in an FPGA? CPU cores + FPGA
blocks seem to be resurrecting now. Also, a bunch of companies are
working strongly on FPAAs and other analog configurable architectures
(this time done right). If we have 8051/PSoC/ARM7/PowerPC embedded
cores now, why we can't dream of having devices based on a state of the
art and truly open platform (GPL) in the next years? And differently
from the other proprietary solutions, anyone can share ownership and
help in the development of OpenSPARC...

--
PabloBleyerKocik /"But what... is it good for?"
pablo / -- 1968 Engineer at IBM's Advanced Computing
@bleyer.org / Systems Division, commenting on the microchip
 
Is it not possible to use DC and target an FPGA library, say Altera's
Stratix family? I was thinking that's possible.

Or does one need to use DC FPGA?

Thanks,
Shyam
 
"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> schrieb im
Newsbeitrag news:0gkg22t80t0838e2odqlfet8h3pror7np4@4ax.com...
On Mon, 27 Mar 2006 22:35:31 +0200, "Antti Lukats"
antti@openchip.org> wrote:

"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> schrieb im
Newsbeitrag news:6jgg221p6iuffrbbb6dtml39fn3u9sdu4k@4ax.com...
We have a perfect-storm clock problem. A stock 16 MHz crystal
oscillator drives a CPU and two Spartan3 FPGAs. The chips are arranged
linearly in that order (xo, cpu, Fpga1, Fpga2), spaced about 1.5"
apart. The clock trace is 8 mils wide, mostly on layer 6 of the board,
the bottom layer. We did put footprints for a series RC at the end (at
Fpga2) as terminators, just in case.

Now it gets nasty: for other reasons, the ground plane was moved to
layer 5, so we have about 7 mils of dielectric under the clock
microstrip, which calcs to roughly 60 ohms. Add the chips, a couple of
tiny stubs, and a couple of vias, and we're at 50 ohms, or likely
less.

And the crystal oscillator turns out to be both fast and weak. On its
rise, it puts a step into the line of about 1.2 volts in well under 1
ns, and doesn't drive to the Vcc rail until many ns later. At Fpga1,
the clock has a nasty flat spot on its rising edge, just about halfway
up. And it screws up, of course. The last FPGA, at the termination, is
fine, and the CPU is ancient 99-micron technology or something and
couldn't care less.

Adding termination at Fpga2 helps a little, but Fpga1 still glitches
now and then. If it's not truly double-clocking then the noise margin
must be zilch during the plateau, and the termination can't help that.

One fix is to replace the xo with something slower, or kluge a series
inductor, 150 nH works, just at the xo output pin, to slow the rise.
Unappealing, as some boards are in the field, tested fine but we're
concerned they may be marginal.

So we want to deglitch the clock edges *in* the FPGAs, so we can just
send the customers an upgrade rom chip, and not have to kluge any
boards.

Some ideas:

1. Use the DCM to multiply the clock by, say, 8. Run the 16 MHz clock
as data through a dual-rank d-flop resynchronizer, clocked at 128 MHz
maybe, and use the second flop's output as the new clock source. A
Xilinx fae claims this won't work. As far as we can interpret his
English, the DCM is not a true PLL (ok, then what is it?) and will
propagate the glitches, too. He claims there *is* no solution inside
the chip.

2. Run the clock in as a regular logic pin. That drives a delay chain,
a string of buffers, maybe 4 or 5 ns worth; call the input and output
of the string A and B. Next, set up an RS flipflop; set it if A and B
are both high, and clear it if both are low. Drive the new clock net
from that flop. Maybe include a midpoint tap or two in the logic, just
for fun.

3. Program the clock logic threshold to be lower. It's not clear to us
if this is possible without changing Vccio on the FPGAs. Marginal at
best.


Any other thoughts/ideas? Has anybody else fixed clock glitches inside
an FPGA?

John


you can run a genlocked NCO clocked from in-fabric onchip oscillator. your
internal recovered clock will have jitter +-1 clock period of the ring
oscillator (what could be as high as about 370MHz in S3), you might need a
some sync logic that will ensure the 16mhz clock edges are only used to
adjust the NCO.

Nice idea. But I do need the 16 MHz to be long-term correct, although
duty cycle and edges could jitter a bit and not cause problems. So I
could build an internal ring oscillator and use that to resync the
incoming 16 MHz clock (dual-rank d-flops again) on the theory that the
input glitches will never last anything like the 300-ish MHz resync
clock period. And that's even easier.

Thanks for the input,

John

the gen locked NCO would be cycle accurate too, ok I meant you would fine
adjust the NCO to track the 16mhz
but given and higher clock you can build different deglitch circuitries
without the need of an full digital PLL technique

antti
 
"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> wrote in message
news:eukg22d156r3lq8ke0l63nt7mdvv6fh818@4ax.com...
Since the issue is 'local', I'd fix it locally, and 2. sounds
preferable. You know the CLK freq, so can choose the delay banding.

That's looking promising; we're testing that one now. Gotta figure how
many cells it takes to delay 5 ns. (We'll just xor the ends and bring
that out to a test point.)

John,
I've done option 2 before on a sine wave oscillator fed to an FPGA. I used
the delay element within some unbonded IOBs to get the delay. (Drive signal
out on O/P, get it back through I/P delay. NB. turn off DRC check in
bitgen.) I figured it would be more likely to carry on working with newer
versions of software. Each new version comes up with more fiendish ways to
remove delay chains!
Good luck, Syms.
 
javaguy11111@gmail.com wrote:
I have been playing around with opensparc some using xst. I did manage
to get a build without errors, but since I did not have a clock defined
everything got optimized away. So no gate counts. One thing I did
learn is to not import the files using Project Navigator. It just locks
it up.

A build using an xst script worked. There was one file that had a
function defined that was causing xst to fail. However when I removed
it I got no more complaints.

It may not be possible to synthesize with 8 cores, but I thought I saw
something in the docs about being to specify fewer cores.

I will probably play around with it some more as free time allows and
see how far I can get with it.
OK, Thanks very much for the information. I also tried to synthesize
using Altera's Quartus II but gave up when I got many errors. Thought I
would try it out with DC first and see what I get.

-Shyam
 
On Mon, 27 Mar 2006 14:31:13 -0800, "Symon" <symon_brewer@hotmail.com>
wrote:

"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> wrote in message
news:eukg22d156r3lq8ke0l63nt7mdvv6fh818@4ax.com...

Since the issue is 'local', I'd fix it locally, and 2. sounds
preferable. You know the CLK freq, so can choose the delay banding.

That's looking promising; we're testing that one now. Gotta figure how
many cells it takes to delay 5 ns. (We'll just xor the ends and bring
that out to a test point.)

John,
I've done option 2 before on a sine wave oscillator fed to an FPGA. I used
the delay element within some unbonded IOBs to get the delay. (Drive signal
out on O/P, get it back through I/P delay. NB. turn off DRC check in
bitgen.) I figured it would be more likely to carry on working with newer
versions of software. Each new version comes up with more fiendish ways to
remove delay chains!
Good luck, Syms.
The unbonded pad thing sounds slick. I argued to use a real pin
in-and-out as the delay element, but certain stingy engineers around
here are unwilling to give up one of their two available test points.

John
 
John Larkin wrote:
On Tue, 28 Mar 2006 08:55:50 +1200, Jim Granville
Enable the schmitt option on the pin :)


Don't I wish! There is a programmable delay element in the IO block,
but it's probably a string of inverters, not an honest R-C delay, so
it likely can't be used to lowpass the edge. We're not sure.

I wish they'd tell us a little more about the actual electrical
behavior of the i/o bits. I mean, Altera and Actel and everybody else
has snooped all this out already.


Since the issue is 'local', I'd fix it locally, and 2. sounds
preferable. You know the CLK freq, so can choose the delay banding.


That's looking promising; we're testing that one now. Gotta figure how
many cells it takes to delay 5 ns. (We'll just xor the ends and bring
that out to a test point.)
Yes, your main challenge will then be to persuade the tools to keep
your delay elements...
What is the pin-delay on the part - you could use that feature,
enable it on your pin, drive another nearby pin(s) (non bonded?)
and then use those as the S/R time-shutters.
-jg
 
"John Larkin" <jjlarkin@highNOTlandTHIStechnologyPART.com> wrote in message
The unbonded pad thing sounds slick. I argued to use a real pin
in-and-out as the delay element, but certain stingy engineers around
here are unwilling to give up one of their two available test points.

John

I sent you some stuff from my Hotmail account. If your spam filter blocks
it, let me know.
Best, Syms.
 
I got a little further into poking around with the opensparc code. I
realized that I did not have any defines enabled for any of the cores
to synthesize. I enabled 1 core and the xst sucked up all the memory my
system had and then failed for lack of memory.

So I guess if I am going to play with this anymore I need to up my
system memory from 1G to 4G memory at least. For now I will just have
to stick to openrisc experimenting.


Shyam wrote:
javaguy11111@gmail.com wrote:
I have been playing around with opensparc some using xst. I did manage
to get a build without errors, but since I did not have a clock defined
everything got optimized away. So no gate counts. One thing I did
learn is to not import the files using Project Navigator. It just locks
it up.

A build using an xst script worked. There was one file that had a
function defined that was causing xst to fail. However when I removed
it I got no more complaints.

It may not be possible to synthesize with 8 cores, but I thought I saw
something in the docs about being to specify fewer cores.

I will probably play around with it some more as free time allows and
see how far I can get with it.


OK, Thanks very much for the information. I also tried to synthesize
using Altera's Quartus II but gave up when I got many errors. Thought I
would try it out with DC first and see what I get.

-Shyam
 
John Larkin wrote:
On 29 Mar 2006 01:00:59 -0800, bill.sloman@ieee.org wrote:


John Larkin wrote:

On 28 Mar 2006 12:45:23 -0800, bill.sloman@ieee.org wrote:



So the Fpga to Fpga routing worked - good.

That's not what we did. We designed a clock deglitcher to go inside
the FPGA.

Enough propagation delays to cover the dwell at the switching
threshold, and a state machine to make sure that the clock only changes
state once in that interval?



We did my original #2 suggestion, a tapped delay line driven from the
pin, driving an r-s flipflop. Set the flop if all the taps are 1s,
clear it if all are 0s. Sort of a poor man's 1-bit FIR lowpass filter.
The delay line is a string of eight buffers, about 10 ns overall.

We'd have done Peter's circuit if we'd learned of it sooner.

It's interesting that my post evoked two classes of response:

1. It can't be done, don't do it, kluge the boards (also the official
Xilinx response!)

2. Yes, and here are my ideas on how you could do it/how I've already
done it/interesting asides.
A couple of questions for those who are in 1),

Would you use the Pin-Delay feature on a FPGA, to de-skew click lines
from other devices ?

If you are OK with that, suppose John now uses the same Pin-delay
feature (chained) on his 'toggle rate governer' - is that then OK ?

-jg
 
Yes the networking, communications, and DSP industries are thouroughly
into the simple idea of timesharing, latency hiding etc and thats where
I have been since leaving Inmos 20yrs ago. I also use those SRL16s to
keep 4 sets of instruction fetch state over 8 clocks so that variable
length opcodes can interleave without confusion. Without those, I'd be
looking at 50% more LUTs per PE.

As a Transputer person I want as many threads as possible for a thread
pool, and these can then be effectively allocated on demand to the
concurrent language threads. Of course for idle threads, I would push
all threads onto busy PEs and shut down fully idle PEs, so power
consumption follows work done. In the Niagara case, the goals are
different, continuous threaded server loads. One thing I did realize is
that the after doing all the MTA, MMU work, any old instruction set
could be used even that damned x86 but since this is an FPGA design,
its still better to tune for that and go KISS at speed.

Its really a question of trading the single threaded Memory Wall
problem for a Thread Wall problem. A problem for single threaded C
guys, not so for the CSP people out there. Amdahls law has done more to
set back parallel computing than who knows what, if its serial, then
its serial, but there usually room to mix seq & par at many levels.
Even if a task has 2 or 3 threads, MTA still comes out ahead on
hardware cost.

The paper I gave on this Transputer design at CPA2005 last sep, is
finally available at wotug.org for anyone thats interested.


Regards

John Jakson

PS Perhaps oneday somebody out there could make a nano CC size FPGA
card but with some RLDRAM on it for a TRAM replacement.
 

Welcome to EDABoard.com

Sponsor

Back
Top