pcb&bitstream

glen herrmannsfeldt · Mar 15, 2011

Ed McGettigan <ed.mcgettigan@xilinx.com> wrote:

On Mar 14, 12:49 pm, geobsd <geobsd...@gmail.com> wrote:
On 12 Mrz., 05:27, Ed McGettigan <ed.mcgetti...@xilinx.com> wrote:
"companies have been started and failed over the years
trying to mate FPGAs with CPUs
"http://www.xilinx.com/technology/roadmap/processing-platform.htm

When I wrote my original statement I was referring to the system
architecture that you were trying to achieve, that is a high
performance CPU connected directly with a FPGA for dynamic algorithm
acceleration.

There is a whole conference for people interested in FPGA based
computing, that is FCCM. FPGAs for Custom Computing Machines.

FPGA based co-processors for algorithm acceleration word fairly
well in fixed point, not so well in floating point. There are a
few problems that need large amounts of computation.

Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day.
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.

-- glen

glen herrmannsfeldt · Mar 15, 2011

Kolja Sulimma <ksulimma@googlemail.com> wrote:

On 13 Mrz., 01:46, rickman <gnu...@gmail.com> wrote:
since a bad bitstream has potential of frying an FPGA.

This argument is invalid.

You can fry an FPGA with VDHL and vendor synthesis software.
This has been demonstrated at the FPL conference a decade ago.

(snip)

I believe that documenting LUT content locations in the bitstream
would be a good compromise. It is relatively easy to document
and use and not much can go wrong

They use to do that. It was necessary to verify on readback when
RAM was in use, and the bits would change, such that you could mask
them out.

and it has a decent amount of applications where it is useful.
OTOH: The introduction of SRL16 made it easy to support LUT-reloading
explicitely in the HDL.

How much routing does that take?

-- glen

Ed McGettigan · Mar 15, 2011

On Mar 14, 7:02 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

Kolja Sulimma <ksuli...@googlemail.com> wrote:
On 13 Mrz., 01:46, rickman <gnu...@gmail.com> wrote:
since a bad bitstream has potential of frying an FPGA.
This argument is invalid.
You can fry an FPGA with VDHL and vendor synthesis software.
This has been demonstrated at the FPL conference a decade ago.

(snip)

I believe that documenting LUT content locations in the bitstream
would be a good compromise. It is relatively easy to document
and use and not much can go wrong

They use to do that. It was necessary to verify on readback when
RAM was in use, and the bits would change, such that you could mask
them out.

and it has a decent amount of applications where it is useful.
OTOH: The introduction of SRL16 made it easy to support LUT-reloading
explicitely in the HDL.

How much routing does that take?

-- glen

Nothing to speak of as the only nets are the DIN, CLK and CE. The
SRLC16E, SRLC32E and SRLC64E primitives has two outputs Q and Q15/Q31/
Q63 so that the Q15/Q31/Q63 can be connected to the DIN of the next
SRLCxxE while the Q output is the result of the inputs.

The ChipScope ILA core uses this method to load the trigger controls.

Ed McGettigan
--
Xilinx Inc.

glen herrmannsfeldt · Mar 15, 2011

Ed McGettigan <ed.mcgettigan@xilinx.com> wrote:

(snip, someone wrote)

and it has a decent amount of applications where it is useful.
OTOH: The introduction of SRL16 made it easy to support LUT-reloading
explicitely in the HDL.

(then I wrote)

How much routing does that take?

Nothing to speak of as the only nets are the DIN, CLK and CE. The
SRLC16E, SRLC32E and SRLC64E primitives has two outputs Q and Q15/Q31/
Q63 so that the Q15/Q31/Q63 can be connected to the DIN of the next
SRLCxxE while the Q output is the result of the inputs.

But they have to route everywhere, and for ROMs I don't need
any such nets at all.

So, say a design has 300 ROMs that are 16x2. (16 words of two bits.)
I need to get the CE to all the ROM/SRL16's, chain the DIN to
the DOUT of the previous one, and get a CLK to them. I suppose
the CLK comes from the global clock tree, so doesn't need any
other routing resources. Also, this doesn't have to be at the
full speed on the rest of the design, so can use slower (longer)
routes without any penalty.

To be more specific, the idea is to get a search pattern in
for a systolic array search processor. Without this, I would just
find the bits in the bitstream and stuff the appropriate data
into them.

The ChipScope ILA core uses this method to load the trigger controls.

-- glen

whygee · Mar 15, 2011

Ed McGettigan wrote:
<snip>

In each of these cases the FPGA fabric and the processor, whether hard
or soft, was treated as a single entity to create a custom set of
peripherals around the processor to meet the needs that an off-the-
shelf processor couldn't.

Thank you Ed for yet another post which could become the basis for
a wikipedia article

though yes, i know, "references needed"...

the new Actel/Microsemi solution with the ARM hard core
has many benefits compared to the usual FPGA vendors,
but it makes most sense in the industrial/automation/robotic
world, where the direct analog I/O cuts component count
for room/power/security sensitive applications (aerospace, for example)
but not for cost sensitive applications... the same old problem !

I love their original Fusion devices though i use the cost-conscious,
less integrated ProASIC3 parts, which give me just what I want, and I can
add a cheap peripheral (external ADC or Flash storage for example)
if needed. I suppose that everybody does the same...

Ed McGettigan
yg

--
http://ygdes.com / http://yasep.org

whygee · Mar 15, 2011

glen herrmannsfeldt wrote:

Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day.
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.
at this level, maybe a finely tuned and optimised MMX/SSE/SSE2/whatever

code would do the trick, using 8-bit chunks in 64, 128 or 256-bits
wide words... and it would be future-proof, as the x86 (or whatever
arch of the day is) could be upgraded to faster/lower power/more core
hardware.

Yeah i know, it's software and it's not sexy...
but if it's sufficiently simple, it can be coded
with GCC intrinsics such as in this file :
http://ygdes.com/sources/lm98/def64.h

just my 0,02 bits of entropy

-- glen
yg

--
http://ygdes.com / http://yasep.org

geobsd · Mar 15, 2011

the thread returned interesting
thanks all !

whygee · Mar 15, 2011

Thomas Womack wrote:

No, but getting hold of a hundred PCs (which will do four SSE
operations per clock cycle, so about sixty six-bit add/subtracts, and
run at 2.5GHz with four cores) is not a difficult task. If you can
partition the job over 2000 FPGAs, you can probably partition it over
100 PCs.

And the PC farm can be easily repurposed, upgraded, etc.
(even though i don't like them, but i see the business case)

Tom
yg

--
http://ygdes.com / http://yasep.org

glen herrmannsfeldt · Mar 15, 2011

whygee <yg@yg.yg> wrote:

(after I wrote)

Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day.
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.

at this level, maybe a finely tuned and optimised MMX/SSE/SSE2/whatever
code would do the trick, using 8-bit chunks in 64, 128 or 256-bits
wide words... and it would be future-proof, as the x86 (or whatever
arch of the day is) could be upgraded to faster/lower power/more core
hardware.

At (about) 1e5 seconds/day, I need 1e13/second. At 1GHz,
that is 1e4 per clock cycle. No, there are no 80000 bit
registers in MMX or SSE.

-- glen

Thomas Womack · Mar 15, 2011

In article <ilnebt$tm9$1@news.eternal-september.org>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:

whygee <yg@yg.yg> wrote:

(after I wrote)
Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day.
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.

at this level, maybe a finely tuned and optimised MMX/SSE/SSE2/whatever
code would do the trick, using 8-bit chunks in 64, 128 or 256-bits
wide words... and it would be future-proof, as the x86 (or whatever
arch of the day is) could be upgraded to faster/lower power/more core
hardware.

At (about) 1e5 seconds/day, I need 1e13/second. At 1GHz,
that is 1e4 per clock cycle. No, there are no 80000 bit
registers in MMX or SSE.

No, but getting hold of a hundred PCs (which will do four SSE
operations per clock cycle, so about sixty six-bit add/subtracts, and
run at 2.5GHz with four cores) is not a difficult task. If you can
partition the job over 2000 FPGAs, you can probably partition it over
100 PCs.

Tom

rickman · Mar 16, 2011

On Mar 14, 8:46 pm, Kolja Sulimma <ksuli...@googlemail.com> wrote:

On 13 Mrz., 01:46, rickman <gnu...@gmail.com> wrote:

since a bad bitstream has potential of frying an FPGA.

This argument is invalid.

You can fry an FPGA with VDHL and vendor synthesis software.
This has been demonstrated at the FPL conference a decade ago.

That is a silly statement. It doesn't matter if there are other ways
to fry a part. The point is that the vendors exert control over the
design software so that they have control over this sort of problem.
It doesn't matter if they prevent you 100% from doing damage to the
chips. They take responsibility if you are using their tools.

I guess the truth is closer to this argument: documenting the
bitstream format
is a lot of work and is likely to create only very few additional
revenue
from customer that are rather support intensive so it simply isn't
worthwhile for
the vendors.

I believe that documenting LUT content locations in the bitstream
would be a good
compromise. It is relatively easy to document and use and not much can
go wrong
and it has a decent amount of applications where it is useful.
OTOH: The introduction of SRL16 made it easy to support LUT-reloading
explicitely in the HDL.

LUT content is also very easy to reverse engineer. Everything about
the CLB is easy to reverse engineer since the structure is regular and
repeated. The only hard part is all the little bits and pieces around
the edges.

BTW, SRL16 is a Xilinx specific feature. I have been assuming we are
talking about the process in general.

Rick

rickman · Mar 16, 2011

On Mar 14, 9:28 pm, Ed McGettigan <ed.mcgetti...@xilinx.com> wrote:

On Mar 14, 5:46 pm, Kolja Sulimma <ksuli...@googlemail.com> wrote:

On 13 Mrz., 01:46, rickman <gnu...@gmail.com> wrote:

since a bad bitstream has potential of frying an FPGA.

This argument is invalid.

You can fry an FPGA with VDHL and vendor synthesis software.
This has been demonstrated at the FPL conference a decade ago.

I guess the truth is closer to this argument: documenting the
bitstream format
is a lot of work and is likely to create only very few additional
revenue
from customer that are rather support intensive so it simply isn't
worthwhile for
the vendors.

I believe that documenting LUT content locations in the bitstream
would be a good
compromise. It is relatively easy to document and use and not much can
go wrong
and it has a decent amount of applications where it is useful.
OTOH: The introduction of SRL16 made it easy to support LUT-reloading
explicitely in the HDL.

Kolja
You can fry an FPGA with VDHL and vendor synthesis software.
This has been demonstrated at the FPL conference a decade ago.

I am quite surprised about this. Can you provide any additional
material on how this was achieved?

There aren't any scenarios, other than internal tri-state contention,
that I can come up with to make this happen with a proven tool chain.

Ed McGettigan
--
Xilinx Inc.

Since when are tool chains "proven"??? I thought, like nearly all
software, they were tested to verify correctness which means they
aren't correct at all.

Rick

rickman · Mar 16, 2011

On Mar 14, 11:59 am, geobsd <geobsd...@gmail.com> wrote:

i wanted to use chunks of bit-stream assembled in my model conformance
to process in place of the cpu

Uhhh... you can do that. It's called partial reconfiguration. You
use Xilinx tools to design a framework which defines the I/O of your
design along with any parts that won't be changing. Then you design
the pieces that plug into the framework. At any time you can load in
any of the pluggable pieces.

I was going to use this once, a long time ago, with the Spartan
family. But I don't think this ever materialized for the Spartans.
But you can do this with the Virtex parts. The partial bitstreams are
stored in a file or ROM and a controlling CPU sends them to the FPGA.
I think you can even build the CPU into the static part of the FPGA
design and it can reload the partial bitstreams itself!

Rick

rickman · Mar 16, 2011

On Mar 14, 1:03 pm, NeedCleverHandle <d_s_kl...@yahoo.com> wrote:

On Mar 14, 7:52 am, "Nial Stewart"

nial*REMOVE_TH...@nialstewartdevelopments.co.uk> wrote:
If you think that you will get farther in your goal by using FPGAs
from Altera, Lattice, Microsemi (aka Actel), or QuickLogic then you
should absolutely switch over.

Ed, I read this as "Please use someone else's devices so I can stop
answering these ****ing stupid questions".

Nial.

My experience with the brand-X sales force leads me to believe that
they have significant training in politeness and political
correctness. Many times I have watched in amazement as impossible
demands were deflected without insult.

I ain't made of that stuff.

If that is your impression of the X-team then you haven't been posting
here but for a short time. There used to be some very interesting
"conversations" here which were not very "polite" or PC. But someone
at X high up the food chain put an end to that by decree.

Rick

rickman · Mar 16, 2011

On Mar 15, 6:27 am, Thomas Womack <twom...@chiark.greenend.org.uk>
wrote:

In article <ilnebt$tm...@news.eternal-september.org>,
glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

whygee <y...@yg.yg> wrote:

(after I wrote)
Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day.
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.

at this level, maybe a finely tuned and optimised MMX/SSE/SSE2/whatever
code would do the trick, using 8-bit chunks in 64, 128 or 256-bits
wide words... and it would be future-proof, as the x86 (or whatever
arch of the day is) could be upgraded to faster/lower power/more core
hardware.

At (about) 1e5 seconds/day, I need 1e13/second. At 1GHz,
that is 1e4 per clock cycle. No, there are no 80000 bit
registers in MMX or SSE.

No, but getting hold of a hundred PCs (which will do four SSE
operations per clock cycle, so about sixty six-bit add/subtracts, and
run at 2.5GHz with four cores) is not a difficult task. If you can
partition the job over 2000 FPGAs, you can probably partition it over
100 PCs.

Tom

I don't follow. Why would it take 2000 FPGAs to do what you can do
with 100 PCs? I would think the ratio would be far in the other
direction. The PC may have a 10x performance advantage in terms of
clock speed (or less), but the FPGA has the advantage of 100x to 1000x
hardware resources!

We are only talking about 6 bit adders, right? No mention of memory
or other resources required. But the add/sub units are trivial.

Rick

Thomas Womack · Mar 16, 2011

In article <0a389d39-3e86-46a1-8f20-b291384bac49@34g2000pru.googlegroups.com>,
rickman <gnuarm@gmail.com> wrote:

On Mar 15, 6:27=A0am, Thomas Womack <twom...@chiark.greenend.org.uk
wrote:
In article <ilnebt$tm...@news.eternal-september.org>,
glen herrmannsfeldt =A0<g...@ugcs.caltech.edu> wrote:

whygee <y...@yg.yg> wrote:

(after I wrote)
Not so long ago, I was considering a project for genomics that
needs about 1e18 six bit add/subtract operations per day. =A0
I figured out that it could be done with about 2000 S3E devices,
though so far no interest in building one.

at this level, maybe a finely tuned and optimised MMX/SSE/SSE2/whateve=
r
code would do the trick, using 8-bit chunks in 64, 128 or 256-bits
wide words... and it would be future-proof, as the x86 (or whatever
arch of the day is) could be upgraded to faster/lower power/more core
hardware.

At (about) 1e5 seconds/day, I need 1e13/second. =A0At 1GHz,
that is 1e4 per clock cycle. =A0No, there are no 80000 bit
registers in MMX or SSE.

No, but getting hold of a hundred PCs (which will do four SSE
operations per clock cycle, so about sixty six-bit add/subtracts, and
run at 2.5GHz with four cores) is not a difficult task. =A0If you can
partition the job over 2000 FPGAs, you can probably partition it over
100 PCs.

Tom

I don't follow. Why would it take 2000 FPGAs to do what you can do
with 100 PCs?

10^18 per day = 10^13 per second = 10^9.7 per FPGA-second, according
to the figures he's using. Which might be a 100MHz FPGA clock and 80
units on the FPGA, or 25MHz and 300 units.

The PCs are 2.5GHz quad-cores, so there's the factor 100; SSE gets you
sixteen units rather than eighty, but the much faster clocks make up
for it.

(this is the problem I run into whenever considering how to do
number-theory really fast on FPGAs: a Spartan 3 has a hundred 17x17
multipliers running at 200MHz, a cheap AMD CPU has four 64x64
multipliers running at 2500MHz and an expensive one has twelve)

Tom

NeedCleverHandle · Mar 16, 2011

On Mar 16, 8:49 am, rickman <gnu...@gmail.com> wrote:

On Mar 14, 1:03 pm, NeedCleverHandle <d_s_kl...@yahoo.com> wrote:

On Mar 14, 7:52 am, "Nial Stewart"

nial*REMOVE_TH...@nialstewartdevelopments.co.uk> wrote:
If you think that you will get farther in your goal by using FPGAs
from Altera, Lattice, Microsemi (aka Actel), or QuickLogic then you
should absolutely switch over.

Ed, I read this as "Please use someone else's devices so I can stop
answering these ****ing stupid questions".

Nial.

My experience with the brand-X sales force leads me to believe that
they have significant training in politeness and political
correctness. Many times I have watched in amazement as impossible
demands were deflected without insult.

I ain't made of that stuff.

If that is your impression of the X-team then you haven't been posting
here but for a short time. There used to be some very interesting
"conversations" here which were not very "polite" or PC. But someone
at X high up the food chain put an end to that by decree.

Rick

I was posting when it was called usenix.

The experiences I was talking about were in the Real World, not The
Internet.

Cheers,
RK

Ed McGettigan · Mar 16, 2011

On Mar 16, 8:49 am, rickman <gnu...@gmail.com> wrote:

On Mar 14, 1:03 pm, NeedCleverHandle <d_s_kl...@yahoo.com> wrote:

On Mar 14, 7:52 am, "Nial Stewart"

nial*REMOVE_TH...@nialstewartdevelopments.co.uk> wrote:
If you think that you will get farther in your goal by using FPGAs
from Altera, Lattice, Microsemi (aka Actel), or QuickLogic then you
should absolutely switch over.

Ed, I read this as "Please use someone else's devices so I can stop
answering these ****ing stupid questions".

Nial.

My experience with the brand-X sales force leads me to believe that
they have significant training in politeness and political
correctness. Many times I have watched in amazement as impossible
demands were deflected without insult.

I ain't made of that stuff.

If that is your impression of the X-team then you haven't been posting
here but for a short time. There used to be some very interesting
"conversations" here which were not very "polite" or PC. But someone
at X high up the food chain put an end to that by decree.

Rick- Hide quoted text -

- Show quoted text -

Rick, I'm not sure where you got that idea. There have been a number
of influences that have affected the Xilinx participation here, but
there has never been a decree.

Over the last 15 or so years there's only been a handful of Xilinx
people posting to comp.arch.fpga that would include myself, Peter
Alfke and Austin Lesea as the most proliferate posters. There were a
number of others that posted occasionally, but when AT&T dropped all
USENET groups a few years ago the internal comp.arch.fpga newsgroup
feed was no longer available and the casual Xilinx poster became a
rarity.

The recent rise in SPAM postings that are not filter out from Google
Groups almost made me retreat completely to the Xilinx Forums, but I'm
still here for now.

Ed McGettigan
--
Xilinx Inc.

glen herrmannsfeldt · Mar 16, 2011

rickman <gnuarm@gmail.com> wrote:
(snip)

I was going to use this once, a long time ago, with the Spartan
family. But I don't think this ever materialized for the Spartans.
But you can do this with the Virtex parts. The partial bitstreams are
stored in a file or ROM and a controlling CPU sends them to the FPGA.
I think you can even build the CPU into the static part of the FPGA
design and it can reload the partial bitstreams itself!

Yes. I heard a talk some years ago from someone who had Linux
running on the PPC in the FPGA, then using that to control the
reconfiguration.

-- glen

glen herrmannsfeldt · Mar 16, 2011

rickman <gnuarm@gmail.com> wrote:
(snip)

I don't follow. Why would it take 2000 FPGAs to do what you can do
with 100 PCs? I would think the ratio would be far in the other
direction. The PC may have a 10x performance advantage in terms of
clock speed (or less), but the FPGA has the advantage of 100x to 1000x
hardware resources!

We are only talking about 6 bit adders, right? No mention of memory
or other resources required. But the add/sub units are trivial.

The memory requirements are very small. Maybe 300 16x2 ROMs
(or, as previously suggested, SRL16s). Maybe also some buffering
of results on the way out, probably using the BRAMs.

So, for 1e18 add/sub per day, 86400 seconds per day, say 3GHz and,
with SSEn, 32 add/sub per cycle. (There is probably some overhead
in there.) Then 120 would do it. I am pretty sure it doesn't work
such that every cycle uses the appropriate SSEn registers, so it is
likely somewhat worse.

Even so, the choice is between a box full of FPGAs connected to
one computer (which might fit in a small room), and, say 300 PCs,
appropriate power and network wiring. I suppose they could net boot
such that you didn't need to install OS, but write custom software
for them. They take up a lot of space, use a lot of power, and
otherwise aren't that much more useful.

Data is coming in at the appropriate rate to keep the system
busy for years.

-- glen

pcb&bitstream

glen herrmannsfeldt

Guest

glen herrmannsfeldt

Guest

Ed McGettigan

Guest

glen herrmannsfeldt

Guest

whygee

Guest

whygee

Guest

geobsd

Guest

whygee

Guest

glen herrmannsfeldt

Guest

Thomas Womack

Guest

rickman

Guest

rickman

Guest

rickman

Guest

rickman

Guest

rickman

Guest

Thomas Womack

Guest

NeedCleverHandle

Guest

Ed McGettigan

Guest

glen herrmannsfeldt

Guest

glen herrmannsfeldt

Guest

Log in

Welcome to EDABoard.com

Sponsor