Spartan 3 - avaliable in small quantities?

Thomas Heller · Feb 20, 2004

I have read several posts here about the difficulties to get Spartan 3
parts in small quantities.

Is it realistic to start a project using spartan 3 (actually the XC3S50
in the VQ100 package is what I probably want to use) when I only need
small quantities - starting with getting some 10 samples, in full
production lets say 100 to 500 pieces per year?

Thanks for opinions,

Thomas

Simon Peacock · Feb 21, 2004

I would suggest that past history of Xilinx should be taken here you won't
get any for a year after the announcement!
there aren't any unless you have a spare million so don't bother asking

"Thomas Heller" <theller@python.net> wrote in message
news:65e1twtj.fsf@python.net...

I have read several posts here about the difficulties to get Spartan 3
parts in small quantities.

Is it realistic to start a project using spartan 3 (actually the XC3S50
in the VQ100 package is what I probably want to use) when I only need
small quantities - starting with getting some 10 samples, in full
production lets say 100 to 500 pieces per year?

Thanks for opinions,

Thomas

Hal Murray · Feb 21, 2004

Is it realistic to start a project using spartan 3 (actually the XC3S50
in the VQ100 package is what I probably want to use) when I only need
small quantities - starting with getting some 10 samples, in full
production lets say 100 to 500 pieces per year?

Where would you buy them? What do they say? Can you get
samples now?

Are there any features on the Spartan 3 that you absolutely need?
(Can you use some other chip?)

What are the costs of alternatives? What are the costs of
not being able to get the chips when you need them?

How long is it going to take you to do the design? (When
do you absolutely need the samples?) Can you work on the
design with two plans in mind and make the choice a month
or two from now?

My rule of thumb is to not design in a chip unless I have parts
in hand or a distributor has stock that I'm sure I can get.

If an interesting chip has some features that would make a
project a lot better (or even possible), then you have to decide
if you want to stick your neck out. Do you like fighting with
not-quite-debugged tools? Do you have good contacts at the vendor?

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

B. Joshua Rosen · Feb 21, 2004

On Fri, 20 Feb 2004 19:44:40 +0100, Thomas Heller wrote:

I have read several posts here about the difficulties to get Spartan 3
parts in small quantities.

Is it realistic to start a project using spartan 3 (actually the XC3S50
in the VQ100 package is what I probably want to use) when I only need
small quantities - starting with getting some 10 samples, in full
production lets say 100 to 500 pieces per year?

Thanks for opinions,

Thomas

Spartan 3s are hard to get but they are available, I'm using XC3S400s in a
new design and we were able to get sample quantities. If you are just
starting the project put your sample order in now, the lead times are
long. My client waited until a week before the boards showed up and they
ended up having to buy the parts from a broker on the other side of the
world. I wouldn't worry about production quantities, Xilinx claims their
yields are good. The problem is that demand unexpectedly spiked so there
is a shortage this quarter.

rickman · Feb 21, 2004

Hal Murray wrote:

Is it realistic to start a project using spartan 3 (actually the XC3S50
in the VQ100 package is what I probably want to use) when I only need
small quantities - starting with getting some 10 samples, in full
production lets say 100 to 500 pieces per year?

Where would you buy them? What do they say? Can you get
samples now?

Are there any features on the Spartan 3 that you absolutely need?
(Can you use some other chip?)

What are the costs of alternatives? What are the costs of
not being able to get the chips when you need them?

How long is it going to take you to do the design? (When
do you absolutely need the samples?) Can you work on the
design with two plans in mind and make the choice a month
or two from now?

My rule of thumb is to not design in a chip unless I have parts
in hand or a distributor has stock that I'm sure I can get.

If an interesting chip has some features that would make a
project a lot better (or even possible), then you have to decide
if you want to stick your neck out. Do you like fighting with
not-quite-debugged tools? Do you have good contacts at the vendor?

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

If you design in a Spartan 3 based on quoted pricing today, you are not
likely to see that price drop at any time through the life cycle of the
part.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

john jakson · Feb 22, 2004

rickman wrote:

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

--

I don't know about that, Xilinx initially set expectations low except
on price. I heard Microblaze only ran at 85MHz on it compared to
120MHz or more on bigger Virtex.

But on a cpu project I am working on I am seeing synth reports of
311MHz on sp3-5 with the latest speed file v 320Mhz for v2pro-8 and
the -7s seem to be same speed as sp3-5 IIRC. The sp2-4s are way down
to 120MHz. Seems as if its v2 made dirt cheap (if and when we get
them) with a small cut in speed. Also the LUT counts are similar to
sp2 but the blockrams are 4x bigger.

I can still port back to sp2(e) with almost same floor plan but with
much smaller ram instances although lots of 4ks could still be more
usefull than equiv no of 16/18Ks but the speed cut would hurt.

For an oldtime VLSI guy, I couldn't imagine getting such performance
on an ASIC flow without 100x the design resources.

johnjakson_usa_com

B. Joshua Rosen · Feb 22, 2004

On Sun, 22 Feb 2004 07:14:16 -0800, john jakson wrote:

rickman wrote:

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

--

I don't know about that, Xilinx initially set expectations low except
on price. I heard Microblaze only ran at 85MHz on it compared to
120MHz or more on bigger Virtex.

But on a cpu project I am working on I am seeing synth reports of
311MHz on sp3-5 with the latest speed file v 320Mhz for v2pro-8 and
the -7s seem to be same speed as sp3-5 IIRC. The sp2-4s are way down
to 120MHz. Seems as if its v2 made dirt cheap (if and when we get
them) with a small cut in speed. Also the LUT counts are similar to
sp2 but the blockrams are 4x bigger.

I can still port back to sp2(e) with almost same floor plan but with
much smaller ram instances although lots of 4ks could still be more
usefull than equiv no of 16/18Ks but the speed cut would hurt.

For an oldtime VLSI guy, I couldn't imagine getting such performance
on an ASIC flow without 100x the design resources.

johnjakson_usa_com

John,

I'd check your report files closely if I were you. If you are seeing
311MHZ on a Spartan 3 something is very wrong. I suspect that your
synthesizer discarded most of your design. My experience sith Spartan
XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that
I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing
but that is with very few logic levels between pipeline stages. I'm sure
that with lots of floorplanning it would be possible to push it higher
than that but certainly not to 300MHz, especially not on something as
complex as a CPU.

rickman · Feb 22, 2004

john jakson wrote:

rickman wrote:

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

--

I don't know about that, Xilinx initially set expectations low except
on price. I heard Microblaze only ran at 85MHz on it compared to
120MHz or more on bigger Virtex.

But on a cpu project I am working on I am seeing synth reports of
311MHz on sp3-5 with the latest speed file v 320Mhz for v2pro-8 and
the -7s seem to be same speed as sp3-5 IIRC. The sp2-4s are way down
to 120MHz. Seems as if its v2 made dirt cheap (if and when we get
them) with a small cut in speed. Also the LUT counts are similar to
sp2 but the blockrams are 4x bigger.

I don't know why you would not expect the XC3S parts to be faster than
the XC2S parts. Certainly going with a 2x reduction in feature size (or
close to it) *should* give you a huge increase in speed. In fact, they
should outrun everything Xilinx makes given the feature size. But they
cut a lot of corners to make the parts cheap so they don't follow the
curve. So far, I have not seen the prices beat the older Spartan parts
either. Sure, they are an improvement, but in this industry,
improvement is normal and part of the game. But the XC3S parts seem to
be just the next new chip, not anything really special.

If the XC3S parts were both faster than the Virtex line and cheaper than
the older Spartan line, *that* would be something to crow about. But
they are *neither* at the moment. They are just the standard improved
line that combines both (more or less).

I can still port back to sp2(e) with almost same floor plan but with
much smaller ram instances although lots of 4ks could still be more
usefull than equiv no of 16/18Ks but the speed cut would hurt.

For an oldtime VLSI guy, I couldn't imagine getting such performance
on an ASIC flow without 100x the design resources.

--

Rick "rickman" Collins

rick.collins@XYarius.com
Ignore the reply address. To email me use the above address with the XY
removed.

Arius - A Signal Processing Solutions Company
Specializing in DSP and FPGA design URL http://www.arius.com
4 King Ave 301-682-7772 Voice
Frederick, MD 21701-3110 301-682-7666 FAX

john jakson · Feb 23, 2004

"B. Joshua Rosen" <bjrosen@polybus.com> wrote in message news:<pan.2004.02.22.16.31.34.651568@polybus.com>...

On Sun, 22 Feb 2004 07:14:16 -0800, john jakson wrote:

rickman wrote:

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

--

I don't know about that, Xilinx initially set expectations low except
on price. I heard Microblaze only ran at 85MHz on it compared to
120MHz or more on bigger Virtex.

But on a cpu project I am working on I am seeing synth reports of
311MHz on sp3-5 with the latest speed file v 320Mhz for v2pro-8 and
the -7s seem to be same speed as sp3-5 IIRC. The sp2-4s are way down
to 120MHz. Seems as if its v2 made dirt cheap (if and when we get
them) with a small cut in speed. Also the LUT counts are similar to
sp2 but the blockrams are 4x bigger.

I can still port back to sp2(e) with almost same floor plan but with
much smaller ram instances although lots of 4ks could still be more
usefull than equiv no of 16/18Ks but the speed cut would hurt.

For an oldtime VLSI guy, I couldn't imagine getting such performance
on an ASIC flow without 100x the design resources.

johnjakson_usa_com

John,

I'd check your report files closely if I were you. If you are seeing
311MHZ on a Spartan 3 something is very wrong. I suspect that your
synthesizer discarded most of your design. My experience sith Spartan
XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that
I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing
but that is with very few logic levels between pipeline stages. I'm sure
that with lots of floorplanning it would be possible to push it higher
than that but certainly not to 300MHz, especially not on something as
complex as a CPU.

Hi Rick

I know what you are saying. When I first presented my paper cpu
architecture to XST, the situation looked hopeless. I backed of and
built a no of test projects that only included 1 object that was
pushed to the max bringing all IOs to the pads. The synth reports are
then crystal clear even for someone with little exp of the tool
before. I also look at the layout and placement to see if it looks
kosher. It did. From that I had a feel for what each Xilinx device

john jakson · Feb 23, 2004

"B. Joshua Rosen" <bjrosen@polybus.com> wrote in message news:<pan.2004.02.22.16.31.34.651568@polybus.com>...

On Sun, 22 Feb 2004 07:14:16 -0800, john jakson wrote:

rickman wrote:

I don't know that the Spartan 3 parts are a major step forward in
FPGAs. From what I can see, the main difference is the elimination of
the huge startup currents on power up. The marketing claim is that
these will be much cheaper parts because of the small die. But so far,
I don't think anyone has seen the results of this.

--

I don't know about that, Xilinx initially set expectations low except
on price. I heard Microblaze only ran at 85MHz on it compared to
120MHz or more on bigger Virtex.

But on a cpu project I am working on I am seeing synth reports of
311MHz on sp3-5 with the latest speed file v 320Mhz for v2pro-8 and
the -7s seem to be same speed as sp3-5 IIRC. The sp2-4s are way down
to 120MHz. Seems as if its v2 made dirt cheap (if and when we get
them) with a small cut in speed. Also the LUT counts are similar to
sp2 but the blockrams are 4x bigger.

I can still port back to sp2(e) with almost same floor plan but with
much smaller ram instances although lots of 4ks could still be more
usefull than equiv no of 16/18Ks but the speed cut would hurt.

For an oldtime VLSI guy, I couldn't imagine getting such performance
on an ASIC flow without 100x the design resources.

johnjakson_usa_com

John,

I'd check your report files closely if I were you. If you are seeing
311MHZ on a Spartan 3 something is very wrong. I suspect that your
synthesizer discarded most of your design. My experience sith Spartan
XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that
I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing
but that is with very few logic levels between pipeline stages. I'm sure
that with lots of floorplanning it would be possible to push it higher
than that but certainly not to 300MHz, especially not on something as
complex as a CPU.

Hi Rick

I know what you are saying. When I first presented my paper cpu
architecture to XST, the situation looked hopeless. I backed of and
built a no of test projects that only included 1 object that was
pushed to the max bringing all IOs to the pads. The synth reports are
then crystal clear even for someone with little exp of the tool
before. I also look at the layout and placement to see if it looks
kosher. It did. From that I had a feel for what each Xilinx device

john jakson · Feb 23, 2004

Joshua replied:

johnjakson_usa_com

John,

I'd check your report files closely if I were you. If you are seeing
311MHZ on a Spartan 3 something is very wrong. I suspect that your
synthesizer discarded most of your design. My experience sith Spartan
XC3S400-4s is that they are much slower than Virtex2Ps (-5 is the V2P that
I'm comparing it to). I'm able to get the Spartan 3s to meet 140MHz timing
but that is with very few logic levels between pipeline stages. I'm sure
that with lots of floorplanning it would be possible to push it higher
than that but certainly not to 300MHz, especially not on something as
complex as a CPU.

Hi Joshua, Rick

Hopefully 4th time lucky, my girls are helping me way too much. With
google I don't know what happened for several hours, I am sure a
couple of half posts are infront. Apologies. Long replay warning.

I know what you are saying. My 1st paper cpu arch when presented to
XST gives me little clue where to start. I always used to work on
ASICs in teams where I write Verilog & C models and someone else (far
less speed/area motivated) bangs the FPGA tool. With Virtex800 exp
only at <30MHz I never had that great an expectation to start with, I
always had way too much logic in each pipeline but we only needed
30MHz. There was no time to explore speediac style and reduce logic as
it was ASIC prototyping.

Ray Andraka's work on super pipeling everything DSP left me wondering
if a cpu could also go as fast. Usually not so because there are way
too many random blocks of logic covering many adjacent pipelines. This
is why MicroBlaze is stuck in the 120MHz zone, I could probably guess
(reverse engineer) the code used for the datapath if I really studied
the ISA.

But the Alpha chip and ofcourse now the x86s are also deeply
superpipelined but more complex than can fit in any FPGA (or maybe
not). Now I am free to explore the boundaries and see what can be done
on a clean sheet at max freq.

I am also following very late after Philip and Jans work on FPGA cpus
from the 4000 days but even Jan got 30MHz on 4000s along time ago.
Since I am coming from cpu & DSP background, I wanted Alpha speed but
on a better architecture for par programming ie a modern Transputer.

I built a no of test projects that only included 1 instance of a real
pipelined blockram, or adders of varying widths, and so on. I also
play through the device type list and try sp2s through to v2pro with
varying speed grades and even different packages since the reports
only take 20s for such simple models. The last speed file posted by
Austin made a huge difference bringing sp3 close enough to v2pro that
the differences is marginal, only -8 pulled ahead another 5%. The sp2s
remain at the lower end of 100-200MHz which is what I expect for these
simple pipes.

I always study the report and generate the layout. Everything looks
kosher but the layout always looks haphazard. So I learn to use the
floorplanner and write C code to make the .ucf file for FF placement.
On occasion a stupid typo would whip up the speed to 700Mhz or
something, and voila most of the top level would be missing but then
the report usually says as much in bright red or yellow. I only allow
a few yellow marks for known issues beyond my control like the unused
parity bits of blockram instance. Any more than that requires
immediate fixing.

Now that I have my expectations set right I know that a Blockram can
cycle at around 320MHz on various sp3 -5 devices. Infact the ds99.pdf
IIRC says as much. A 32b plain adder is 250MHz, that needs pipelining
work to get to 300MHz plus. I ended up with a 12,10,10.msb width
3stage 32b add. I really wanted to do a faster 2 stage carry select
design but XST always seem to hack it into something less. Trivial
things like generating CVNZ flags become trouble at that speed, I end
up piping that as well since you can only do 3 LUT layers of logic or
a 12b registered add or 12b logic fn() or a BRam cycle and ZERO
combinations of these.

This is only possible because the cpu design is 4 way hyperthreaded
with 1 nice hazard path, so that all the datapath pipes are as
decoupled as they would be in any DSP engine. Only the instruction
decode has some local coupling but again it has no wide adds or big
rams so its looking doable and it is also Nway threaded. I have more
work to do but I never add more logic in series with my critical
blocks. If I get to 4 LUT/mux levels I immediately drop out of warp
speed back to 250MHz or even way less and that makes the other stuff
that is fully pipelined redundant. Any time my speed drops below
311MHz, I know I just added a 4th LUT level, track it down and redo it
till its 3 or less. This usually requires working on that module in
isolation, keeping its speed as much as possible over my target.
Further I can not allow any module to have unregistered IOs however
painful that is with out tracking that at a global level. The 3 levels
of LUT logic is almost always in one place inside a module between 2
pipes. The Verilog code is a mix of structural & RTL style, assigns
for wiring and always @ for the FFing.

This is really the same deal with the fastest VLSI cpus that are
limited to 10 levels of low fanout gate level logic. Seymour was doing
this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
enough 10gates.

I will report on the work as it gets closer to live results. I know I
can download to an sp2e dev board for about 200MHz or way less but by
the time the cpu C & Verilog models can run code and I have the lcc
compiler done, gee I might have a sp3 -5 dev board to play with. The
intended market is licensing to high end users for embedded & par
computing. I am even tempted to max the datapath to 64b as it only
adds 3-4 pipestages and not much to the control.

The LUT count is still below 500. and is mostly going to control, a
64b Alpha path would balance it more to computing, but thats another
story. My only concern is how much power 1 cpu <800 LUTs or FFs will
dump. I use 2BRams per cpu instance, so I am just about to lose having
2 in an sp 50. The bigger sp's though are more on the LUT side.

Regards all

johnjakson_usa_com

Jim Granville · Feb 23, 2004

john jakson wrote:
<interesting stuff snipped>

If I get to 4 LUT/mux levels I immediately drop out of warp
speed back to 250MHz or even way less and that makes the other stuff
that is fully pipelined redundant. Any time my speed drops below
311MHz, I know I just added a 4th LUT level, track it down and redo it
till its 3 or less. This usually requires working on that module in
isolation, keeping its speed as much as possible over my target.
Further I can not allow any module to have unregistered IOs however
painful that is with out tracking that at a global level. The 3 levels
of LUT logic is almost always in one place inside a module between 2
pipes. The Verilog code is a mix of structural & RTL style, assigns
for wiring and always @ for the FFing.

This is really the same deal with the fastest VLSI cpus that are
limited to 10 levels of low fanout gate level logic. Seymour was doing
this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
enough 10gates.

I will report on the work as it gets closer to live results.

Sounds to me like something you could negotiate
a job at Xilinx doing

Their marketing dept would just LOVE to boast about 300+ MHz
CPU cores, even if that is 'very peaky'. (after all, so are the
alternatives)

Key question is what code size is this working from ?

-jg

Jon Beniston · Feb 23, 2004

I am even tempted to max the datapath to 64b as it only
adds 3-4 pipestages and not much to the control.

Sure, but the more pipeline stages you add, the longer the latency is
for each instruction. How many cycles latency will there be for a
single add instruction? Do you intend to make sure that the number of
threads is equal to this latency, so that the latency as perceived the
thread executing the instruction is 0?

What's your cache / memory architecture? Handling lots of threads
could be tricky.

Cheers,
JonB

john jakson · Feb 23, 2004

This is really the same deal with the fastest VLSI cpus that are
limited to 10 levels of low fanout gate level logic. Seymour was doing
this in ECL 40yrs ago. A LUT counts as 3 levels of gate logic so close
enough 10gates.

I will report on the work as it gets closer to live results.

Sounds to me like something you could negotiate
a job at Xilinx doing

Their marketing dept would just LOVE to boast about 300+ MHz
CPU cores, even if that is 'very peaky'. (after all, so are the
alternatives)

Key question is what code size is this working from ?

-jg

I am sure anyone would love to get a cpu at 300MHz in FPGA but the
arch will be on my terms. The code base is remarkably small v previous
projects I have worked on, the Verilog is <4K IIRC sofar. It will get
bigger for control logic. 1st pass will defer some opcode complexity
to xops as TI9900 once called them ie low overhead low address
subroutines. That will reduce performance of OS message passing
scheduling specific code by 4x or so but its easier to write asm than
design HW. Later FPGA space permitting most of that will get hardened.

Note there is almost no HW needed for hazard detection, no bra
prediction, no pipeline flushing. Just like a DSP really. Other than
that the cpu looks more like 4 78MHz classic Ld St Risc cpus
timesharing the HW (and the cache unfortunately). Actually the
performance on paper should compare well to x86 at 1.5 the clk ie
500MHz x86 but the cache size is a real limit here. I still have to
design cache & TLB HW. Associative HW really costs.

Note that all ccbranches take 0 cycles as they group onto non bra
opcodes. So it may well run smaller loops at effective 400MHz if every
4th op is a bra. Its also a joy to count cycles based on bandwidth the
opcode actually uses, so ccbra really uses 1/8 or 1/4 cycles but from
slack time. And add a<-b+c would actually use 9/8 since the opcode
fetch is another 1/8 from slack time. But a cmp a,b would be 5/8 since
the unused write back gives back 4/8 cycles. Ofcourse the ops really
run in integer cycles, but there are queues to be filled and that uses
slack cache memory ports. The actual non Transputer ISA is actually
quite soft, I can mess with opcode encodings at will since as we all
know, cpus only do movs these days (yeh right).

The arch should port to any FPGA that supports true dual port
2WW/2RR/2RW BlockRams, not really using any other special features.
SRL16 nice if available. This also means it can be ASICed where I
would expect it to run atleast 3x faster as long as the libs include
prebuilt DP Ram as that is always the 1st limit. Other adder width
limits can be worked around. Some time I will get around to trying
free Quartus but I wish they would drop the IP node nonsense.

I am really pushing to get the Transputer arch back in front since
that allows many cpus to work in harmony with the message passing
scheme. It worked well before but Inmos folded up for bad engineering
& business reasons not because the basic premise was unsound. At one
time before 486 came along it was the dominant 32b arch esp in Europe
and very popular with HW embedded & extreme computing types. Occam was
a killer though, most SW types didn't get it although in hindsight I
see it now as a Forthy/Lispy HDL language.

I address that issue by suggesting it be programmed in V++ a language
which just combines std C with Verilog event driven language. It also
includes the Occam primitives Par, Alt, Seq, and !? operators to round
out but using C syntax. HandelC does the same thing and is a SW & HW
language too. I am about half way done on that using lcc as the base
technology. Have to get back to tree generation and code emit. Std lcc
can't just be hacked the way Jan did on XSOC because of the need to
include the Par support. The runtime is really a tiny OS with
scheduler, basic memory management in SW etc. The compiler is actually
90% of the project effort and the HW is almost the easy part,
certainly the fun part. I would like to transfer the compiler workload
pronto but few compiler writers know about Verilog internals etc.

The big kicker here as I keep saying is that end user code can be
written in any of the 3 styles, maybe start in C and rewrite parts in
Occam style message passing to get more parallelism. For real speed
ups rerwite in HDL style and voila the SW can be synthed with any free
FPGA synth tool into something like a coprocessor. Fits in very well
the good article the Altera guy linked here a few days ago on another
thread.

Better get back to work

johnjakson_usa_com

508-4800777

john jakson · Feb 23, 2004

jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...

I am even tempted to max the datapath to 64b as it only
adds 3-4 pipestages and not much to the control.

Sure, but the more pipeline stages you add, the longer the latency is
for each instruction. How many cycles latency will there be for a
single add instruction? Do you intend to make sure that the number of
threads is equal to this latency, so that the latency as perceived the
thread executing the instruction is 0?

What's your cache / memory architecture? Handling lots of threads
could be tricky.

Cheers,
JonB

I just posted a very long reply but the server just xxxxed it so I
will write it again later offline.

Quick answer yes, HT must match 4 or 8 etc. Cache architecture is
currently 1 way set associative, but more Blockrams would allow more
ways. Question of whether the FPGA should hold lots of lite cpus or 1
monster cpu or maybe combinations of both!

Regards

johnjakson_usa_com

508 4800777 EST after 8pm

Ray Andraka · Feb 23, 2004

John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file. The beauty of it
was that for control applications, you often did not even need any memory beyond the register file. The
processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

john jakson wrote:

jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402230332.3e1160e@posting.google.com>...
I am even tempted to max the datapath to 64b as it only
adds 3-4 pipestages and not much to the control.

Sure, but the more pipeline stages you add, the longer the latency is
for each instruction. How many cycles latency will there be for a
single add instruction? Do you intend to make sure that the number of
threads is equal to this latency, so that the latency as perceived the
thread executing the instruction is 0?

What's your cache / memory architecture? Handling lots of threads
could be tricky.

Cheers,
JonB

I just posted a very long reply but the server just xxxxed it so I
will write it again later offline.

Quick answer yes, HT must match 4 or 8 etc. Cache architecture is
currently 1 way set associative, but more Blockrams would allow more
ways. Question of whether the FPGA should hold lots of lite cpus or 1
monster cpu or maybe combinations of both!

Regards

johnjakson_usa_com

508 4800777 EST after 8pm

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

Jim Granville · Feb 23, 2004

Ray Andraka wrote:

John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file. The beauty of it
was that for control applications, you often did not even need any memory beyond the register file. The
processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

This is right, and John admits this in another reply.
You should also add DEBUG support, as that's more important as the CPU
targets bigger applications.
Once you have a compiler, users will want to do more and more, and
then debug becomes very important.

It depends a lot on the target use.
Something that runs from a Block RAM inside the FPGA, can be very
small/very fast, but is probably best coded in some form of Assembler.
Best example of 'Advanced Assembler Art' is Randy Hyde's HLA (High
level Assembler) but that currently targets only x86
- tho I'm sure that's not hard to fix

This HLA allows IF..THEN..ELSIF etc, and handles the labels needed,
as well as giving local scope (so is a big step-up from vanilla ASM).

-jg

john jakson · Feb 23, 2004

Ray Andraka <ray@andraka.com> wrote in message news:<403A43E8.6338D0C1@andraka.com>...

John,

In my experience, the stumbling block for custom CPUs is not so much the hardware as it is the compiler for
it. I did a small microcontroller for a XC4036E design several years back that ran at 66 Mhz. It was a
pretty simple machine that was sort of a cross between a PIC and an RCA1802 in that it used a 16 deep
register file like the 1802, and it was a harvard architecture like the PIC. Like the 1802, the operands
for the ALU were fetched from the register file and results returned to the register file. The beauty of it
was that for control applications, you often did not even need any memory beyond the register file. The
processor size was about 80 CLBs (translates to 80 slices in current architectures). I'm not a compiler
person, so the big difficulty I had with it was the compiler.

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

Hi Ray

Half agreed, as Jan has shown any std risc cpu project can grab lcc to
do the task quite quickly by messing with the emit tables. If this
were just another std risc project I'd probably do same, but then it
wouldn't be anywhere near 300MHz either, more like MicroBlaze.

Only hyperthreading allows max speed, but if the processes don't
communicate with each other then lcc could still be used as is and
ignore the HT stuff.

Some of my background is in compilers and other tools but I never
worked for anybody doing that. The lcc compiler (Hanson & Fraser) is
possibly the best documented C compiler writing text book around and
highly recomended as it explains thoroughly just how horrible C really
is where most C books gloss over it's complexity. The complexity for
me comes because I am combining essentially 3 langs together and
putting in a mini OS runtime. The Transputer did it before but chose
an unfriendly syntax and supported C only as an afterthought.

I will probably get through it ok but I would love to pass that part
on but then that person would be knee deep in it instead.

The HW part is more fun though. The 1802 takes me back, not bad in a
twisted sort of way, it certainly used very little logic, I had it
under a scope at Inmos.

Regards

johnjakson_usa_com

Hal Murray · Feb 24, 2004

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

How much code are you writing? Would you be willing/happy to do it in asembler?

Assemblers can be pretty simple, especially if the target is raw binary running
at loaded at 0 rather than something needing linkers and libraries. Also helps
if the target is RISC and doesn't have messy addressing modes.

How much would a reasonably clean sample assembler help? There should be
a good example from the academic world. Just type in the new opcode table.

--
The suespammers.org mail server is located in California. So are all my
other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited
commercial e-mail to my suespammers.org address or any of my other addresses.
These are my opinions, not necessarily my employer's. I hate spam.

Ray Andraka · Feb 24, 2004

The low complexity is why I chose the architecture I did. Unfortunately, I did that design in schematics, before
I started using VHDL, so resurrecting it at this point involves more time than can devote to it.

john jakson wrote:

The HW part is more fun though. The 1802 takes me back, not bad in a
twisted sort of way, it certainly used very little logic, I had it
under a scope at Inmos.

Regards

johnjakson_usa_com

--
--Ray Andraka, P.E.
President, the Andraka Consulting Group, Inc.
401/884-7930 Fax 401/884-7950
email ray@andraka.com
http://www.andraka.com

"They that give up essential liberty to obtain a little
temporary safety deserve neither liberty nor safety."
-Benjamin Franklin, 1759

Spartan 3 - avaliable in small quantities?

Thomas Heller

Guest

Simon Peacock

Guest

Hal Murray

Guest

B. Joshua Rosen

Guest

rickman

Guest

john jakson

Guest

B. Joshua Rosen

Guest

rickman

Guest

john jakson

Guest

john jakson

Guest

john jakson

Guest

Jim Granville

Guest

Jon Beniston

Guest

john jakson

Guest

john jakson

Guest

Ray Andraka

Guest

Jim Granville

Guest

john jakson

Guest

Hal Murray

Guest

Ray Andraka

Guest

Log in

Welcome to EDABoard.com

Sponsor