Spartan 3 - avaliable in small quantities?

Jim Granville · Feb 24, 2004

"Hal Murray wrote:

I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

How much code are you writing? Would you be willing/happy to do it in asembler?

Assemblers can be pretty simple, especially if the target is raw binary running
at loaded at 0 rather than something needing linkers and libraries. Also helps
if the target is RISC and doesn't have messy addressing modes.

How much would a reasonably clean sample assembler help? There should be
a good example from the academic world. Just type in the new opcode table.

"AS" from Alfred Arnold is a good wide-cores assembler, with a choice of
Pascal or C sources :

http://john.ccac.rwth-aachen.de:8000/as/download.html

And HLA (High Level Assembler) is currently x86 only, but the front
end, and approach is much closer to higher level languages (but minus
the bloat). V2 will allow different back ends, for opcode outputs.
Worth watching.

http://webster.cs.ucr.edu/AsmTools/HLA/index.html

This is able to support quite large code efforts, and remain
close to the iron..

A benefit of working from the 'best assembler' end, is the ease of
support multiple/tiny core instances - which is one of the
advantages of such soft cores.

-jg

Jon Beniston · Feb 24, 2004

Cache architecture is
currently 1 way set associative, but more Blockrams would allow more
ways.

Do you not think that the number of ways has to be as least as great
as the number of threads? I would expect a significant amount of
conflict misses (particularly in the I-Cache) if this is not the case.
Hit-under-miss is a must.
Otherwise all those impressive Mega-Hurtz will just be thown away
stalling for cache refills.

Cheers,
JonB

john jakson · Feb 24, 2004

Jim Granville <no.spam@designtools.co.nz> wrote in message news:<RDx_b.28109$ws.3170985@news02.tsnz.net>...

"Hal Murray wrote:
I suspect that the difficulty for just about any home grown processor is going to be the tools to compile the
code for it, although folks who are more saavy than I on the software side might argue that the high speed
hardware design is the hard part.

How much code are you writing? Would you be willing/happy to do it in asembler?

Assemblers can be pretty simple, especially if the target is raw binary running
at loaded at 0 rather than something needing linkers and libraries. Also helps
if the target is RISC and doesn't have messy addressing modes.

How much would a reasonably clean sample assembler help? There should be
a good example from the academic world. Just type in the new opcode table.

"AS" from Alfred Arnold is a good wide-cores assembler, with a choice of
Pascal or C sources :

http://john.ccac.rwth-aachen.de:8000/as/download.html

And HLA (High Level Assembler) is currently x86 only, but the front
end, and approach is much closer to higher level languages (but minus
the bloat). V2 will allow different back ends, for opcode outputs.
Worth watching.

http://webster.cs.ucr.edu/AsmTools/HLA/index.html

This is able to support quite large code efforts, and remain
close to the iron..

A benefit of working from the 'best assembler' end, is the ease of
support multiple/tiny core instances - which is one of the
advantages of such soft cores.

-jg

Although an assembler is only a tiny fraction of the effort of a C
compiler, once done it only opens the door just enough to bootstrap up
slowly. For a processor to have much wider appeal needs the full
effort either to port or write from scratch.

I will probably set the hard type semantics of C aside for awhile and
just add a very quick dirty codegen that handles C style assembler and
simple 1 size expressions with none of the usual optimizations and
just play dumb. Then baseline C/Verilog/Occam/inline asm can be
written that might violate some proper rules. The compiler wouldn't be
able to compile itself but I could get on with testbench and
verification. Right now it can analyize itself but doesn't emit
anything. It does have a nice #preprocessor built into the lexer that
allows C++ like use of definitions with same name but varying no of
params that is not described in lcc book.

The usual way in the past was to define subsets of the target language
and compile for that with the compiler also being restricted to that
level. The 1st pass might be an assembler. The compiler could then
operate at some level on the target and as the language subset is
raised, the compiler gets to use the new features and tests them on
the next round. I don't think people do that anymore unless the
language is brand new and no compiler exists yet. Once it does exist,
it's usually easier to cross port.

This brings up a point, can a new compiler be $ distributed if the
design is largely based off of previous open code. I will have to go
check the license on lcc.

johnjakson_usa_com

john jakson · Feb 24, 2004

jon@beniston.com (Jon Beniston) wrote in message news:<e87b9ce8.0402240132.7a92aa17@posting.google.com>...

Cache architecture is
currently 1 way set associative, but more Blockrams would allow more
ways.

Do you not think that the number of ways has to be as least as great
as the number of threads? I would expect a significant amount of
conflict misses (particularly in the I-Cache) if this is not the case.
Hit-under-miss is a must.
Otherwise all those impressive Mega-Hurtz will just be thown away
stalling for cache refills.

Cheers,
Jo

Hi Jon

Not neccesarily. On a conventional HT cpu, the threads would all be
independant, and likely fight over the cache set size and 2 way would
probably be a min. Since these threads are supposed to be cooperating
as Occam proceses would, then their opcodes would be local but that
assumes sibling processes run close to each other in time space. No
guarantee of that. In the HW event driven case, its much easier to
speculate about what will likely happen as the scheduling model is so
much simpler. Even if there are lots of conflicts what will happen is
the threads will just keep delaying.

In the HW time wheel, there are actually 16 threads waiting to go (or
null Ps if less available). These 16 represent the front of the proper
P queue stored in linklist out in memory space (only some of which
might be in cache at any time). The HW only allows the front 4 of
those to queue up in the Iop queue. The fetcher steals or forces
available cache reads slots to keep this full rotating between the 4
queues which live inside distributed 16b DP rams by 64 wide. Hence
each running thread can buffer up to 16 small ops or 4 extended 64b
ops or some mix.

On a side note, if the cpu were 64b wide, the HT would have to be
8way, but then the Iqueue HW would be twice as wide too so that still
allows each P to buffer up same no of ops. I would have to tweak the
HDL code to group the rams for hight v width keeping it 64b wide
output always. Wider data ops doesn't really change the opcode fetch
rate since that now looks half as much as before. The fractional costs
of executing ops now changes from 9/8 to 17/16 cycles for ALU a<-b OP
c; so a slight speed up. Putting large literals or actual addresses in
code space would wipe that out.

The fetcher also write the Pid with the opcodes and it does a
superficial check to see if any 16b ops are bra codes or not. If it
pushes a bra, it will then keep pushing just a few more words until
its past and then rotate that Pid out and take the next one from the
other 12 waiting. The other side of the Iop queue just reads the 1-4
wide opcodes with the Pid and decode/execute the ops. It tracks the
opcode size and uses any bra codes as just another contol field. By
the time the bra decision is available, the Iop box will have executed

4 ops, but then it will have been P switched already. The bra
decision when it does arrive will post back the modified ip into the

Pid selected ip field. Pid rides along the datapath pipeline too. Bra
pts may be used to do the outer timesharing but I may leave that to a
SW kernal.

Cache misses will probably be treated same way, if the miss is going
to be long, switch to the next P in the side queue. You can imagine a
little railway track figure of 8 made up of selective pipelines &
muxes holding minimal P state. Something like Johnson logic or hot
coded state engine in charge.

One huge difference between this HT processor and the ones you hear
about x86, Alpha etc, I expect to use RLDRAM as 2nd level cache which
RAS cycles in 20ns which is about 1.5 effective cpu cycles 13.3ns. It
is 8way banked and can support 2.5ns datarates and control. I will
probably be limited to the 311MHz rate and DDR is limited to 622MHz in
the specs (conveniant 2x), this is right on the edge of what FPGAs can
do and below the RLDRAM2 800MHz std.

Remember x86 in particular has to be designed to work with very slow
RAS elcheapo DDR Rams which can be several 100x slower than cpu
cycles. Intel can't do a special tweak for RLDRAM since the difference
is still very large, maybe 50 or more.

In this cpu I could almost throw cache out and go direct to RLDRAM as
main memory which is why I am not too concerned about tiny cache. I
will be building an RLDRAM model soon by faking a bunch of 8 Blockrams
together with delays and muxes demuxes. This will let me test out 1-8
cpu models running with faked RLDRAM all inside a sp-400 part. Further
a 64b 8way HT cpu would actually cycle slower than RLDRAM ie 26ns.

The real purpose of the cache which is a unified
data-instruction-workspace is to satisfy the enormous bandwith req of
the workspace operations. Reg cpus have 1 or more reg files separate
from d/i cache but they have the burden of very high swap contexts. R3
keeps many workspaces in uni cache and provide 3ports to datapath 2
reads and 2joined writes using a pair of DP rams. The instr and data
fetch requirements could be met by fast RLDRAM without cache, some
buffering would still be needed. The T9000 style workspace caching is
what makes this all work and that the cpus run close to RLDRAM speed.
If R3 ever went ASIC and n x faster, ofcourse the cache would go full
cuctom and bigger by far.

Hope that helps

johnjakson_usa_com

Tim · Feb 25, 2004

Ray Andraka wrote:

In my experience, the stumbling block for custom CPUs is not so much
the hardware as it is the compiler for it.

Jan Gray did an interesting article on this for Circuit Cellar
a few years back, targeting the lcc compiler. The article
will still be on www.fpgacpu.org

Spartan 3 - avaliable in small quantities?

Jim Granville

Guest

Jon Beniston

Guest

john jakson

Guest

john jakson

Guest

Tim

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Spartan 3 - avaliable in small quantities?

Jim Granville

Guest

Jon Beniston

Guest

john jakson

Guest

john jakson

Guest

Tim

Guest

Log in

Welcome to EDABoard.com

Sponsor