Building the 'uber processor'

Goran Bilski <goran@xilinx.com> wrote in message news:<bocq73$bgh1@cliff.xsj.xilinx.com>...
Hi,

I have been following this thread with great interest.

If you need a processor with links to/from the processor register file
then MicroBlaze could be the answer.

MicroBlaze has 18 direct links (in the current version, the ISA allows
up to 2048) and 8 new instructions for sending
or receiving data to/from the register file.

The connection is called LocalLink (or FSL) and has this features
- Unshared non-arbitrated communication
- Control and Data support
- Uni-directional point-to-point
- FIFO based
- 600 MHz standalone operation

-

Hi Goran

What I am really after is a speedy Transputer, better still many of
them distributed inside & across FPGAs. Not the original with funny
8bit opcodes (partly my fault) but a modern design that is RISC &
targeted to FPGA using MicroBlaze as a HW/performance reference. I
would budget for about 2x the cost before thinking of FPU, still
pretty cheap.

The important part is the ISA supports process communication
transparently with scheduler in HW. The physical links internal or
external is only a part of it. Since many cpus now have these links,
and serial speeds can be far in excess of cycle speed, thats nice, but
no use if the programmer has to program them themselves. With an
improved event wheel scheduler in HW too, HW simulation becomes
possible for HW that might be "hard" or "soft", but then HW in FPGAs
are not strictly "hardware" either (see old thread). So if HW & SW can
be somewhat interchanged, it becomes easier to migrate large C seq
problems gradually into C-Occam par/seq then into more HDL par all
from inside one (maybe ugly)language. It would be even nicer to start
over with a new leaner language that can cover HDL & SW but its more
practical to fuse together the languages people actually use.

Who is the potential customer for this, any SW-HW person interested in
speeding up SW like the original poster or any embedded engineer that
wants to customize cpu with own HW addons using Occam style channels
to link them. I could go on, but much work to do.


John

johnjaksonATusaDOTcom
 
Hi John,

The new instruction in MicroBlaze for handling these locallinks are
simple but there is no HW scheduler in MicroBlaze. I have done processor
before with complete Ada RTOS in HW but it would be an overkill in a FPGA:

The Locallinks for MicroBlaze is 32-bit wide so they are not serial.
They can handle a new word every clock cycle.

<rand mode on>

MicroBlaze has 8 new instructions for handling the Locallinks (or I call
them FSL)
FSL = Fast Simplex Links
The are mainly two instruction with each four options.
The instruction for reading from a FSL is called GET
GET rD, FSL #n
This will get a word from the FSL number #n and move the data into
register rD
The instruction for writing a value to a FSL is
PUT rA, FSL #n
This will move the value in register rA to FSL #n

Each FSL has a FIFO which will control if there is available data on the
FSL or if the FSL is full.
If you try to do a GET when the FSL is empty, MicroBlaze will stall
until the data is there and
if you try to do a PUT on a FSL which is full, MicroBlaze will stall
until space is available on the FSL.

So the normal GET and PUT is blocking which is great when you
communicate with HW since there is no "read status bits, mask status
bits, branch back if not ready" loop which will reduce any kind of
bandwidth with HW. HW tends to be much faster than SW so in general the
blocking instruction will never block.

One option on the instruction is to have a nonblocking version of GET
and PUT which are called nGET and nPUT.
These will try to perform the instruction and the carry bit will be set
if they were successful and it will be cleared if they failed. This will
make it possible to try communication on the FSL.

The FSL also have a extra bit which is done for sending more than 32-bit
data, you can consider this as a tag information. We call it "control".
Normal PUT will set this signal to '0' but the cPUT will set it to '1'.
The Control signal is stored as the data in the FIFO so the FIFO's are
really 33 bit.
This extra bit permits more synchronization and control/data flow
betwene to FSL units.

The normal usage of FSL is to analyze your c-code and find a function
where the MicroBlaze spends most of the cycles. Moce this function in
the HW (there are getting moreand more C-to-HW tools).
Create a wrapper in C with the same function name as the HW function.
This wrapper will just take all the parameter and PUT them onto a FSL
and then it's do a GET.

You could also connect up a massive array of MicroBlaze over FSL ala
transputer but I think that the usage of the FPGA logic as SW
accelarators will be a more popular way since FPGA logic can be many
magnitudes faster than any processor and with the ease of interconnect
as the FSL provides it will be the most used case.

<rant mode off>

Göran

john jakson wrote:

Goran Bilski <goran@xilinx.com> wrote in message news:<bocq73$bgh1@cliff.xsj.xilinx.com>...


Hi,

I have been following this thread with great interest.

If you need a processor with links to/from the processor register file
then MicroBlaze could be the answer.

MicroBlaze has 18 direct links (in the current version, the ISA allows
up to 2048) and 8 new instructions for sending
or receiving data to/from the register file.

The connection is called LocalLink (or FSL) and has this features
- Unshared non-arbitrated communication
- Control and Data support
- Uni-directional point-to-point
- FIFO based
- 600 MHz standalone operation



-

Hi Goran

What I am really after is a speedy Transputer, better still many of
them distributed inside & across FPGAs. Not the original with funny
8bit opcodes (partly my fault) but a modern design that is RISC &
targeted to FPGA using MicroBlaze as a HW/performance reference. I
would budget for about 2x the cost before thinking of FPU, still
pretty cheap.

The important part is the ISA supports process communication
transparently with scheduler in HW. The physical links internal or
external is only a part of it. Since many cpus now have these links,
and serial speeds can be far in excess of cycle speed, thats nice, but
no use if the programmer has to program them themselves. With an
improved event wheel scheduler in HW too, HW simulation becomes
possible for HW that might be "hard" or "soft", but then HW in FPGAs
are not strictly "hardware" either (see old thread). So if HW & SW can
be somewhat interchanged, it becomes easier to migrate large C seq
problems gradually into C-Occam par/seq then into more HDL par all
from inside one (maybe ugly)language. It would be even nicer to start
over with a new leaner language that can cover HDL & SW but its more
practical to fuse together the languages people actually use.

Who is the potential customer for this, any SW-HW person interested in
speeding up SW like the original poster or any embedded engineer that
wants to customize cpu with own HW addons using Occam style channels
to link them. I could go on, but much work to do.


John

johnjaksonATusaDOTcom
 
john jakson wrote:

What I am really after is a speedy Transputer, better still many of
them distributed inside & across FPGAs. Not the original with funny
8bit opcodes (partly my fault) but a modern design that is RISC &
targeted to FPGA using MicroBlaze as a HW/performance reference. I
would budget for about 2x the cost before thinking of FPU, still
pretty cheap.
Hi John,

do you know about this nice stuff developed by Cradle
(http://www.cradle.com) ?

They have developed something like an FPGA. But the PFUs
do not consist of generic logic blocks but small processors.
That's perhaps something you would like :)

Regards,
Mario
 
Hi Goran

The new instruction in MicroBlaze for handling these locallinks are
simple but there is no HW scheduler in MicroBlaze. I have done processor
before with complete Ada RTOS in HW but it would be an overkill in a FPGA:
... now that sounds like something we could chat about for some time.
An Ada RTOS in HW certainly would be heavy, but the Occam model is
very light. The Burns book on Occam compares them, the jist being that
ADA has something for everybody, and Occam is maybe too light. Anyway
they both rendezvous. At the beginning of my Inmos days we were
following ADA and the iAPX32 very closely to see where concurrency on
other cpus might go (or not as the case turned out). Inmos went for
simplicity, ADA went for complexity.

Thanks for all the gory details.

The Locallinks for MicroBlaze is 32-bit wide so they are not serial.
They can handle a new word every clock cycle.

You could also connect up a massive array of MicroBlaze over FSL ala
transputer but I think that the usage of the FPGA logic as SW
accelarators will be a more popular way since FPGA logic can be many
magnitudes faster than any processor and with the ease of interconnect
as the FSL provides it will be the most used case.
I am curious what the typ power useage of MicroBlaze is per node, and
has anybody actually tried to hook any no of them up. If I wanted
large no of cpus to work on some project that weren't Transputers, I
might also look at PicoTurbo, Clearspeed or some other BOPSy cpu
array, but they would all be hard to program and I wouldn't be able to
customize them. Having lots of cpus in FPGA brings up the issue of how
to organize memory hierarchy. Most US architects seem to favor the
complexity of shared memory and complicated coherent caches, Europeans
seem to favor strict message passing (as I do).

We agree that if SW can be turned into HW engines quickly and
obviously, for the kernals, sure they should be mapped right onto FPGA
fabric for whatever speed up. That brings up some points, 1st P4
outruns typ FPGA app maybe 50x on clockspeed. 2nd converting C code to
FPGA is likely to be a few x less efficient than an EE designed
engine, I guess 5x. IO bandwidth to FPGA engine from PC is a killer.
It means FPGAs best suited to continuous streaming engines like real
time DSP. When hooked to PC, FPGA would need to be doing between
50-250x more work in parallel just to be even. But then I thinks most
PCs run far slower than Intel/AMD would have us believe because they
too have been turned into streaming engines that stall on cache misses
all too often.

But SW tends to follow 80/20 (or whatever xx/yy) rule, some little
piece of code takes up most of the time. What about the rest of it, it
will still be sequential code that interacts with the engine(s). We
would still be forced to rewrite the code and cut it with an axe and
keep one side in C and one part in HDL. If C is used as a HDL, we know
thats already very inefficient compared to EE HDL code.

The Transputer & mixed language approach allows a middle road between
the PC cluster and raw FPGA accelerator. It uses less resources than
cluster but more than the dedicated accelerator. Being more general
means that code can run on an array of cpus can leave decision to
commit to HW for later or never. The less efficient approach also
sells more FPGAs or Transputer nodes than one committed engine. In the
Bioinformatics case, a whole family of algorithms need to be
implemented, all in C, some need FP. An accelerator board that suits
one problem may not suit others, so does Bio guy get another board,
probably not. TimeLogic is an interesting case study, the only
commercial FPGA solution left for Bio.

My favourite candidate for acceleration is in our own backyard, EDA,
esp P/R, I used to spend days waiting for it to finish on much smaller
ASICs and FPGAs. I don't see how it can get better as designs are
getting bigger much faster than pentium can fake up its speed. One
thing EDA SW must do is to use ever increasingly complex algorithms to
make up the short fall, but that then becomes a roadblock to turning
it to HW so it protects itself in clutter. Not as important as the Bio
problem (growing at 3x Moores law), but its in my backyard.

rant_mode_off

Regards

johnjakson_usa_com
 
Mario Trams <Mario.Trams@informatik.tu-chemnitz.de> wrote in message >
Hi John,

do you know about this nice stuff developed by Cradle
(http://www.cradle.com) ?

They have developed something like an FPGA. But the PFUs
do not consist of generic logic blocks but small processors.
That's perhaps something you would like :)

Regards,
Mario
Thanks for pointer, I hadn't seen it yet, will take a peek.
 
do you know about this nice stuff developed by Cradle
(http://www.cradle.com) ?

They have developed something like an FPGA. But the PFUs
do not consist of generic logic blocks but small processors.
This reminds me of the PACT XPP, which is an array ALUs with
reconfigurable interconnect. Basically you replace the LUT
with a math component.

New ideas have a hard time, unless there's a real advantage
over traditional technology. PACT tries to find their niche
by offering DSP IP, eg for the upcoming UMTS cellular market.

Here's their URL: http://www.pactcorp.com/

Marc
 
Hi John,

john jakson wrote:

Hi Goran



The new instruction in MicroBlaze for handling these locallinks are
simple but there is no HW scheduler in MicroBlaze. I have done processor
before with complete Ada RTOS in HW but it would be an overkill in a FPGA:




.. now that sounds like something we could chat about for some time.
An Ada RTOS in HW certainly would be heavy, but the Occam model is
very light. The Burns book on Occam compares them, the jist being that
ADA has something for everybody, and Occam is maybe too light. Anyway
they both rendezvous. At the beginning of my Inmos days we were
following ADA and the iAPX32 very closely to see where concurrency on
other cpus might go (or not as the case turned out). Inmos went for
simplicity, ADA went for complexity.

Thanks for all the gory details.


Actually the Ada RTOS was not that large. The whole processor THOR was
created to run Ada as efficient as possible
(http://www.space.se/node3066.asp?product={7EA5439E-962C-11D5-B730-00508B63C9B4}&category=148)
The processor only supported 16 tasks in HW but you could have more taks
that had to be swap in and out.
The other thing was that the processor didn't have interrupts only
external rendezvous.
There was also other implementation to handle Ada much better, HW stack
control, HW handled exception handling,...

The Locallinks for MicroBlaze is 32-bit wide so they are not serial.
They can handle a new word every clock cycle.

You could also connect up a massive array of MicroBlaze over FSL ala
transputer but I think that the usage of the FPGA logic as SW
accelarators will be a more popular way since FPGA logic can be many
magnitudes faster than any processor and with the ease of interconnect
as the FSL provides it will be the most used case.




I am curious what the typ power useage of MicroBlaze is per node, and
has anybody actually tried to hook any no of them up. If I wanted
large no of cpus to work on some project that weren't Transputers, I
might also look at PicoTurbo, Clearspeed or some other BOPSy cpu
array, but they would all be hard to program and I wouldn't be able to
customize them. Having lots of cpus in FPGA brings up the issue of how
to organize memory hierarchy. Most US architects seem to favor the
complexity of shared memory and complicated coherent caches, Europeans
seem to favor strict message passing (as I do).

We agree that if SW can be turned into HW engines quickly and
obviously, for the kernals, sure they should be mapped right onto FPGA
fabric for whatever speed up. That brings up some points, 1st P4
outruns typ FPGA app maybe 50x on clockspeed. 2nd converting C code to
FPGA is likely to be a few x less efficient than an EE designed
engine, I guess 5x. IO bandwidth to FPGA engine from PC is a killer.
It means FPGAs best suited to continuous streaming engines like real
time DSP. When hooked to PC, FPGA would need to be doing between
50-250x more work in parallel just to be even. But then I thinks most
PCs run far slower than Intel/AMD would have us believe because they
too have been turned into streaming engines that stall on cache misses
all too often.

But SW tends to follow 80/20 (or whatever xx/yy) rule, some little
piece of code takes up most of the time. What about the rest of it, it
will still be sequential code that interacts with the engine(s). We
would still be forced to rewrite the code and cut it with an axe and
keep one side in C and one part in HDL. If C is used as a HDL, we know
thats already very inefficient compared to EE HDL code.

The Transputer & mixed language approach allows a middle road between
the PC cluster and raw FPGA accelerator. It uses less resources than
cluster but more than the dedicated accelerator. Being more general
means that code can run on an array of cpus can leave decision to
commit to HW for later or never. The less efficient approach also
sells more FPGAs or Transputer nodes than one committed engine. In the
Bioinformatics case, a whole family of algorithms need to be
implemented, all in C, some need FP. An accelerator board that suits
one problem may not suit others, so does Bio guy get another board,
probably not. TimeLogic is an interesting case study, the only
commercial FPGA solution left for Bio.

My favourite candidate for acceleration is in our own backyard, EDA,
esp P/R, I used to spend days waiting for it to finish on much smaller
ASICs and FPGAs. I don't see how it can get better as designs are
getting bigger much faster than pentium can fake up its speed. One
thing EDA SW must do is to use ever increasingly complex algorithms to
make up the short fall, but that then becomes a roadblock to turning
it to HW so it protects itself in clutter. Not as important as the Bio
problem (growing at 3x Moores law), but its in my backyard.

rant_mode_off

Regards

johnjakson_usa_com
 
"mikegw" <mikegw20@hotmail.spammers.must.die.com> wrote in message news:bo4na0$5qk$1@tomahawk.unsw.edu.au...
Hello all,

I have seen about the place add
on boards for PC's that act as co-processors. This is the interesting bit
to me. Our research group is looking into building a computer (cluster
perhaps) for calculation of particle dynamics, similar to CFD in
application.
< snip >

Hi,

Been reading this thread. I wonder if instead of using an FPGA, DSP, general purpose processor or niche product it
would be possible to use a graphics processor like the ones developed for 3D graphics boards.

It seems that this particular application requires a lot of 3D processing (distance between particles, direction of
interacting force, ...) and similar matrix calculations. Graphics processors are good at this. They have a huge memory
bandwidth also, because they have a 64bit or 128bit bus width and DDR, and they support like 64Mbyte or 128Mbyte of
memory, so they should be able to handle large data sets very well.

But I'm not sure if you can program these like a 'normal' processor. And if it would be feasable for Mike's research
group to design a system around such a graphics processor.

Maybe it would be possible to 'hack' a graphics board and change its firmware to run simulations ? But are there any
development environments available for these chips ? If so, they'd probably be all assembler. And they'd be very
expensive I guess.

Regards,
Marc
http://users.skynet.be/vanriet.marc/
 
In article <3fb407d3$0$13522$ba620e4c@reader1.news.skynet.be>,
Marc Van Riet <marcvanriet@yahoo.com> wrote:

Been reading this thread. I wonder if instead of using an FPGA,
DSP, general purpose processor or niche product it would be possible
to use a graphics processor like the ones developed for 3D graphics
boards.
It's conceivable -- you can do a lot with them nowadays -- see
www.gpgpu.org. Depends on details of the problem; from the description
I'd be surprised if it fit in 128MBytes. And the GPUs are only single-
precision at the best; the fastest use a custom FP representation
which might have only 16 bits in the mantissa.

Tom
 

Welcome to EDABoard.com

Sponsor

Back
Top