New soft processor core paper publisher?

Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
I think most Java runs on SIM cards.
Err, what is this belief based on? I mean, you might be right, but I
never heard that before.
http://www.oracle.com/us/technologies/java/embedded/card/overview/index.html :

Overview:

Currently shipping on more than 2 billion devices/year

Deployed on more than 9 billion devices around the world since 1998

More than 50% of SIM cards deployed in 2011 run Java Card

...

Included in billions of SIM cards, payment cards, ID cards,
e-passports, and more
 
In comp.lang.forth Paul Rubin <no.email@nospam.invalid> wrote:
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
I think most Java runs on SIM cards.
Err, what is this belief based on? I mean, you might be right, but I
never heard that before.

http://www.oracle.com/us/technologies/java/embedded/card/overview/index.html :

Overview:

Currently shipping on more than 2 billion devices/year

Deployed on more than 9 billion devices around the world since 1998

More than 50% of SIM cards deployed in 2011 run Java Card

...

Included in billions of SIM cards, payment cards, ID cards,
e-passports, and more
OK, but that's hardly "most Java", unless you're just counting the
number of virtual machines that might run at some point.

Andrew.
 
Andrew Haley <andrew29@littlepinkcloud.invalid> writes:
OK, but that's hardly "most Java", unless you're just counting the
number of virtual machines that might run at some point.
Well there's all sorts of ways to calculate it. If you want total LOC,
Android phones may be past servers by now.
 
rickman wrote:
On 6/25/2013 7:55 PM, glen herrmannsfeldt wrote:
Eric Wallin<tammie.eric@gmail.com> wrote:

(snip)
It talks about separate threads writing to the same location,
which I understand can be a problem with interrupts and without
atomic read-modify-write. All I can do is repeat that you
don't program this way it won't happen. A subroutine can be
written so that threads can share a common instance of it,
but without using a common memory location to store data
associated with the execution of that subroutine (unless
the location is memory mapped HW).

Sure, that is pretty common. It is usually related to being
reentrant, but not always exactly the same.

In Hive, there is a register that when read returns the thread
ID, which is unique for each thread. This could be used as an
offset for subroutine data locations.

Yes, but usually once in a while there needs to be communication
between threads. If no other time, to get data through an
I/O device, such as separate threads writing to the same
user console. (Screen, terminal, serial port, etc.)

Why would communications be a problem? Just let one processor control the I/O device and let the other processors talk to that one.
Oh dear. It looks like you have vanishingly little experience
writing software. That is supported by your statement in another
post.

On 26/06/13 01:06, rickman wrote:
I never understood the difference between thread and
process until I read the link you provided.
 
On Tuesday, June 25, 2013 7:39:51 PM UTC-4, thomas....@gmail.com wrote:

I just want to point out that there is a lot of competition out there...
I'm just putting it out there, people can use it if they want to, or not.

Thomas, with your experience with ERIC5 series, do you see anything obviously missing from the Hive instruction set? What do you think of the literal sizing?
 
Question to the programming types:

Ever seen a signed logical or arithmetic shift distance before? Hive shift distances are signed, which works out quite nicely (the basic shift is shift left, with negative shift distances performing right shifts). This is something I haven't encountered in any opcode listings I've had the pleasure to peruse, so I'm wondering if it is kind of new-ish.
 
Eric Wallin <tammie.eric@gmail.com> wrote:
Question to the programming types:

Ever seen a signed logical or arithmetic shift distance before?
Hive shift distances are signed, which works out quite nicely
(the basic shift is shift left, with negative shift distances
performing right shifts).

This is something I haven't encountered in any opcode listings
I've had the pleasure to peruse, so I'm wondering if it is
kind of new-ish.
PDP-10 has signed shifts. The manual is available on bitsavers,
such as:

AA-H391A-TK_DECsystem-10_DECSYSTEM-20_Processor_Reference_Jun1982.pdf

Shifts use a signed 9 bit value from the computed effective
address.

-- glen
 
Thomas, with your experience with ERIC5 series, do you see anything obviously missing from the Hive instruction set? What do you think of the literal sizing?
I just took a quick look at your document (time is limited...). What I like is the concept of "in-line" literals. A good extension would be to have the same concept also for calls and jumps (i.e. so you do not have to load the destination address into a register first) and maybe also other instructions that can work with literals. I also think that you leave some bits unused: e.g. byt instruction does not use register B, so you would have 3 additional bits in to opcode to make it possible to have 11b literal instead of an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)

What others already mentioned is the restricted code-space, but without C-compiler this will never become a real issue ;-)

For your desired application, you could maybe think of options to reduce the resource usage. BTW: The bad habit of Quartus to replace flip-flop chains with memories (you mentioned this somewhere in your document) can be disabled by turning off "auto replace shift registers" somewhere in the synthesis settings of Quartus.

Regards,

Thomas
www.entner-electronics.com
 
On Wednesday, June 26, 2013 5:44:39 PM UTC-4, thomas....@gmail.com wrote:

I just took a quick look at your document (time is limited...). What I like is the concept of "in-line" literals. A good extension would be to have the same concept also for calls and jumps (i.e. so you do not have to load the destination address into a register first) and maybe also other instructions that can work with literals. I also think that you leave some bits unused: e.g. byt instruction does not use register B, so you would have 3 additional bits in to opcode to make it possible to have 11b literal instead of an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)
Oooh, very nice idea, thanks so much! I gave this some thought and even found some space to shoehorn some opcodes in, but the lit has to come from the data memory port and go back into the control ring to offset / replace the PC, and this would require some combinatorial logic in front of the program memory address port which could slow the entire thing down. I'll definitely give it a try though.

I'm kind of against invading the B stack index/pop for other things, having it always present allows for concurrent stack cleanup.

What others already mentioned is the restricted code-space, but without C-compiler this will never become a real issue ;-)
Hive could be easily edited to have 32 bit addresses, but the use of BRAM for small processor main memory is likely an even stronger restriction on code-space, which is why I don't feel the need for anything beyond 16 bits.

For your desired application, you could maybe think of options to reduce the resource usage. BTW: The bad habit of Quartus to replace flip-flop chains with memories (you mentioned this somewhere in your document) can be disabled by turning off "auto replace shift registers" somewhere in the synthesis settings of Quartus.
Using the "speed" optimization technique for analysis and synthesis avoids this as well.
 
In comp.arch.fpga Andrew Haley <andrew29@littlepinkcloud.invalid> wrote:
In comp.lang.forth Paul Rubin <no.email@nospam.invalid> wrote:
More than 50% of SIM cards deployed in 2011 run Java Card

OK, but that's hardly "most Java", unless you're just counting the
number of virtual machines that might run at some point.
Java Card isn't the JVM - it's Java compiled down to whatever CPU is on the
card.

Theo
 
<snip>

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.
Transputer?
http://en.wikipedia.org/wiki/Transputer


---------------------------------------
Posted through http://www.FPGARelated.com
 
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer
It had a lot going for it, but was a too dogmatic about
the development environment. At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.

It would also help if people can decide that reliability
is important, and that bucketfuls of salt should be
on hand when listening to salesman's protestations that
"the software/hardware framework takes care of all of
that so you don't have to worry".
 
On 6/28/2013 5:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment.
You mean 'C'? I worked on a large transputer oriented project and they
used ANSI 'C' rather than Occam. It got the job done... or should I say
"jobs"?


At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.
You mean like 144 transputers on a single chip? I"m not sure where
processing is headed. I actually just see confusion ahead as all of the
existing methods seem to have come to a steep incline if not a brick
wall. It may be time for something completely different.


It would also help if people can decide that reliability
is important, and that bucketfuls of salt should be
on hand when listening to salesman's protestations that
"the software/hardware framework takes care of all of
that so you don't have to worry".
What? Since when did engineers listen to salesmen?

--

Rick
 
On Wednesday, June 26, 2013 5:44:39 PM UTC-4, thomas....@gmail.com wrote:

I just took a quick look at your document (time is limited...). What I like is the concept of "in-line" literals. A good extension would be to have the same concept also for calls and jumps (i.e. so you do not have to load the destination address into a register first) and maybe also other instructions that can work with literals. I also think that you leave some bits unused: e.g. byt instruction does not use register B, so you would have 3 additional bits in to opcode to make it possible to have 11b literal instead of an 8b literal (or you could use this 3 bits for other purposes, e.g. A = A + lit8)
After looking into this yesterday I don't think I'll do it. The in-line value has to be retrieved before it can be used to offset or replace the PC, which is one clock too late for the way the pipeline is currently configured. Using it in other ways like adding wouldn't work unless I used a sepearate adder, as the ALU add/subtract happens fairly early in the pipe. But I really appreciate this excellent suggestion Thomas, and for the time you took to read my paper!
 
On 28/06/13 15:52, rickman wrote:
On 6/28/2013 5:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos
can
In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment.

You mean 'C'? I worked on a large transputer oriented project and they used ANSI 'C' rather than Occam. It got the job done... or should I say "jobs"?
I only looked at the Transputer when it was Occam only.
I liked Occam as an academic language, but at that time
it would have been a bit of a pain to do any serious
engineering; ISTR anything other than primitive types
weren't supported in the language. IIRC that was
ameliorated later, but by then the opportunity for
me (and Inmos) had passed.

I don't know how C fitted onto the Transputer, but
I'd only have been interested if "multithreaded"
(to use the term loosely) code could have been
expressed reasonably easily.

Shame, I'd have loved to use it.

At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.

You mean like 144 transputers on a single chip?
Or Intel's 80 cored chip :)

I"m not sure where processing is headed.
Not that way! Memory bandwidth and latency are
key issues - but you knew that!

I actually just see confusion ahead as all of the existing methods seem to have come to a steep incline if
not a brick wall. It may be time for something completely different.
Precisely. My bet is that message passing between
independent processor+memory systems has the
biggest potential. It matches nicely onto many
forms of event-driven industrial and financial
applications and, I am told, onto significant
parts of HPC. It is also relatively easy to
comprehend and debug.

The trick will be to get the sizes of the
processor + memory + computation "just right".
And desktop/GUI doesn't match that.



It would also help if people can decide that reliability
is important, and that bucketfuls of salt should be
on hand when listening to salesman's protestations that
"the software/hardware framework takes care of all of
that so you don't have to worry".

What? Since when did engineers listen to salesmen?
Since their PHBs get taken out to the golf course
to chat about sport by the salesmen :(
 
On 6/28/2013 12:23 PM, Tom Gardner wrote:
On 28/06/13 15:52, rickman wrote:
On 6/28/2013 5:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos
can
In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment.

You mean 'C'? I worked on a large transputer oriented project and they
used ANSI 'C' rather than Occam. It got the job done... or should I
say "jobs"?

I only looked at the Transputer when it was Occam only.
I liked Occam as an academic language, but at that time
it would have been a bit of a pain to do any serious
engineering; ISTR anything other than primitive types
weren't supported in the language. IIRC that was
ameliorated later, but by then the opportunity for
me (and Inmos) had passed.

I don't know how C fitted onto the Transputer, but
I'd only have been interested if "multithreaded"
(to use the term loosely) code could have been
expressed reasonably easily.

Shame, I'd have loved to use it.

At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.

You mean like 144 transputers on a single chip?

Or Intel's 80 cored chip :)

I"m not sure where processing is headed.

Not that way! Memory bandwidth and latency are
key issues - but you knew that!
Yeah, but I think the current programming paradigm is the problem. I
think something else needs to come along. The current methods are all
based on one, massive von Neumann design and that is what has hit the
wall... duh!

Time to think in terms of much smaller entities not totally different
from what is found in FPGAs, just processors rather than logic.

An 80 core chip will just be a starting point, but the hard part will
*be* getting started.


I actually just see confusion ahead as all of the existing methods
seem to have come to a steep incline if
not a brick wall. It may be time for something completely different.

Precisely. My bet is that message passing between
independent processor+memory systems has the
biggest potential. It matches nicely onto many
forms of event-driven industrial and financial
applications and, I am told, onto significant
parts of HPC. It is also relatively easy to
comprehend and debug.

The trick will be to get the sizes of the
processor + memory + computation "just right".
And desktop/GUI doesn't match that.
I think the trick will be in finding ways of dividing up the programs so
they can meld to the hardware rather than trying to optimize everything.

Consider a chip where you have literally a trillion operations per
second available all the time. Do you really care if half go to waste?
I don't! I design FPGAs and I have never felt obliged (not since the
early days anyway) to optimize the utility of each LUT and FF. No, it
turns out the precious resource in FPGAs is routing and you can't do
much but let the tools manage that anyway.

So a fine grained processor array could be very effective if the
programming can be divided down to suit. Maybe it takes 10 of these
cores to handle 100 Mbps Ethernet, so what? Something like a browser
might need to harness a couple of dozen. If the load slacks off and
they are idling, so what?


It would also help if people can decide that reliability
is important, and that bucketfuls of salt should be
on hand when listening to salesman's protestations that
"the software/hardware framework takes care of all of
that so you don't have to worry".

What? Since when did engineers listen to salesmen?

Since their PHBs get taken out to the golf course
to chat about sport by the salesmen :(
It's a bit different with me. I am my own PHB and I kayak, not golf. I
have one disti person who I really enjoy talking to. She tried to help
me from time to time, but often she can't do a lot because I'm not
buying 1000's of chips. But my quantities have gone up a bit lately,
we'll see where it goes.

--

Rick
 
On 6/28/13 2:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment. At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.
Have you looked at Tilera's TILEpro64 or Adapteva's Epiphany
64 core processors?

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.
Languages like Erlang and Go use similar concepts (as
did Occam on the transputer). But I think the problem
is that /in general/ we still don't know how to write
parallel or distributed programs. Most of the concepts
are from ~40 years back (CSP, guarded commands etc.).
We still don't have decent tools. Turning serial programs
into parallel versions is manual, laborious, error prone
and not very successful.
 
On 28/06/13 20:55, Bakul Shah wrote:
On 6/28/13 2:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment. At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Have you looked at Tilera's TILEpro64 or Adapteva's Epiphany
64 core processors?
No I haven't.

I've been constrained by getting high-availability
software to market quickly, on hardware that is
demonstrably supported all over the world.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.

Languages like Erlang and Go use similar concepts (as
did Occam on the transputer). But I think the problem
is that /in general/ we still don't know how to write
parallel or distributed programs. Most of the concepts
are from ~40 years back (CSP, guarded commands etc.).
We still don't have decent tools. Turning serial programs
into parallel versions is manual, laborious, error prone
and not very successful.
Erlang is certainly interesting from this point of view.

I'm not interested in turning existing serial programs
into parallel ones; that way lies madness and failure.

What is more interestingly tractable are "embarrassingly
parallel" problems (e.g. massive event processing systems),
and completely new approaches (currently typified by
big data and map-reduce, but that's just the beginning).
 
On 28/06/13 20:06, rickman wrote:
On 6/28/2013 12:23 PM, Tom Gardner wrote:
On 28/06/13 15:52, rickman wrote:
On 6/28/2013 5:33 AM, Tom Gardner wrote:
On 28/06/13 10:09, RCIngham wrote:
snip

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos
can
In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.


Transputer?
http://en.wikipedia.org/wiki/Transputer

It had a lot going for it, but was a too dogmatic about
the development environment.

You mean 'C'? I worked on a large transputer oriented project and they
used ANSI 'C' rather than Occam. It got the job done... or should I
say "jobs"?

I only looked at the Transputer when it was Occam only.
I liked Occam as an academic language, but at that time
it would have been a bit of a pain to do any serious
engineering; ISTR anything other than primitive types
weren't supported in the language. IIRC that was
ameliorated later, but by then the opportunity for
me (and Inmos) had passed.

I don't know how C fitted onto the Transputer, but
I'd only have been interested if "multithreaded"
(to use the term loosely) code could have been
expressed reasonably easily.

Shame, I'd have loved to use it.

At the time it was respectably
fast, but that wasn't sufficient -- particularly since there
was so much scope for increasing speed of uniprocessor
machines.

Given that uniprocessors have hit a wall, transputer
*concepts* embodied in a completely different form
might begin to be fashionable again.

You mean like 144 transputers on a single chip?

Or Intel's 80 cored chip :)

I"m not sure where processing is headed.

Not that way! Memory bandwidth and latency are
key issues - but you knew that!

Yeah, but I think the current programming paradigm is the problem. I think something else needs to come along. The current methods are all based on one, massive von Neumann design and that is what
has hit the wall... duh!

Time to think in terms of much smaller entities not totally different from what is found in FPGAs, just processors rather than logic.

An 80 core chip will just be a starting point, but the hard part will *be* getting started.


I actually just see confusion ahead as all of the existing methods
seem to have come to a steep incline if
not a brick wall. It may be time for something completely different.

Precisely. My bet is that message passing between
independent processor+memory systems has the
biggest potential. It matches nicely onto many
forms of event-driven industrial and financial
applications and, I am told, onto significant
parts of HPC. It is also relatively easy to
comprehend and debug.

The trick will be to get the sizes of the
processor + memory + computation "just right".
And desktop/GUI doesn't match that.

I think the trick will be in finding ways of dividing up the programs so they can meld to the hardware rather than trying to optimize everything.
My suspicion is that, except for compute-bound
problems that only require "local" data, that
granularity will be too small.

Examples where it will work, e.g. protein folding,
will rapidly migrate to CUDA and graphics processors.

Consider a chip where you have literally a trillion operations per second available all the time. Do you really care if half go to waste? I don't! I design FPGAs and I have never felt obliged (not
since the early days anyway) to optimize the utility of each LUT and FF. No, it turns out the precious resource in FPGAs is routing and you can't do much but let the tools manage that anyway.
Those internal FPGA constraints also have analogues at
a larger scale, e.g. ic pinout, backplanes, networks...


So a fine grained processor array could be very effective if the programming can be divided down to suit. Maybe it takes 10 of these cores to handle 100 Mbps Ethernet, so what? Something like a
browser might need to harness a couple of dozen. If the load slacks off and they are idling, so what?
The fundamental problem is that in general as you make the
granularity smaller, the communications requirements
get larger. And vice versa :(


It would also help if people can decide that reliability
is important, and that bucketfuls of salt should be
on hand when listening to salesman's protestations that
"the software/hardware framework takes care of all of
that so you don't have to worry".

What? Since when did engineers listen to salesmen?

Since their PHBs get taken out to the golf course
to chat about sport by the salesmen :(

It's a bit different with me. I am my own PHB and I kayak, not golf. I have one disti person who I really enjoy talking to. She tried to help me from time to time, but often she can't do a lot
because I'm not buying 1000's of chips. But my quantities have gone up a bit lately, we'll see where it goes.
I'm sort-of retired (I got sick of corporate in-fighting,
and I have my "drop dead money", so...)

I regard golf as silly, despite having two courses in
walking distance. My equivalent of kayaking is flying
gliders.
 

Welcome to EDABoard.com

Sponsor

Back
Top