New soft processor core paper publisher?

Bakul Shah · Jun 24, 2013

On 6/24/13 5:03 AM, Eric Wallin wrote:

On Monday, June 24, 2013 3:24:44 AM UTC-4, Tom Gardner wrote:

Consider trying to pass a message consisting of one
integer from one thread to another such that the
receiving thread is guaranteed to be able to picks
it up exactly once.

Thread A works on the integer value and when it is done it writes it to location Z. It then reads a value at location X, increments it, and writes it back to location X.

Thread B has been repeatedly reading location X and notices it has been incremented. It reads the integer value at Z, performs some function on it, and writes it back to location Z. It then reads a value at Y, increments it, and writes it back to location Y to let thread A know it took, worked on, and replaced the integer at Z.

The above seems airtight to me if reads and writes to memory are not cached or otherwise delayed, and I don't see how interrupts are germane, but perhaps I haven't taken everything into account.

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

Bakul Shah · Jun 24, 2013

On 6/24/13 3:23 PM, Eric Wallin wrote:

On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Tom Gardner · Jun 24, 2013

Bakul Shah wrote:

On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Just so.

A programmer that doesn't understand that is the equivalent
of a hardware engineer that doesn't under stand metastability.
(When I started out most people denied the possibility of
synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.

Eric Wallin · Jun 24, 2013

Verilog code for my Hive processor is now up:

http://opencores.org/project,hive

(Took me most of the freaking day to figure out SVN.)

Eric Wallin · Jun 25, 2013

On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Anyone have comments on my paper or the verilog?

glen herrmannsfeldt · Jun 25, 2013

Bakul Shah <usenet@bitblocks.com> wrote:

(snip)

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

In the core memory days, there was a special solution.
Core read is destructive, so after reading the value out it has
to be restored. For read-modify-write instructions, one can avoid
the restore, and instead rewrite the new value. That assumes that
the instruction set has a read-modify-write instruction, a favorite
for DEC machines being increment and decrement.

DRAM also has descructive read, but except for the very early days,
I don't believe it has been used in that way.

If the architecture does have a read-modify-write instruction,
such as increment, it can be designed such that no other thread
or I/O can come in between.

-- glen

David Brown · Jun 25, 2013

On 25/06/13 01:17, Tom Gardner wrote:

Bakul Shah wrote:
On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Just so.

A programmer that doesn't understand that is the equivalent
of a hardware engineer that doesn't under stand metastability.
(When I started out most people denied the possibility of
synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

This sounds nice in theory, but in practice there can be problems.
Scaling with number of processors can quickly become an issue here -
lock-free algorithms and fifos work well between two processors, but
scale badly with many processors. Independent memory for each processor
sounds nice, and can work well for some purposes, but is a poor
structure for general-purpose computing.

If you want to scale well, you want hardware support for semaphores.
And you don't want to divide things up by processor - you want to be
able to divide them up by process or thread. Threads should have
independent memory areas, which they can access safely and quickly
regardless of which cpu they are running on. Otherwise you spend much
of your bandwidth just moving data around between your cpu-dependent
memory blocks (replacing the cache coherence problems with new memory
movement bottlenecks), or your threads have to have very strong affinity
to particular cpus and you lose your scaling.

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.

rickman · Jun 25, 2013

On 6/24/2013 12:56 PM, Tom Gardner wrote:

rickman wrote:
On 6/24/2013 11:57 AM, Eric Wallin wrote:
On Monday, June 24, 2013 9:47:28 AM UTC-4, Tom Gardner wrote:

Please explain why your processor does not need test and set or
compare and exchange operations. What theoretical advance have you
made?

I'm not exactly sure why we're having this generalized, theoretical
discussion when a simple reading the design document I've provided
would probably answer your questions. If it doesn't then
perhaps you could tell me what I left out, and I might include that
info in the next rev. Not trying to be gruff or anything, I'd very
much like the document (and processor) to be on as solid a
footing as possible.

Eric, I think you have explained properly how your design will deal
with synchronization. I'm not sure what Tom is going on about. Clearly
he doesn't understand your design.

Correct.

I'm glad you understand that.

If it is of any help, Eric's design is more like 8 cores running in
parallel, time sharing memory and in fact, the same processor hardware
on a machine cycle basis
(so no 8 ported memory required).

Fair enough; sounds like it is in the same area as the propellor chip.

No point in making such a comparison. If you want to understand Eric's
chip, then learn about Eric's chip. I certainly don't know enough about
the Propeller chip to compare in a meaningful manner.

Just think of each processor executing one instruction every 8 clocks,
but all processors are out of phase, so no one completes on the same
clock.

Is there anything to prevent multiple cores reading/writing the
same memory location in the same machine cycle? What is the
result when that happens?

Not sure what you mean by "machine cycle". As I said above, there are 8
clocks to the processor machine cycle, but they are all out of phase.
So on any given clock cycle only one processor will be updating
registers or memory.

I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer. Is
that not a good enough method?

If an interrupt occurs it doesn't cause one of the other 7 tasks to
run, they are already running, it simply invokes the interrupt
handler. I believe Eric is not envisioning multiple tasks on a
single processor.

Such presumptions would be useful to have in the white paper.

Have you read the paper? How do you know its not there?

As others have pointed out, test and set instructions are not required
to support concurrency and communications. They are certainly nice to
have, but are not essential.

Agreed. I'm perfectly prepared to accept alternative techniques,
e.g. disable interrupts.

Ok, so is this discussion over?

In your case they would be superfluous.

Not proven to me.

The trouble is I've seen too many hardware designs that
leave the awkward problems to software - especially first
efforts by small teams.

And too often those problems can be very difficult to solve
in software. Nowadays it is hard to find people that have
sufficient experience across the whole hardware/firmware/system
software spectrum to enable them to avoid such traps.

I don't know whether Eric is such a person, but I'm afraid
his answers have raised orange flags in my mind.

As a point of reference, I had similar misgivings when I
first heard about the Itanium's architecture in, IIRC,
1994. I suppressed them because the people involved were
undoubtedly more skilled in the area that I, and had been
working for 5 years. Much later I regrettably came to the
conclusion the orange flags were too optimistic.

If you still have reservations, then learn about the design. If you
don't want to invest the time to learn about the design, why are you
bothering to object to it?

--

Rick

Tom Gardner · Jun 25, 2013

David Brown wrote:

On 25/06/13 01:17, Tom Gardner wrote:
Bakul Shah wrote:
On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Just so.

A programmer that doesn't understand that is the equivalent
of a hardware engineer that doesn't under stand metastability.
(When I started out most people denied the possibility of
synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

This sounds nice in theory, but in practice there can be problems.
Scaling with number of processors can quickly become an issue here -
lock-free algorithms and fifos work well between two processors, but
scale badly with many processors. Independent memory for each processor
sounds nice, and can work well for some purposes, but is a poor
structure for general-purpose computing.

I agree with all your points. Unfortunately they are equally
applicable to the current batch of SMP/NUMA architectures

A key point is the granularity of the computation and message
passing, and that varies radically between applications.

There are a large number of commercially important workloads
that would work well on such a system, ranging from embarrassingly
parallel problems such as soft real-time event proccessing, some
HPC, big data (think map-reduce).

But I agree it wouldn't be a significant benefit for
bog-standard desktop processing - but current machines
are more than sufficient for that anyway!

If you want to scale well, you want hardware support for semaphores.
And you don't want to divide things up by processor - you want to be
able to divide them up by process or thread. Threads should have
independent memory areas, which they can access safely and quickly
regardless of which cpu they are running on. Otherwise you spend much
of your bandwidth just moving data around between your cpu-dependent
memory blocks (replacing the cache coherence problems with new memory
movement bottlenecks), or your threads have to have very strong affinity
to particular cpus and you lose your scaling.

I agree with all those points too.

rickman · Jun 25, 2013

On 6/24/2013 3:25 PM, Eric Wallin wrote:

On Monday, June 24, 2013 12:07:23 AM UTC-4, rickman wrote:

I'm glad you can take (hopefully) constructive criticism. I was
concerned when I wrote the above that it might be a bit too blunt.

I apologize to everyone here, I kind of barged in and have behaved somewhat brashly.

... part of the utility
of a design is the ease of programming efficiently. I haven't looked at
yours yet, but just picturing the four stacks makes it seem pretty
simple... so far. :^)

Writing a conventional stack machine in an HDL isn't too daunting, but programming it afterward, for me anyway, was just too much.

I have to say I'm not crazy about the large instruction word. That is
one of the appealing things about MISC to me. I work in very small
FPGAs and 16 bit instructions are better avoided if possible, but that
may be a red herring. What matters is how many bytes a given program
uses, not how many bits are in an instruction.

Yes. Opcode space obviously expands exponentially with bit count, so one can get a lot more with a small size increase. I think a 32 bit opcode is pushing it for a small FPGA implementation, but a 16 bit opcode gives one a couple of small operand indices, and some reasonably sized immediate instructions (data, conditional jumps, shifts, add) that I find I'm using quite a bit during the testing and verification phase. Data plus operation in a single opcode is hard to beat for efficiency but it has to earn it's keep in the expanded opcode space. With the operand indices you get a free copy/move with most single operand operations which is another efficiency.

I am supposed to present to the SVFIG and I think your design would be a
very interesting part of the presentation unless you think you would
rather present yourself. I'm sure they would like to hear about it and
they likely would be interested in your opinions on MISC. I know I am.

I'm on the other coast so I most likely can't attend, but I would be most honored if you were to present it to SVFIG.

I was going to talk about the CPU design I had been working on, but I
think it is going to be more of a survey of CPU designs for FPGAs ending
with my spin on how to optimize a design. Your implementation is very
different from mine, but the hybrid register/stack approach is similar
in intent and results from a similar line of thought.

Turns out I am busier in July than expected, so I will not be able to
present at the July meeting. I'll shoot for August. I've been looking
at their stuff on the web and they do a pretty good job. I was thinking
it was a local group and it would be a small audience, but I think it
may be a lot bigger when the web is considered.

--

Rick

rickman · Jun 25, 2013

On 6/24/2013 7:17 PM, Tom Gardner wrote:

Bakul Shah wrote:
On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Just so.

A programmer that doesn't understand that is the equivalent
of a hardware engineer that doesn't under stand metastability.
(When I started out most people denied the possibility of
synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.

The *only* way? lol You think like a programmer. The big assumption
you are making that is no longer valid is that the processor itself is a
precious resource that must be optimized. That is no longer valid.
When x86 and ARM machines put four cores on a chip with one memory
interface they are choking the CPU's airway. Those designs are no
longer efficient and the processor is underused. So clearly it is not
the precious resource anymore.

Rather than trying to optimize the utilization of the CPU, design needs
to proceed with the recognition of the limits of multiprocessors. Treat
processors the same way you treat peripheral functions. Dedicate them
to tasks. Let them have a job to do and not worry if they are idle part
of the time. This results in totally different designs and can result
in faster, lower cost and lower power systems.

--

Rick

rickman · Jun 25, 2013

On 6/24/2013 7:00 PM, Bakul Shah wrote:

On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

What assumptions is this based on? Do you know?

What are the alternatives to "mutexes"? How inefficient are they? When
do you need to use a mutex?

Have you looked at Eric's design in the least? Do you have any idea of
the applications it is targeted to?

--

Rick

Tom Gardner · Jun 25, 2013

rickman wrote:

Not sure what you mean by "machine cycle".

I mean it in the same sense as it was used in the posting
that I replied to.

I believe Eric's point is that the thing that prevents more than one processor from accessing the same memory location is the programmer. Is that not a good enough method?

I'd prefer it if Eric gave the correct answer rather than
someone else's possibly correct answer.

It is a good enough method for some things, and not for others.

If you still have reservations, then learn about the design. If you don't want to invest the time to learn about the design, why are you bothering to object to it?

There are *many* new designs which might be interesting.
Nobody has time to look at them all so they make fast
decisions as to whether to design and designed is credible.

I'm not objecting to it, but I am giving the designer the
opportunity to pass the "elevator pitch" test.

rickman · Jun 25, 2013

On 6/24/2013 5:30 PM, Eric Wallin wrote:

Verilog code for my Hive processor is now up:

http://opencores.org/project,hive

(Took me most of the freaking day to figure out SVN.)

You mean you actually figured it out?

--

Rick

Tom Gardner · Jun 25, 2013

rickman wrote:

On 6/24/2013 7:17 PM, Tom Gardner wrote:
Bakul Shah wrote:
On 6/24/13 3:23 PM, Eric Wallin wrote:
On Monday, June 24, 2013 6:03:38 PM UTC-4, Bakul Shah wrote:

Consider a case where *both* thread A and B want to increment
a counter at location X? A reads X and finds it contains 10. But
before it can write back 11, B reads X and finds 10 and it too
writes back 11. Now you've lost a count. Can this happen in your
design? If so you need some sort of atomic update instruction.

It can happen if the programmer is crazy enough to do it, otherwise not.

Concurrent threads need to communicate with each other to cooperate
on some common task. Consider two threads adding an item to a linked
list or keeping statistics on some events or many such things. You
are pretty much required to be "crazy enough"! Any support for mutex
would simplify things quite a bit. Without atomic update you have to
use some complicated, inefficient algorithm to implement mutexes.

Just so.

A programmer that doesn't understand that is the equivalent
of a hardware engineer that doesn't under stand metastability.
(When I started out most people denied the possibility of
synchronisation failure due to metastability!)

Mind you, I'd *love* to see a radical overhaul of traditional
multicore processors so they took the form of
- a large number of processors
- each with completely independent memory
- connected by message passing fifos

In the long term that'll be the only way we can continue
to scale individual machines: SMP scales for a while, but
then cache coherence requirements kill performance.

The *only* way? lol You think like a programmer. The big assumption you are making that is no longer valid is that the processor itself is a precious resource that must be optimized. That is no
longer valid. When x86 and ARM machines put four cores on a chip with one memory interface they are choking the CPU's airway. Those designs are no longer efficient and the processor is underused. So
clearly it is not the precious resource anymore.

I don't think that and your statements don't follow from my comments.

Rather than trying to optimize the utilization of the CPU, design needs to proceed with the recognition of the limits of multiprocessors. Treat processors the same way you treat peripheral
functions. Dedicate them to tasks. Let them have a job to do and not worry if they are idle part of the time. This results in totally different designs and can result in faster, lower cost and
lower power systems.

That approach is valuable when and where it works, but can
be impractical for many workloads.

rickman · Jun 25, 2013

On 6/25/2013 1:14 PM, glen herrmannsfeldt wrote:

rickman<gnuarm@gmail.com> wrote:
On 6/24/2013 12:56 PM, Tom Gardner wrote:

(snip)
Is there anything to prevent multiple cores reading/writing the
same memory location in the same machine cycle? What is the
result when that happens?

Not sure what you mean by "machine cycle". As I said above, there are 8
clocks to the processor machine cycle, but they are all out of phase.
So on any given clock cycle only one processor will be updating
registers or memory.

If there 8 processors that never communicate, it would be better
to have 8 separate RAM units.

Why is that? What would be "better" about it?

I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer. Is
that not a good enough method?

So no thread ever communicates with another one?

Well, read the wikipedia article on spinlock and the linked-to
article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be
done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special
instructions.

Are we talking about the same thing here? We were talking about the
Hive processor.

Otherwise, spinlocks might be good enough.

So your point is?

What would the critical section of code be doing that is critical?
Simple interprocess communications is not necessarily "critical".

--

Rick

glen herrmannsfeldt · Jun 25, 2013

rickman <gnuarm@gmail.com> wrote:

On 6/24/2013 12:56 PM, Tom Gardner wrote:

(snip)

Is there anything to prevent multiple cores reading/writing the
same memory location in the same machine cycle? What is the
result when that happens?

Not sure what you mean by "machine cycle". As I said above, there are 8
clocks to the processor machine cycle, but they are all out of phase.
So on any given clock cycle only one processor will be updating
registers or memory.

If there 8 processors that never communicate, it would be better
to have 8 separate RAM units.

I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer. Is
that not a good enough method?

So no thread ever communicates with another one?

Well, read the wikipedia article on spinlock and the linked-to
article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be
done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special
instructions.

Otherwise, spinlocks might be good enough.

-- glen

Eric Wallin · Jun 25, 2013

On Tuesday, June 25, 2013 1:14:57 PM UTC-4, glen herrmannsfeldt wrote:

So no thread ever communicates with another one?

All threads share the same Von Neumann memory, so of course they can communicate with each other.

If only there were a paper somewhere, written by the designer, freely available to anyone on the web...

Eric Wallin · Jun 25, 2013

On Tuesday, June 25, 2013 11:14:52 AM UTC-4, Tom Gardner wrote:

I believe Eric's point is that the thing that prevents more than one processor from accessing the same memory location is the programmer. Is that not a good enough method?

I'd prefer it if Eric gave the correct answer rather than
someone else's possibly correct answer.

If Rick says anything wrong I'll correct him.

I'm not objecting to it, but I am giving the designer the
opportunity to pass the "elevator pitch" test.

The paper has bulleted feature list at the very front and a downsides bulleted list at the very back. I tried to write it in an accessible manner for the widest audience. We all like to think aloud now and then, but I'd think a comprehensive design paper would sidestep all of this wild speculation and unnecessary third degree.

http://opencores.org/usercontent,doc,1371986749

Bakul Shah · Jun 25, 2013

On 6/25/13 11:18 AM, Eric Wallin wrote:

On Tuesday, June 25, 2013 11:14:52 AM UTC-4, Tom Gardner wrote:

I believe Eric's point is that the thing that prevents more than one processor from accessing the same memory location is the programmer. Is that not a good enough method?

This is not good enough in general. I gave some examples where threads
have to read/write the same memory location.

I agree with you that if threads communicate just through fifos
and there is exactly one reader and one writer there is no problem.
The reader updates the read ptr & watches but doesn't update the
write ptr. The writer updates the write ptr & watches but doesn't
update the read ptr. You can use fifos like these to implement a
mutex but this is a very expensive way to implement mutexes and
doesn't scale.

The paper has bulleted feature list at the very front and a downsides bulleted list at the very back. I tried to write it in an accessible manner for the widest audience. We all like to think aloud now and then, but I'd think a comprehensive design paper would sidestep all of this wild speculation and unnecessary third degree.

I don't think it is a question of "third degree". You did invite
feedback!

Adding compare-and-swap or load-linked & store-conditional would
make your processor more useful for parallel programming. I am not
motivated enough to go through 4500+ lines of verilog to know how
hard that is but you must already have some bus arbitration logic
since all 8 threads can access memory.

http://opencores.org/usercontent,doc,1371986749

I missed this link before. A nicely done document! A top level
diagram would be helpful. 64K address space seems too small.

New soft processor core paper publisher?

Bakul Shah

Guest

Bakul Shah

Guest

Tom Gardner

Guest

Eric Wallin

Guest

Eric Wallin

Guest

glen herrmannsfeldt

Guest

David Brown

Guest

rickman

Guest

Tom Gardner

Guest

rickman

Guest

rickman

Guest

rickman

Guest

Tom Gardner

Guest

rickman

Guest

Tom Gardner

Guest

rickman

Guest

glen herrmannsfeldt

Guest

Eric Wallin

Guest

Eric Wallin

Guest

Bakul Shah

Guest

Log in

Welcome to EDABoard.com

Sponsor