New soft processor core paper publisher?

Eric Wallin wrote:
On Monday, June 24, 2013 3:24:44 AM UTC-4, Tom Gardner wrote:

Consider trying to pass a message consisting of one
integer from one thread to another such that the
receiving thread is guaranteed to be able to picks
it up exactly once.

Thread A works on the integer value and when it is done it writes it to location Z. It then reads a value at location X, increments it, and writes it back to location X.

Thread B has been repeatedly reading location X and notices it has been incremented. It reads the integer value at Z, performs some function on it, and writes it back to location Z. It then reads a value at Y, increments it, and writes it back to location Y to let thread A know it took, worked on, and replaced the integer at Z.

The above seems airtight to me if reads and writes to memory are not cached or otherwise delayed, and I don't see how interrupts are germane, but perhaps I haven't taken everything into account.
Have a look at
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/Book/threads-intro.pdf
section 25.3 et al for one exposition of the kinds of problem that arise.

That exposition is in x86 terms but it applies equally to all other 11 major
processor families I've examined over the past 35 years.
If there is a reason your processor cannot experience these issues, let us know.

Subsequent chapters on the solutions can be found
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/
 
On 6/25/13 10:14 AM, glen herrmannsfeldt wrote:
Well, read the wikipedia article on spinlock and the linked-to
article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be
done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special
instructions.

Otherwise, spinlocks might be good enough.
Spinlock is not good enough without special instructions. That
is why Petersen's or Dekker's or Szymanski's algorithms. Now
most processor provide some h/w support for mutexes. Most papers
on implementing mutex with just shared memory are 25+ years old.
Now this is just an interesting puzzle!
 
rickman <gnuarm@gmail.com> wrote:

(snip, someone wrote)
Not sure what you mean by "machine cycle". As I said above, there are 8
clocks to the processor machine cycle, but they are all out of phase.
So on any given clock cycle only one processor will be updating
registers or memory.
(then I wrote)
If there 8 processors that never communicate, it would be better
to have 8 separate RAM units.

Why is that? What would be "better" about it?
Well, if the RAM really is fast enough not to be the in the
critical path, then maybe not, but separate RAM means no access
limitations.

I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer. Is
that not a good enough method?

So no thread ever communicates with another one?

Well, read the wikipedia article on spinlock and the linked-to
article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be
done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special
instructions.

Are we talking about the same thing here? We were talking about the
Hive processor.
I was mentioning it for context. For processor that do reorder
writes, you can't use Peterson's algorithm.

Otherwise, spinlocks might be good enough.

So your point is?
Without write reordering, it is possible, though maybe not
efficient, to communicate without interlocked writes.

What would the critical section of code be doing that is critical?
Simple interprocess communications is not necessarily "critical".
"Critical" means that the messages won't get lost due to other
threads writing at about the same time. Now, much of networking
is based on unreliable "best effort" protocols, and that may also
work for communications to threads. But that involves delays and
retransmission after timers expire.

-- glen
 
On 6/25/2013 4:23 PM, Bakul Shah wrote:
On 6/25/13 11:18 AM, Eric Wallin wrote:
On Tuesday, June 25, 2013 11:14:52 AM UTC-4, Tom Gardner wrote:

I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer.
Is that not a good enough method?

This is not good enough in general. I gave some examples where threads
have to read/write the same memory location.
I didn't see any examples that were essential. You talked about two
processes accessing the same data. Why do you need to do that? Just
have one process send the data to the other process so only one updates
the list.


I agree with you that if threads communicate just through fifos
and there is exactly one reader and one writer there is no problem.
The reader updates the read ptr & watches but doesn't update the
write ptr. The writer updates the write ptr & watches but doesn't
update the read ptr. You can use fifos like these to implement a
mutex but this is a very expensive way to implement mutexes and
doesn't scale.
Doesn't scale? Can you explain?


The paper has bulleted feature list at the very front and a downsides
bulleted list at the very back. I tried to write it in an accessible
manner for the widest audience. We all like to think aloud now and
then, but I'd think a comprehensive design paper would sidestep all of
this wild speculation and unnecessary third degree.

I don't think it is a question of "third degree". You did invite
feedback!

Adding compare-and-swap or load-linked & store-conditional would
make your processor more useful for parallel programming. I am not
motivated enough to go through 4500+ lines of verilog to know how
hard that is but you must already have some bus arbitration logic
since all 8 threads can access memory.
You don't understand even the most basic concept of how the device
works. There is no arbitration logic because there is only one
processor that is time shared between 8 processes on a clock cycle basis
to match the 8 deep pipeline.

I'm not trying to be snarkey, but there are a lot of people posting here
who really don't get the idea behind this design.


http://opencores.org/usercontent,doc,1371986749

I missed this link before. A nicely done document! A top level
diagram would be helpful. 64K address space seems too small.
Too small for what? That is part of what people aren't getting. This
is not intended to be even remotely comparable to an ARM or an x86
processor. This is intended to replace a microBlaze or a B16 type FPGA
core.

--

Rick
 
Eric Wallin wrote:
On Tuesday, June 25, 2013 4:26:14 PM UTC-4, Tom Gardner wrote:

Have a look at
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/Book/threads-intro.pdf
section 25.3 et al for one exposition of the kinds of problem that arise.

It talks about separate threads writing to the same location, which I
understand can be a problem with interrupts and without atomic
read-modify-write. All I can do is repeat that you don't program
this way it won't happen.
If that is a constraint on the permissible programming style then it
would be good to state that explicitly - to save other people's time,
to save you questions, and to save everybody late unpleasant
surprises.

That is a very common programming paradigm that people will expect
to employ to solve problems they expect to encounter. It would be
beneficial for you to demonstrate the coding techniques that you
expect to be used to solve their problems. Think of it as an
application note :)

A subroutine can be written so that threads can share a common
instance of it,
I presume "it" = code.

but without using a common memory location tostore data
associated with the execution of that subroutine (unless
the location is memory mapped HW).
Sounds equivalent to keeping all the data on the thread's
stack in most of the other processors I've used.

Works for data that isn't shared between threads.

In Hive, there is a register that when read returns the
thread ID, which is unique for each thread. This could
be used as an offset for subroutine data locations.
But what about data that has, of necessity, to be shared
between threads? For example a flag indicating whether or
not a non-sharable global resource (e.g. some i/o device,
or some data structure) is in use or is free to be used.

None of these situations are unique to your processor.
They first became a pain point in the 1960s and
necessitated development of techniques to resolve the
problem. If you've found a way to avoid such problems,
write it up and become famous.
 
On 6/25/2013 3:02 PM, glen herrmannsfeldt wrote:
rickman<gnuarm@gmail.com> wrote:

(snip, someone wrote)
Not sure what you mean by "machine cycle". As I said above, there are 8
clocks to the processor machine cycle, but they are all out of phase.
So on any given clock cycle only one processor will be updating
registers or memory.

(then I wrote)
If there 8 processors that never communicate, it would be better
to have 8 separate RAM units.

Why is that? What would be "better" about it?

Well, if the RAM really is fast enough not to be the in the
critical path, then maybe not, but separate RAM means no access
limitations.
I don't follow your logic, but I bet that is because your logic doesn't
apply to this design. Do you understand that there is really only one
processor? So what advantage could there be having 8 RAMs?


I believe Eric's point is that the thing that prevents more than one
processor from accessing the same memory location is the programmer. Is
that not a good enough method?

So no thread ever communicates with another one?

Well, read the wikipedia article on spinlock and the linked-to
article Peterson's_Algorithm.

It is more efficient if you have an interlocked write, but can be
done with spinlocks, if there is no reordering of writes to memory.

As many processors now do reorder writes, there is need for special
instructions.

Are we talking about the same thing here? We were talking about the
Hive processor.

I was mentioning it for context. For processor that do reorder
writes, you can't use Peterson's algorithm.
Ok, so this does not apply to the processor at hand, right?

Your quotes are a bit hard to read. They turn the quoted blank lines
into new unquoted lines. Are you using Google by any chance and ripping
out all the double spacing or something?


Otherwise, spinlocks might be good enough.

So your point is?

Without write reordering, it is possible, though maybe not
efficient, to communicate without interlocked writes.
Since this processor doesn't do write reordering Bob's your uncle!


What would the critical section of code be doing that is critical?
Simple interprocess communications is not necessarily "critical".

"Critical" means that the messages won't get lost due to other
threads writing at about the same time. Now, much of networking
is based on unreliable "best effort" protocols, and that may also
work for communications to threads. But that involves delays and
retransmission after timers expire.
You are talking very general here and I don't see how it applies to this
discussion which is specific to this processor.

--

Rick
 
Am Mittwoch, 12. Juni 2013 23:17:18 UTC+2 schrieb Eric Wallin:
I have a general purpose soft processor core that I developed in verilog. The processor is unusual in that it uses four indexed LIFO stacks with explicit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. The design is relatively simple (as these things go) yet powerful enough to do real work.
Hi Eric,

first of all: I like your name, I have designed a soft-core CPU called ERIC5 ;-)

I have read your paper quickly and would like to give you some feedback:
- What is the target application of your processor? Barrel processors can make sense for special (highly parallel) applications but will have the problem that most programmers prefer high single thread performance simply because it is much easier to program.
- If you target general purpose applications in FPGAs, your core will be compared with e.g. Nios II or MICO32 (open source). They are about the same size, are fully 32bit, have high single thread performance and a full design suite. What are the benefits of your core?
- If you want the core to be really used by others, a C-compiler is a MUST. (I learned this with ERIC5 quickly.) This will most likely be much more effort than the core itself...

I know that designing a CPU is a lot of fun and I assume that this was the real motivation (which is perfectly valid, of course). Also it will give you experience in this field and maybe also reputation with future employees or others. However, if you want to make it a commercial successful product (or even more widely used than other CPUs on opencores), it will be a long hard way against Nios II, etc.

Regards,

Thomas
www.entner-electronics.com
 
On 6/25/2013 7:07 PM, Tom Gardner wrote:
Eric Wallin wrote:
On Tuesday, June 25, 2013 4:26:14 PM UTC-4, Tom Gardner wrote:

Have a look at
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/Book/threads-intro.pdf

section 25.3 et al for one exposition of the kinds of problem that
arise.

It talks about separate threads writing to the same location, which I
understand can be a problem with interrupts and without atomic
read-modify-write. All I can do is repeat that you don't program
this way it won't happen.

If that is a constraint on the permissible programming style then it
would be good to state that explicitly - to save other people's time,
to save you questions, and to save everybody late unpleasant
surprises.
If you care to go back through the discussion, I believe he did exactly
that, say that two threads should not write to the same address. And we
have already discussed that this can be worked around.


That is a very common programming paradigm that people will expect
to employ to solve problems they expect to encounter. It would be
beneficial for you to demonstrate the coding techniques that you
expect to be used to solve their problems. Think of it as an
application note :)

A subroutine can be written so that threads can share a common
instance of it,

I presume "it" = code.

but without using a common memory location tostore data
associated with the execution of that subroutine (unless
the location is memory mapped HW).

Sounds equivalent to keeping all the data on the thread's
stack in most of the other processors I've used.
Or you just don't share data...

One issue is the use of the word "thread". I never understood the
difference between thread and process until I read the link you
provided. We don't have to be talking about threads here. I expect the
processors will be much more likely to be running separate processes
using separate memory. Does that make you happier? Then we can just
say they don't share memory other than for communications that are well
defined and preclude the conditions that cause problems.


Works for data that isn't shared between threads.
Yes, or more specifically, it works as long as two threads (or
processes) don't write to the same locations.


In Hive, there is a register that when read returns the
thread ID, which is unique for each thread. This could
be used as an offset for subroutine data locations.

But what about data that has, of necessity, to be shared
between threads? For example a flag indicating whether or
not a non-sharable global resource (e.g. some i/o device,
or some data structure) is in use or is free to be used.
That's easy, don't have *global* I/O devices... let one processor
control that I/O device and everyone else asks that processor for I/O
support. In fact, that is one of the few ways to actually get benefit
from this processor design. It is not all that much better than a
single threaded processor in an FPGA. The J1 runs at 100 MIPS and this
runs at 200 MIPS. But no one processor does more than 25. So how to
you use that? You can assign tasks to processors and let them do
separate jobs.


None of these situations are unique to your processor.
They first became a pain point in the 1960s and
necessitated development of techniques to resolve the
problem. If you've found a way to avoid such problems,
write it up and become famous.
Yes, but none of these apply if you just read his paper...

--

Rick
 
On 6/25/13 4:06 PM, rickman wrote:
On 6/25/2013 4:23 PM, Bakul Shah wrote:
This is not good enough in general. I gave some examples where threads
have to read/write the same memory location.

I didn't see any examples that were essential. You talked about two processes accessing the same
data. Why do you need to do that? Just have one process send the data to the other process so only
one updates the list.
There by reducing things to single threading.

I agree with you that if threads communicate just through fifos
and there is exactly one reader and one writer there is no problem.
The reader updates the read ptr & watches but doesn't update the
write ptr. The writer updates the write ptr & watches but doesn't
update the read ptr. You can use fifos like these to implement a
mutex but this is a very expensive way to implement mutexes and
doesn't scale.

Doesn't scale? Can you explain?
Scale to more than two threads. For that it may be better to use
one of the other algorithms mentioned in my last article. Still pretty
complicated and inefficient.

Adding compare-and-swap or load-linked & store-conditional would
make your processor more useful for parallel programming. I am not
motivated enough to go through 4500+ lines of verilog to know how
hard that is but you must already have some bus arbitration logic
since all 8 threads can access memory.

You don't understand even the most basic concept of how the device works. There is no arbitration
logic because there is only one processor that is time shared between 8 processes on a clock cycle
basis to match the 8 deep pipeline.
In this case load-linked, store conditional may be possible?
load-linked records the loaded address & thread id in a special
register. If any other thread tries to *write* to the same
address, a subsequent store-conditional fails & the next instn
can test that. You could simplify this further at some loss of
efficiency: fail the store if there is *any* store by any other
thread!
 
On 6/25/2013 7:55 PM, glen herrmannsfeldt wrote:
Eric Wallin<tammie.eric@gmail.com> wrote:

(snip)
It talks about separate threads writing to the same location,
which I understand can be a problem with interrupts and without
atomic read-modify-write. All I can do is repeat that you
don't program this way it won't happen. A subroutine can be
written so that threads can share a common instance of it,
but without using a common memory location to store data
associated with the execution of that subroutine (unless
the location is memory mapped HW).

Sure, that is pretty common. It is usually related to being
reentrant, but not always exactly the same.

In Hive, there is a register that when read returns the thread
ID, which is unique for each thread. This could be used as an
offset for subroutine data locations.

Yes, but usually once in a while there needs to be communication
between threads. If no other time, to get data through an
I/O device, such as separate threads writing to the same
user console. (Screen, terminal, serial port, etc.)
Why would communications be a problem? Just let one processor control
the I/O device and let the other processors talk to that one.

--

Rick
 
On 6/25/2013 8:10 PM, Bakul Shah wrote:
On 6/25/13 4:06 PM, rickman wrote:
On 6/25/2013 4:23 PM, Bakul Shah wrote:
This is not good enough in general. I gave some examples where threads
have to read/write the same memory location.

I didn't see any examples that were essential. You talked about two
processes accessing the same
data. Why do you need to do that? Just have one process send the data
to the other process so only
one updates the list.

There by reducing things to single threading.
Maybe you need to define what you mean by thread and process...


I agree with you that if threads communicate just through fifos
and there is exactly one reader and one writer there is no problem.
The reader updates the read ptr & watches but doesn't update the
write ptr. The writer updates the write ptr & watches but doesn't
update the read ptr. You can use fifos like these to implement a
mutex but this is a very expensive way to implement mutexes and
doesn't scale.

Doesn't scale? Can you explain?

Scale to more than two threads. For that it may be better to use
one of the other algorithms mentioned in my last article. Still pretty
complicated and inefficient.
I didn't mean explain what "more" means, explain *why* it doesn't scale.


Adding compare-and-swap or load-linked & store-conditional would
make your processor more useful for parallel programming. I am not
motivated enough to go through 4500+ lines of verilog to know how
hard that is but you must already have some bus arbitration logic
since all 8 threads can access memory.

You don't understand even the most basic concept of how the device
works. There is no arbitration
logic because there is only one processor that is time shared between
8 processes on a clock cycle
basis to match the 8 deep pipeline.

In this case load-linked, store conditional may be possible?
load-linked records the loaded address & thread id in a special
register. If any other thread tries to *write* to the same
address, a subsequent store-conditional fails & the next instn
can test that. You could simplify this further at some loss of
efficiency: fail the store if there is *any* store by any other
thread!
Do you understand the processor design?

--

Rick
 
On Tuesday, June 25, 2013 4:26:14 PM UTC-4, Tom Gardner wrote:

Have a look at
http://pages.cs.wisc.edu/~remzi/Classes/537/Fall2011/Book/threads-intro.pdf
section 25.3 et al for one exposition of the kinds of problem that arise.
It talks about separate threads writing to the same location, which I understand can be a problem with interrupts and without atomic read-modify-write.. All I can do is repeat that you don't program this way it won't happen. A subroutine can be written so that threads can share a common instance of it, but without using a common memory location to store data associated with the execution of that subroutine (unless the location is memory mapped HW). In Hive, there is a register that when read returns the thread ID, which is unique for each thread. This could be used as an offset for subroutine data locations.
 
On Tuesday, June 25, 2013 5:56:51 PM UTC-4, thomas....@gmail.com wrote:

- What is the target application of your processor? Barrel processors can make sense for special (highly parallel) applications but will have the problem that most programmers prefer high single thread performance simply because it is much easier to program.
The target application is for an FPGA logic designer who needs processor functionality but doesn't want or need anything too complex. There is no need for a toolchain for instance, and operation has been kept as simple as possible.

- If you target general purpose applications in FPGAs, your core will be compared with e.g. Nios II or MICO32 (open source). They are about the same size, are fully 32bit, have high single thread performance and a full design suite. What are the benefits of your core?
The benefit is it is really is free so you aren't legally bound to vendor silicon (not that all are). And if you hate having yet another toolset between you and what is going on you're probably SOL with most soft processors as they are quite complex (overly so for many low level applications, IMO). No one will be running Linux on Hive for instance. But running Linux on any soft core seems kind of dumb to me, when you need that much processor you might as well buy an ASIC which is cheaper, faster, etc. and not a blob of logic.

- If you want the core to be really used by others, a C-compiler is a MUST. (I learned this with ERIC5 quickly.) This will most likely be much more effort than the core itself...
Nope, ain't gonna do it, and you can't make me! :) A compiler for something this low level is overkill and kind of asking for it IMO.

I know that designing a CPU is a lot of fun and I assume that this was the real motivation (which is perfectly valid, of course). Also it will give you experience in this field and maybe also reputation with future employees or others. However, if you want to make it a commercial successful product (or even more widely used than other CPUs on opencores), it will be a long hard way against Nios II, etc.
I have like zero interest in Nios et al. Hive is mainly for my real use for serializing low bandwidth FPGA applications that would otherwise under utilize the fast FPGA fabric. But after all the work that went into it I wanted to get it out there for others to use, or perhaps to employ one or more aspects of Hive in their own processor core.

I hope to use Hive in a digital Theremin I've been working on for about a year now. Too soon to really know, but one thread will probably handle the user interface (LCD, rotary encoder, LEDs, etc.) another will probably handle linearization and scaling of the pitch side, another the wavetable and filtering stuff, etc. so I believe I can keep the threads busy. My main fear at this point is that heat from the FPGA will disturb the exquisitely sensitive electronics (there's only about 1 pF difference over the entire playable pitch range). The open project is described in a forum thread over at http://www.thereminworld.com if anyone is interested (I'm "dewster").
 
On Tuesday, June 25, 2013 7:07:59 PM UTC-4, Tom Gardner wrote:

But what about data that has, of necessity, to be shared
between threads? For example a flag indicating whether or
not a non-sharable global resource (e.g. some i/o device,
or some data structure) is in use or is free to be used.
I plan to have one and only one thread handling I/O and passing the data on as needed via memory space to one or more other threads. I promise to be careful and not blow up space-time when I write the code. ;-)
 
On Tuesday, June 25, 2013 5:56:51 PM UTC-4, thomas....@gmail.com wrote:

Barrel processors...
Hive is a barrel processor! Thanks for that term Thomas! I knew the idea wasn't original with me, but I had no idea the concept was so old (1964 Cray designed CDC 6000 series peripheral processors) and has been implemented many times since:

http://en.wikipedia.org/wiki/Barrel_processor
 
Am Mittwoch, 26. Juni 2013 01:07:28 UTC+2 schrieb Eric Wallin:
I hope to use Hive in a digital Theremin I've been working on for about a year now. Too soon to really know, but one thread will probably handle the user interface (LCD, rotary encoder, LEDs, etc.) another will probably handle linearization and scaling of the pitch side, another the wavetable and filtering stuff, etc. so I believe I can keep the threads busy.
OK, I understand your idea behind the processor better now. But I think you are targeting applications that could be realized also with PicoBlaze / Mico8 / ERIC5 which are all MUCH smaller than your design. Of course your design has the benefit of 32b operations.

I guess it makes sense and will be fun for you to use it in your own projects. However, if other people compare Hive with e.g. Nios, most of them will choose Nios because (for them) it looks less painful (both processors are new for them anyway, one can be programmed in C, for the other they have to learn a new assembler language, one is supported by a large company and large community, the other not, etc.).

I just want to point out that there is a lot of competition out there...

Regards,

Thomas
www.entner-electronics.com
 
Eric Wallin <tammie.eric@gmail.com> wrote:

(snip)
It talks about separate threads writing to the same location,
which I understand can be a problem with interrupts and without
atomic read-modify-write. All I can do is repeat that you
don't program this way it won't happen. A subroutine can be
written so that threads can share a common instance of it,
but without using a common memory location to store data
associated with the execution of that subroutine (unless
the location is memory mapped HW).
Sure, that is pretty common. It is usually related to being
reentrant, but not always exactly the same.

In Hive, there is a register that when read returns the thread
ID, which is unique for each thread. This could be used as an
offset for subroutine data locations.
Yes, but usually once in a while there needs to be communication
between threads. If no other time, to get data through an
I/O device, such as separate threads writing to the same
user console. (Screen, terminal, serial port, etc.)

-- glen
 
Am Mittwoch, 26. Juni 2013 01:30:48 UTC+2 schrieb Eric Wallin:
On Tuesday, June 25, 2013 5:56:51 PM UTC-4, thomas....@gmail.com wrote:



Barrel processors...



Hive is a barrel processor! Thanks for that term Thomas! I knew the idea wasn't original with me, but I had no idea the concept was so old (1964 Cray designed CDC 6000 series peripheral processors) and has been implemented many times since:



http://en.wikipedia.org/wiki/Barrel_processor
Yes, the old heros of super computers and mainframes invented almost everything... E.g. I long assumed that Intel invented all this fancy out-of-order execution stuff, etc., just to learn recently that it was all long there, e.g.:
http://en.wikipedia.org/wiki/Tomasulo_algorithm

Regards,

Thomas
www.entner-electronics.com
 
Eric Wallin <tammie.eric@gmail.com> wrote:

(snip)
I plan to have one and only one thread handling I/O and passing
the data on as needed via memory space to one or more other
threads. I promise to be careful and not blow up space-time
when I write the code. ;-)
OK, but you need a way to tell the other thread that its data
is ready, and a way for that thread to tell the I/O thread that
it got the data and is ready for more. And you want to do all
that without too much overhead.

-- glen
 
thomas.entner99@gmail.com wrote:

(snip)

Yes, the old heros of super computers and mainframes invented
almost everything... E.g. I long assumed that Intel invented all
this fancy out-of-order execution stuff, etc., just to learn
recently that it was all long there, e.g.:

http://en.wikipedia.org/wiki/Tomasulo_algorithm
The 360/91 is much more fun, though. Intel has out-of-order
execution, but in-order retirement. The results of instructions
are done in order. That takes memory to keep things around for
a while.

The 360/91 does out-of-order retirement. It helps that S/360
(except for the 67) doesn't have virtual memory.

When an interrupt comes through, the pipelines have to
be flushed of all instructions, at least up to the last one
retired. The result is imprecise interrupts where the address
reported isn't the instruction at fault. (It is where to resume
execution after the interrupt, as usual.) Even more there is
multiple imprecise interrupt as more can occur before the pipeline
is empty.

Much of that went away when VS came in, so that page faults
could be serviced appropriately.

The 360/91 was for many years, and maybe still is, a favorite
example for books on pipelined architecture.

-- glen
 

Welcome to EDABoard.com

Sponsor

Back
Top