Which is the most beautiful and memorable hardware structure

On 3/30/2010 9:15 PM, glen herrmannsfeldt wrote:
In comp.arch.fpga "Andy \"Krazy\" Glew"<ag-news@patten-glew.net> wrote:
The two hardware datastructures supporting out of order execution:

Reservation stations.

And, less beautifully, the register renaming map.

Both from the IBM 360/91, as far as I know.

S/360 has only four floating point registers, so register
renaming was pretty important for out-of-order execution.

OK, how about imprecise interrupts?

-- glen

I never really knew how the 360/91 did register renaming. I don't think it used a RAM style map. I think it used CAMs.

I actually asked Tomasulo this, but he never really answered the question.
 
In comp.arch.fpga "Andy \"Krazy\" Glew" <ag-news@patten-glew.net> wrote:
(snip)

I never really knew how the 360/91 did register renaming.
I don't think it used a RAM style map. I think it used CAMs.

I actually asked Tomasulo this, but he never really answered
the question.
Never having had anyone to ask, but only read about it in books,
that sounds about right.

The explanation I have seen for the CDB, common data bus, was
that results come out broadcast to all possible destinations.
Those destinations expecting a result from that source accept it.
Possible destinations are registers, reservation stations
(for adders or mutliply/divide), or to be written to main memory.
Sources are results from arithmetic units, or data read from
(750 ns, 16 way interleaved) main memory.

Among the not so obvious ones, if you store to memory and then
refetch, register renaming will detect the same address is
being used and go directly to the source. (No cache on the
360/91, it originated on the 360/85.)

-- glen
 
On Apr 1, 1:07 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
The explanation I have seen for the CDB, common data bus, was
that results come out broadcast to all possible destinations.
Do you realize that the length of this bus and the number of
destination nodes was one of the reasons the IBM machine topped out at
60ns while the CDC machines topped out at 27.5ns and could deliver 4
result (and one load) per cycle (as catch-up bandwidth).

Mitch
 
On 4/1/2010 11:07 AM, glen herrmannsfeldt wrote:
In comp.arch.fpga "Andy \"Krazy\" Glew"<ag-news@patten-glew.net> wrote:
(snip)

I never really knew how the 360/91 did register renaming.
I don't think it used a RAM style map. I think it used CAMs.

I actually asked Tomasulo this, but he never really answered
the question.

Never having had anyone to ask, but only read about it in books,
that sounds about right.
All I know is that I proposed having a separate pipestage to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number, in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.

I.e. I proposed eliminating the CAMs, replacing them by a RAM and an additional pipestage.

The idea seemed new to everyone who encountered it. It was not universally accepted as good. Indeed, I remember arguing
with Tom Olson of AMD (if memory serves), who said that spending an extra pipestage was not a good idea.

I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the



The explanation I have seen for the CDB, common data bus, was
that results come out broadcast to all possible destinations.
Those destinations expecting a result from that source accept it.
Possible destinations are registers, reservation stations
(for adders or mutliply/divide), or to be written to main memory.
Sources are results from arithmetic units, or data read from
(750 ns, 16 way interleaved) main memory.
Many people say that the CDB was an important invention. I think it was a bad idea - long wires, CAMs.

Conceptually it is elegant, but implementation wise it is a bad idea.

The important thing is taking that conceptually elegant CAM-ful idea, and implementing it in an efficient non-CAM manner.

The modern style of register renaming accomplishes this - certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still being used).




Among the not so obvious ones, if you store to memory and then
refetch, register renaming will detect the same address is
being used and go directly to the source. (No cache on the
360/91, it originated on the 360/85.)
I'd love to see a reference for this.

I believe that a UWisc patent on this was one of the things that resulted in a big payment from Intel to UWisc.

Myself, I thought it was obvious.
 
On Apr 1, 10:05 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:

Myself, I thought it was obvious.
There's your problem right there, Andy. Everyone else will say:

1. It's already been done (heard way too many times in this forum).

2. It was obvious (emphasis on the past tense).

People answering either (1) or (2) assume that everything that can be
thought of is already in textbooks. That's how they got to where they
are.

Robert.
 
On Apr 1, 9:05 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:
I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the
Any chance you could complete this sentance?

Perhaps from {88100, 88110, 88120, crazy, insane, Asilomar
participants, Hot Chips participants, all of the preceeding?}

Mitch
 
On Apr 1, 10:20 pm, Robert Myers <rbmyers...@gmail.com> wrote:
On Apr 1, 10:05 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net
wrote:

Myself, I thought it was obvious.

There's your problem right there, Andy.  Everyone else will say:

1. It's already been done (heard way too many times in this forum).

2. It was obvious (emphasis on the past tense).

People answering either (1) or (2) assume that everything that can be
thought of is already in textbooks.  That's how they got to where they
are.
Textbooks? Didn't you get the memo? Everything that can be thought
of is on Google...or should I say Topeka ;)

KJ
 
In comp.arch.fpga "Andy \"Krazy\" Glew" <ag-news@patten-glew.net> wrote:
(snip)

All I know is that I proposed having a separate pipestage
to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number,
in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.

I.e. I proposed eliminating the CAMs, replacing them by a
RAM and an additional pipestage.
With the 360/91 system, though, values can easily have more than
one destination. I suppose that could be done other ways,
too, but it is especially convenient that way.

The idea seemed new to everyone who encountered it. It was
not universally accepted as good. Indeed, I remember arguing
with Tom Olson of AMD (if memory serves), who said that
spending an extra pipestage was not a good idea.

Many people say that the CDB was an important invention.
I think it was a bad idea - long wires, CAMs.
If the wires are too long, then add more pipeline stages along
the way. With 750ns 16way interleaved core, though, the 91
wasn't going to get much faster than 60ns.

Conceptually it is elegant, but implementation wise it is a bad idea.

The important thing is taking that conceptually elegant
CAM-ful idea, and implementing it in an efficient non-CAM manner.

The modern style of register renaming accomplishes this -
certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still
being used).
Logic was much more expensive then, than now, so the
tradoffs are likely different. If you used RAM tables
with more than one entry for each source, you could do
multiple destinations easily.

Among the not so obvious ones, if you store to memory and then
refetch, register renaming will detect the same address is
being used and go directly to the source. (No cache on the
360/91, it originated on the 360/85.)

I'd love to see a reference for this.
There is an issue of the IBM Journal of Research and
Development pretty much devoted to the 91. I believe
it is in there. The 91 is pretty much a favorite for
books on pipelined processor design, mostly referencing
that journal issue.

I believe that a UWisc patent on this was one of the things
that resulted in a big payment from Intel to UWisc.

Myself, I thought it was obvious.
-- glen
 
On Apr 1, 7:05 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:
On 4/1/2010 11:07 AM, glen herrmannsfeldt wrote:

In comp.arch.fpga "Andy \"Krazy\" Glew"<ag-n...@patten-glew.net>  wrote:
(snip)

I never really knew how the 360/91 did register renaming.
I don't think it used a RAM style map.  I think it used CAMs.

I actually asked Tomasulo this, but he never really answered
the question.

Never having had anyone to ask, but only read about it in books,
that sounds about right.

All I know is that I proposed having a separate pipestage to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number, in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.

I.e. I proposed eliminating the CAMs, replacing them by a RAM and an additional pipestage.

The idea seemed new to everyone who encountered it. It was not universally accepted as good.  Indeed, I remember arguing
with Tom Olson of AMD (if memory serves), who said that spending an extra pipestage was not a good idea.

I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the

The explanation I have seen for the CDB, common data bus, was
that results come out broadcast to all possible destinations.
Those destinations expecting a result from that source accept it.
Possible destinations are registers, reservation stations
(for adders or mutliply/divide), or to be written to main memory.
Sources are results from arithmetic units, or data read from
(750 ns, 16 way interleaved) main memory.

Many people say that the CDB was an important invention.  I think it was a bad idea - long wires, CAMs.

Conceptually it is elegant, but implementation wise it is a bad idea.

The important thing is taking that conceptually elegant CAM-ful idea, and implementing it in an efficient non-CAM manner.

The modern style of register renaming accomplishes this - certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still being used).

Among the not so obvious ones, if you store to memory and then
refetch, register renaming will detect the same address is
being used and go directly to the source.  (No cache on the
360/91, it originated on the 360/85.)

I'd love to see a reference for this.

I believe that a UWisc patent on this was one of the things that resulted in a big payment from Intel to UWisc.

Myself, I thought it was obvious.
Hi Andy,
Your opinion is bright.

Can you tell me UWisc patent number or its title?

I have a design which is expected to work in a core of modern
multiprocessors in more than 3GHz world,
and the output drives one target.

The design can have two implementations:
1. One source always drives the one target and it uses a lot of power;
2. 16 sources can selectively use a common output bus to drive the
target with much less power.

The output must be finished within 1 clock cycle.

Which implementation is more wise in real world?

In another words, a 16 sources selectively drives a common output bus
with one target
is implementation wise in more than 3GHz world?

Thank you.

Weng




Thank you.

Weng
 
On 4/1/2010 7:48 PM, MitchAlsup wrote:
On Apr 1, 9:05 pm, "Andy \"Krazy\" Glew"<ag-n...@patten-glew.net
wrote:
I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the

Any chance you could complete this sentance?

Perhaps from {88100, 88110, 88120, crazy, insane, Asilomar
participants, Hot Chips participants, all of the preceeding?}
Got distracted, forgot to finish. Wasn't exactly sure I remembered what you were working on.

Remember the first time I met you, Mitch, and Willie Anderson? What were you working on? Memory bandwidth spreadsheets
for the 88110? SIMD vectors? I remember we talked about DRAM bank structure, and you made your usual "If DRAMs were
designed the way I want them to be designed..." speech. I remember that you were interested in Linpack, while I was
interested in OOO and GCC.
 
On 4/1/2010 9:31 PM, glen herrmannsfeldt wrote:
In comp.arch.fpga "Andy \"Krazy\" Glew"<ag-news@patten-glew.net> wrote:
(snip)

All I know is that I proposed having a separate pipestage
to rename registers, using a RAM (SRAM) table indexed by
logical register number returning physical register number,
in 1986 or 1987 - in Wen-mei Hwu's microprocessor design
class - after he had taken us through Tomasulo and HPSm.

I.e. I proposed eliminating the CAMs, replacing them by a
RAM and an additional pipestage.

With the 360/91 system, though, values can easily have more than
one destination. I suppose that could be done other ways,
too, but it is especially convenient that way.
That's basically why P6 both renamed to physical registers, and had an RS with CAMs.

RAM style indexing for the big data structure.

CAMs for the relatively smaller RS, broadcast.

I've always regretted not totally eliminating the CAMs in the RS. Always meant to get around to it in P6 v2.0, but that
never happened.

(BTW, no, Willamette did not eliminate the CAMs The bitmap scheduler is CAMs, but decoded CAs rather than encoded CAMs.
Many people think that the term "CAM" only apples to encoded CAMs, but don't really have a name for the decoded CAMs,
e.g. 1-hots. Me, I think encoded vs. decoded is just a circuit trick.)




The modern style of register renaming accomplishes this -
certainly for the registers, but also, depending on the
system, for the reservation stations (if those are still
being used).

Logic was much more expensive then, than now, so the
tradoffs are likely different. If you used RAM tables
with more than one entry for each source, you could do
multiple destinations easily.
Right The problem then has always bee "how may destinations", and "how do you handle exceeding the number of
destinations without (a) falling of a cliff, and (b) complexity".



Among the not so obvious ones, if you store to memory and then
refetch, register renaming will detect the same address is
being used and go directly to the source. (No cache on the
360/91, it originated on the 360/85.)

I'd love to see a reference for this.

There is an issue of the IBM Journal of Research and
Development pretty much devoted to the 91. I believe
it is in there. The 91 is pretty much a favorite for
books on pipelined processor design, mostly referencing
that journal issue.
I practically memorized that issue. Not there that I remember. Likely we are talking about different things.
 
On Apr 3, 12:19 pm, "Andy \"Krazy\" Glew" <ag-n...@patten-glew.net>
wrote:
On 4/1/2010 7:48 PM, MitchAlsup wrote:

On Apr 1, 9:05 pm, "Andy \"Krazy\" Glew"<ag-n...@patten-glew.net
wrote:
I also talked to Mitch about it at around that time, although he was preoccupied with spreadsheets for the

Any chance you could complete this sentance?

Perhaps from {88100, 88110, 88120, crazy, insane, Asilomar
participants, Hot Chips participants, all of the preceeding?}

Got distracted, forgot to finish.  Wasn't exactly sure I remembered what you were working on.

Remember the first time I met you, Mitch, and Willie Anderson? What were you working on?  Memory bandwidth spreadsheets
for the 88110? SIMD vectors?  I remember we talked about DRAM bank structure, and you made your usual "If DRAMs were
designed the way I want them to be designed..." speech.  I remember that you were interested in Linpack, while I was
interested in OOO and GCC.
Willie was on 88110
Sounds like I was already on 88120
As to DRAM see USPTO 5367494

It was not so much that I was concentratng on Linpack, We (shebanow
and I) were trying to build a machine that could perform as if it were
a vector machine on vectorizable codes (without vector instructions::
i.e. native 88100 instructions at 6 per cycle) and also perform well
on GCC-like spaghetti codes. Linpack (Matrix 300) was simply the
vector code expample.

Mitch
 
In comp.arch.fpga MitchAlsup <MitchAlsup@aol.com> wrote:
(snip)

It was not so much that I was concentratng on Linpack, We (shebanow
and I) were trying to build a machine that could perform as if it were
a vector machine on vectorizable codes (without vector instructions::
i.e. native 88100 instructions at 6 per cycle) and also perform well
on GCC-like spaghetti codes. Linpack (Matrix 300) was simply the
vector code expample.
The 360/91 was also designed to perform well on non-vectorized code.
Well, on the code generated for other 360's. Among others is
loop mode where for a small enough loop it stops fetching
instructions from memory (they are in a special cache).
The goal was one instruction per cycle. (With 750ns core it
wasn't likely to do more than that.)

The 360/91 even had to handle self-modifying code, including
instructions that might have already been fetched. The IBM
Fortran library for OS/360 did use some self-modifying code.
(No recursion in Fortran 66 so it wasn't so hard to do.)

-- glen
 
In article <hp87mm$s7m$1@naig.caltech.edu>, gah@ugcs.caltech.edu says...
In comp.arch.fpga MitchAlsup <MitchAlsup@aol.com> wrote:
(snip)

It was not so much that I was concentratng on Linpack, We (shebanow
and I) were trying to build a machine that could perform as if it were
a vector machine on vectorizable codes (without vector instructions::
i.e. native 88100 instructions at 6 per cycle) and also perform well
on GCC-like spaghetti codes. Linpack (Matrix 300) was simply the
vector code expample.

The 360/91 was also designed to perform well on non-vectorized code.
Well, on the code generated for other 360's. Among others is
loop mode where for a small enough loop it stops fetching
instructions from memory (they are in a special cache).
The goal was one instruction per cycle. (With 750ns core it
wasn't likely to do more than that.)
Must have got that idea from the CDC 6600.

The 360/91 even had to handle self-modifying code, including
instructions that might have already been fetched. The IBM
Fortran library for OS/360 did use some self-modifying code.
(No recursion in Fortran 66 so it wasn't so hard to do.)
SMC was not allowed in the CDC instruction stack (i.e. non-coherent cache).

- Tim
 

Welcome to EDABoard.com

Sponsor

Back
Top