MISC - Stack Based vs. Register Based

Alex McDonald · Apr 5, 2013

On Apr 3, 5:34 pm, "Rod Pemberton" <do_not_h...@notemailnotq.cpm>
wrote:

CISC was
typically little-endian to reduce the space needed for integer
encodings.

As long as you discount IBM mainframes. They are big endian. Or
Borroughs/Unisys; they were big endian too. Or the Motorola 68K; it
was big endian. For little-endian CISC, only the VAX and x86 come to
mind. Of those only the x86 survives.

glen herrmannsfeldt · Apr 5, 2013

In comp.arch.fpga Alex McDonald <blog@rivadpm.com> wrote:

(snip, someone wrote)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.

Albert, do you have a reference about this?

Let's take two commonly used S/360 opcodes as an example of CISC; some
move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes).

MVC moves 1 to 256 bytes, conveniently. (Unless you want 0.)

MVC does no padding or truncation. MVCL can pad and truncate, but
unlike MVC will do nothing and report overflow if the operands
overlap. MVC appears to other processors as a single indivisible
operation; every processor (including IO processors) sees storage as
either before the MVC or after it;

I haven't looked recently, but I didn't think it locked out I/O.
Seems that one of the favorite tricks for S/360 was modifying
channel programs while they are running. (Not to mention self-
modifying channel prorams.) Seems that MVC would be convenient
for that. It might be that MVC interlocks on CCW fetch such that
only whole CCWs are fetched, though.

it's not interruptible. MVCL is
interruptible, and partial products can be observed by other
processors. MVCL requires 4 registers and their contents are updated
after completion of the operation; MVC requires 1 for variable length
moves, 0 for fixed and its contents are preserved. MVCL has a high
code setup cost; MVC has none.

Writing a macro to do multiple MVCs and mimic the behaviour of MVCL?
Why not? It's possible, if a little tricky. And by all accounts, MVC
in a loop is faster than MVCL too. IBM even provided a macro; $MVCL.

But then, when you look at MVCL usage closely, there are a few
defining characteristics that are very useful. It can zero memory, and
the millicode (IBM's word for microcode) recognizes 4K boundaries for
4K lengths and optimises it; it's faster than 16 MVCs.

As far as I understand, millicode isn't exactly like microcode,
but does allow for more complicated new instructions to be more
easily implemented.

There's even a MVPG instruction for moving 4K aligned pages! What are
those crazy instruction set designers thinking?

The answer's a bit more than just code density; it never really was
about that. In all the years I wrote IBM BAL, I never gave code
density a serious thought -- with one exception. That was the 4K base
address limit; a base register could only span 4K, so code that was
bigger than that, you had to have either fancy register footwork or
waste registers for multiple bases.

Compared to VAX, S/360 is somewhat RISCy. Note only three different
instruction lengths and, for much of the instruction set only two
address modes. If processors fast path the more popular instructions,
like L and even MVC, it isn't so far from RISC.

It was more about giving assembler programmers choice and variety to
get the best out of the box before the advent of optimising compilers;

Though stories are that even the old Fortran H could come close to
good assembly programmers, and likely better than the average assembly
programmer.

a way, if you like, of exposing the potential of the micro/millicode
through the instruction set. "Here I want you to zero memory" meant an
MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A
knowledgeable assembler programmer could out-perform a compiler.
(Nowadays quality compilers do a much better job of instruction
selection than humans, especially for pipelined processors that
stall.)

For many processors, MVC was much faster on appropriately aligned
data, such as the 8 bytes from A to B. Then again, some might
use LD and STD.

Hence CISC instruction sets (at least, IMHO and for IBM). They were
there for people and performance, not for code density.

I noticed some time ago that the hex opcodes for add instructions
end in A, and for divide in D. (That leaves B for subtract and
C for multiply, but not so hard to remember.)

If they really wanted to reduce code size, they should have added
a load indirect register instruction. (RR format.) A good
fraction of L (load) instructions have both base and offset
zero, (or, equivalently, index and offset).

-- glen

Albert van der Horst · Apr 5, 2013

In article <kjkpnp$qdp$1@dont-email.me>, rickman <gnuarm@gmail.com> wrote:

On 4/4/2013 8:44 AM, glen herrmannsfeldt wrote:
In comp.arch.fpga Arlet Ottens<usenet+5@c-scape.nl> wrote:
On 04/04/2013 01:16 PM, Albert van der Horst wrote:

You mentioned code density. AISI, code density is purely a CISC
concept. They go together and are effectively inseparable.

They do go together, but I am not so sure that they are inseperable.

CISC began when much coding was done in pure assembler, and anything
that made that easier was useful. (One should figure out the relative
costs, but at least it was in the right direction.)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.

Albert, do you have a reference about this?

Not in a wikipedia sense where you're not allowed to mention original
research, and are only quoting what is in the books. It is more experience
that school knowledge.

If you want to see what can be done with a good macro processor like m4
study the one source of the 16/32/64 bit ciforth x86 for linux/Windows/Apple.
See my site below.

The existance of an XLAT instruction (to name an example) OTOH does virtually
nothing to make the life of an assembler programmer better.

Groetjes Albert

--

Rick
--

Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Alex McDonald · Apr 5, 2013

On Apr 4, 4:15 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:

In comp.arch.fpga Alex McDonald <b...@rivadpm.com> wrote:

(snip, someone wrote)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.
Albert, do you have a reference about this?
Let's take two commonly used S/360 opcodes as an example of CISC; some
move operations. MVC (move 0 to 255 bytes) MVCL (move 0 to 16M bytes).

MVC moves 1 to 256 bytes, conveniently. (Unless you want 0.)

My bad; the encoding is 0 to 255 but it's interpreted as +1.

MVC does no padding or truncation. MVCL can pad and truncate, but
unlike MVC will do nothing and report overflow if the operands
overlap. MVC appears to other processors as a single indivisible
operation; every processor (including IO processors) sees storage as
either before the MVC or after it;

I haven't looked recently, but I didn't think it locked out I/O.
Seems that one of the favorite tricks for S/360 was modifying
channel programs while they are running. (Not to mention self-
modifying channel prorams.) Seems that MVC would be convenient
for that. It might be that MVC interlocks on CCW fetch such that
only whole CCWs are fetched, though.

Certainly on the S/360 whole CCWs, since it had a precise interrupt
model and MVC wasn't (and still isn't) interruptible. The S/370
allowed interrupts on page faults for the target and source, but that
is done before the instruction is executed.

IPL operates just like that; it issues a fixed CCW that reads in data
that's a PSW and some CCWs, and away she goes...

it's not interruptible. MVCL is
interruptible, and partial products can be observed by other
processors. MVCL requires 4 registers and their contents are updated
after completion of the operation; MVC requires 1 for variable length
moves, 0 for fixed and its contents are preserved. MVCL has a high
code setup cost; MVC has none.
Writing a macro to do multiple MVCs and mimic the behaviour of MVCL?
Why not? It's possible, if a little tricky. And by all accounts, MVC
in a loop is faster than MVCL too. IBM even provided a macro; $MVCL.
But then, when you look at MVCL usage closely, there are a few
defining characteristics that are very useful. It can zero memory, and
the millicode (IBM's word for microcode) recognizes 4K boundaries for
4K lengths and optimises it; it's faster than 16 MVCs.

As far as I understand, millicode isn't exactly like microcode,
but does allow for more complicated new instructions to be more
easily implemented.

There's even a MVPG instruction for moving 4K aligned pages! What are
those crazy instruction set designers thinking?
The answer's a bit more than just code density; it never really was
about that. In all the years I wrote IBM BAL, I never gave code
density a serious thought -- with one exception. That was the 4K base
address limit; a base register could only span 4K, so code that was
bigger than that, you had to have either fancy register footwork or
waste registers for multiple bases.

Compared to VAX, S/360 is somewhat RISCy. Note only three different
instruction lengths and, for much of the instruction set only two
address modes. If processors fast path the more popular instructions,
like L and even MVC, it isn't so far from RISC.

A modern Z series has more instructions than the average Forth has
words; it's in the high hundreds.

It was more about giving assembler programmers choice and variety to
get the best out of the box before the advent of optimising compilers;

Though stories are that even the old Fortran H could come close to
good assembly programmers, and likely better than the average assembly
programmer.

Fortran/H was a good compiler. The early PL/I was horrible, and there
was a move to use it for systems programming work. I never did so due
to its incredibly bad performance.

a way, if you like, of exposing the potential of the micro/millicode
through the instruction set. "Here I want you to zero memory" meant an
MVCL. "Here I am moving 8 bytes from A to B" meant using MVC. A
knowledgeable assembler programmer could out-perform a compiler.
(Nowadays quality compilers do a much better job of instruction
selection than humans, especially for pipelined processors that
stall.)

For many processors, MVC was much faster on appropriately aligned
data, such as the 8 bytes from A to B. Then again, some might
use LD and STD.

OK, 9 bytes.

Hence CISC instruction sets (at least, IMHO and for IBM). They were
there for people and performance, not for code density.

I noticed some time ago that the hex opcodes for add instructions
end in A, and for divide in D. (That leaves B for subtract and
C for multiply, but not so hard to remember.)

If they really wanted to reduce code size, they should have added
a load indirect register instruction. (RR format.) A good
fraction of L (load) instructions have both base and offset
zero, (or, equivalently, index and offset).

Agreed. And had they wanted to, a single opcode for the standard
prolog & epilog; for example:

STM 14,12,12(13)
LR 12,15
LA 15,SAVE
ST 15,8(13)
ST 13,4(15)
LR 13,15

could have been the single op

ENTRY SAVE

It was macros every time, which is in direct opposition to Albert's
assertion.

-- glen

glen herrmannsfeldt · Apr 5, 2013

In comp.arch.fpga Alex McDonald <blog@rivadpm.com> wrote:

On Apr 3, 5:34 pm, "Rod Pemberton" <do_not_h...@notemailnotq.cpm
wrote:
CISC was
typically little-endian to reduce the space needed for integer
encodings.

As long as you discount IBM mainframes. They are big endian. Or
Borroughs/Unisys; they were big endian too. Or the Motorola 68K; it
was big endian. For little-endian CISC, only the VAX and x86 come to
mind. Of those only the x86 survives.

To me, the only one where little endian seems reasonable is the 6502.
They did amazingly well with a small number of gates. Note, for one,
that on subroutine call the 6502 doesn't push the address of the next
instruction on the stack. That would have taken too much logic.
It pushes the adress minus one, as, it seems, that is what is
in the register at the time. RET adds one after the pop.

Two byte addition is slightly easier in little endian order, but
only slightly. It doesn't help at all for multiply and divide.

VAX was little endian because the PDP-11 was, though I am not sure
that there was a good reason for that.

-- glen

Alex McDonald · Apr 5, 2013

On Apr 4, 9:10 pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
..

VAX was little endian because the PDP-11 was, though I am not sure
that there was a good reason for that.

-- glen

John Savard's take on it; http://www.plex86.org/Computer_Folklore/Little-Endian-1326.html

Mark Wills · Apr 5, 2013

On Apr 5, 12:30 am, rickman <gnu...@gmail.com> wrote:

On 4/4/2013 5:34 PM, glen herrmannsfeldt wrote:

In comp.arch.fpga rickman<gnu...@gmail.com> wrote:

(snip, then I wrote)

Then POLY evaluates a whole polynomial, such as is used to approximate
many mathematical functions, but again, as I understand it, too slow.

Both the PDP-10 and S/360 have the option for an index register on
many instructions, where when register 0 is selected no indexing is
done. VAX instead has indexed as a separate address mode selected by
the address mode byte. Is that the most efficient use for those bits?

I think you have just described the CISC instruction development
concept. Build a new machine, add some new instructions. No big
rational, no "CISC" concept, just "let's make it better, why not add
some instructions?"

Yes, but remember that there is competition and each has to have
some reason why someone should by their product. Adding new instructions
was one way to do that.

I believe if you check you will find the term CISC was not even coined
until after RISC was invented. So CISC really just means, "what we used
to do".

Well, yes, but why did "we used to do that"? For S/360, a lot of
software was still written in pure assembler, for one reason to make
it faster, and for another to make it smaller. And people were just
starting to learn that people (writing software) are more expensive
that machines (hardware). Well, that is about the point that it was
true. For earlier machines you were lucky to get one compiler and
enough system to run it.

Sure, none of this stuff was done without some purpose. My point is
that there was no *common* theme to the various CISC instruction sets.
Everybody was doing their own thing until RISC came along with a basic
philosophy. Someone felt the need to give a name to the previous way of
doing things and CISC seemed appropriate. No special meaning in the
name actually, just a contrast to the "Reduced" in RISC.

I don't think this is a very interesting topic really. It started in
response to a comment by Rod.

--

Rick- Hide quoted text -

- Show quoted text -

Exactly. The CISC moniker was simply applied to an entire *generation*
of processors by the RISC guys. The term RISC was coined by David
Patterson at the University of California between 1980 and 1984.
http://en.wikipedia.org/wiki/Berkeley_RISC

*Until* that time, there was no need for the term "CISC", because
there was no RISC concept that required a differentiation!

I always thought of the Z80 of a CISC 8-bitter; it has some useful
memory move instructions (LDIR etc).

Mark Wills · Apr 5, 2013

On Apr 5, 1:07 am, rickman <gnu...@gmail.com> wrote:

On 4/4/2013 7:16 AM, Albert van der Horst wrote:

In article<kjin8q$so...@speranza.aioe.org>,
glen herrmannsfeldt<g...@ugcs.caltech.edu> wrote:
In comp.arch.fpga Rod Pemberton<do_not_h...@notemailnotq.cpm> wrote:
"rickman"<gnu...@gmail.com> wrote in message
news:kjf48e$5qu$1@dont-email.me...
Weren't you the person who brought CISC into this discussion?

Yes.

Why are you asking this question about CISC?

You mentioned code density. AISI, code density is purely a CISC
concept. They go together and are effectively inseparable.

They do go together, but I am not so sure that they are inseperable.

CISC began when much coding was done in pure assembler, and anything
that made that easier was useful. (One should figure out the relative
costs, but at least it was in the right direction.)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.

--

Rick- Hide quoted text -

- Show quoted text -

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Arlet Ottens · Apr 5, 2013

On 04/05/2013 09:51 AM, Mark Wills wrote:

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.

Arlet Ottens · Apr 5, 2013

On 04/05/2013 04:33 PM, Mark Wills wrote:

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.
I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.- Hide quoted text -

- Show quoted text -

If you're referring to general purpose CPUs I'm inclined to agree.
When commenting earlier, I had in mind a Forth CPU, which executes
Forth words as native CPU instructions. That is, the instruction set
is Forth.

Since there are no need for things like addressing modes in (shall we
say) classical Forth processors then you don't actually need things
like bit-fields in which to encode registers and/or addressing modes*.
All you're left with is the instructions themselves. And you don't
need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit
instruction slots per 16-bit word, and a 4-bit 'special' field for
other stuff. We haven't allocated all 64 instructions yet.

* Even though it's not strictly necessary in a classical registerless
Forth CPU, bit-fields can be useful. We're using a couple of bits to
tell the ALU if a word pushes results or pops arguments, for example.

Well, I was thinking about general purpose applications. A Forth CPU may
map well to a Forth program, but you also have to take into account how
well the problem you want to solve maps to a Forth program.

A minimal stack based CPU can be efficient if the values you need can be
kept on top of the stack. But if you need 5 or 6 intermediate values,
you'll need to store them in memory, resulting in expensive access to
them. Even getting the address of a memory location where a value is
stored can be expensive.

Compare that to a register based machine with 8 registers. You need 3
more opcode bits, but you get immediate access to a pool of 8
intermediate values. And, with some clever encoding (like rickman
suggested) some operations can be restricted to a subset of the
registers, relaxing the number of encoding bits required.

It would be interesting to see a comparison using non-trivial
applications, and see how much code is required for one of those minimal
stack CPUs compared to a simple register based CPU.

Mark Wills · Apr 5, 2013

On Apr 5, 1:31 pm, Arlet Ottens <usene...@c-scape.nl> wrote:

On 04/05/2013 09:51 AM, Mark Wills wrote:

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.
I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.- Hide quoted text -

- Show quoted text -

If you're referring to general purpose CPUs I'm inclined to agree.
When commenting earlier, I had in mind a Forth CPU, which executes
Forth words as native CPU instructions. That is, the instruction set
is Forth.

Since there are no need for things like addressing modes in (shall we
say) classical Forth processors then you don't actually need things
like bit-fields in which to encode registers and/or addressing modes*.
All you're left with is the instructions themselves. And you don't
need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit
instruction slots per 16-bit word, and a 4-bit 'special' field for
other stuff. We haven't allocated all 64 instructions yet.

* Even though it's not strictly necessary in a classical registerless
Forth CPU, bit-fields can be useful. We're using a couple of bits to
tell the ALU if a word pushes results or pops arguments, for example.

glen herrmannsfeldt · Apr 5, 2013

In comp.arch.fpga Arlet Ottens <usenet+5@c-scape.nl> wrote:

(snip on MISC, RISC, and CISC)

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.

Well, it depends on your definition of instruction.

The CDC 60 bit computers have multiple instructions per 60 bit word,
but then the word is big enough.

Now, consider that on many processors an instruction can push or pop
only one register to/from the stack.

The 6809 push and pop instructions have a bit mask that allows up
to eight register to be pushed or popped in one instruction.
(It takes longer for more, but not as long as for separate
instructions.)

For x87, there are instructions that do some math function and
then do, or don't, remove values from the stack. Some might count
those as two or three instructions.

-- glen

rickman · Apr 5, 2013

On 4/4/2013 9:17 PM, Albert van der Horst wrote:

In article<kjkpnp$qdp$1@dont-email.me>, rickman<gnuarm@gmail.com> wrote:

Albert, do you have a reference about this?

Not in a wikipedia sense where you're not allowed to mention original
research, and are only quoting what is in the books. It is more experience
that school knowledge.

If you want to see what can be done with a good macro processor like m4
study the one source of the 16/32/64 bit ciforth x86 for linux/Windows/Apple.
See my site below.

The existance of an XLAT instruction (to name an example) OTOH does virtually
nothing to make the life of an assembler programmer better.

Groetjes Albert

Ok, I'm a bit busy with a number of things to go into this area now, but
I appreciate the info.

I have used macro assemblers in the past and got quite good at them. I
even developed a micro programmed board which used a macro assembler and
added new opcodes for my design (it was a company boilerplate design
adapted to a new host) to facilitate some self test functions. I sorta
got yelled at because this meant you needed my assembler adaptations to
assemble the code for the board. I wasn't ordered to take it out so it
remained. I didn't. I left within a year and the company eventually
folded. You can make a connection if you wish... lol

--

Rick

rickman · Apr 5, 2013

On 4/5/2013 3:51 AM, Mark Wills wrote:

On Apr 5, 1:07 am, rickman<gnu...@gmail.com> wrote:
On 4/4/2013 7:16 AM, Albert van der Horst wrote:

In article<kjin8q$so...@speranza.aioe.org>,
glen herrmannsfeldt<g...@ugcs.caltech.edu> wrote:
In comp.arch.fpga Rod Pemberton<do_not_h...@notemailnotq.cpm> wrote:
"rickman"<gnu...@gmail.com> wrote in message
news:kjf48e$5qu$1@dont-email.me...
Weren't you the person who brought CISC into this discussion?

Yes.

Why are you asking this question about CISC?

You mentioned code density. AISI, code density is purely a CISC
concept. They go together and are effectively inseparable.

They do go together, but I am not so sure that they are inseperable.

CISC began when much coding was done in pure assembler, and anything
that made that easier was useful. (One should figure out the relative
costs, but at least it was in the right direction.)

But, of course, this is a fallacy. The same goal is accomplished by
macro's, and better. Code densitity is the only valid reason.

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.

--

Rick- Hide quoted text -

- Show quoted text -

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

I have looked at the multiple instruction in parallel thing and have not
made any conclusions yet. To do that you need a bigger instruction word
and smaller instruction opcodes. The opcodes essentially have to become
specific for the execution units. My design has three, the data stack,
the return stack and the instruction fetch. It is a lot of work to
consider this because there are so many tradeoffs to analyze.

One issue that has always bugged me is that allocating some four or five
bits for the instruction fetch instruction seems very wasteful when some
70-90% of the time the instruction is IP < IP+1. Trying to Huffman
encode this is a bit tricky as what do you do with the unused bit??? I
gave up looking at this until after I master the Rubik's Cube. lol

It does clearly have potential, it's just a bear to unlock without
adding a lot of data pathways to the design.

--

Rick

rickman · Apr 5, 2013

On 4/5/2013 11:01 AM, Arlet Ottens wrote:

On 04/05/2013 04:33 PM, Mark Wills wrote:

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as
reducing the
amount of storage needed.
I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.- Hide
quoted text -

- Show quoted text -

If you're referring to general purpose CPUs I'm inclined to agree.
When commenting earlier, I had in mind a Forth CPU, which executes
Forth words as native CPU instructions. That is, the instruction set
is Forth.

Since there are no need for things like addressing modes in (shall we
say) classical Forth processors then you don't actually need things
like bit-fields in which to encode registers and/or addressing modes*.
All you're left with is the instructions themselves. And you don't
need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit
instruction slots per 16-bit word, and a 4-bit 'special' field for
other stuff. We haven't allocated all 64 instructions yet.

* Even though it's not strictly necessary in a classical registerless
Forth CPU, bit-fields can be useful. We're using a couple of bits to
tell the ALU if a word pushes results or pops arguments, for example.

Well, I was thinking about general purpose applications. A Forth CPU may
map well to a Forth program, but you also have to take into account how
well the problem you want to solve maps to a Forth program.

Just to make a point, my original post wasn't really about Forth
processors specifically. It was about MISC processors which may or may
not be programmed in Forth.

A minimal stack based CPU can be efficient if the values you need can be
kept on top of the stack. But if you need 5 or 6 intermediate values,
you'll need to store them in memory, resulting in expensive access to
them. Even getting the address of a memory location where a value is
stored can be expensive.

There aren't many algorithms that can't be dealt with reasonably using
the data and return stacks. The module I was coding that made me want
to try a register approach has four input variables, two double
precision. These are in memory and have to be read in because this is
really an interrupt routine and there is no stack input. The main
process would have access to these parameters to update them.

The stack routine uses up to 9 levels on the data stack currently. But
that is because to optimize the execution time I was waiting until the
end when I could save them off more efficiently all at once. But - in
the process of analyzing the register processor I realized that I had an
unused opcode that would allow me to save the parameters in the same way
I am doing it in the register based code, reducing the stack usage to
five items max.

I understand that many stack based machines only have an 8 level data
stack, period. The GA144 is what, 10 words, 8 circular and two registers?

Compare that to a register based machine with 8 registers. You need 3
more opcode bits, but you get immediate access to a pool of 8
intermediate values. And, with some clever encoding (like rickman
suggested) some operations can be restricted to a subset of the
registers, relaxing the number of encoding bits required.

That's the whole enchilada with a register MISC, figuring out how to
encode the registers in the opcodes. I think I've got a pretty good
trade off currently, but I have not looked at running any Forth code on
it yet. This may be much less efficient than a stack machine... duh!

It would be interesting to see a comparison using non-trivial
applications, and see how much code is required for one of those minimal
stack CPUs compared to a simple register based CPU.

I have been invited to get a presentation to the SVFIG on my design. I
need to work up some details and will let you know when that will be.
They are talking about doing a Google+ hangout which I suppose is a
video since it would be replayed for the meeting. I'm not sure I can be
ready in time for the next meeting, but I'll see. I don't think my
design is all that novel in the grand scheme of MISC. So I likely will
do this as a comparison between the two designs. I think I also need to
bone up on some of the other designs out there like the J1 and the B16.

--

Rick

rickman · Apr 6, 2013

On 4/5/2013 3:51 AM, Mark Wills wrote:

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

This morning I found one exception to the one cycle rule. Block RAM.
My original design was for the Altera ACEX part which is quite old now,
but had async read on the block rams for memory and stack. So the main
memory and stacks could be accessed for reading and writing in the same
clock cycle, read/modify/write. You can't do that with today's block
RAMs, they are totally synchronous.

I was looking at putting the registers in a block RAM so I could get two
read ports and two write ports. But this doesn't work the way an async
read RAM will. So I may consider using a multiphase clock. The
operations that really mess me up is the register indirect reads of
memory, like stack accesses... or any memory accesses really, that is
the only way to address memory, via register. So the address has to be
read from register RAM, then the main memory RAM is read, then the
result is written back to the register RAM. Wow! That is three clock
edges I'll need.

If I decide to go with using block RAM for registers it will give me N
sets of regs so I have a big motivation. It also has potential for
reducing the total amount of logic since a lot of the multiplexing ends
up inside the RAM.

The multiphase clocking won't be as complex as using multiple machine
cycles for more complex instructions. But it is more complex than the
good old simple clock I have worked with. It will also require some
tighter timing and more complex timing constraints which are always hard
to get correct.

--

Rick

rickman · Apr 6, 2013

On 4/5/2013 8:31 AM, Arlet Ottens wrote:

On 04/05/2013 09:51 AM, Mark Wills wrote:

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.

I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.

I think multiple instructions per word is a good idea of you have a wide
disparity in speed between instruction memory and the CPU clock rate.
If not, why introduce the added complexity? Well, unless you plan to
execute them simultaneously... I took a look at a 16 bit VLIW idea
once and didn't care for the result. There are just too many control
points so that a proper VLIW design would need more than just 16 bits I
think. At least in the stack design.

An 18 bit instruction word might be worth looking at in the register
CPU. But getting more parallelism in the register design will require
more datapaths and there goes the "minimal" part of the MISC.

So many options, so little time...

--

Rick

rickman · Apr 6, 2013

On 4/5/2013 10:33 AM, Mark Wills wrote:

On Apr 5, 1:31 pm, Arlet Ottens<usene...@c-scape.nl> wrote:
On 04/05/2013 09:51 AM, Mark Wills wrote:

I'm pretty sure that conclusion is not correct. If you have an
instruction that does two or three memory accesses in one instruction
and you replace it with three instructions that do one memory access
each, you end up with two extra memory accesses. How is this faster?

That is one of the reasons why I want to increase code density, in my
machine it automatically improves execution time as well as reducing the
amount of storage needed.
I think you're on the right track. With FPGAs it's really quite simple
to execute all instructions in a single cycle. It's no big deal at all
- with MPY and DIV being exceptions. In the 'Forth CPU world' even
literals can be loaded in a single cycle. It then comes down to
careful selection of your instruction set. With a small enough
instruction set one can pack more than one instruction in a word - and
there's your code density. If you can pack more than one instruction
in a word, you can execute them in a single clock cycle. With added
complexity, you may even be able to execute them in parallel rather
than as a process.

Multiple instructions per word sounds like a bad idea. It requires
instructions that are so small that they can't do very much, so you need
more of them. And if you need 2 or more small instructions to do
whatever 1 big instruction does, it's better to use 1 big instruction
since it makes instruction decoding more efficient and simpler.- Hide quoted text -

- Show quoted text -

If you're referring to general purpose CPUs I'm inclined to agree.
When commenting earlier, I had in mind a Forth CPU, which executes
Forth words as native CPU instructions. That is, the instruction set
is Forth.

Since there are no need for things like addressing modes in (shall we
say) classical Forth processors then you don't actually need things
like bit-fields in which to encode registers and/or addressing modes*.
All you're left with is the instructions themselves. And you don't
need that many bits for that.

A Forth chip that I am collaborating on right now two 6 bit
instruction slots per 16-bit word, and a 4-bit 'special' field for
other stuff. We haven't allocated all 64 instructions yet.

* Even though it's not strictly necessary in a classical registerless
Forth CPU, bit-fields can be useful. We're using a couple of bits to
tell the ALU if a word pushes results or pops arguments, for example.

That can be useful. The automatic pop of operands is sometimes
expensive by requiring a DUP beforehand. In my design the fetch/store
words have versions to drop the address from the return stack or
increment and hold on to it. In looking at the register design I
realized it would be useful to just make the plus and the drop both
options so now there are three versions, fetch, fetch++ and fetchK (keep).

--

Rick

Brian Davis · Apr 7, 2013

rickman wrote:

So the main memory and stacks could be accessed for reading and writing
in the same clock cycle, read/modify/write. You can't do that with today's
block RAMs, they are totally synchronous.

I had the same problem when I first moved my XC4000 based RISC

over to the newer parts with registered Block RAM.

I ended up using opposite edge clocking, with a dual port BRAM,
to get what appears to be single cycle access on the data and
instruction ports.

As this approach uses the same clock, the constraints are painless;
but you now have half a clock for address -> BRAM setup, and half
for the BRAM data <-> core data setup. The latter can cause some
some timing issues if the core is configured with a byte lane mux
so as to support 8/16/32 bit {sign extending} loads.

-Brian

rickman · Apr 8, 2013

On 4/7/2013 5:59 PM, Brian Davis wrote:

rickman wrote:

So the main memory and stacks could be accessed for reading and writing
in the same clock cycle, read/modify/write. You can't do that with today's
block RAMs, they are totally synchronous.

I had the same problem when I first moved my XC4000 based RISC
over to the newer parts with registered Block RAM.

I ended up using opposite edge clocking, with a dual port BRAM,
to get what appears to be single cycle access on the data and
instruction ports.

As this approach uses the same clock, the constraints are painless;
but you now have half a clock for address -> BRAM setup, and half
for the BRAM data<-> core data setup. The latter can cause some
some timing issues if the core is configured with a byte lane mux
so as to support 8/16/32 bit {sign extending} loads.

Yes, that was one way to solve the problem. This other I considered was
to separate the read and write on the two ports. Then the read would be
triggered from the address that was at the input to the address
register... from the previous cycle. So the read would *always* be done
and the data presented whether you used it or not. I'm not sure how
much power this would waste, but the timing impact would be small.

I looked at making the register block RAM part of the main memory
address space. This would required a minimum of three clock cycles in a
machine cycle, read address or data from register, use address to read
or write data from/to memory and then write data to register. If it
helps timing, the memory write can be done at the same time as the
register write. I'm not crazy about this approach, but I'm considering
how useful it would be to have direct address capability of the multiple
register banks.

Some of the comments about register vs. stacks and what I have seen of
the J1 has made me think about a hybrid approach using stacks in memory,
but with offset access, so items further down in the stack can be
operands, not just TOS and NOS. This has potential for saving stack
operations. The J1 has a two bit field controlling the stack pointer, I
assume that is +1 to -2 or 1 push to 2 pop. The author claims this
provides some ability to combine Forth functions into one instruction,
but doesn't provide details. I guess the compiler code would have to be
examined to find out what combinations would be useful.

The compiler end is not my strong suit, but I suppose I could figure out
how to take advantage of features like this.

--

Rick

MISC - Stack Based vs. Register Based

Alex McDonald

Guest

glen herrmannsfeldt

Guest

Albert van der Horst

Guest

Alex McDonald

Guest

glen herrmannsfeldt

Guest

Alex McDonald

Guest

Mark Wills

Guest

Mark Wills

Guest

Arlet Ottens

Guest

Arlet Ottens

Guest

Mark Wills

Guest

glen herrmannsfeldt

Guest

rickman

Guest

rickman

Guest

rickman

Guest

rickman

Guest

rickman

Guest

rickman

Guest

Brian Davis

Guest

rickman

Guest

Log in

Welcome to EDABoard.com

Sponsor