MISC - Stack Based vs. Register Based

R

rickman

Guest
I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density. My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.

The last couple of days I was looking at some code I plan to use and
realized that it could be a lot more efficient if I could find a way to
use more parallelism inside the CPU and use fewer instructions. So I
started looking at defining separate opcodes for the two primary
function units in the design, the data stack and the return stack. Each
has its own ALU. The data stack has a full complement of capabilities
while the return stack can only add, subtract and compare. The return
stack is actually intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel capabilities of
these units, I realized that many operations were just stack
manipulations. Then I read the thread about the relative "cost" of
stack ops vs memory accesses and I realized these were what I needed to
optimize. I needed to find a way to not use an instruction and a clock
cycle for moving data around on the stack.

In the thread on stack ops it was pointed out repeatedly that very often
the stack operands would be optimized to register operands, meaning they
wouldn't need to do the stack ops at all really. So I took a look at a
register based MISC design. Guess what, I don't see the disadvantage!
I have pushed this around for a couple of days and although I haven't
done a detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as fast, if
not faster than my stack based design and it will use fewer
instructions. I still need to add some features like support for a
stack in memory, in other words, pre-increment/post-decrement (or the
other way around...), but I don't see where this is a bad design. It
may end up using *less* logic as well. My stack design provides access
to the stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has changed
my perspective over the years so that I am willing and able to do
optimizations in ways I would not have wanted to do in the past. But I
am a bit surprised that there has been so much emphasis on stack
oriented MISC machines which it may well be that register based MISC
designs are also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC model.

--

Rick
 
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:
I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.
You might also look at some VLIW designs. Not that the designs
themselves will be useful, but maybe some of the ideas that were used
would help.

Well, much of the idea of RISC is that code density isn't very
important, and that many of the more complicated instructions made
assembly language programming easier, but compilers didn't use them.

So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density.
Do you mean the size of CPU (lines of verilog) or size of the programs
that run on it?

My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.
Seems to me that much of the design of VAX was to improve code
density when main memory was still a very significant part of the
cost of a machine. The large number of addressing modes allowed for
efficient use of bits. It also greatly complicated making an efficient
pipelined processor. With little overlap and a microprogrammed CPU,
it is easy to go sequentially through the instruction bytes and process
them in order. Most of the complication is in the microcode itself.

But every level of logic adds delay. Using a wide bus to fast memory
is more efficient that a complicated decoder. But sometimes RISC went
too far. In early RISC, there was the idea of one cycle per instruction.
They couldn't do that for multiply, so they added multiply-step, an
instruction that you execute many times for each multiply operation.
(And maybe no divide at all.)

For VLIW, a very wide instruction word allows for specifying many
different operations at the same time. It relies on complicated
compilers to optimally pack the multiple operations into the
instruction stream.

The last couple of days I was looking at some code I plan to use and
realized that it could be a lot more efficient if I could find a way to
use more parallelism inside the CPU and use fewer instructions. So I
started looking at defining separate opcodes for the two primary
function units in the design, the data stack and the return stack. Each
has its own ALU. The data stack has a full complement of capabilities
while the return stack can only add, subtract and compare. The return
stack is actually intended to be an "address" processing unit.
This kind of thought is what lead to the branch delay slot on many
RISC processors. There is work to be done for branches, especially
for branch to or return from subroutine. Letting the processor know
early allows for more overlap. So, on many early (maybe not now) RISC
machines the instruction after a branch instruction is executed before
the branch is taken. (It might be a NOOP, though.) Again, reduced code
density (if it is NOOP) in exchange for a faster instruction cycle.)

While trying to figure out how to maximize the parallel capabilities of
these units, I realized that many operations were just stack
manipulations. Then I read the thread about the relative "cost" of
stack ops vs memory accesses and I realized these were what I needed to
optimize. I needed to find a way to not use an instruction and a clock
cycle for moving data around on the stack.
Well, consider the x87 stack instructions. Some operations exist in
forms that do and don't pop something off the stack. A small increase
in the number of opcodes used allows for avoiding many POP instructions.

In the thread on stack ops it was pointed out repeatedly that very often
the stack operands would be optimized to register operands, meaning they
wouldn't need to do the stack ops at all really. So I took a look at a
register based MISC design. Guess what, I don't see the disadvantage!
Well, actually the 8087 was designed to be either a register or stack
machine, as many instructions can index into the stack. If you keep
track of the current stack depth, you can find data in the stack like
in registers. Well, it was supposed to be able to do that.

I have pushed this around for a couple of days and although I haven't
done a detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as fast, if
not faster than my stack based design and it will use fewer
instructions. I still need to add some features like support for a
stack in memory, in other words, pre-increment/post-decrement (or the
other way around...), but I don't see where this is a bad design. It
may end up using *less* logic as well. My stack design provides access
to the stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.
Well, the stack design has the advantage that you can use instruction
bits either for a memory address or for the operation, allowing for much
smaller instructions. But that only works as long as everything is in
the right place on the stack.

I guess looking at other peoples designs (such as Chuck's) has changed
my perspective over the years so that I am willing and able to do
optimizations in ways I would not have wanted to do in the past. But I
am a bit surprised that there has been so much emphasis on stack
oriented MISC machines which it may well be that register based MISC
designs are also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC model.
Seems to me that there hasn't been much done in stack machines since
the B5500, the first computer I ever used when I was about nine.

-- glen
 
On 03/29/2013 10:00 PM, rickman wrote:
I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density. My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.
I once made a CPU design for an FPGA that had multiple stacks. There was
a general purpose stack "A", two index stacks "X" and "Y", and a return
stack "R". ALU operations worked between A and any other stack, so they
only required 2 bits in the opcode. There was also a move instruction
that could move data from a source to a destination stack.

Having access to multiple stacks means you spend less time shuffling
data on the stack. There's no more need for swap, over, rot and similar
stack manipulation instructions. The only primitive operations you need
are push and pop.

For instance, I had a load instruction that could load from memory using
the address in the X stack, and push the result on the A stack. The cool
part is that the X stack itself isn't changed by this operation, so the
same address can be used multiple time. So, you could do a

LOAD (X) ; load from (X) and push on A
1 ; push literal on A
ADD ; add top two elements of A
STORE (X) ; pop A, and store in (X)

to increment a location in memory.

And if you wanted to increment X to access the next memory location,
you'd do:

1 ; push literal on A
ADD X ; pop X, pop A, add, and push result on A.
MOVE A, X ; pop A, and push on X

It was an 8 bit architecture with 9 bit instructions (to match the FPGA
block RAM + parity bit). Having 9 bit instructions allows an 8 bit
literal push to be encoded in 1 instruction.

Feel free to e-mail if you want more details.
 
"rickman" <gnuarm@gmail.com> wrote in message
news:kj4vae$msi$1@dont-email.me...

I have been working with stack based MISC designs in FPGAs for
some years. All along I have been comparing my work to the work
of others. These others were the conventional RISC type
processors supplied by the FPGA vendors as well as the many
processor designs done by individuals or groups as open source.

So far my CPUs have always ranked reasonably well in terms of
speed, but more importantly to me, very well in terms of size
and code density. My efforts have shown it hard to improve on
code density by a significant degree while simultaneously
minimizing the resources used by the design. Careful selection
of the instruction set can both improve code density and
minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the
other.

The last couple of days I was looking at some code I plan to use
and realized that it could be a lot more efficient if I could
find a way to use more parallelism inside the CPU and use fewer
instructions. So I started looking at defining separate opcodes
for the two primary function units in the design, the data stack
and the return stack. Each has its own ALU. The data stack has
a full complement of capabilities while the return stack can
only add, subtract and compare. The return stack is actually
intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel
capabilities of these units, I realized that many operations
were just stack manipulations. Then I read the thread about the
relative "cost" of stack ops vs memory accesses and I realized
these were what I needed to optimize. I needed to find a way to
not use an instruction and a clock cycle for moving data around
on the stack.

In the thread on stack ops it was pointed out repeatedly that
very often the stack operands would be optimized to register
operands, meaning they wouldn't need to do the stack ops at all
really. So I took a look at a register based MISC design.
Guess what, I don't see the disadvantage! I have pushed this
around for a couple of days and although I haven't done a
detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as
fast, if not faster than my stack based design and it will use
fewer instructions. I still need to add some features like
support for a stack in memory, in other words,
pre-increment/post-decrement (or the other way around...), but I
don't see where this is a bad design. It may end up using
*less* logic as well. My stack design provides access to the
stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has
changed my perspective over the years so that I am willing and
able to do optimizations in ways I would not have wanted to do
in the past. But I am a bit surprised that there has been so
much emphasis on stack oriented MISC machines which it may well
be that register based MISC designs are also very efficient,
at least if you aren't building them to service a C compiler or
trying to match some ideal RISC model.
Are those your actual results or did you just reiterate what is on
Wikipedia? Yes, that's a serious question. Read the MISC page:
http://en.wikipedia.org/wiki/Minimal_instruction_set_computer

See ... ?!

Code density is a CISC concept. I don't see how it applies to
your MISC project. Increasing code density for a MISC processor
means implementing more powerful instructions, i.e., those that do
more work, while minimizing bytes in the instruction opcode
encoding. Even if you implement CISC-like instructions, you can't
forgo the MISC instructions you already have in order to add the
CISC-like instructions. So, to do that, you'll need to increase
the size of the instruction set, as well as implement a more
complicated instruction decoder. I.e., that means the processor
will no longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...

No offense, but you seem to be "reinventing the wheel" in terms of
microprocessor design. You're coming to the same conclusions that
were found in the 1980's, e.g., concluding a register based
machine can perform better than a stack based machine, except
you've applied it to MISC in an FPGA package... How is that a new
conclusion?

Also, please read up on CISC and RISC:
http://en.wikipedia.org/wiki/Reduced_instruction_set_computing
http://en.wikipedia.org/wiki/Complex_instruction_set_computing

You posted to two groups. Which group was the " ... thread about
the relative 'cost' of stack ops vs memory accesses ..." posted
on? comp.lang.forth? comp.arch.fpga? I can look it up, but I'd
rather not.

Also, you cross-posted to comp.arch.fpga. While they'll likely be
familiar with FPGAs, most there are not going to be familiar with
the features of stack-based processors or Forth processors that
you discuss indirectly within your post. They might not be
familiar with ancient CISC concepts such as "code density" either,
or understand why it was important at one point in time. E.g., I
suspect this Forth related stuff from above won't be widely
understood on c.a.f. without clarification:

"peoples[sic] designs (such as Chuck's)"
- various Forth processors by Charles Moore

"the data stack and the return stack."
- interpreted Forth machine model


Rod Pemberton
 
On 3/29/2013 5:43 PM, glen herrmannsfeldt wrote:
In comp.arch.fpga rickman<gnuarm@gmail.com> wrote:
I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.

You might also look at some VLIW designs. Not that the designs
themselves will be useful, but maybe some of the ideas that were used
would help.

Well, much of the idea of RISC is that code density isn't very
important, and that many of the more complicated instructions made
assembly language programming easier, but compilers didn't use them.
I am somewhat familiar with VLIW. I am also very familiar with
microcode which is the extreme VLIW. I have coded microcode for an I/O
processor on an attached array processor. That's like saying I coded a
DMA controller in a DSP chip, but before DSP chips were around.


So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density.

Do you mean the size of CPU (lines of verilog) or size of the programs
that run on it?
I don't count lines of HDL code. I can't see that as a useful metric
for much. I am referring to code density of the compiled code for the
target CPU. All of my work is in FPGAs and memory is limited inside the
chip... well, at least in the small ones... lol Some FPGAs now have
literally Mbits of memory on chip. Still, nearly all of my work is
miniature and low power, sometimes *very* low power. So there may only
be 6 block memories in the FPGA for example.


My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.

Seems to me that much of the design of VAX was to improve code
density when main memory was still a very significant part of the
cost of a machine. The large number of addressing modes allowed for
efficient use of bits. It also greatly complicated making an efficient
pipelined processor. With little overlap and a microprogrammed CPU,
it is easy to go sequentially through the instruction bytes and process
them in order. Most of the complication is in the microcode itself.
One of the big differences is that they were not much limited in the
size of the machine itself. They could throw lots of gates at the
problem and not worry too much other than how it impacted speed. I am
again more limited in the small FPGAs I use and I expect I will achieve
higher clock speeds by keeping the machine simple as well. Also, as an
architectural decision, there is no microcode and all instructions are
one clock cycle. This helps with simplification and also makes
interrupt processing very simple.


But every level of logic adds delay. Using a wide bus to fast memory
is more efficient that a complicated decoder. But sometimes RISC went
too far. In early RISC, there was the idea of one cycle per instruction.
They couldn't do that for multiply, so they added multiply-step, an
instruction that you execute many times for each multiply operation.
(And maybe no divide at all.)
I"m not sure what your point is. What part of this is "too far"? This
is exactly the type of design I am doing, but to a greater extent.


For VLIW, a very wide instruction word allows for specifying many
different operations at the same time. It relies on complicated
compilers to optimally pack the multiple operations into the
instruction stream.
Yes, in theory that is what VLIW is. This is just one step removed from
microcode where the only limitation to how parallel operations can be is
the data/address paths themselves. The primary application of VLIW I
have seen is in the TI 6000 series DSP chips. But in reality this is
not what I consider VLIW. This design uses eight CPUs which are mostly
similar, but not quite identical with two sets of four CPU units sharing
a register file, IIRC. In reality each of the eight CPUs gets its own
32 bit instruction stream. They all operate in lock step, but you can't
do eight FIR filters. I think of the four in a set, two are set up to
do full math, etc and two are able to generate addresses. So this is
really two CPUs with dual MACs and two address generators as it ends up
being used most of the time. But then they make it run at clocks of
over 1 GHz so it is a damn fast DSP and handles most of the cell phone
calls as part of the base station.

Regardless of whether it is a "good" example of VLIW, it was a marketing
success.


The last couple of days I was looking at some code I plan to use and
realized that it could be a lot more efficient if I could find a way to
use more parallelism inside the CPU and use fewer instructions. So I
started looking at defining separate opcodes for the two primary
function units in the design, the data stack and the return stack. Each
has its own ALU. The data stack has a full complement of capabilities
while the return stack can only add, subtract and compare. The return
stack is actually intended to be an "address" processing unit.

This kind of thought is what lead to the branch delay slot on many
RISC processors. There is work to be done for branches, especially
for branch to or return from subroutine. Letting the processor know
early allows for more overlap. So, on many early (maybe not now) RISC
machines the instruction after a branch instruction is executed before
the branch is taken. (It might be a NOOP, though.) Again, reduced code
density (if it is NOOP) in exchange for a faster instruction cycle.)
That is a result of the heavy pipelining that is being done. I like to
say my design is not pipelined, but someone here finally convinced me
that my design *is* pipelined with the execution in parallel with the
next instruction fetch, but there is never a need to stall or flush
because there are no conflicts.

I want a design to be fast, but not at the expense of complexity. That
is one way I think like Chuck Moore. Keep it simple and that gives you
speed.


While trying to figure out how to maximize the parallel capabilities of
these units, I realized that many operations were just stack
manipulations. Then I read the thread about the relative "cost" of
stack ops vs memory accesses and I realized these were what I needed to
optimize. I needed to find a way to not use an instruction and a clock
cycle for moving data around on the stack.

Well, consider the x87 stack instructions. Some operations exist in
forms that do and don't pop something off the stack. A small increase
in the number of opcodes used allows for avoiding many POP instructions.
In Forth speak a POP would be a DROP. That is not often used in Forth
really, or in my apps. I just wrote some code for my stack CPU and I
think there were maybe two DROPs in just over 100 instructions. I am
talking about the DUPs, SWAPs, OVERs and such. The end up being needed
enough that it makes the register design look good... at least at first
blush. I am looking at how to organize a register based instruction set
without expanding the size of the instructions. I'm realizing that is
one issue with registers, you have to specify them. But I don't need to
make the machine totally general like the goal for RISC. That is so
writing compilers is easier. I don't need to consider that.


In the thread on stack ops it was pointed out repeatedly that very often
the stack operands would be optimized to register operands, meaning they
wouldn't need to do the stack ops at all really. So I took a look at a
register based MISC design. Guess what, I don't see the disadvantage!

Well, actually the 8087 was designed to be either a register or stack
machine, as many instructions can index into the stack. If you keep
track of the current stack depth, you can find data in the stack like
in registers. Well, it was supposed to be able to do that.

I have pushed this around for a couple of days and although I haven't
done a detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as fast, if
not faster than my stack based design and it will use fewer
instructions. I still need to add some features like support for a
stack in memory, in other words, pre-increment/post-decrement (or the
other way around...), but I don't see where this is a bad design. It
may end up using *less* logic as well. My stack design provides access
to the stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.

Well, the stack design has the advantage that you can use instruction
bits either for a memory address or for the operation, allowing for much
smaller instructions. But that only works as long as everything is in
the right place on the stack.
Yes, it is a tradeoff between instruction size and the number of ops
needed to get a job done. I'm looking at trimming the instruction size
down to give a workable subset for register operations.


I guess looking at other peoples designs (such as Chuck's) has changed
my perspective over the years so that I am willing and able to do
optimizations in ways I would not have wanted to do in the past. But I
am a bit surprised that there has been so much emphasis on stack
oriented MISC machines which it may well be that register based MISC
designs are also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC model.

Seems to me that there hasn't been much done in stack machines since
the B5500, the first computer I ever used when I was about nine.
I wouldn't say that. There may not be many commercial designs out
there, but there have been a few stack based CPU chips (mostly from the
work of Chuck Moore) and there are a number of stack CPUs internal to
chips. Just ask Bernd Paysan. He has done several I believe. Mine has
only been run in FPGAs.

--

Rick
 
On 3/30/2013 3:20 AM, Arlet Ottens wrote:
On 03/29/2013 10:00 PM, rickman wrote:
I have been working with stack based MISC designs in FPGAs for some
years. All along I have been comparing my work to the work of others.
These others were the conventional RISC type processors supplied by the
FPGA vendors as well as the many processor designs done by individuals
or groups as open source.

So far my CPUs have always ranked reasonably well in terms of speed, but
more importantly to me, very well in terms of size and code density. My
efforts have shown it hard to improve on code density by a significant
degree while simultaneously minimizing the resources used by the design.
Careful selection of the instruction set can both improve code density
and minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the other.


I once made a CPU design for an FPGA that had multiple stacks. There was
a general purpose stack "A", two index stacks "X" and "Y", and a return
stack "R". ALU operations worked between A and any other stack, so they
only required 2 bits in the opcode. There was also a move instruction
that could move data from a source to a destination stack.
Let's see, this would be a hybrid really, between register based and
stack based as it has multiple stacks which must be selected in the
instruction set like registers, just not many.


Having access to multiple stacks means you spend less time shuffling
data on the stack. There's no more need for swap, over, rot and similar
stack manipulation instructions. The only primitive operations you need
are push and pop.
There is the extra hardware required to implement multiple stacks. Why
have quite so many? I use the return stack for addresses. That works
pretty well. Maybe A, X and R?


For instance, I had a load instruction that could load from memory using
the address in the X stack, and push the result on the A stack. The cool
part is that the X stack itself isn't changed by this operation, so the
same address can be used multiple time. So, you could do a

LOAD (X) ; load from (X) and push on A
1 ; push literal on A
ADD ; add top two elements of A
STORE (X) ; pop A, and store in (X)

to increment a location in memory.
But you aren't quite done yet, at least if you care about stack
overflows. Chuck's designs don't care. If he has data on the bottom of
the stack he can just leave it. Otherwise you need to drop the address
on the X stack.

I looked at this when designing my MISC processor. I ended up with two
fetch and two stores. One just does the fetch or store and pops the
address. The other does a post increment and retains the address to be
used in a loop. This one would have to be dropped at the end.


And if you wanted to increment X to access the next memory location,
you'd do:

1 ; push literal on A
ADD X ; pop X, pop A, add, and push result on A.
MOVE A, X ; pop A, and push on X
Add an autoincrement to the index stacks and these three instructions go
away.


It was an 8 bit architecture with 9 bit instructions (to match the FPGA
block RAM + parity bit). Having 9 bit instructions allows an 8 bit
literal push to be encoded in 1 instruction.
I was all about 9 bit instructions and 18 bit address/data bus sizes
until I started working with the iCE40 parts which only have 8 bit
memory. I think the parity bit is a comms industry thing. That one bit
makes a *big* difference in the instruction set capabilities.


Feel free to e-mail if you want more details.
Thanks.

--

Rick
 
On 03/30/2013 10:54 PM, Rod Pemberton wrote:

I guess looking at other peoples designs (such as Chuck's) has
changed my perspective over the years so that I am willing and
able to do optimizations in ways I would not have wanted to do
in the past. But I am a bit surprised that there has been so
much emphasis on stack oriented MISC machines which it may well
be that register based MISC designs are also very efficient,
at least if you aren't building them to service a C compiler or
trying to match some ideal RISC model.


Are those your actual results or did you just reiterate what is on
Wikipedia? Yes, that's a serious question. Read the MISC page:
http://en.wikipedia.org/wiki/Minimal_instruction_set_computer

See ... ?!
It sounds to me rickman is questioning the (unsupported) claims on
wikipedia that stack based machines have an advantage in size and/or
simplicity, not reiterating them.

Code density is a CISC concept. I don't see how it applies to
your MISC project. Increasing code density for a MISC processor
means implementing more powerful instructions, i.e., those that do
more work, while minimizing bytes in the instruction opcode
encoding. Even if you implement CISC-like instructions, you can't
forgo the MISC instructions you already have in order to add the
CISC-like instructions. So, to do that, you'll need to increase
the size of the instruction set, as well as implement a more
complicated instruction decoder. I.e., that means the processor
will no longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...
It is perfectly possible to trade one type of MISC processor for another
one. The choice between stack and register based is an obvious one. If
you switch from stack to register based, there's no need to keep stack
manipulation instructions around.

Also, you cross-posted to comp.arch.fpga. While they'll likely be
familiar with FPGAs, most there are not going to be familiar with
the features of stack-based processors or Forth processors that
you discuss indirectly within your post. They might not be
familiar with ancient CISC concepts such as "code density" either,
or understand why it was important at one point in time. E.g., I
suspect this Forth related stuff from above won't be widely
understood on c.a.f. without clarification:
The design of simple and compact processors is of great interest to many
FPGA engineers. Plenty of FPGA designs need some sort of control
processor, and for cost reduction it's important to use minimal
resources. Like rickman said, this involves a careful balance between
implementation complexity, speed, and code density, while also
considering how much work it is to write/maintain the software that's
running on the processor.

Code density is still critically important. Fast memory is small, both
on FPGA as well as general purpose processors.
 
On 03/31/2013 12:00 AM, rickman wrote:

I once made a CPU design for an FPGA that had multiple stacks. There was
a general purpose stack "A", two index stacks "X" and "Y", and a return
stack "R". ALU operations worked between A and any other stack, so they
only required 2 bits in the opcode. There was also a move instruction
that could move data from a source to a destination stack.

Let's see, this would be a hybrid really, between register based and
stack based as it has multiple stacks which must be selected in the
instruction set like registers, just not many.
Exactly. And as a hybrid, it offers some advantages from both kinds of
designs.

Having access to multiple stacks means you spend less time shuffling
data on the stack. There's no more need for swap, over, rot and similar
stack manipulation instructions. The only primitive operations you need
are push and pop.

There is the extra hardware required to implement multiple stacks. Why
have quite so many? I use the return stack for addresses. That works
pretty well. Maybe A, X and R?
I had all stacks implemented in the same block RAM, just using different
sections of it. But you are right, in my implementation I had reserved
room for the Y stack, but never really implemented it. Just using the X
was sufficient for the application I needed the CPU For. Of course, for
other applications, having an extra register/stack that you can use as
an memory pointer could be useful, so I left the Y register in the
instruction encoding. For 3 registers you need 2 bits anyway, so it
makes sense to allow for 4.

But you aren't quite done yet, at least if you care about stack
overflows. Chuck's designs don't care. If he has data on the bottom of
the stack he can just leave it. Otherwise you need to drop the address
on the X stack.
Correct, if you no longer need the address, you need to drop it from the
stack. On the other hand, if you let it drop automatically, and you need
it twice, you would have to dup it. Intuitively, I would say that in
inner loops it would be more common that you'd want to reuse an address
(possibly with offset or autoinc/dec).

I was all about 9 bit instructions and 18 bit address/data bus sizes
until I started working with the iCE40 parts which only have 8 bit
memory. I think the parity bit is a comms industry thing. That one bit
makes a *big* difference in the instruction set capabilities.
Agreed. As soon you use the parity bits, you're tied to a certain
architecture. On the other hand, if you're already chose a certain
architecture, and the parity bits are available for free, it can be very
advantageous to use them.
 
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:

(snip)
Well, much of the idea of RISC is that code density isn't very
important, and that many of the more complicated instructions made
assembly language programming easier, but compilers didn't use them.

I am somewhat familiar with VLIW. I am also very familiar with
microcode which is the extreme VLIW. I have coded microcode for an I/O
processor on an attached array processor. That's like saying I coded a
DMA controller in a DSP chip, but before DSP chips were around.
Well, yes, VLIW is just about like compiling from source directly to
microcode. The currently in production VLIW processor is Itanium.
I believe 128 bit wide instructions, which specify many different
operations that can happen at the same time. 128 general and 128
floating point registers, 128 bit wide data bus, but it uses a lot
of power. Something like 100W, with Icc of 100A and Vcc about 1V.

I have an actual RX2600 dual Itanium box, but don't run it very
often, mostly because of the power used.

(snip)

But every level of logic adds delay. Using a wide bus to fast memory
is more efficient that a complicated decoder. But sometimes RISC went
too far. In early RISC, there was the idea of one cycle per instruction.
They couldn't do that for multiply, so they added multiply-step, an
instruction that you execute many times for each multiply operation.
(And maybe no divide at all.)

I"m not sure what your point is. What part of this is "too far"? This
is exactly the type of design I am doing, but to a greater extent.
The early SPARC didn't have a multiply instruction. (Since it couldn't
be done in one cycle.) Instead, they did multiply, at least the Sun
machines did, through software emulation, with a software trap.

Early when I started using a Sun4/110 I generated a whole set of fonts
for TeX from Metafont source, which does a lot of multiply. Unix (SunOS)
keeps track of user time (what your program does) and system time (what
the OS does while executing your program). Multiply counted as system
time, and could be a large fraction of the total time.

For VLIW, a very wide instruction word allows for specifying many
different operations at the same time. It relies on complicated
compilers to optimally pack the multiple operations into the
instruction stream.

Yes, in theory that is what VLIW is. This is just one step removed from
microcode where the only limitation to how parallel operations can be is
the data/address paths themselves. The primary application of VLIW I
have seen is in the TI 6000 series DSP chips. But in reality this is
not what I consider VLIW. This design uses eight CPUs which are mostly
similar, but not quite identical with two sets of four CPU units sharing
a register file, IIRC. In reality each of the eight CPUs gets its own
32 bit instruction stream. They all operate in lock step, but you can't
do eight FIR filters. I think of the four in a set, two are set up to
do full math, etc and two are able to generate addresses. So this is
really two CPUs with dual MACs and two address generators as it ends up
being used most of the time. But then they make it run at clocks of
over 1 GHz so it is a damn fast DSP and handles most of the cell phone
calls as part of the base station.
For Itanium, the different units do different things. There are
instruction formats that divide up the bits in different ways to make
optimal use of the bits. I used to have the manual nearby, but I don't
see it right now.

(snip)

That is a result of the heavy pipelining that is being done. I like to
say my design is not pipelined, but someone here finally convinced me
that my design *is* pipelined with the execution in parallel with the
next instruction fetch, but there is never a need to stall or flush
because there are no conflicts.
Yes, that counts, but it gets much more interesting with pipelines
like the Cray-1 that, after some cycles of latency, generate one result
every clock cycle.

I want a design to be fast, but not at the expense of complexity. That
is one way I think like Chuck Moore. Keep it simple and that gives you
speed.
(snip)

In Forth speak a POP would be a DROP. That is not often used in Forth
really, or in my apps. I just wrote some code for my stack CPU and I
think there were maybe two DROPs in just over 100 instructions. I am
talking about the DUPs, SWAPs, OVERs and such. The end up being needed
enough that it makes the register design look good... at least at first
blush. I am looking at how to organize a register based instruction set
without expanding the size of the instructions. I'm realizing that is
one issue with registers, you have to specify them. But I don't need to
make the machine totally general like the goal for RISC. That is so
writing compilers is easier. I don't need to consider that.
For x87, they avoid the DUP, SWAP, and such by instructions being able
to specify any register in the stack. You can, for example, add any
stack register (there are eight) to the top of stack. I haven't thought
about it for a while, but I believe either pushing the result, or
replacing the previous top of the stack.

(snip)

Well, the stack design has the advantage that you can use instruction
bits either for a memory address or for the operation, allowing for much
smaller instructions. But that only works as long as everything is in
the right place on the stack.

Yes, it is a tradeoff between instruction size and the number of ops
needed to get a job done. I'm looking at trimming the instruction size
down to give a workable subset for register operations.
Seems to me that one possibility is to have a really functional stack
operation instruction, such that, with the give number of bits, it
allows for the most opertions. Some combination of DUP, SWAP, and POP
all at once. Though that isn't easy for the stack itself.

-- glen
 
glen herrmannsfeldt wrote:


Seems to me that much of the design of VAX was to improve code
density when main memory was still a very significant part of the
cost of a machine.
The original VAX series (780, then 730 and 750) and the uVAX I and
uVAX II had no cache, so reducing rate of memory fetches was also
important. The first cached machine I used was the uVAX III (KA650)
and a good part of the performance increase was due to the cache.

Jon
 
"Arlet Ottens" <usenet+5@c-scape.nl> wrote in message
news:5157e1a1$0$6924$e4fe514c@news2.news.xs4all.nl...
On 03/30/2013 10:54 PM, Rod Pemberton wrote:


I guess looking at other peoples designs (such as Chuck's)
has changed my perspective over the years so that I am
willing and able to do optimizations in ways I would not have
wanted to do in the past. But I am a bit surprised that there
has been so much emphasis on stack oriented MISC machines
which it may well be that register based MISC designs are
also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC
model.


Are those your actual results or did you just reiterate what
is on Wikipedia? Yes, that's a serious question. Read the
MISC page: [link]

See ... ?!

It sounds to me rickman is questioning the (unsupported) claims
on wikipedia that stack based machines have an advantage in size
and/or simplicity, not reiterating them.
Well, you snipped alot of context. I wouldn't have reformatted
all of it either.

Anyway, what I took from rickman's statements was that he had only
managed to confirm what is already known about MISC according to
Wikipedia. I.e., what was/is the point?

Code density is a CISC concept. I don't see how it applies to
your MISC project. Increasing code density for a MISC
processor means implementing more powerful instructions,
i.e., those that do more work, while minimizing bytes in the
instruction opcode encoding. Even if you implement
CISC-like instructions, you can't forgo the MISC instructions
you already have in order to add the CISC-like instructions.
So, to do that, you'll need to increase the size of the
instruction set, as well as implement a more complicated
instruction decoder. I.e., that means the processor will no
longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...

It is perfectly possible to trade one type of MISC processor for
another one. The choice between stack and register based is an
obvious one. If you switch from stack to register based, there's
no need to keep stack manipulation instructions around.
Yes, true. But, what does eliminating MISC stack instructions and
replacing them with MISC register instructions have to do with the
CISC concept of code density or with CISC instructions? ...

Also, you cross-posted to comp.arch.fpga. While they'll
likely be familiar with FPGAs, most there are not going
to be familiar with the features of stack-based processors
or Forth processors that you discuss indirectly within your
post. They might not be familiar with ancient CISC
concepts such as "code density" either, or understand why
it was important at one point in time. E.g., I suspect this
Forth related stuff from above won't be widely
understood on c.a.f. without clarification:

The design of simple and compact processors is of great interest
to many FPGA engineers. Plenty of FPGA designs need some
sort of control processor, and for cost reduction it's important
to use minimal resources. Like rickman said, this involves a
careful balance between implementation complexity, speed,
and code density, while also considering how much work it
is to write/maintain the software that's running on the
processor.

Code density is still critically important. Fast memory is
small, both on FPGA as well as general purpose processors.
So, shouldn't he dump the entire MISC instruction set he has, and
implement a CISC instruction set instead? That's the only way
he's going to get the "critically important" code density, which
I'll take it you rank well above MISC as being important. Of
course, a CISC instruction set requires a more complicated
instruction decoder ... So, it seems that either way he proceeds,
he is being contrained by the "minimal resources" of his FPGA.
That was what he stated:

"My efforts have shown it hard to improve on code density by a
significant degree while simultaneously minimizing the resources
used by the design."

I.e., if the FPGA he is attempting to use is insufficient to do
what he wants or needs, then it's insufficient. Or, he needs some
new techniques. He didn't explicitly ask for any though ...

Glen mentioned the numerous address modes of the VAX. The 6502
also had alot of address modes and had instructions which used
zero-page as a set of fast registers. I would think that early,
compact designs like the 6502 and Z80 could be useful to rickman.
They had low transistor counts.

Z80 8500 transistors
6502 9000 transistors
8088 29000 transistors (for comparison...)


Rod Pemberton
 
In comp.arch.fpga Rod Pemberton <do_not_have@notemailnotq.cpm> wrote:

(snip)

Well, you snipped alot of context. I wouldn't have reformatted
all of it either.

Anyway, what I took from rickman's statements was that he had only
managed to confirm what is already known about MISC according to
Wikipedia. I.e., what was/is the point?
(snip, someone wrote)
It is perfectly possible to trade one type of MISC processor for
another one. The choice between stack and register based is an
obvious one. If you switch from stack to register based, there's
no need to keep stack manipulation instructions around.

Yes, true. But, what does eliminating MISC stack instructions and
replacing them with MISC register instructions have to do with the
CISC concept of code density or with CISC instructions? ...
Seems to me that one could still Huffman code the opcode, even
within the MISC concept. That is, use fewer bits for more common
operations, or where it otherwise simplifies the result.

As someone noted, you can have an N-1 bit load immediate instruction
where N is the instruction size.

In one of Knuth's TAOCP books he describes a two instruction computer.

Seems like if one is interested in MISC, someone should build that one.

Also, maybe a MIX and MMIX machine, maybe the decimal version.

For those who don't read TAOCP, MIX is defined independent of the
underlying base. Programs in MIXAL are supposed to assemble and
run correctly on hosts that use any base within the range specified.

Instruction bytes have, if I remember, between 64 and 100 possible
values, such that six bits or two decimal digits are possible
representations.

I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

-- glen
 
On 04/02/2013 02:10 AM, Rod Pemberton wrote:

Yes, true. But, what does eliminating MISC stack instructions and
replacing them with MISC register instructions have to do with the
CISC concept of code density or with CISC instructions? ...
I don't think 'code density' is a CISC concept at all. Code density
applies to any kind of instruction encoding.

Any kind of MISC architecture will have a certain code density (for a
given application), and require a certain amount of FPGA resources. And
if you store the code in local FPGA memory, and you know your
application, you can even convert it all to FPGA resources.

Obviously, one MISC design will have better resource use than another,
and one of the questions is whether we can make any kind of general
statement about stack vs register implementation.

So, shouldn't he dump the entire MISC instruction set he has, and
implement a CISC instruction set instead? That's the only way
he's going to get the "critically important" code density, which
I'll take it you rank well above MISC as being important. Of
course, a CISC instruction set requires a more complicated
instruction decoder ... So, it seems that either way he proceeds,
he is being contrained by the "minimal resources" of his FPGA.
Exactly. If you want to get FPGA resources low, a MISC design seems most
appropriate. But within the MISC concept, there are still plenty of
design decisions left.

Really, the biggest problem with FPGA CPU design is that there are too
many design decisions, and each decision influences everything else.

Glen mentioned the numerous address modes of the VAX. The 6502
also had alot of address modes and had instructions which used
zero-page as a set of fast registers. I would think that early,
compact designs like the 6502 and Z80 could be useful to rickman.
They had low transistor counts.

Z80 8500 transistors
6502 9000 transistors
8088 29000 transistors (for comparison...)
Low transistor counts do not necessarily translate to low FPGA
resources. Early CPUs used dynamic storage, dual clock latches, pass
logic and tri-state buses to create really small designs that don't
necessarily map well to FPGAs. On the other hand, FPGAs have specific
features (depending on the brand) that can be exploited to create really
tight designs.

Also, these processors are slow, using multi cycle instructions, and 8
bit operations. That may not be acceptable. And even if low performance
is acceptable, there are numerous other ways where you can trade speed
for code density, so you'd have to consider these too. For instance, I
can replace pipelining with multi-cycle execution, or use microcode.
 
On 3/31/2013 2:34 PM, glen herrmannsfeldt wrote:
In comp.arch.fpga rickman<gnuarm@gmail.com> wrote:
(snip)
I have an actual RX2600 dual Itanium box, but don't run it very
often, mostly because of the power used.
A 100 watt lightbulb uses about $10 a month if left on 24/7. So I
wouldn't worry too much about the cost of running your Itanium. If you
turn it off when you aren't using it I can't imagine it would really
cost anything noticeable to run.


For Itanium, the different units do different things. There are
instruction formats that divide up the bits in different ways to make
optimal use of the bits. I used to have the manual nearby, but I don't
see it right now.
Yes, the array processor I worked on was coded from scratch, very
laboriously. The Itanium is trying to run existing code as fast as
possible. So they have a number of units to do similar things, but also
different types, all working in parallel as much as possible. Also the
parallelism in the array processor was all controlled by the programmer.
In regular x86 processors the parallelism is controlled by the chip
itself. I'm amazed sometimes at just how much they can get the chip to
do, no wonder there are 100's of millions of transistors on the thing.
I assume parallelism in the Itanium is back to the compiler smarts to
control since it needs to be coded into the VLIW instructions.


For x87, they avoid the DUP, SWAP, and such by instructions being able
to specify any register in the stack. You can, for example, add any
stack register (there are eight) to the top of stack. I haven't thought
about it for a while, but I believe either pushing the result, or
replacing the previous top of the stack.
That is something I have not yet looked at, a hybrid approach with a
stack, but also with stack top relative addressing. It is important to
keep the hardware simple, but that might not be too hard to implement.
Arlet talked about his hybrid processor design with multiple stacks. I
would need to give this a bit of thought as the question becomes, what
possible advantage would a hybrid processor have over the other two?
Actually, the image in my mind is a bit like the C language stack frame
model. You can work with addressing relative to the top of stack and
when a sub is called it layers its variables on top of the existing
stack. That would require a bit of entry and exit code, so there will
be a tradeoff between simple register addressing and simple subroutine
entry.


Seems to me that one possibility is to have a really functional stack
operation instruction, such that, with the give number of bits, it
allows for the most opertions. Some combination of DUP, SWAP, and POP
all at once. Though that isn't easy for the stack itself.
Like many tradeoffs, this can complicate the hardware. If you want to
be able to combine an ADD with a SWAP or OVER, you need to be able to
access more than two operands on the stack at once. In my designs that
means pulling another register out of the proper stack. Rather than one
register and a block of memory, each stack would need two registers and
the block of memory along with the input multiplexers. This would need
to be examined in context of common code to see how much it would help
vs the expense of added hardware.

I'm still not complete in my analysis of a register based MISC design,
but at the moment I think the register approach gives better instruction
size efficiency/faster execution as well as simpler hardware design.

--

Rick
 
On 3/30/2013 5:54 PM, Rod Pemberton wrote:
"rickman"<gnuarm@gmail.com> wrote in message
news:kj4vae$msi$1@dont-email.me...

I have been working with stack based MISC designs in FPGAs for
some years. All along I have been comparing my work to the work
of others. These others were the conventional RISC type
processors supplied by the FPGA vendors as well as the many
processor designs done by individuals or groups as open source.

So far my CPUs have always ranked reasonably well in terms of
speed, but more importantly to me, very well in terms of size
and code density. My efforts have shown it hard to improve on
code density by a significant degree while simultaneously
minimizing the resources used by the design. Careful selection
of the instruction set can both improve code density and
minimize logic used if measured together, but there is always a
tradeoff. One can always be improved at the expense of the
other.

The last couple of days I was looking at some code I plan to use
and realized that it could be a lot more efficient if I could
find a way to use more parallelism inside the CPU and use fewer
instructions. So I started looking at defining separate opcodes
for the two primary function units in the design, the data stack
and the return stack. Each has its own ALU. The data stack has
a full complement of capabilities while the return stack can
only add, subtract and compare. The return stack is actually
intended to be an "address" processing unit.

While trying to figure out how to maximize the parallel
capabilities of these units, I realized that many operations
were just stack manipulations. Then I read the thread about the
relative "cost" of stack ops vs memory accesses and I realized
these were what I needed to optimize. I needed to find a way to
not use an instruction and a clock cycle for moving data around
on the stack.

In the thread on stack ops it was pointed out repeatedly that
very often the stack operands would be optimized to register
operands, meaning they wouldn't need to do the stack ops at all
really. So I took a look at a register based MISC design.
Guess what, I don't see the disadvantage! I have pushed this
around for a couple of days and although I haven't done a
detailed design, I think I have looked at it enough to realize
that I can design a register oriented MISC CPU that will run as
fast, if not faster than my stack based design and it will use
fewer instructions. I still need to add some features like
support for a stack in memory, in other words,
pre-increment/post-decrement (or the other way around...), but I
don't see where this is a bad design. It may end up using
*less* logic as well. My stack design provides access to the
stack pointers which require logic for both the pointers and
muxing them into the data stack for reading.

I guess looking at other peoples designs (such as Chuck's) has
changed my perspective over the years so that I am willing and
able to do optimizations in ways I would not have wanted to do
in the past. But I am a bit surprised that there has been so
much emphasis on stack oriented MISC machines which it may well
be that register based MISC designs are also very efficient,
at least if you aren't building them to service a C compiler or
trying to match some ideal RISC model.


Are those your actual results or did you just reiterate what is on
Wikipedia? Yes, that's a serious question. Read the MISC page:
http://en.wikipedia.org/wiki/Minimal_instruction_set_computer

See ... ?!
You clearly don't understand my post or the Wiki article or possibly
both. I suggest you reread both and then ask questions if you still
don't understand.


Code density is a CISC concept. I don't see how it applies to
your MISC project.
Yes, I get that you don't understand. Do you have a specific question?


Increasing code density for a MISC processor
means implementing more powerful instructions, i.e., those that do
more work, while minimizing bytes in the instruction opcode
encoding.
Yes, that part you seem to understand.


Even if you implement CISC-like instructions, you can't
forgo the MISC instructions you already have in order to add the
CISC-like instructions.
Really? I can't drop the entire instruction set and start over? Who
says so? Am I breaking a law?


So, to do that, you'll need to increase
the size of the instruction set, as well as implement a more
complicated instruction decoder.
Define "increase the size of the instruction set". I am using a 9 bit
opcode for my stack design and am using a similar 9 bit opcode for the
register design. In what way is the register design using a larger
instruction set?

That was exactly the blinder I was wearing until now. I had read a lot
about register CPU instruction sets where the intent was not in line
with MISC goals. MicroBlaze is a good example. I think it uses well in
excess of 1000 LUTs, maybe multiple 1000's. I need something that is
much smaller and it hadn't occurred to me that perhaps the common goals
of register designs (lots of registers, orthogonality, address mode
flexibility, etc) could be limited or even tossed out the window. I
don't need a machine that is easy for a C compiler to produce code for.
My goals are for minimal hardware without losing any more performance
than is essential. In particular I work in FPGAs, so the design needs
to work well in that environment.


I.e., that means the processor
will no longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...
Nonsense. "Minimal Instruction Set Computer (MISC) is a processor
architecture with a very small number of basic operations and
corresponding opcodes." from Wikipedia. BTW, I don't think the term
MISC is widely used and is not well defined. This is the only web page
I found that even tries to define it.

Actually, if you consider only the opcodes and not the operand
combinations, I think the register design may have fewer instructions
than does the stack design. But the register design still is in work so
I'm not done counting yet.

There are some interesting corners to be explored. For example a MOV
rx,rx is essentially a NOP. There are eight of these. So instead, why
not make them useful by clearing the register? So MOV rx,rx is a clear
to be given the name CLR rx... unless the rx is r7 in which case it is
indeed a MOV r7,r7 which is now a NOP to be coded as such. The CLR r7
is not needed because LIT 0 already does the same job. Even better the
opcode for a NOP is 0x1FF or octal 777. That's very easy to remember
and recognize. It feels good to find convenient features like this and
makes me like the register MISC design.


No offense, but you seem to be "reinventing the wheel" in terms of
microprocessor design. You're coming to the same conclusions that
were found in the 1980's, e.g., concluding a register based
machine can perform better than a stack based machine, except
you've applied it to MISC in an FPGA package... How is that a new
conclusion?
I don't see how you can say that. I don't know of other MISC designs
that are very good. I think the picoBlaze is one, but I don't think it
was designed to be very MISC really. It was designed to be small,
period. It has a lot of fixed features and so can't be altered without
tossing old code. Some of the MicroChip devices might be MISC. I have
not worked with them. I might want to take a look actually. There may
be useful ideas.

But "reinventing the wheel"? I don't think so.

--

Rick
 
On 4/1/2013 8:10 PM, Rod Pemberton wrote:
"Arlet Ottens"<usenet+5@c-scape.nl> wrote in message
news:5157e1a1$0$6924$e4fe514c@news2.news.xs4all.nl...
On 03/30/2013 10:54 PM, Rod Pemberton wrote:


I guess looking at other peoples designs (such as Chuck's)
has changed my perspective over the years so that I am
willing and able to do optimizations in ways I would not have
wanted to do in the past. But I am a bit surprised that there
has been so much emphasis on stack oriented MISC machines
which it may well be that register based MISC designs are
also very efficient, at least if you aren't building them to
service a C compiler or trying to match some ideal RISC
model.


Are those your actual results or did you just reiterate what
is on Wikipedia? Yes, that's a serious question. Read the
MISC page: [link]

See ... ?!

It sounds to me rickman is questioning the (unsupported) claims
on wikipedia that stack based machines have an advantage in size
and/or simplicity, not reiterating them.


Well, you snipped alot of context. I wouldn't have reformatted
all of it either.

Anyway, what I took from rickman's statements was that he had only
managed to confirm what is already known about MISC according to
Wikipedia. I.e., what was/is the point?
You really need to reread the wiki description of MISC. There seems to
be a disconnect.


Code density is a CISC concept. I don't see how it applies to
your MISC project. Increasing code density for a MISC
processor means implementing more powerful instructions,
i.e., those that do more work, while minimizing bytes in the
instruction opcode encoding. Even if you implement
CISC-like instructions, you can't forgo the MISC instructions
you already have in order to add the CISC-like instructions.
So, to do that, you'll need to increase the size of the
instruction set, as well as implement a more complicated
instruction decoder. I.e., that means the processor will no
longer be MISC, but MISC+minimal CISC hybrid, or pure
CISC...

It is perfectly possible to trade one type of MISC processor for
another one. The choice between stack and register based is an
obvious one. If you switch from stack to register based, there's
no need to keep stack manipulation instructions around.


Yes, true. But, what does eliminating MISC stack instructions and
replacing them with MISC register instructions have to do with the
CISC concept of code density or with CISC instructions? ...
Weren't you the person who brought CISC into this discussion? Why are
you asking this question about CISC?


Also, you cross-posted to comp.arch.fpga. While they'll
likely be familiar with FPGAs, most there are not going
to be familiar with the features of stack-based processors
or Forth processors that you discuss indirectly within your
post. They might not be familiar with ancient CISC
concepts such as "code density" either, or understand why
it was important at one point in time. E.g., I suspect this
Forth related stuff from above won't be widely
understood on c.a.f. without clarification:

The design of simple and compact processors is of great interest
to many FPGA engineers. Plenty of FPGA designs need some
sort of control processor, and for cost reduction it's important
to use minimal resources. Like rickman said, this involves a
careful balance between implementation complexity, speed,
and code density, while also considering how much work it
is to write/maintain the software that's running on the
processor.

Code density is still critically important. Fast memory is
small, both on FPGA as well as general purpose processors.


So, shouldn't he dump the entire MISC instruction set he has, and
implement a CISC instruction set instead? That's the only way
he's going to get the "critically important" code density, which
I'll take it you rank well above MISC as being important. Of
course, a CISC instruction set requires a more complicated
instruction decoder ... So, it seems that either way he proceeds,
he is being contrained by the "minimal resources" of his FPGA.
That was what he stated:

"My efforts have shown it hard to improve on code density by a
significant degree while simultaneously minimizing the resources
used by the design."
I have "dumped" the stack related portion of the instruction set and
replaced it with register references. Why would I do otherwise?


I.e., if the FPGA he is attempting to use is insufficient to do
what he wants or needs, then it's insufficient. Or, he needs some
new techniques. He didn't explicitly ask for any though ...
No one said anything about "insufficient". I am looking for an optimal
design for a small CPU that can be used efficiently in an FPGA.


Glen mentioned the numerous address modes of the VAX. The 6502
also had alot of address modes and had instructions which used
zero-page as a set of fast registers. I would think that early,
compact designs like the 6502 and Z80 could be useful to rickman.
They had low transistor counts.

Z80 8500 transistors
6502 9000 transistors
8088 29000 transistors (for comparison...)
I'm not counting transistors. That is one of the fallacies of comparing
chip designs to FPGA designs. When you have the flexibility of using
transistors you can do things in ways that are efficient that are not
efficient in the LUTs of an FPGA.

The two big constraints are resources used in the FPGA and opcode size.
If an instruction would require addition of resources to the design,
it needs to justify that addition. An instruction also needs to justify
the portion of the opcode space it requires. Operations that specify
more than one register become very expensive in terms of opcode bits.
Operations that require additional data paths become expensive in terms
of resources. Doing as much as possible with as little as possible is
what I'm looking for.

BTW, the ZPU is a pretty good example of just how small a design can be.
But it is very slow. That's the other size of the equation. Improve
speed as much as possible while not using more resources than possible.
The optimal point depends on the coefficients used... in other words,
my judgement. This is not entirely objective.

--

Rick
 
On 4/1/2013 8:54 PM, glen herrmannsfeldt wrote:
Seems to me that one could still Huffman code the opcode, even
within the MISC concept. That is, use fewer bits for more common
operations, or where it otherwise simplifies the result.

As someone noted, you can have an N-1 bit load immediate instruction
where N is the instruction size.
Yup, I am still using the high bit to indicate an immediate operand.
This requires an implied location, so this literal is always in R7, the
addressing register. In my stack design it is always the return stack,
also the addressing register.

Jumps and calls still contain a small literal to be used alone when the
relative jump is within range or with the literal instruction when not.

I think this is very similar to Huffman encoding.


In one of Knuth's TAOCP books he describes a two instruction computer.

Seems like if one is interested in MISC, someone should build that one.
You can design a one instruction computer, but there is a balance
between resources used and the effectiveness of the resulting design.
The effectiveness of this sort of design is too low.


Also, maybe a MIX and MMIX machine, maybe the decimal version.

For those who don't read TAOCP, MIX is defined independent of the
underlying base. Programs in MIXAL are supposed to assemble and
run correctly on hosts that use any base within the range specified.

Instruction bytes have, if I remember, between 64 and 100 possible
values, such that six bits or two decimal digits are possible
representations.

I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.
Doesn't sound like an especially practical computer. Has anyone ever
built one?

--

Rick
 
On 4/2/2013 2:25 AM, Arlet Ottens wrote:
On 04/02/2013 02:10 AM, Rod Pemberton wrote:

Z80 8500 transistors
6502 9000 transistors
8088 29000 transistors (for comparison...)

Low transistor counts do not necessarily translate to low FPGA
resources. Early CPUs used dynamic storage, dual clock latches, pass
logic and tri-state buses to create really small designs that don't
necessarily map well to FPGAs. On the other hand, FPGAs have specific
features (depending on the brand) that can be exploited to create really
tight designs.

Also, these processors are slow, using multi cycle instructions, and 8
bit operations. That may not be acceptable. And even if low performance
is acceptable, there are numerous other ways where you can trade speed
for code density, so you'd have to consider these too. For instance, I
can replace pipelining with multi-cycle execution, or use microcode.
I think you have a very good handle on the nature of my goals.

--

Rick
 
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:
On 3/31/2013 2:34 PM, glen herrmannsfeldt wrote:
In comp.arch.fpga rickman<gnuarm@gmail.com> wrote:

(snip)

I have an actual RX2600 dual Itanium box, but don't run it very
often, mostly because of the power used.

A 100 watt lightbulb uses about $10 a month if left on 24/7. So I
wouldn't worry too much about the cost of running your Itanium. If you
turn it off when you aren't using it I can't imagine it would really
cost anything noticeable to run.
It is a dual processor box, plus all the rest of the systems
in the box. Yes, I was considering running it all the time, but
it is too expensive for that.

For Itanium, the different units do different things. There are
instruction formats that divide up the bits in different ways to make
optimal use of the bits. I used to have the manual nearby, but I don't
see it right now.

Yes, the array processor I worked on was coded from scratch, very
laboriously. The Itanium is trying to run existing code as fast as
possible. So they have a number of units to do similar things, but also
different types, all working in parallel as much as possible. Also the
parallelism in the array processor was all controlled by the programmer.
In regular x86 processors the parallelism is controlled by the chip
itself. I'm amazed sometimes at just how much they can get the chip to
do, no wonder there are 100's of millions of transistors on the thing.
I assume parallelism in the Itanium is back to the compiler smarts to
control since it needs to be coded into the VLIW instructions.
Seems to me that the big problem with the original Itanium was the
need to also run x86 code. That delayed the release for some time, and
in that time other processors had advanced. I believe that later
versions run x86 code in software emulation, maybe with some hardware
assist.

-- glen
 
In comp.arch.fpga rickman <gnuarm@gmail.com> wrote:

(snip, I wrote)
Also, maybe a MIX and MMIX machine, maybe the decimal version.

For those who don't read TAOCP, MIX is defined independent of the
underlying base. Programs in MIXAL are supposed to assemble and
run correctly on hosts that use any base within the range specified.

Instruction bytes have, if I remember, between 64 and 100 possible
values, such that six bits or two decimal digits are possible
representations.

I believe that allows for bases 2, 3, 4, 8, 9, 10, and 16.

Doesn't sound like an especially practical computer.
Has anyone ever built one?
Well, it exists mostly to write the examples (and, I believe,
homework problems) in the book. I believe that there have been
software emulation, but maybe no hardware (FPGA) versions.

MIXAL programs should be base independent, but actual implementations
are likely one base.

(The model number, 1009 or MIX in roman numerals, is supposed to
be the average of the model numbers of some popular machines.

There is also the DLX, a RISC machine used by Hennessy and Patterson
in their book, which could also be in roman numerals.

-- glen
 

Welcome to EDABoard.com

Sponsor

Back
Top