Soft core processors: RISC versus stack/accumulator for equa

Guest
It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

Jim Brakefield
 
On 9/26/2015 2:07 PM, jim.brakefield@ieee.org wrote:
It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a stack
processor was not much denser than the instruction set for the RISC CPU
I compared it to. I don't recall which one it was.

A lot depends on the code you use for comparison. I was using loops
that move data. Many stack processors have some levels of inefficiency
because of the juggling of the stack required in some code. Usually
proponents say the code can be done to reduce the juggling of operands
which I have found to be mostly true. If you code to reduce the
parameter juggling, stack processors can be somewhat more efficient in
terms of code space usage.

I have looked at a couple of things as alternatives. One is to use VLIW
to allow as much parallelism in usage of the execution units within the
processor, they are, data unit, address unit and instruction unit. This
presents some inherent inefficiency in that a fixed size instruction
field is used to control the instruction unit when most IU instructions
are just "next", for example. But it allows both the address unit and
the data unit to be doing work at the same time for doing things like
moving data to/from memory and counting a loop iteration, for example.

Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top of
stack to be used for a given operand along with variable size stack
adjustments. I didn't pursue this very far but I think it has potential
of virtually eliminating operand juggling making stack processor much
faster. I'm not sure of the effect on code size optimization because of
the larger instruction size.

--

Rick
 
On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org wrote:
Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

I worked with the 1980's Lilith computer and its Modula-2 compiler which used a stack-based architecture. Christian Jacobi includes a detailed analysis of the code generated in his dissertation titled "Code Generation and the Lilith Architecture". You can download a copy from my website:

http://www.cfbsoftware.com/modula2/
I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Oberon compiler used in Project Oberon:

http://www.projectoberon.com

The code generation is described in detail in the included documentation.

I have both systems in operation and have some very similar test programs for both. I'll experiment to see if the results give any surprises. Any comparison would have to take into account the fact that the Lilith was a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky.

Regards,
Chris Burrows
CFB Software
http://www.astrobe.com
 
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
It would appear there are very similar resource needs for either RISC or Stack/Accumulator architectures when both are of the "load/store" classification.
Herein, same multi-port LUT RAM for either RISC register file or dual stacks. And the DSP for multiply and block RAM for main memory. "Load/store" refers to using distinct instructions for moving data between LUT RAM and block RAM.

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a stack
processor was not much denser than the instruction set for the RISC CPU
I compared it to. I don't recall which one it was.

A lot depends on the code you use for comparison. I was using loops
that move data. Many stack processors have some levels of inefficiency
because of the juggling of the stack required in some code. Usually
proponents say the code can be done to reduce the juggling of operands
which I have found to be mostly true. If you code to reduce the
parameter juggling, stack processors can be somewhat more efficient in
terms of code space usage.

I have looked at a couple of things as alternatives. One is to use VLIW
to allow as much parallelism in usage of the execution units within the
processor, they are, data unit, address unit and instruction unit. This
presents some inherent inefficiency in that a fixed size instruction
field is used to control the instruction unit when most IU instructions
are just "next", for example. But it allows both the address unit and
the data unit to be doing work at the same time for doing things like
moving data to/from memory and counting a loop iteration, for example.

Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top of
stack to be used for a given operand along with variable size stack
adjustments. I didn't pursue this very far but I think it has potential
of virtually eliminating operand juggling making stack processor much
faster. I'm not sure of the effect on code size optimization because of
the larger instruction size.

--

Rick

I have looked at a couple of things as alternatives. One is to use VLIW
to allow as much parallelism in usage of the execution units within the
processor, they are, data unit, address unit and instruction unit.
Have considered multiple stacks as a form of VLIW: each stack having its own part of the VLIW instruction, or if nothing to do, providing future immediates for any of the other stack instructions.

Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top of
stack to be used for a given operand along with variable size stack
adjustments. I didn't pursue this very far but I think it has potential
of virtually eliminating operand juggling making stack processor much
faster.
Also this is a way to improve processing rate as there are fewer instructions than "pure" stack code (each instruction has a stack/accumulator operation and a small offset for the other operand). While one is at it, one can add various instructions bits for "return", stack/accumulator mode, replace operation, stack pointer selector, ...

Personally, don't have hard numbers for any of this (there are open source stack machines with small offsets and various instruction bits, what is needed is compilers so that comparisons can be done). And don't want to duplicate any work (AKA research) that has already been done.

Jim Brakefield
 
On Saturday, September 26, 2015 at 8:19:29 PM UTC-5, cfbso...@gmail.com wrote:
On Sunday, September 27, 2015 at 3:37:24 AM UTC+9:30, jim.bra...@ieee.org wrote:

Has someone studied this situation?
Would appear the stack/accumulator program code would be denser?
Would appear multiple instruction issue would be simpler with RISC?


I worked with the 1980's Lilith computer and its Modula-2 compiler which used a stack-based architecture. Christian Jacobi includes a detailed analysis of the code generated in his dissertation titled "Code Generation and the Lilith Architecture". You can download a copy from my website:

http://www.cfbsoftware.com/modula2/
I am currently working on the 2015 RISC equivalent - the FPGA RISC5 Oberon compiler used in Project Oberon:

http://www.projectoberon.com

The code generation is described in detail in the included documentation.

I have both systems in operation and have some very similar test programs for both. I'll experiment to see if the results give any surprises. Any comparison would have to take into account the fact that the Lilith was a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky.

Regards,
Chris Burrows
CFB Software
http://www.astrobe.com

Any comparison would have to take into account the fact that the Lilith was a 16-bit architecture whereas RISC5 is 32-bit so it might be tricky.
And in the 1980s main memory access time was smaller multiple of clock rate than today's DRAMs. However, the main memory for the RISC5 FPGA card is asynchronous static RAM with a fast access time and comparable to the main memory of the Lilith?

Jim Brakefield
 
On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote:
On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote:
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
It would appear there are very similar resource needs for either
RISC or Stack/Accumulator architectures when both are of the
"load/store" classification. Herein, same multi-port LUT RAM for
either RISC register file or dual stacks. And the DSP for
multiply and block RAM for main memory. "Load/store" refers to
using distinct instructions for moving data between LUT RAM and
block RAM.

Has someone studied this situation? Would appear the
stack/accumulator program code would be denser? Would appear
multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a
stack processor was not much denser than the instruction set for
the RISC CPU I compared it to. I don't recall which one it was.

A lot depends on the code you use for comparison. I was using
loops that move data. Many stack processors have some levels of
inefficiency because of the juggling of the stack required in some
code. Usually proponents say the code can be done to reduce the
juggling of operands which I have found to be mostly true. If you
code to reduce the parameter juggling, stack processors can be
somewhat more efficient in terms of code space usage.

I have looked at a couple of things as alternatives. One is to use
VLIW to allow as much parallelism in usage of the execution units
within the processor, they are, data unit, address unit and
instruction unit. This presents some inherent inefficiency in that
a fixed size instruction field is used to control the instruction
unit when most IU instructions are just "next", for example. But
it allows both the address unit and the data unit to be doing work
at the same time for doing things like moving data to/from memory
and counting a loop iteration, for example.

Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top
of stack to be used for a given operand along with variable size
stack adjustments. I didn't pursue this very far but I think it
has potential of virtually eliminating operand juggling making
stack processor much faster. I'm not sure of the effect on code
size optimization because of the larger instruction size.

--

Rick

I have looked at a couple of things as alternatives. One is to use
VLIW to allow as much parallelism in usage of the execution units
within the processor, they are, data unit, address unit and
instruction unit.
Have considered multiple stacks as a form of VLIW: each stack having
its own part of the VLIW instruction, or if nothing to do, providing
future immediates for any of the other stack instructions.


I assume you mean two data stacks? I was trying hard not to expand on
the hardware significantly. The common stack machine is typically two
stacks, one for data and one for return addresses. In Forth the return
stack is also used for loop counting. My derivation uses the return
stack for addresses such as memory accesses as well as jump/calls, so I
call it the address stack. This lets you do minimal arithmetic (loop
counting and incrementing addresses) and reduces stack ops on the data
stack such as the two drops required for a memory write.


Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top
of stack to be used for a given operand along with variable size
stack adjustments. I didn't pursue this very far but I think it
has potential of virtually eliminating operand juggling making
stack processor much faster.
Also this is a way to improve processing rate as there are fewer
instructions than "pure" stack code (each instruction has a
stack/accumulator operation and a small offset for the other
operand). While one is at it, one can add various instructions bits
for "return", stack/accumulator mode, replace operation, stack
pointer selector, ...

Yes, returns are common so it can be useful to provide a minimal
instruction overhead for that. The other things can require extra
hardware.


Personally, don't have hard numbers for any of this (there are open
source stack machines with small offsets and various instruction
bits, what is needed is compilers so that comparisons can be done).
And don't want to duplicate any work (AKA research) that has already
been done.

Jim Brakefield



--

Rick

Reply:
I assume you mean two data stacks?
Yes, in particular integer arithmetic on one and floating-point on the other.

My derivation uses the return stack for addresses such as memory accesses as well as jump/calls, so I call it the address stack.
OK

I was trying hard not to expand on the hardware significantly.
The other things can require extra hardware.
With FPGA 6LUTs one can have several read ports (4LUT RAM can do it also, its just not as efficient). At one operation per clock and mapping both data and address stacks to the same LUT RAM, one has two ports for operand reads, one port for result write and one port for "return" address read. Just about any stack or accumulator operation that fits these constraints is possible with appropriate instruction decode and ALU. The SWAP operation requires two writes, so one would need to make TOS a separate register to do it in one clock (other implementations possible using two multiport LUT RAMs).

Jim
 
On 9/27/2015 8:30 PM, jim.brakefield@ieee.org wrote:
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman wrote:
On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
It would appear there are very similar resource needs for either
RISC or Stack/Accumulator architectures when both are of the
"load/store" classification. Herein, same multi-port LUT RAM for
either RISC register file or dual stacks. And the DSP for
multiply and block RAM for main memory. "Load/store" refers to
using distinct instructions for moving data between LUT RAM and
block RAM.

Has someone studied this situation? Would appear the
stack/accumulator program code would be denser? Would appear
multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a
stack processor was not much denser than the instruction set for
the RISC CPU I compared it to. I don't recall which one it was.

A lot depends on the code you use for comparison. I was using
loops that move data. Many stack processors have some levels of
inefficiency because of the juggling of the stack required in some
code. Usually proponents say the code can be done to reduce the
juggling of operands which I have found to be mostly true. If you
code to reduce the parameter juggling, stack processors can be
somewhat more efficient in terms of code space usage.

I have looked at a couple of things as alternatives. One is to use
VLIW to allow as much parallelism in usage of the execution units
within the processor, they are, data unit, address unit and
instruction unit. This presents some inherent inefficiency in that
a fixed size instruction field is used to control the instruction
unit when most IU instructions are just "next", for example. But
it allows both the address unit and the data unit to be doing work
at the same time for doing things like moving data to/from memory
and counting a loop iteration, for example.

Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top
of stack to be used for a given operand along with variable size
stack adjustments. I didn't pursue this very far but I think it
has potential of virtually eliminating operand juggling making
stack processor much faster. I'm not sure of the effect on code
size optimization because of the larger instruction size.

--

Rick

I have looked at a couple of things as alternatives. One is to use
VLIW to allow as much parallelism in usage of the execution units
within the processor, they are, data unit, address unit and
instruction unit.
Have considered multiple stacks as a form of VLIW: each stack having
its own part of the VLIW instruction, or if nothing to do, providing
future immediates for any of the other stack instructions.

I assume you mean two data stacks? I was trying hard not to expand on
the hardware significantly. The common stack machine is typically two
stacks, one for data and one for return addresses. In Forth the return
stack is also used for loop counting. My derivation uses the return
stack for addresses such as memory accesses as well as jump/calls, so I
call it the address stack. This lets you do minimal arithmetic (loop
counting and incrementing addresses) and reduces stack ops on the data
stack such as the two drops required for a memory write.


Another potential stack optimization I have looked at is combining
register and stack concepts by allowing very short offsets from top
of stack to be used for a given operand along with variable size
stack adjustments. I didn't pursue this very far but I think it
has potential of virtually eliminating operand juggling making
stack processor much faster.
Also this is a way to improve processing rate as there are fewer
instructions than "pure" stack code (each instruction has a
stack/accumulator operation and a small offset for the other
operand). While one is at it, one can add various instructions bits
for "return", stack/accumulator mode, replace operation, stack
pointer selector, ...

Yes, returns are common so it can be useful to provide a minimal
instruction overhead for that. The other things can require extra
hardware.


Personally, don't have hard numbers for any of this (there are open
source stack machines with small offsets and various instruction
bits, what is needed is compilers so that comparisons can be done).
And don't want to duplicate any work (AKA research) that has already
been done.

Jim Brakefield

--

Rick
 
On 9/28/2015 12:31 AM, jim.brakefield@ieee.org wrote:
On Sunday, September 27, 2015 at 10:20:39 PM UTC-5, rickman wrote:
On 9/27/2015 8:30 PM, jim.brak...@ieee.org wrote:
On Saturday, September 26, 2015 at 3:02:27 PM UTC-5, rickman
wrote:
On 9/26/2015 2:07 PM, jim.brak...@ieee.org wrote:
It would appear there are very similar resource needs for
either RISC or Stack/Accumulator architectures when both are
of the "load/store" classification. Herein, same multi-port
LUT RAM for either RISC register file or dual stacks. And
the DSP for multiply and block RAM for main memory.
"Load/store" refers to using distinct instructions for moving
data between LUT RAM and block RAM.

Has someone studied this situation? Would appear the
stack/accumulator program code would be denser? Would appear
multiple instruction issue would be simpler with RISC?

I've done a little investigation and the instruction set for a
stack processor was not much denser than the instruction set
for the RISC CPU I compared it to. I don't recall which one it
was.

A lot depends on the code you use for comparison. I was using
loops that move data. Many stack processors have some levels
of inefficiency because of the juggling of the stack required
in some code. Usually proponents say the code can be done to
reduce the juggling of operands which I have found to be mostly
true. If you code to reduce the parameter juggling, stack
processors can be somewhat more efficient in terms of code
space usage.

I have looked at a couple of things as alternatives. One is to
use VLIW to allow as much parallelism in usage of the execution
units within the processor, they are, data unit, address unit
and instruction unit. This presents some inherent inefficiency
in that a fixed size instruction field is used to control the
instruction unit when most IU instructions are just "next", for
example. But it allows both the address unit and the data unit
to be doing work at the same time for doing things like moving
data to/from memory and counting a loop iteration, for
example.

Another potential stack optimization I have looked at is
combining register and stack concepts by allowing very short
offsets from top of stack to be used for a given operand along
with variable size stack adjustments. I didn't pursue this
very far but I think it has potential of virtually eliminating
operand juggling making stack processor much faster. I'm not
sure of the effect on code size optimization because of the
larger instruction size.

--

Rick

I have looked at a couple of things as alternatives. One is to
use VLIW to allow as much parallelism in usage of the execution
units within the processor, they are, data unit, address unit
and instruction unit.
Have considered multiple stacks as a form of VLIW: each stack
having its own part of the VLIW instruction, or if nothing to do,
providing future immediates for any of the other stack
instructions.


I assume you mean two data stacks? I was trying hard not to expand
on the hardware significantly. The common stack machine is
typically two stacks, one for data and one for return addresses. In
Forth the return stack is also used for loop counting. My
derivation uses the return stack for addresses such as memory
accesses as well as jump/calls, so I call it the address stack.
This lets you do minimal arithmetic (loop counting and incrementing
addresses) and reduces stack ops on the data stack such as the two
drops required for a memory write.


Another potential stack optimization I have looked at is
combining register and stack concepts by allowing very short
offsets from top of stack to be used for a given operand along
with variable size stack adjustments. I didn't pursue this
very far but I think it has potential of virtually eliminating
operand juggling making stack processor much faster.
Also this is a way to improve processing rate as there are fewer
instructions than "pure" stack code (each instruction has a
stack/accumulator operation and a small offset for the other
operand). While one is at it, one can add various instructions
bits for "return", stack/accumulator mode, replace operation,
stack pointer selector, ...

Yes, returns are common so it can be useful to provide a minimal
instruction overhead for that. The other things can require extra
hardware.


Personally, don't have hard numbers for any of this (there are
open source stack machines with small offsets and various
instruction bits, what is needed is compilers so that comparisons
can be done). And don't want to duplicate any work (AKA research)
that has already been done.

Jim Brakefield



--

Rick

Reply:
I assume you mean two data stacks?
Yes, in particular integer arithmetic on one and floating-point on
the other.

Yes, if you need floating point a separate stack is often used.


My derivation uses the return stack for addresses such as memory
accesses as well as jump/calls, so I call it the address stack.
OK

I was trying hard not to expand on the hardware significantly. The
other things can require extra hardware.

With FPGA 6LUTs one can have several read ports (4LUT RAM can do it
also, its just not as efficient). At one operation per clock and
mapping both data and address stacks to the same LUT RAM, one has two
ports for operand reads, one port for result write and one port for
"return" address read. Just about any stack or accumulator operation
that fits these constraints is possible with appropriate instruction
decode and ALU. The SWAP operation requires two writes, so one would
need to make TOS a separate register to do it in one clock (other
implementations possible using two multiport LUT RAMs).

I used a TOS register for each stack and used a write port and read port
for each stack in one block RAM. The write/read ports share the
address. A read happens on each cycle automatically and in all the
parts I have used that can be set so the data written in a cycle shows
up on the read port, so it is the next on stack at all times.

Managing the stack pointers can get a bit complex if an effort to keep
it simple is not made. As it was the stack pointer was in the critical
timing path which ended in the flag registers. The stack pointers set
error flags in the CPU status register for over and underflow. I
thought this would be useful for debugging, but there is likely ways to
minimize the timing overhead.

--

Rick
 
On Monday, September 28, 2015 at 10:49:47 AM UTC+9:30, jim.bra...@ieee.org wrote:
> And in the 1980s main memory access time was smaller multiple of clock rate than today's DRAMs. However, the main memory for the RISC5 FPGA card is asynchronous static RAM with a fast access time and comparable to the main memory of the Lilith?

Rather than trying to paraphrase the information and risk getting it wrong I refer you to a detailed description of the Lilith memory organisation in the 'Lilith Computer Hardware Manual'. You can download a copy of this and several other related documents from BitSavers:

http://www.bitsavers.org/pdf/eth/lilith/

Regards,
Chris
 

Welcome to EDABoard.com

Sponsor

Back
Top