NibzX7 processor

J

jacko

Guest
Well, finally got Altera Quartus 10.1 working on Jolicloud linux using the dash->bash hack. The compile speed on this netbook is quite good compared to the old windows box. I've fiddled with the instruction set, the sound filter and the video resolution, and added some modulo addressing on the R and S stack registers.

It now weighs in at 75% of 1270 4LUT device. No specific altera megafunctions used. Pure VHDL. (will add UFM spi though at some point). With no constraints it gives Fmax of 85MHz in C5. All arithmetic is based on the MInus instruction. All conditional branching is based on stack return address manipulation.

http://code.google.com/p/nibz/downloads/detail?name=nibzX7.vhd&can=2&qwhich is a re-upload while a compiling version (no VHDL errors)

Apart from later bug fixes, it's a wrap. Free BSD.

Cheers Jacko
 
On Apr 16, 12:39 pm, jacko <jackokr...@gmail.com> wrote:
Well, finally got Altera Quartus 10.1 working on Jolicloud linux using the dash->bash hack. The compile speed on this netbook is quite good compared to the old windows box. I've fiddled with the instruction set, the sound filter and the video resolution, and added some modulo addressing on the R and S stack registers.

It now weighs in at 75% of 1270 4LUT device. No specific altera megafunctions used. Pure VHDL. (will add UFM spi though at some point). With no constraints it gives Fmax of 85MHz in C5. All arithmetic is based on the MInus instruction. All conditional branching is based on stack return address manipulation.

http://code.google.com/p/nibz/downloads/detail?name=nibzX7.vhd&can=2&q> which is a re-upload while a compiling version (no VHDL errors)

Apart from later bug fixes, it's a wrap. Free BSD.

Cheers Jacko
Jacko,

What was you goal in designing this CPU? What were you attempting to
opimize? Only having a minus instruction for arithmetic seems like it
might use a dozen or so fewer LUTs, but at what cost? I can only
assume that means an addition is done by first subtracting one addend
from 0 and then subtracting it from the other addend. So every add
requires two instructions. I believe your instruction set is pretty
minimal from what I've seen. How many bits wide are the stacks? Just
under 1000 LUTs is not bad for a 16 bit processor and is really good
for a 32 bit machine.

Have you seen the ZPU? It is a stack based machine designed to be
coded in C! What will they think of next...

Rick
 
On Saturday, 16 April 2011 21:18:18 UTC+1, rickman wrote:
<snip>
What was you goal in designing this CPU?
To make a very small system (it includes a video and sound output too), after that the priorities were a high MIPS/MB rating, a high MIPS rating, a reasonably high code density using a dynamic compression system, and a foundation to make larger SMP systems.

What were you attempting to
opimize?
Mainly area, but the speed technology was used for the last compile as it fits.

Only having a minus instruction for arithmetic seems like it
might use a dozen or so fewer LUTs, but at what cost? I can only
assume that means an addition is done by first subtracting one addend
from 0 and then subtracting it from the other addend. So every add
requires two instructions. I believe your instruction set is pretty
minimal from what I've seen.
Yes, the instruction set is optimized for threaded code, and so it's likely + would be a subroutine.

How many bits wide are the stacks? Just
under 1000 LUTs is not bad for a 16 bit processor and is really good
for a 32 bit machine.
The stacks (2 of them) are 16 bit wide with auto increment and decrement.

Have you seen the ZPU? It is a stack based machine designed to be
coded in C! What will they think of next...
I had a look, and the very small version is limited in the number of instructions it offers. Designed for C? Almost as funny a claim as designed for Haskell...

Cheers Jacko
 
On Apr 17, 5:18 am, jacko <jackokr...@gmail.com> wrote:
On Saturday, 16 April 2011 21:18:18 UTC+1, rickman  wrote:

snip

What was you goal in designing this CPU?

To make a very small system (it includes a video and sound output too), after that the priorities were a high MIPS/MB rating, a high MIPS rating, a reasonably high code density using a dynamic compression system, and a foundation to make larger SMP systems.

What were you attempting to
opimize?

Mainly area, but the speed technology was used for the last compile as it fits.

Only having a minus instruction for arithmetic seems like it
might use a dozen or so fewer LUTs, but at what cost?  I can only
assume that means an addition is done by first subtracting one addend
from 0 and then subtracting it from the other addend.  So every add
requires two instructions.  I believe your instruction set is pretty
minimal from what I've seen.

Yes, the instruction set is optimized for threaded code, and so it's likely + would be a subroutine.

How many bits wide are the stacks?  Just
under 1000 LUTs is not bad for a 16 bit processor and is really good
for a 32 bit machine.

The stacks (2 of them) are 16 bit wide with auto increment and decrement.

Have you seen the ZPU?  It is a stack based machine designed to be
coded in C!  What will they think of next...

I had a look, and the very small version is limited in the number of instructions it offers. Designed for C? Almost as funny a claim as designed for Haskell...
Not sure I follow. What do you mean the instructions are limited?
They use emulation to implement some instructions depending on the
core used. This is very much the Forth concept of building words.

The C claim is not funny, It's real. They are using gcc I believe and
people have used the ZPU in real apps. I wasn't impressed because it
is not as fast as my design, but it is even smaller and faster
versions are not a lot bigger. So I have to give them their due.
They met their goal of making the smallest possible (32 bit!)
processor supported by a C compiler with variations designed for
higher speeds. I don't think there is ANY other soft CPU under
several thousand LUTs that has a C compiler.

Rick
 
On Sunday, 17 April 2011 23:00:57 UTC+1, rickman wrote:
On Apr 17, 5:18 am, jacko <jacko...@gmail.com> wrote:
On Saturday, 16 April 2011 21:18:18 UTC+1, rickman  wrote:

snip

What was you goal in designing this CPU?

To make a very small system (it includes a video and sound output too), after that the priorities were a high MIPS/MB rating, a high MIPS rating, a reasonably high code density using a dynamic compression system, and a foundation to make larger SMP systems.

What were you attempting to
opimize?

Mainly area, but the speed technology was used for the last compile as it fits.

Only having a minus instruction for arithmetic seems like it
might use a dozen or so fewer LUTs, but at what cost?  I can only
assume that means an addition is done by first subtracting one addend
from 0 and then subtracting it from the other addend.  So every add
requires two instructions.  I believe your instruction set is pretty
minimal from what I've seen.

Yes, the instruction set is optimized for threaded code, and so it's likely + would be a subroutine.

How many bits wide are the stacks?  Just
under 1000 LUTs is not bad for a 16 bit processor and is really good
for a 32 bit machine.

The stacks (2 of them) are 16 bit wide with auto increment and decrement.

Have you seen the ZPU?  It is a stack based machine designed to be
coded in C!  What will they think of next...

I had a look, and the very small version is limited in the number of instructions it offers. Designed for C? Almost as funny a claim as designed for Haskell...

Not sure I follow. What do you mean the instructions are limited?
They use emulation to implement some instructions depending on the
core used. This is very much the Forth concept of building words.
Yes the emulation to reduce core size, 'having' instructions but not implementing them in hardware is not having them at all, and is marketing speak.

The C claim is not funny, It's real. They are using gcc I believe and
people have used the ZPU in real apps. I wasn't impressed because it
is not as fast as my design, but it is even smaller and faster
versions are not a lot bigger. So I have to give them their due.
They met their goal of making the smallest possible (32 bit!)
processor supported by a C compiler with variations designed for
higher speeds. I don't think there is ANY other soft CPU under
several thousand LUTs that has a C compiler.
Supporting C, that's good, but designed for C is more marketing speak. Considering C was designed to work on processors, I'd expect a stack frame link instruction similar to the 68k at least... with word stride multiplication for pointer arithmetic... but fair dues, it's not too bad, but suffers from hype.

Cheers Jacko
 
On 18 Apr., 00:00, rickman <gnu...@gmail.com> wrote:
 I don't think there is ANY other soft CPU under
several thousand LUTs that has a C compiler.

Please do not forget my ERIC5: About 300 LUTs, about ATMEL AVR
performance, with C-compiler.

Regards,

Thomas

www.entner-electronics.com
 
On 18 Apr., 12:08, Thomas Entner <thomas.entne...@gmail.com> wrote:
On 18 Apr., 00:00, rickman <gnu...@gmail.com> wrote:> I don't think there is ANY other soft CPU under
several thousand LUTs that has a C compiler.

Please do not forget my ERIC5: About 300 LUTs, about ATMEL AVR
performance, with C-compiler.

Regards,

Thomas

www.entner-electronics.com
P.S.: And it even includes an add-instruction ;-) Not to mention the
multiplier...
 
It looks ok Thomas, haven't seen the ISA. The main reason I dropped the add instruction (originally there was no minus), was that minus is more primitive, in that construction of minus from plus requires xor. In the context of threaded code compilation the MI instruction can be used just once.

The main features are a 3 in 1 compression mode, so that 3 instructions may be placed in 16 bits, for a high code density. No opcode is needed to prefix a subroutine jump. There are 5 registers and a borrow flag. Pre/post -/+ is applied to all indirect memory access. A hardware loading of RAM via an SPI EEPROM at boot time is included, via a hardware SPI interface. A simple interrupt method can be used. Code size is 48K * 16 bit when using 16 bit generic word size and the 3 in 1 compression. Data size is up to 64K * 16 bit, as addressable memory is 128KB using a 16 bit generic. Video DMA is included for a sub VGA resolution of 256*256 in 8 colours. A 16 bit delta sigma DAC is included. 2 * 8 bit ports (one in, one out) are included. With no cache a 0.2 MIPS/MB processing from RAM is standard (including operand access). BSD license.

For an further explanation of a preference for MInus, the Z80 explains best with a DJNZ, explaining count down to borrow is an excellent looping mechanism. The saving of a few cells is necessary considering the size of the UFM-SPI mega function, and the 1270 LE MAX II kit I am targeting. It's all very logical.

After all is considered, the 16 bit memory model with auto +/- saves a lot of code is a stack based design. Think of all those extra cycles adding or subtracting 1 which are hidden in Nibz, and the poultry complexity of performing an add is tiny. The subroutine branch saving alone is major significant, considering factorization into small subroutines is where code density comes from.

Cheers Jacko
 
On Apr 18, 6:08 am, Thomas Entner <thomas.entne...@gmail.com> wrote:
On 18 Apr., 00:00, rickman <gnu...@gmail.com> wrote:> I don't think there is ANY other soft CPU under
several thousand LUTs that has a C compiler.

Please do not forget my ERIC5: About 300 LUTs, about ATMEL AVR
performance, with C-compiler.

Regards,

Thomas

www.entner-electronics.com
I should have included "32 bit" processor. That was what they
wanted. One 32 bit processor architecture, one instruction set and
many possible speed ranges.

That is not my goal. I prefer to provide more customized CPUs which
are optimized for the application which almost always require more
speed, at least in bursts.

Rick
 
Jacko,

Supporting C, that's good, but designed for C is more marketing speak.
Considering C was designed to work on processors, I'd expect a stack
frame link instruction similar to the 68k at least... with word stride
multiplication for pointer arithmetic... but fair dues, it's not too bad, but
suffers from hype.
I don't agree that it is just marketing speak - the instructions were
selected to encode C programs as compactly as possible while still
having a tiny implementation. The CRISP (sold as the AT&T Hobbit) was
a much better C processor, but an FPGA implementation of that would be
several times larger than the ZPU. The VAX was also a really great
target for C, but couldn't perform as well as RISCs (neither can the
ZPU).

-- Jecel
 
I can see how it could be so, but why go for 16 local variable indexes? The following development from the Nibz basis:

http://code.google.com/p/scc-on-gcc/wiki/FBWT

and the idea of an unroll n instructions instruction to extend the subroutine factor cover to similar beginning but uncommon endings, should go real well on OoO execution machines with speculative. The net reduction in MB/s traffic per MIPS, could more than offset the reduction because of threading subroutines.

The X7 is not my shiny shiny but my pet dog. And my dog's better than your dog :)

C compiler for Nibz... it might happen, but that was never the ultimate aim..

Cheers Jacko

p.s. and if OS561 ever gets made, I will have failed in my purpose of not being bound by backward compatibility limits, stopping future development.
 
On Apr 19, 10:42 pm, jacko <jackokr...@gmail.com> wrote:
I can see how it could be so, but why go for 16 local variable indexes? The following development from the Nibz basis:

http://code.google.com/p/scc-on-gcc/wiki/FBWT

and the idea of an unroll n instructions instruction to extend the subroutine factor cover to similar beginning but uncommon endings, should go real well on OoO execution machines with speculative. The net reduction in MB/s traffic per MIPS, could more than offset the reduction because of threading subroutines.

The X7 is not my shiny shiny but my pet dog. And my dog's better than your dog :)

C compiler for Nibz... it might happen, but that was never the ultimate aim.

Cheers Jacko

p.s. and if OS561 ever gets made, I will have failed in my purpose of not being bound by backward compatibility limits, stopping future development.
I don't know why you are comparing the ZPU to your processor in terms
of performance and size. The ZPU is a 32 bit processor, yours is 16
no?

Rick
 
On 20/04/11 09:45, jacko wrote:
set the generic wide to 16, and it becomes 32 bit..
Is your processor design as good as your ability to post linked
messages?
 
On Apr 20, 4:45 am, jacko <jackokr...@gmail.com> wrote:
set the generic wide to 16, and it becomes 32 bit..
And it doubles in size and slows down as well, yes?

Rick
 
Not doubles as the video and DAC and SPI hardware are all limited in size, but yes it gets bigger and slower. The memory data bus becomes 16 bit though.

Cheers Jacko
 
On 4/20/2011 10:55 AM, Jan Coombs wrote:
On 20/04/11 09:45, jacko wrote:
set the generic wide to 16, and it becomes 32 bit..

Is your processor design as good as your ability to post linked messages?
In other newsgroups the issue was attributed to the latest Google Groups
interface. Some solved the issue by returning to the old interface.
 
Well, seems to be working now, but I know what you mean. For example, the current compile idea is taking small C the 8080 compiler, and adapting it to compile under gcc (some reserved word collisions and some lack of forward definitions, but leaving pointer and int casting warnings). Just need to complete the code generation section, get it outputting 8080 assembly, and then get it to output Nibz code with some macros for common operations. Minimize the macro set, and then insert a code size optimizer.

http://scc-on-gcc.googlecode.com

Sometimes old, although not best is simpler to get working.

Cheers Jacko
 
On 21/04/11 14:45, Christopher Felton wrote:
On 4/20/2011 10:55 AM, Jan Coombs wrote:
On 20/04/11 09:45, jacko wrote:
set the generic wide to 16, and it becomes 32 bit..

Is your processor design as good as your ability to post linked
messages?



In other newsgroups the issue was attributed to the latest Google
Groups interface. Some solved the issue by returning to the old
interface.
Yes, I should have remembered that. But, the posts are also
difficult to understand because of being very terse, and quoting no
context, for example, this orphaned one posted to c.l.f, about 5
hours ago:


: f1 ... DUP ;

: f2 ... f1 ... ;

: f3 ... DROP f1 ... ;

Which leads to the idea of a NOP permutation generator.
 

Welcome to EDABoard.com

Sponsor

Back
Top