New soft processor core paper publisher?

Jun 12, 2013

I have a general purpose soft processor core that I developed in verilog. The processor is unusual in that it uses four indexed LIFO stacks with explicit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. The design is relatively simple (as these things go) yet powerful enough to do real work.

I wrote a fairly extensive paper describing the processor, and am about to post it and my code over at opencores.org, but was thinking the paper and the concepts might be good enough for a more formal publication. Any suggestions on who might be interested in publishing it?

Nikolaos Kavvadias · Jun 13, 2013

Hi,

I was in a similar position about 5 years ago. My own processor is the
ByoRISC, a RISC-like extensible custom processor supporting multiple-
input, multiple-output custom instructions.

I have a general purpose soft processor core that I developed in verilog. The processor is unusual in that it uses four indexed LIFO stacks with explicit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. The design is relatively simple (as these things go) yet powerful enough to do real work.

This reads like a "fourstack" architecture on steroids. It seems
good!
How do you compare with more classic RISC-like soft-cores like
MicroBlaze, Nios-II, LEON, etc?
There is also a classic book on stack-based computers, you really need
to go through this and reference it in your publication.

I wrote a fairly extensive paper describing the processor, and am about to post it and my code over at opencores.org, but was thinking the paper and the concepts might be good enough for a more formal publication. Any suggestions on who might be interested in publishing it?

I had chosen to publish to VLSI-SoC 2008 (due to proximity, that year
it was held in Greece).
It is an OK conference, however, not indexed well by DBLP and the
likes.
Anyway, here is a link to my submitted version of the paper:
http://www.nkavvadias.com/publications/kavvadias_vlsisoc08.pdf

The paper was really well accepted at the conference venue. I had
received some of my best reviews.
However, I didn't had the chance to present the paper in person,
because I was in really deep-S in the army and couldn't get a three-
day special leave for the conference. (I joined the army at 31, so I
was s'thing like an elderly private

. For instance, I was about the
same age as all the majors in the camp. Only colonels and people among
permanent staff where older.

On the contrary, I had a hard-time to publish an extended/long version
of the paper as a journal paper. All three publishers were arguing
about the existence of the conference paper, and that due to this
fact, no journal paper version was necessary (even with ~40% material
additions).

My suggestion is to:
a) go for the journal paper (e.g. IEEE Trans. on VLSI or ACM TECS if
you have s'thing really modern)
b) otherwise submit to an FPGA or architecture conference. It depends
on where you live, there are numerous European and worldwide
conferences with processor-related topics (FPGA-based architectures,
GPUs, ASIPs, novel architectures, manycores, etc).

In all cases you may have to adapt your material (e.g. due to page
limits) to the conventions of the publisher.

BTW another more recent example is the paper on the iDEA DSP soft-core
processor:
http://www.ntu.edu.sg/home/sfahmy/files/papers/fpt2012-cheah.pdf

This looks like a lean, mean architecture well-opted for contemporary
FPGAs.

Hope these help.

Best regards,
Nikolaos Kavvadias
http://www.nkavvadias.com

jt_eaton · Jun 13, 2013

I have a general purpose soft processor core that I developed in verilog.
=
The processor is unusual in that it uses four indexed LIFO stacks wit
expl=
icit stack pointer controls in the opcode. It is 32 bit, 2 operand, full
=
pipelined, 8 threads, and produces an aggregate 200 MIPs in bargai
basemen=
t Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs.
Th=
e design is relatively simple (as these things go) yet powerful enough t
d=
o real work.

I wrote a fairly extensive paper describing the processor, and am about t
=
post it and my code over at opencores.org, but was thinking the paper an
t=
he concepts might be good enough for a more formal publication. An
sugges=
tions on who might be interested in publishing it?

Do you also have an assembler, C++ compiler and debugger for this beast?
You should have a reference design running on a FPGA board if you want to
attract a following. Ideally it should also run linux.

Why can't you do both. Post the code to opencores.org and then write
paper
about it and publish.

John

---------------------------------------
Posted through http://www.FPGARelated.com

Eric Wallin · Jun 13, 2013

Thank you for your reply Nikolaos!

This reads like a "fourstack" architecture on steroids. It seems
good!

"A Four Stack Processor" by Bernd Paysan? I ran across that paper several years ago (thanks!). Very interesting, but with multiple ALUs, access to data below the LIFO tops, TLBs, security, etc. it is much more complex than my processor. It looks like a real bear to program and manage at the lowest level.

How do you compare with more classic RISC-like soft-cores like
MicroBlaze, Nios-II, LEON, etc?

The target audience for my processor is an FPGA developer who needs to implement complex functionality that tolerates latency but requires deterministic timing. Hand coding with no toolchain (verilog initial statement boot code). Simple enough to keep the processor model and current state in one's head (with room to spare). Small enough to fit in the smallest of FPGAs (with room to spare). Not meant at all to run a full-blown OS, but not a trivial processor.

There is also a classic book on stack-based computers, you really need
to go through this and reference it in your publication.

"Stack Computers: The New Wave" by Philip J. Koopman, Jr.? Also ran across that many years ago (thanks!). The main thrust of it seems to be the advocating of single data stack, single return stack, zero operand machines, which I feel (nothing personal) are crap. Easy to design and implement (I've made several while under the spell) but impossible to program in an efficient manner (gobs of real time wasted on stack thrash, the minimization of which leads directly to unreadable procedural coding practices, which leads to catastrophic stack faults).

On the contrary, I had a hard-time to publish an extended/long version
of the paper as a journal paper. All three publishers were arguing
about the existence of the conference paper, and that due to this
fact, no journal paper version was necessary (even with ~40% material
additions).

Hmm. The last thing I want is to have my hands tied when I'm trying to give something away for free. But my paper would likely benefit from external editorial input.

My suggestion is to:
a) go for the journal paper (e.g. IEEE Trans. on VLSI or ACM TECS if
you have s'thing really modern)

My processor incorporates what I believe are a couple of new innovations (but who ever really knows?) that I'd like to get out there if possible. And I wouldn' mind a bit of personal recognition if only for my efforts.

IEEE is probably out. I fundamentally disagree with the hoarding of tecnical papers behind a greedy paywall.

BTW another more recent example is the paper on the iDEA DSP soft-core
processor:
http://www.ntu.edu.sg/home/sfahmy/files/papers/fpt2012-cheah.pdf

Wow, very nice paper describing a very nice design, thanks!

Eric Wallin · Jun 13, 2013

Thanks for the response John!

Do you also have an assembler, C++ compiler and debugger for this beast?
You should have a reference design running on a FPGA board if you want to
attract a following. Ideally it should also run linux.

See my response to Nikolaos above. Full-blown OS support was not the development target. But it's not a pico-blaze either. Somewhere in the middle, mainly for FPGA algorithms that can benefit from serialization.

Why can't you do both. Post the code to opencores.org and then write a
paper about it and publish.

That's probably the route I'll end up taking.

Theo Markettos · Jun 14, 2013

Eric Wallin <tammie.eric@gmail.com> wrote:

Thanks for the response John!

Do you also have an assembler, C++ compiler and debugger for this beast?
You should have a reference design running on a FPGA board if you want to
attract a following. Ideally it should also run linux.

See my response to Nikolaos above. Full-blown OS support was not the
development target. But it's not a pico-blaze either. Somewhere in the
middle, mainly for FPGA algorithms that can benefit from serialization.

Benchmarks. Tell us why we should use your processor. How does it win
compared with the alternatives?

How easy is it to program? An assembler or a C compiler are really
necessary to make something usable - LLVM may come in handy as a C compiler
toolkit, I'm not sure what's an equivalent assembler toolkit.

Actually synthesise the thing. It's hard to take seriously something that's
never actually been tested for real, especially if it makes assumptions like
having gigabytes of single-cycle-latency memory. Debug it and make sure it
works in real hardware.

Why can't you do both. Post the code to opencores.org and then write a
paper about it and publish.

If you put it on opencores, document document document. There are tons of
half-baked projects with lame or nonexistent documentation, that kind of
half work on the author's dev system but fall over in real life for one
reason or another.

Is it vendor-independent, or does it use Xilinx/Altera/etc special stuff?
If so, how easily can that be replaced with an alternative vendor?

Regression tests and test suites. How do we know it's working? Can we work
on the code and make sure we don't break anything? What does 'working' mean
in the first place?

If you're trying to make an argument in computer architecture you can get
away without some of this stuff (a research prototype can have rough edges
because it's only to prove a point, as long as you tell us what they are).
Generally you need to tell a convincing story, and either the story is that
XYZ is a useful approach to take (so we can throw away the prototype and
build something better) or XYZ is a component people should use (when it
becomes more convincing if there's more support)

Some lists of well-known conferences:
http://sites.google.com/site/calasweb/fpga-conferences-and-workshops
http://tcfpga.org/conferences.html

Good luck

Theo

Eric Wallin · Jun 14, 2013

Thanks for your response Theo!

On Thursday, June 13, 2013 8:44:03 PM UTC-4, Theo Markettos wrote:

Benchmarks. Tell us why we should use your processor. How does it win
compared with the alternatives?

Good point. So far I've coded a verification boot code gauntlet that it has passed, as well as restoring division and log2. If I had more code to push through it I could statistically tailor the instruction set (size the immediates, etc.) but I don't. I may at some point but I may not either. This is mainly for me, to help me to implement various projects that require complex computations in an FPGA (I currently need it for a digital Theremin that is under development), but I want to release it so others may examine and possibly use it or help me make it better, or use some of the ideas in there in their own stuff.

How easy is it to program? An assembler or a C compiler are really
necessary to make something usable - LLVM may come in handy as a C compiler
toolkit, I'm not sure what's an equivalent assembler toolkit.

It's fairly general purpose and I think if you read the paper you might (or might not) find it easy to understand and program by hand using verilog initial statements. My main goals were that it be simple enough to grasp without tools, complex and fast enough to do real things, have compact opcodes so BRAM isn't wasted, etc. A compiler, OS, etc. are overkill and definitely not the intended target.

There is a middle ground between trivial and full-blown processors (particularly for FPGA logical use). Of all the commercial offerings in this range that I'm aware of, my processor is probably most similar to the Parallax Propeller, which is almost certainly pipeline threaded (though they don't tell you that in the documentation). The Propeller and has a video generator; character, sine, and log tables; and other stuff mine doesn't. But mine has a simpler, more unified overall concept and programming model. It is a true middle ground between register and stack machines.

Actually synthesise the thing. It's hard to take seriously something that's
never actually been tested for real, especially if it makes assumptions like
having gigabytes of single-cycle-latency memory. Debug it and make sure it
works in real hardware.

Not trying to argue from authority, but I've got 10 years of professional HDL experience, and have made several processors in the past for my own edification and had them up and running on Xilinx demo boards. This one hasn't actually run in the flesh yet, but it has gone through the build process many times and has been pretty thoroughly verified, so I would be amazed if there were any issues (famous last words). But I'll run it on a Cyclone IV board before releasing it.

If you put it on opencores, document document document. There are tons of
half-baked projects with lame or nonexistent documentation, that kind of
half work on the author's dev system but fall over in real life for one
reason or another.

I know what you mean, I never use any code directly from there. To be fair, most of the code I ran across in industry was fairly poor as well. Anyway, I've got a really nice document that took me about a month to write, with lots of drawings, tables, examples, etc. describing the design and my thoughts behind it. Even if people don't particularly like my processor they might be able to get something out of the background info in the paper (FPGA multipliers and RAM, LIFO & ALU design, pipelining, register set construction, etc.).

Is it vendor-independent, or does it use Xilinx/Altera/etc special stuff?
If so, how easily can that be replaced with an alternative vendor?

I was careful to not use vendor specific constructs in the verilog. The block RAM for main memory the the stacks is inferred, as are the ALU signed multipliers. I spent a long time on the modular partitioning of the code with a strong eye towards verification (as I usually do). The code was developed in Quartus, and has been compiled many, many times, but I haven't run it through XST yet.

Regression tests and test suites. How do we know it's working? Can we work
on the code and make sure we don't break anything? What does 'working' mean
in the first place?

I'm probably an odd man out, but I don't agree with a lot of "standard" industry verification methodology. Test benches are fine for really complex code and / or data environments, but there is no substitute for good coding, proper modular partitioning, and thorough hand testing of each module. I've seen too many out of control projects with designers throwing things over various walls, leaving the verification up to the next guy who usually isn't familiar enough with it to really bang on the sensitive parts. And I kind of hate modelsim.

Anyone that codes should spend a lot of time verifying - I do, and for the most part really enjoy it. The industry has turned this essential activity into something most people loathe, so it just doesn't happen unless people get pushed into doing it. And even then it usually doesn't get done very thoroughly. Co-developing in environments like that is a nightmare.

Some lists of well-known conferences:
http://sites.google.com/site/calasweb/fpga-conferences-and-workshops
http://tcfpga.org/conferences.html

Thanks, I'll check them out!

Theo Markettos · Jun 14, 2013

Eric Wallin <tammie.eric@gmail.com> wrote:

Thanks for your response Theo!

On Thursday, June 13, 2013 8:44:03 PM UTC-4, Theo Markettos wrote:

Benchmarks. Tell us why we should use your processor. How does it win
compared with the alternatives?

Good point. So far I've coded a verification boot code gauntlet that it
has passed, as well as restoring division and log2. If I had more code to
push through it I could statistically tailor the instruction set (size the
immediates, etc.) but I don't. I may at some point but I may not either.
This is mainly for me, to help me to implement various projects that
require complex computations in an FPGA (I currently need it for a digital
Theremin that is under development), but I want to release it so others
may examine and possibly use it or help me make it better, or use some of
the ideas in there in their own stuff.

FWIW 'benchmarks' doesn't necessarily mean running SPECfoo at 2.7 times
quicker than a 4004, but things like 'how many instructions does it take to
write division/FFT/quicksort/whatever' compared with the leading brand. Or
how many LEs, BRAMs, mW, etc. Numbers are good (as is publishing the source
so we can reproduce them).

How easy is it to program? An assembler or a C compiler are really
necessary to make something usable - LLVM may come in handy as a C compiler
toolkit, I'm not sure what's an equivalent assembler toolkit.

It's fairly general purpose and I think if you read the paper you might
(or might not) find it easy to understand and program by hand using
verilog initial statements. My main goals were that it be simple enough
to grasp without tools, complex and fast enough to do real things, have
compact opcodes so BRAM isn't wasted, etc. A compiler, OS, etc. are
overkill and definitely not the intended target.

Fair enough. If you're making architectural points, you can probably get
away with assembly examples. A simple assembler is good for developer
sanity, though. Could probably be knocked up in Python reasonably fast.

Actually synthesise the thing. It's hard to take seriously something that's
never actually been tested for real, especially if it makes assumptions like
having gigabytes of single-cycle-latency memory. Debug it and make sure it
works in real hardware.

Not trying to argue from authority, but I've got 10 years of professional
HDL experience, and have made several processors in the past for my own
edification and had them up and running on Xilinx demo boards. This one
hasn't actually run in the flesh yet, but it has gone through the build
process many times and has been pretty thoroughly verified, so I would be
amazed if there were any issues (famous last words). But I'll run it on a
Cyclone IV board before releasing it.

I'm just a bit jaded from seeing papers at conferences where somebody wrote
some verilog which they only ran in modelsim, and never had to worry about
limited BRAM, or meeting timing, or multiple clock domains, or...

If you put it on opencores, document document document. There are tons of
half-baked projects with lame or nonexistent documentation, that kind of
half work on the author's dev system but fall over in real life for one
reason or another.

I know what you mean, I never use any code directly from there. To be
fair, most of the code I ran across in industry was fairly poor as well.
Anyway, I've got a really nice document that took me about a month to
write, with lots of drawings, tables, examples, etc. describing the
design and my thoughts behind it. Even if people don't particularly like
my processor they might be able to get something out of the background
info in the paper (FPGA multipliers and RAM, LIFO & ALU design,
pipelining, register set construction, etc.).

This is good. Just a thought - could you angle it as 'how to do processor
design' using your processor as a case study? That makes it more of a
useful tutorial than 'buy our brand, it's great'...

I'm probably an odd man out, but I don't agree with a lot of "standard"
industry verification methodology. Test benches are fine for really
complex code and / or data environments, but there is no substitute for
good coding, proper modular partitioning, and thorough hand testing of
each module. I've seen too many out of control projects with designers
throwing things over various walls, leaving the verification up to the
next guy who usually isn't familiar enough with it to really bang on the
sensitive parts. And I kind of hate modelsim.

That's not exactly what I meant... let's say you rearrange the pipelining on
your CPU. It turns out you introduce some obscure bug that causes branches to
jump to the wrong place if there's a multiply 3 instructions back from the
branch. How would you know if you did this, and make sure it didn't happen
again? Hand testing modules won't catch that.

It's worse if there's an OS involved, of course. But it can be easy to
introduce stupid bugs when you're refactoring something, and waste a lot of
time tracking them down.

We use Bluespec so avoid modelsim ;-)
(with Jenkins so we run the test suite for every commit. A bit overkill for
your needs, perhaps)

Anyone that codes should spend a lot of time verifying - I do, and for the
most part really enjoy it. The industry has turned this essential
activity into something most people loathe, so it just doesn't happen
unless people get pushed into doing it. And even then it usually doesn't
get done very thoroughly. Co-developing in environments like that is a
nightmare.

I admit the tools don't always make it easy...

Theo

Eric Wallin · Jun 15, 2013

On Friday, June 14, 2013 5:29:15 PM UTC-4, Theo Markettos wrote:

FWIW 'benchmarks' doesn't necessarily mean running SPECfoo at 2.7 times
quicker than a 4004, but things like 'how many instructions does it take to
write division/FFT/quicksort/whatever' compared with the leading brand. Or
how many LEs, BRAMs, mW, etc. Numbers are good (as is publishing the source
so we can reproduce them).

I have FPGA resource numbers for the Cyclone III target in the paper. Briefly, it consumes ~1800 LEs, 4 18x18 multipliers, 4 BRAMs for the stacks, plus whatever the main memory needs. This is roughly 1/3 of the smallest Cyclone III part. I have a restoring division example in the paper that gives 197 / 293 cycles best / worst case (a thread cycle is 8 200MHz clocks, but there are 8 threads running at this speed so aggregate throughput is potentially 200 MIPs if all threads are busy doing something).

I've seen lots of papers that claim speed numbers but don't give the speed grade, or tell you what hoops they jumped through to get those speeds. Without that info the speeds are meaningless.

Fair enough. If you're making architectural points, you can probably get
away with assembly examples. A simple assembler is good for developer
sanity, though. Could probably be knocked up in Python reasonably fast.

That's certainly possible. At this point I'm writing code for it directly in verilog using an initial statement text file that gets included in the main memory. Several define statements make this clearer and actually fairly easy. But uploading code to a boot loader would require something like an assembler. I'm really trying to stay away from the need for toolsets.

This is good. Just a thought - could you angle it as 'how to do processor
design' using your processor as a case study? That makes it more of a
useful tutorial than 'buy our brand, it's great'...

The paper is kind of that, background and general how to, but my processor doesn't have caches, branch prediction, pipeline hazards, TLBs, etc. so people wanting to know how to do that stuff will come up totally empty.

That's not exactly what I meant... let's say you rearrange the pipelining on
your CPU. It turns out you introduce some obscure bug that causes branches to
jump to the wrong place if there's a multiply 3 instructions back from the
branch. How would you know if you did this, and make sure it didn't happen
again? Hand testing modules won't catch that.

It's correct by construction! ;-) Seriously though, there are no hazards to speak of and very little internal state, so branches pretty much either work or they don't. Once basic functionality was confirmed in simulation, I used processor code to check the processor itself e.g. I wrote some code that checks all branches against all possible branch conditions. Each test increments a count if it passes or decrements if it fails. The final passing number can only be reached if all tests pass. I've got simple code like this to test all of the opcodes. This exercise can help give an early feel for the completeness of the instruction set as well. Verifying something like the Pentium must be one agonizingly painful mountain to climb. Verifying each silicon copy must be a bear as well.

Eric Wallin · Jun 15, 2013

So is there anything like the old Byte magazine (or a web equivalent) where enthusiasts and other non-academic, non-industry types can publish articles on computers / computing hardware?

rickman · Jun 16, 2013

On 6/13/2013 1:07 PM, Eric Wallin wrote:

Thank you for your reply Nikolaos!

This reads like a "fourstack" architecture on steroids. It seems
good!

"A Four Stack Processor" by Bernd Paysan? I ran across that paper several years ago (thanks!). Very interesting, but with multiple ALUs, access to data below the LIFO tops, TLBs, security, etc. it is much more complex than my processor. It looks like a real bear to program and manage at the lowest level.

How do you compare with more classic RISC-like soft-cores like
MicroBlaze, Nios-II, LEON, etc?

The target audience for my processor is an FPGA developer who needs to implement complex functionality that tolerates latency but requires deterministic timing. Hand coding with no toolchain (verilog initial statement boot code). Simple enough to keep the processor model and current state in one's head (with room to spare). Small enough to fit in the smallest of FPGAs (with room to spare). Not meant at all to run a full-blown OS, but not a trivial processor.

That the ground I have been plowing off and on for the last 10 years.

There is also a classic book on stack-based computers, you really need
to go through this and reference it in your publication.

"Stack Computers: The New Wave" by Philip J. Koopman, Jr.? Also ran across that many years ago (thanks!). The main thrust of it seems to be the advocating of single data stack, single return stack, zero operand machines, which I feel (nothing personal) are crap. Easy to design and implement (I've made several while under the spell) but impossible to program in an efficient manner (gobs of real time wasted on stack thrash, the minimization of which leads directly to unreadable procedural coding practices, which leads to catastrophic stack faults).

I assume that you do understand that the point of MISC is that the
implementation can be minimized so that the instructions run faster. In
theory this makes up for the extra instructions needed to manipulate the
stack on occasion. But I understand your interest in minimizing the
inconvenience of stack ops. I spent a little time looking at
alternatives and am currently looking at a stack CPU design that allows
offsets into the stack to get around the extra stack ops. I'm not sure
how this compares to your ideas. It is still a dual stack design as I
have an interest in keeping the size of the implementation at a minimum.
1800 LEs won't even fit on the FPGAs I am targeting.

My processor incorporates what I believe are a couple of new innovations (but who ever really knows?) that I'd like to get out there if possible. And I wouldn' mind a bit of personal recognition if only for my efforts.

I would like to hear about your innovations. As you seem to understand,
it is hard to be truly innovative finding new ideas that others have not
uncovered. But I think you are certainly in an area that is not
thoroughly explored.

IEEE is probably out. I fundamentally disagree with the hoarding of tecnical papers behind a greedy paywall.

I won't argue with that. Even when I was an IEEE member, I never found
a document I didn't have to pay for.

When can we expect to see your paper?

--

Rick

Eric Wallin · Jun 16, 2013

Thanks for your response rickman!

On Saturday, June 15, 2013 8:40:27 PM UTC-4, rickman wrote:

That the ground I have been plowing off and on for the last 10 years.

Ooo, same here, and my condolences. I caught a break a couple of months ago and have been beavering away on it ever since, and I finally have something that doesn't cause me to vomit when I code for it. Multiple indexed simple stacks with explicit pointer control makes everything a lot easier than a bog standard stack machine. I think the auto-consumption of literally everything, particularly the data, indexes, and pointers you dearly want to use again is at the bottom of all the crazy people just accept with stack machines. This mechanism works great for manual data entry on HP calculators, but not so much for stack machines IMHO. Auto consumption also pretty much rules out conditional execution of single instructions.

I assume that you do understand that the point of MISC is that the
implementation can be minimized so that the instructions run faster. In
theory this makes up for the extra instructions needed to manipulate the
stack on occasion. But I understand your interest in minimizing the
inconvenience of stack ops. I spent a little time looking at
alternatives and am currently looking at a stack CPU design that allows
offsets into the stack to get around the extra stack ops. I'm not sure
how this compares to your ideas. It is still a dual stack design as I
have an interest in keeping the size of the implementation at a minimum.

MISC is interesting, but you have to consider that all ops, including simple stack manipulations, will generally consume as much real time as a multiply, which suddenly makes all of those confusing stack gymnastics you have to perform to dig out your loop index or whatever from underneath your read/write pointer from underneath your data and such overly burdensome.

Indexes into a moving stack - that way lies insanity. Ever hit the roll down button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Altera BRAM doesn't go as wide as Xilinx with true dual port. When I was working in Xilinx I was able to use a single BRAM for both the data and return stacks (16 bit data).

1800 LEs won't even fit on the FPGAs I am targeting.

I'm not sure anything less than the smallest Cyclone 2 is really worth developing in. A lot of the stuff below that is often more expensive due to the built-in configuration memory and such. There are quite inexpensive Cyclone dev boards on eBay from China.

I would like to hear about your innovations. As you seem to understand,
it is hard to be truly innovative finding new ideas that others have not
uncovered. But I think you are certainly in an area that is not
thoroughly explored.

I haven't seen anything exactly like it, certainly not the way the stacks are implemented. And I deal with extended arithmetic results in an unusual way. In terms of scheduling and pipelining, the Parallax Propeller is probably the closest in architecture (you can infer from the specs and operational model what they don't explicitly tell you in the datasheet).

I won't argue with that. Even when I was an IEEE member, I never found
a document I didn't have to pay for.

I was a member too right out of grad school. But, like Janet Jackson sang: "What have they done for me lately?"

When can we expect to see your paper?

It's all but done, just picking around the edges at this point. As soon as the code is verified to my satisfaction I'll release both and post here.

Tom Gardner · Jun 16, 2013

Eric Wallin wrote:

Indexes into a moving stack - that way lies insanity. Ever hit the roll down button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Have a look at comp.arch, in particular the current discussion about the "belt" in the Mill processor. Start by watching the video.

The Mill is a radical architecture that offers far greater instruction level parallelism than existing processors, partly by having no general purpose registers.

The Mill is irrelevant to FPGA processors; it is aimed at beating x86 machines.

Eric Wallin · Jun 16, 2013

On Sunday, June 16, 2013 5:23:01 AM UTC-4, Tom Gardner wrote:

The Mill is irrelevant to FPGA processors...

The Mill looks vaguely interesting (if you're into billion transistor processors) but as you indicated I'm not sure how it is relevant to this thread?

Tom Gardner · Jun 16, 2013

Eric Wallin wrote:

On Sunday, June 16, 2013 5:23:01 AM UTC-4, Tom Gardner wrote:

The Mill is irrelevant to FPGA processors...

The Mill looks vaguely interesting (if you're into billion transistor processors) but as you indicated I'm not sure how it is relevant to this thread?

You wrote "Indexes into a moving stack - that way lies insanity."
The Mill's belt is effectively exactly that, and they appear not to have gone insane.

Eric Wallin · Jun 17, 2013

On Sunday, June 16, 2013 1:16:42 PM UTC-4, Tom Gardner wrote:

You wrote "Indexes into a moving stack - that way lies insanity."
The Mill's belt is effectively exactly that, and they appear not to have gone insane.

I bet they would if they tried to hand code it in assembly! ;-)

The first video comment is priceless: "Gandalf?"

Tom Gardner · Jun 17, 2013

Eric Wallin wrote:

On Sunday, June 16, 2013 1:16:42 PM UTC-4, Tom Gardner wrote:

You wrote "Indexes into a moving stack - that way lies insanity."
The Mill's belt is effectively exactly that, and they appear not to have gone insane.

I bet they would if they tried to hand code it in assembly! ;-)

It *is* considerably easier than hand-coding Itanium. With that
you change *any* aspect of the microarchitecture and you go back
to the beginning. How do I know? I asked someone that was doing
it to assess its performance, and decided to Run Away from
anything to do with the Itanium.

The first video comment is priceless: "Gandalf?"

How shallow

rickman · Jun 19, 2013

On 6/15/2013 10:17 PM, Eric Wallin wrote:

Thanks for your response rickman!

On Saturday, June 15, 2013 8:40:27 PM UTC-4, rickman wrote:
That the ground I have been plowing off and on for the last 10 years.

Ooo, same here, and my condolences. I caught a break a couple of months ago and have been beavering away on it ever since, and I finally have something that doesn't cause me to vomit when I code for it. Multiple indexed simple stacks with explicit pointer control makes everything a lot easier than a bog standard stack machine. I think the auto-consumption of literally everything, particularly the data, indexes, and pointers you dearly want to use again is at the bottom of all the crazy people just accept with stack machines. This mechanism works great for manual data entry on HP calculators, but not so much for stack machines IMHO. Auto consumption also pretty much rules out conditional execution of single instructions.

I was looking at how to improve a stack design a few months ago and came
to a similar conclusion. My first attempt at getting around the stack
ops was to use registers. I was able to write code that was both
smaller and faster since in my design all instructions are one clock
cycle so executed instruction count equals number of machine cycles.
Well, sort of. My original dual stack design was literally one clock
per instruction. In order to work with clocked block ram the register
machine would use either both phases of two clocks per machine cycle or
four clock cycles.

While pushing ideas around on paper, the J1 design gave me an idea of
adjusting the stack point as well as using an offset in each
instruction. That gave a design that is even faster with fewer
instructions. I'm not sure if it is practical in a small opcode. I
have been working with 8 and 9 bit opcodes, the latest approach with
stack pointer control can fit in 9 bits, but would be happier with a
couple more bits.

I assume that you do understand that the point of MISC is that the
implementation can be minimized so that the instructions run faster. In
theory this makes up for the extra instructions needed to manipulate the
stack on occasion. But I understand your interest in minimizing the
inconvenience of stack ops. I spent a little time looking at
alternatives and am currently looking at a stack CPU design that allows
offsets into the stack to get around the extra stack ops. I'm not sure
how this compares to your ideas. It is still a dual stack design as I
have an interest in keeping the size of the implementation at a minimum.

MISC is interesting, but you have to consider that all ops, including simple stack manipulations, will generally consume as much real time as a multiply, which suddenly makes all of those confusing stack gymnastics you have to perform to dig out your loop index or whatever from underneath your read/write pointer from underneath your data and such overly burdensome.

Programming to facilitate stack optimization is king on a stack machine.
I'm not sure how the multiply speed is relevant, but the real question
is just how fast does an algorithm run which has to include all the
instructions needed as well as the clock speed. Then it is also
important to consider resources used. I think you said your design uses
1800 LEs which is a *lot* more than a simple two stack design. They
aren't always available.

Indexes into a moving stack - that way lies insanity. Ever hit the roll down button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Then I don't know why you are designing CPUs, lol! I like RPN
calculators and have trouble using anything else. I also program in
Forth so this all works for me.

Altera BRAM doesn't go as wide as Xilinx with true dual port. When I was working in Xilinx I was able to use a single BRAM for both the data and return stacks (16 bit data).

I expect Xilinx has some patent that Altera can't get around for a
couple more years. Lattice seems to be pretty good though. I just
would prefer to have an async read since that works in a one clock
machine cycle better.

1800 LEs won't even fit on the FPGAs I am targeting.

I'm not sure anything less than the smallest Cyclone 2 is really worth developing in. A lot of the stuff below that is often more expensive due to the built-in configuration memory and such. There are quite inexpensive Cyclone dev boards on eBay from China.

I don't know about dev board cost, but I can get a 1280 LUT Lattice part
for under $4 in reasonable quantity. That is the area I typically work
in. My big problem is packages. I don't want to have to use extra fine
pitch on PCBs to avoid the higher costs. BGAs require very fine via
holes and fine pitch PCB traces and run the board costs up a bit. None
of the FPGA makers support the parts I like very well. VQ100 is my
favorite, small but enough pins for most projects.

I would like to hear about your innovations. As you seem to understand,
it is hard to be truly innovative finding new ideas that others have not
uncovered. But I think you are certainly in an area that is not
thoroughly explored.

I haven't seen anything exactly like it, certainly not the way the stacks are implemented. And I deal with extended arithmetic results in an unusual way. In terms of scheduling and pipelining, the Parallax Propeller is probably the closest in architecture (you can infer from the specs and operational model what they don't explicitly tell you in the datasheet).

I won't argue with that. Even when I was an IEEE member, I never found
a document I didn't have to pay for.

I was a member too right out of grad school. But, like Janet Jackson sang: "What have they done for me lately?"

My mistake was getting involved in the local chapters. Seems IEEE is
just a good ol' boys network and is all about status and going along to
get along. They don't believe in the written rules, more so the
unwritten ones.

When can we expect to see your paper?

It's all but done, just picking around the edges at this point. As soon as the code is verified to my satisfaction I'll release both and post here.

Ok, looking forward to it.

--

Rick

Tom Gardner · Jun 20, 2013

Eric Wallin wrote:

Quite the contrary, I've used HP calculators religiously since
I won one in a HS engineering contest almost 30 years ago.
Too bad they don't make the "real" ones anymore (35S is the
best they can do it seems, maybe they lost the plans along
with those of the Saturn V).

I'm sure HP still has the plans for the Saturn, viz
http://www.hpmuseum.org/saturn.htm

Sorry, couldn't resist.

I really want to like Forth, but after reading the books
and being repeatedly repelled by the syntax and programming
model I gave up.

Nobody /writes/ Forth. They write programs that emit Forth.
The most mainstream example of that is printer drivers
emitting PostScript.

Eric Wallin · Jun 20, 2013

On Wednesday, June 19, 2013 5:13:04 PM UTC-4, rickman wrote:

While pushing ideas around on paper, the J1 design gave me an idea of
adjusting the stack point as well as using an offset in each
instruction. That gave a design that is even faster with fewer
instructions. I'm not sure if it is practical in a small opcode.

Interesting. The J1 strongly influenced me as well.

< I have been working with 8 and 9 bit opcodes, the latest approach with

stack pointer control can fit in 9 bits, but would be happier with a
couple more bits.

I decided to stay away from non-powers of 2 widths for instructions and data. Not efficient in standard storage. Having multiple instructions per word I see now as more of a bug than a feature because you have to index into it to return from a subroutine and how / where do you store the index?

Programming to facilitate stack optimization is king on a stack machine.

I feel that this is a fiddly activity that wastes the programmer's time and creates code that is exceedingly difficult to figure out later.

I'm not sure how the multiply speed is relevant, but the real question
is just how fast does an algorithm run which has to include all the
instructions needed as well as the clock speed.

Multiply is relevant because in a 32 bit machine it will likely be THE speed bottleneck, pulling overall timing down. They include non-fabric registering at the I/O of the FPGA multiply hardware to help pipeline it. Same with BRAM - reads really speed up if you use the "free" output registering (in addition to the synchronous register you are generally forced to use).

Indexes into a moving stack - that way lies insanity. Ever hit the roll down button on an HP calculator and get instantly flummoxed? Maybe a compiler can keep track of that kind of stuff, but my weak brain isn't up to the task.

Then I don't know why you are designing CPUs, lol! I like RPN
calculators and have trouble using anything else. I also program in
Forth so this all works for me.

Quite the contrary, I've used HP calculators religiously since I won one in a HS engineering contest almost 30 years ago. Too bad they don't make the "real" ones anymore (35S is the best they can do it seems, maybe they lost the plans along with those of the Saturn V). But when I hit the roll down button to find a value on the stack, I have to give up on the other stack items due to confusion. I really want to like Forth, but after reading the books and being repeatedly repelled by the syntax and programming model I gave up.

My goal with CPU design was to make one simple enough to program without special tools, but complex enough to do real work and I think I've finally achieved that.

I expect Xilinx has some patent that Altera can't get around for a
couple more years. Lattice seems to be pretty good though. I just
would prefer to have an async read since that works in a one clock
machine cycle better.

I like Lattice parts too, and used the original MachXO on many boards in lieu of a CPLD.

But I gave up on single cycle along with two stacks and autoconsumption. Like you say async read BRAM is hard to come by. Single cycle is also slow and strands a bazillion FFs in the fabric.

I wonder if you've read this article:

http://spectrum.ieee.org/semiconductors/processors/25-microchips-that-shook-the-world

Moore made a lot of money off of what seem like frivolous lawsuits, which brings him down several notches in my eyes.

New soft processor core paper publisher?

Guest

Nikolaos Kavvadias

Guest

jt_eaton

Guest

Eric Wallin

Guest

Eric Wallin

Guest

Theo Markettos

Guest

Eric Wallin

Guest

Theo Markettos

Guest

Eric Wallin

Guest

Eric Wallin

Guest

rickman

Guest

Eric Wallin

Guest

Tom Gardner

Guest

Eric Wallin

Guest

Tom Gardner

Guest

Eric Wallin

Guest

Tom Gardner

Guest

rickman

Guest

Tom Gardner

Guest

Eric Wallin

Guest

Log in

Welcome to EDABoard.com

Sponsor