Altera Cyclone replacement

On Wednesday, January 30, 2019 at 8:56:28 PM UTC-5, jim.bra...@ieee.org wrote:
On Wednesday, January 30, 2019 at 7:37:56 PM UTC-6, gnuarm.del...@gmail.com wrote:
On Wednesday, January 30, 2019 at 7:42:54 PM UTC-5, jim.bra...@ieee.org wrote:
On Wednesday, January 30, 2019 at 11:14:26 AM UTC-6, gnuarm.del...@gmail.com wrote:

]>Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well.

Microblaze clones: aeMB, an-noc-mpsoc, mblite, mb-lite-plus, myblaze, openfire_core, openfire2, secretblaze

No NIOS clones that I know of

]>But perhaps more importantly, they are far from optimal.
Ugh, they have some of the best figure-of-merit numbers available.
(Instructions per second per LUT)
And are available in many configuration options.

There are a large variety of RISC-V cores available some of which have low LUT counts.

Jim Brakefield

Not sure what figures you are talking about. Has anyone compiled a comparison?


Rick C.

++ Get 6 months of free supercharging
++ Tesla referral code - https://ts.la/richard11209

Altera/Intel: "Nios II Performance Benchmarks
Xilinx: appendix of MicroBlaze Processor Reference Guide

Not sure what these are about. They certainly don't compare third party implementations of their architectures.

Rick C.

--- Get 6 months of free supercharging
--- Tesla referral code - https://ts.la/richard11209
 
Am 31.01.19 um 01:42 schrieb jim.brakefield@ieee.org:
On Wednesday, January 30, 2019 at 11:14:26 AM UTC-6, gnuarm.del...@gmail.com wrote:
On Wednesday, January 30, 2019 at 11:24:17 AM UTC-5, kkoorndyk wrote:


]>Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well.

Microblaze clones: aeMB, an-noc-mpsoc, mblite, mb-lite-plus, myblaze, openfire_core, openfire2, secretblaze

Proprietary maybe; when the re-implementation is clean, it's OK.
You might also have to re-implement the assembler & C-compiler
for license reasons.

I once have changed the register implementation of a PICO-blaze.
That was not too hard. Its VHDL representation is compiled
with the rest of your FPGA circuit.

The problem was, we used picoblazes in an ORIGINAL dinosaur Virtex
in a space application, and we had to scrubb the configuration
memory every minute or so. That means reloading the configuration
memory to fight accumulation of bad bits due to radiation etc.
It works just like booting the FPGA at powerup - and killing this
process one clock before the before the global reset happens!

The icing on the cake was that the reload circuitry was in the FPGA
itself. That's much like exchanging the carpet under your feet.

I have witten a nice package of triple module redundant standard logic
vectors for that, and for other sensitive processing.

tmr_sl and tmr_slv could be used almost like standard_logic and the
peculiarities were carefully hidden, like avoiding that the ISE
proudly optimizes the redundancy away. The Xilinx TMR tool was
unavailable for European space projects because of ITER. :-(

(Maybe I should do an open source re-implementation in modern VHDL
as a WARM THANK YOU. I know now how to make it even better and we
could make tamagotchis for the children of Fukushima.)

But I disgress. The reason for the picoblaze modification was
that picoblaze uses CLB rams for its registers and these are
really snippets of the configuration RAM. So, during each
scrubbing of the configuration the CPU forgets its register contents.

Replacing the rams with arrays of flip-flops increased the
resource consumption but it was _not_ much slower.

best regards,
Gerhard


Hoffmann Consulting: ANALOG, RF and DSP Design.
 
On Friday, January 25, 2019 at 5:16:04 PM UTC+2, Stef wrote:
Hi,

We got an old design with an Altera Cyclone FPGA (EP1C12F324).
These are probably obsolete (Can't find any info on them on the Intel
site, Farnell is out of stock, etc.). Currently active are the Cyclone-IV
and Cyclone-V if I understood correctly.

Is a design from a Cyclone portable to a Cyclone-IV/V? What kind of
changes should I expect to code and board? Design includes NIOS.

Or alternatively, are their sources for these old Cyclone chips?
(We actually would need 3 different types :-( )


--
Stef (remove caps, dashes and .invalid from e-mail address to reply by mail)

There is never time to do it right, but always time to do it over.

As you probably found out yourself, the least painful and the most cost effective migration path is to Cyclone 10LP. Despite the name 10, it is relatively old family (60 nm) that is less likely than new chips to have problems with 3/3.3V external I/O.
MAX 10 is very cheap at 2KLUTs. If your design is bigger than that then Cy10LP would be cheaper.

For relatively big volumes consider Lattice Mach. Their list price is no good, but volume discounts are fantastic. But be ready for much higher level of pain during development than what you probably accustomed to with Cyclone.
 
On Thursday, January 31, 2019 at 2:42:54 AM UTC+2, jim.bra...@ieee.org wrote:
On Wednesday, January 30, 2019 at 11:14:26 AM UTC-6, gnuarm.del...@gmail.com wrote:
On Wednesday, January 30, 2019 at 11:24:17 AM UTC-5, kkoorndyk wrote:
On Tuesday, January 29, 2019 at 7:57:05 PM UTC-5, gnuarm.del...@gmail..com wrote:
On Monday, January 28, 2019 at 10:49:32 AM UTC-5, kkoorndyk wrote:
You may as well take the opportunity to "future proof" the design by migrating to another vendor that isn't likely to get acquired or axed. Xilinx has the single core Zynq-7000 devices if you want to go with a more main-stream, ARM processor sub-system (although likely overkill for whatever your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be good targets if you want to migrate to a Microblaze or some other soft core. The Spartan-7 family is essentially the Artix-7 fabric with the transcievers removed and are offered in 6K to 100K logic cell densities.

I don't think you actually got my point. Moving to a Spartan by using a MicroBlaze processor isn't "future proofing" anything. It is just shifting from one brand to another with the exact same problems.

If you want to future proof a soft CPU design you need to drop any FPGA company in-house processor and use an open source processor design. Then you can use any FPGA you wish.

Here is some info on the J1, an open source processor that was used to replace a microblaze when it became unequal to the task at hand.

http://www.forth.org/svfig/kk/11-2010-Bowman.pdf

http://www.excamera.com/sphinx/fpga-j1.html

http://www.excamera.com/files/j1.pdf


Rick C.

-- Get 6 months of free supercharging
-- Tesla referral code - https://ts.la/richard11209

No, I got your point perfectly, hence the following part of my recommendation: "or some other soft core."

I am making the point that porting from one proprietary processor to another is of limited value. Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well. But perhaps more importantly, they are far from optimal. That's why I posted the info on the J1 processor. It was invented to replace a Microblaze that wasn't up to the task.


If the original Nios was employed, I'm not entirely convinced a soft core is necessary (yet). How simple is the software running on it? Can it reasonably be ported to HDL, thus ensuring portability? I tend to lean that way unless the SW was simple due to capability limitations in the earlier technologies (e.g., old Cyclone and Nios) and the desire is to add more features that are realizable with new generation devices and soft (or hard) core capabilities.

Sometimes soft CPUs are added to reduce the size of logic. Other times they are added because of the complexity of expression. Regardless of how simply we can write HDL, the large part of the engineering world perceives HDL as much more complex than other languages and are not willing to port code to an HDL unless absolutely required. So if the code is currently in C, it won't get ported to HDL without a compelling reason.

Personally I think Xilinx and Altera are responsible for the present perception that FPGAs are difficult to use, expensive, large and power hungry.. That is largely true if you use their products only. Lattice has been addressing a newer market with small, low power, inexpensive devices intended for the mobile market. Now if someone would approach the issue of ease of use by something more than throwing an IDE on top of their command line tools, the FPGA market can explode into territory presently dominated by MCUs.

Does anyone really think toasters can only be controlled by MCUs? We just need a cheap enough FPGA in a suitable package.


Rick C.

+- Get 6 months of free supercharging
+- Tesla referral code - https://ts.la/richard11209

]>Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well.

Microblaze clones: aeMB, an-noc-mpsoc, mblite, mb-lite-plus, myblaze, openfire_core, openfire2, secretblaze

No NIOS clones that I know of

I am playing with one right now.
Already have half-dozen working variants each with its own advantage/disadvantange in terms of resources usage (LEs vs M9K) and Fmax. The smallest one is still not as small as Altera's Nios2e and the fastest one is still not as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] future plans, beating Altera's Nios2f on speed and features is of lesser priority.

My cores are less full-featured than even nios2e. They are intended for one certain niche that I would call "soft MCU". In particular, the only supported program memory is what Altera calls "tightly coupled memory", i.e. embedded dual-ported SRAM blocks with no other master connected. Another limitations are absence of exceptions and external interrupts. For me it's o.k. that's how I use nios2e anyway.

I didn't check if what I am doing is legal.
Probably does not matter as long as it's just a repo on github.


>But perhaps more importantly, they are far from optimal.
Ugh, they have some of the best figure-of-merit numbers available.
(Instructions per second per LUT)
And are available in many configuration options.

There are a large variety of RISC-V cores available some of which have low LUT counts.

Fixed-instruction-width 32-bit subset of RISC-V ISA is nearly identical to Nios2 down to the level of instruction formats. The biggest difference is 12-bit immediate in RV vs 16-bit in N2. Not a big deal.
So I expect that RV32 cores available in source form can be modified to run Nios2 in few days (or, if original designer is involved, in few hours).

The bigger difference would be external interface. In N2 one expects Avalon-mm. I have no idea what's a standard bus/fabric in the world of RV soft cores and how similar it is to AVM.


Jim Brakefield
 
On Wednesday, February 6, 2019 at 6:54:27 AM UTC-5, already...@yahoo.com wrote:
On Thursday, January 31, 2019 at 2:42:54 AM UTC+2, jim.bra...@ieee.org wrote:
On Wednesday, January 30, 2019 at 11:14:26 AM UTC-6, gnuarm.del...@gmail.com wrote:
On Wednesday, January 30, 2019 at 11:24:17 AM UTC-5, kkoorndyk wrote:
On Tuesday, January 29, 2019 at 7:57:05 PM UTC-5, gnuarm.del...@gmail.com wrote:
On Monday, January 28, 2019 at 10:49:32 AM UTC-5, kkoorndyk wrote:
You may as well take the opportunity to "future proof" the design by migrating to another vendor that isn't likely to get acquired or axed. Xilinx has the single core Zynq-7000 devices if you want to go with a more main-stream, ARM processor sub-system (although likely overkill for whatever your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be good targets if you want to migrate to a Microblaze or some other soft core. The Spartan-7 family is essentially the Artix-7 fabric with the transcievers removed and are offered in 6K to 100K logic cell densities.

I don't think you actually got my point. Moving to a Spartan by using a MicroBlaze processor isn't "future proofing" anything. It is just shifting from one brand to another with the exact same problems.

If you want to future proof a soft CPU design you need to drop any FPGA company in-house processor and use an open source processor design. Then you can use any FPGA you wish.

Here is some info on the J1, an open source processor that was used to replace a microblaze when it became unequal to the task at hand.

http://www.forth.org/svfig/kk/11-2010-Bowman.pdf

http://www.excamera.com/sphinx/fpga-j1.html

http://www.excamera.com/files/j1.pdf


Rick C.

-- Get 6 months of free supercharging
-- Tesla referral code - https://ts.la/richard11209

No, I got your point perfectly, hence the following part of my recommendation: "or some other soft core."

I am making the point that porting from one proprietary processor to another is of limited value. Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well. But perhaps more importantly, they are far from optimal. That's why I posted the info on the J1 processor.. It was invented to replace a Microblaze that wasn't up to the task.


If the original Nios was employed, I'm not entirely convinced a soft core is necessary (yet). How simple is the software running on it? Can it reasonably be ported to HDL, thus ensuring portability? I tend to lean that way unless the SW was simple due to capability limitations in the earlier technologies (e.g., old Cyclone and Nios) and the desire is to add more features that are realizable with new generation devices and soft (or hard) core capabilities.

Sometimes soft CPUs are added to reduce the size of logic. Other times they are added because of the complexity of expression. Regardless of how simply we can write HDL, the large part of the engineering world perceives HDL as much more complex than other languages and are not willing to port code to an HDL unless absolutely required. So if the code is currently in C, it won't get ported to HDL without a compelling reason.

Personally I think Xilinx and Altera are responsible for the present perception that FPGAs are difficult to use, expensive, large and power hungry. That is largely true if you use their products only. Lattice has been addressing a newer market with small, low power, inexpensive devices intended for the mobile market. Now if someone would approach the issue of ease of use by something more than throwing an IDE on top of their command line tools, the FPGA market can explode into territory presently dominated by MCUs.

Does anyone really think toasters can only be controlled by MCUs? We just need a cheap enough FPGA in a suitable package.


Rick C.

+- Get 6 months of free supercharging
+- Tesla referral code - https://ts.la/richard11209

]>Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well.

Microblaze clones: aeMB, an-noc-mpsoc, mblite, mb-lite-plus, myblaze, openfire_core, openfire2, secretblaze

No NIOS clones that I know of


I am playing with one right now.
Already have half-dozen working variants each with its own advantage/disadvantange in terms of resources usage (LEs vs M9K) and Fmax. The smallest one is still not as small as Altera's Nios2e and the fastest one is still not as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] future plans, beating Altera's Nios2f on speed and features is of lesser priority..

My cores are less full-featured than even nios2e. They are intended for one certain niche that I would call "soft MCU". In particular, the only supported program memory is what Altera calls "tightly coupled memory", i.e. embedded dual-ported SRAM blocks with no other master connected. Another limitations are absence of exceptions and external interrupts. For me it's o.k.. that's how I use nios2e anyway.

I didn't check if what I am doing is legal.
Probably does not matter as long as it's just a repo on github.


]>But perhaps more importantly, they are far from optimal.
Ugh, they have some of the best figure-of-merit numbers available.
(Instructions per second per LUT)
And are available in many configuration options.

There are a large variety of RISC-V cores available some of which have low LUT counts.

Fixed-instruction-width 32-bit subset of RISC-V ISA is nearly identical to Nios2 down to the level of instruction formats. The biggest difference is 12-bit immediate in RV vs 16-bit in N2. Not a big deal.
So I expect that RV32 cores available in source form can be modified to run Nios2 in few days (or, if original designer is involved, in few hours).

The bigger difference would be external interface. In N2 one expects Avalon-mm. I have no idea what's a standard bus/fabric in the world of RV soft cores and how similar it is to AVM.

Should I assume you are not using C to program these CPUs?

If that is correct, have you considered a stack based CPU? When you refer to CPUs like the RISC-V I'm thinking they use thousands of LUT4s. Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

There is a lot of interest in stack CPUs in the Forth community since typically their assembly language is similar to the Forth virtual machine.

I'm not familiar with Avalon and I don't know what N2 is. A popular bus in the FPGA embedded world is Wishbone.



Rick C.

--+ Tesla referral code - https://ts.la/richard11209
 
On Thursday, February 7, 2019 at 8:36:36 PM UTC+2, gnuarm.del...@gmail.com wrote:
On Wednesday, February 6, 2019 at 6:54:27 AM UTC-5, already...@yahoo.com wrote:
On Thursday, January 31, 2019 at 2:42:54 AM UTC+2, jim.bra...@ieee.org wrote:
On Wednesday, January 30, 2019 at 11:14:26 AM UTC-6, gnuarm.del...@gmail.com wrote:
On Wednesday, January 30, 2019 at 11:24:17 AM UTC-5, kkoorndyk wrote:
On Tuesday, January 29, 2019 at 7:57:05 PM UTC-5, gnuarm.del...@gmail.com wrote:
On Monday, January 28, 2019 at 10:49:32 AM UTC-5, kkoorndyk wrote:
You may as well take the opportunity to "future proof" the design by migrating to another vendor that isn't likely to get acquired or axed.. Xilinx has the single core Zynq-7000 devices if you want to go with a more main-stream, ARM processor sub-system (although likely overkill for whatever your Nios is doing). Otherwise, the Artix-7 and Spartan-7 would be good targets if you want to migrate to a Microblaze or some other soft core. The Spartan-7 family is essentially the Artix-7 fabric with the transcievers removed and are offered in 6K to 100K logic cell densities.

I don't think you actually got my point. Moving to a Spartan by using a MicroBlaze processor isn't "future proofing" anything. It is just shifting from one brand to another with the exact same problems.

If you want to future proof a soft CPU design you need to drop any FPGA company in-house processor and use an open source processor design.. Then you can use any FPGA you wish.

Here is some info on the J1, an open source processor that was used to replace a microblaze when it became unequal to the task at hand.

http://www.forth.org/svfig/kk/11-2010-Bowman.pdf

http://www.excamera.com/sphinx/fpga-j1.html

http://www.excamera.com/files/j1.pdf


Rick C.

-- Get 6 months of free supercharging
-- Tesla referral code - https://ts.la/richard11209

No, I got your point perfectly, hence the following part of my recommendation: "or some other soft core."

I am making the point that porting from one proprietary processor to another is of limited value. Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well. But perhaps more importantly, they are far from optimal. That's why I posted the info on the J1 processor. It was invented to replace a Microblaze that wasn't up to the task.


If the original Nios was employed, I'm not entirely convinced a soft core is necessary (yet). How simple is the software running on it? Can it reasonably be ported to HDL, thus ensuring portability? I tend to lean that way unless the SW was simple due to capability limitations in the earlier technologies (e.g., old Cyclone and Nios) and the desire is to add more features that are realizable with new generation devices and soft (or hard) core capabilities.

Sometimes soft CPUs are added to reduce the size of logic. Other times they are added because of the complexity of expression. Regardless of how simply we can write HDL, the large part of the engineering world perceives HDL as much more complex than other languages and are not willing to port code to an HDL unless absolutely required. So if the code is currently in C, it won't get ported to HDL without a compelling reason.

Personally I think Xilinx and Altera are responsible for the present perception that FPGAs are difficult to use, expensive, large and power hungry. That is largely true if you use their products only. Lattice has been addressing a newer market with small, low power, inexpensive devices intended for the mobile market. Now if someone would approach the issue of ease of use by something more than throwing an IDE on top of their command line tools, the FPGA market can explode into territory presently dominated by MCUs.

Does anyone really think toasters can only be controlled by MCUs? We just need a cheap enough FPGA in a suitable package.


Rick C.

+- Get 6 months of free supercharging
+- Tesla referral code - https://ts.la/richard11209

]>Microblaze is proprietary. I believe there may be some open source versions available, but I expect there are open source versions of the NIOS available as well.

Microblaze clones: aeMB, an-noc-mpsoc, mblite, mb-lite-plus, myblaze, openfire_core, openfire2, secretblaze

No NIOS clones that I know of


I am playing with one right now.
Already have half-dozen working variants each with its own advantage/disadvantange in terms of resources usage (LEs vs M9K) and Fmax. The smallest one is still not as small as Altera's Nios2e and the fastest one is still not as fast as Altera's Nios2f. Beating Nios2e on size is in my [near] future plans, beating Altera's Nios2f on speed and features is of lesser priority.

My cores are less full-featured than even nios2e. They are intended for one certain niche that I would call "soft MCU". In particular, the only supported program memory is what Altera calls "tightly coupled memory", i.e. embedded dual-ported SRAM blocks with no other master connected. Another limitations are absence of exceptions and external interrupts. For me it's o..k. that's how I use nios2e anyway.

I didn't check if what I am doing is legal.
Probably does not matter as long as it's just a repo on github.


]>But perhaps more importantly, they are far from optimal.
Ugh, they have some of the best figure-of-merit numbers available.
(Instructions per second per LUT)
And are available in many configuration options.

There are a large variety of RISC-V cores available some of which have low LUT counts.

Fixed-instruction-width 32-bit subset of RISC-V ISA is nearly identical to Nios2 down to the level of instruction formats. The biggest difference is 12-bit immediate in RV vs 16-bit in N2. Not a big deal.
So I expect that RV32 cores available in source form can be modified to run Nios2 in few days (or, if original designer is involved, in few hours)..

The bigger difference would be external interface. In N2 one expects Avalon-mm. I have no idea what's a standard bus/fabric in the world of RV soft cores and how similar it is to AVM.

Should I assume you are not using C to program these CPUs?

That would be a wrong assumption.
An exact opposite is far closer to reality - I pretty much never use anything, but C to program these CPUs.

> If that is correct, have you considered a stack based CPU? When you refer to CPUs like the RISC-V I'm thinking they use thousands of LUT4s.

It depends on performance, one is looking for.
2-2.5 KLUT4s (+few embedded memory blocks and multipliers) is a size of fully pipelined single-issue CPU with direct-mapped instruction and data caches, multiplier and divider that runs at very decent Fmax, but features no MMUs or MPU.
On the other end of the spectrum you find winners of RISC-V core size competition - under 400 LUTs, but (I would guess, didn't check it), glacially slow in terms of CPI. But Fmax is still decent.

Half-dozen Nios2 cores of mine is in the middle - 700 to 850 LUT4s, CPI ranging from (approximately) 2.1 to 4.7 and Fmax ranging from reasonable to impractically high.
But my main goal was (is) a learning experience rather than practicality. In particular, for majority of variants I set to myself impractical constrain of implementing register file in a single embedded memory block. Doing it in two blocks is far more practical, but less challenging. The same goes with aiming to very high Fmax- not practical, but fun.
May be, after I explore another half-dozen of dozen of fun possibilities I will settle on building the most practical solutions. But not less probable that I'll lose interest and/or focus before that. I am not too passionate about the whole thing.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

There is a lot of interest in stack CPUs in the Forth community since typically their assembly language is similar to the Forth virtual machine.

I'm not familiar with Avalon and I don't know what N2 is.

N2 is my shortcut for Nios2.

A popular bus in the FPGA embedded world is Wishbone.

I payed attention that Wishbone is popular in Lattice cycles. But Altera world is many times bigger than Lattice and here Avalon is a king. Also, when performance matters, Aavalon is much better technically.


Rick C.

--+ Tesla referral code - https://ts.la/richard11209
 
On Thursday, February 7, 2019 at 4:00:57 PM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 8:36:36 PM UTC+2, gnuarm.del...@gmail.com wrote:

Should I assume you are not using C to program these CPUs?


That would be a wrong assumption.
An exact opposite is far closer to reality - I pretty much never use anything, but C to program these CPUs.

If that is correct, have you considered a stack based CPU? When you refer to CPUs like the RISC-V I'm thinking they use thousands of LUT4s.

It depends on performance, one is looking for.
2-2.5 KLUT4s (+few embedded memory blocks and multipliers) is a size of fully pipelined single-issue CPU with direct-mapped instruction and data caches, multiplier and divider that runs at very decent Fmax, but features no MMUs or MPU.
On the other end of the spectrum you find winners of RISC-V core size competition - under 400 LUTs, but (I would guess, didn't check it), glacially slow in terms of CPI. But Fmax is still decent.

Half-dozen Nios2 cores of mine is in the middle - 700 to 850 LUT4s, CPI ranging from (approximately) 2.1 to 4.7 and Fmax ranging from reasonable to impractically high.
But my main goal was (is) a learning experience rather than practicality. In particular, for majority of variants I set to myself impractical constrain of implementing register file in a single embedded memory block. Doing it in two blocks is far more practical, but less challenging. The same goes with aiming to very high Fmax- not practical, but fun.
May be, after I explore another half-dozen of dozen of fun possibilities I will settle on building the most practical solutions. But not less probable that I'll lose interest and/or focus before that. I am not too passionate about the whole thing.

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

There is a lot of interest in stack CPUs in the Forth community since typically their assembly language is similar to the Forth virtual machine.

I'm not familiar with Avalon and I don't know what N2 is.

N2 is my shortcut for Nios2.

A popular bus in the FPGA embedded world is Wishbone.


I payed attention that Wishbone is popular in Lattice cycles. But Altera world is many times bigger than Lattice and here Avalon is a king. Also, when performance matters, Aavalon is much better technically.

I'm not familiar with what bus is preferred where. I just know that every project I've looked at on OpenCores using a standard bus used Wishbone. If you say Avalon is better, ok. Is it open source? Can it be used on other than Intel products?


Rick C.

-+- Tesla referral code - https://ts.la/richard11209
 
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
On Thursday, February 7, 2019 at 4:00:57 PM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 8:36:36 PM UTC+2, gnuarm.del...@gmail.com wrote:


A popular bus in the FPGA embedded world is Wishbone.


I payed attention that Wishbone is popular in Lattice cycles. But Altera world is many times bigger than Lattice and here Avalon is a king. Also, when performance matters, Aavalon is much better technically.

I'm not familiar with what bus is preferred where. I just know that every project I've looked at on OpenCores using a standard bus used Wishbone. If you say Avalon is better, ok. Is it open source? Can it be used on other than Intel products?

i am not sure what 'open source" means in this context.
Avalon-MM and Avalon-ST are specifications. the documents. The documents freely downloadable from Altera/Intel web site.

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf

A GUI tool, that connects together components, conforming to Avalon specs, which was called SOPC is 00s, then QSYS and now Intel system designer, or something like that, is proprietary close source program.
The code that a tool generates is a normal VHDL or, more often, normal Verilog, that contains copyright statement like that:
// -----------------------------------------------------------
// Legal Notice: (C)2007 Altera Corporation. All rights reserved. Your
// use of Altera Corporation's design tools, logic functions and other
// software and tools, and its AMPP partner logic functions, and any
// output files any of the foregoing (including device programming or
// simulation files), and any associated documentation or information are
// expressly subject to the terms and conditions of the Altera Program
// License Subscription Agreement or other applicable license agreement,
// including, without limitation, that your use is for the sole purpose
// of programming logic devices manufactured by Altera and sold by Altera
// or its authorized distributors. Please refer to the applicable
// agreement for further details.

So, you can't legally use QSYS-generated code with non-Intel devices.
But (IANAL) nobody prevents you from writing your own interconnect generation tool. Or from not using any CAD tool at all and just connecting components manually within your HDL. Isn't it mostly what you do with Wishbone components, anyway?

Rick C.

-+- Tesla referral code - https://ts.la/richard11209
 
On Thursday, February 7, 2019 at 5:11:20 PM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
On Thursday, February 7, 2019 at 4:00:57 PM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 8:36:36 PM UTC+2, gnuarm.del...@gmail.com wrote:


A popular bus in the FPGA embedded world is Wishbone.


I payed attention that Wishbone is popular in Lattice cycles. But Altera world is many times bigger than Lattice and here Avalon is a king. Also, when performance matters, Aavalon is much better technically.

I'm not familiar with what bus is preferred where. I just know that every project I've looked at on OpenCores using a standard bus used Wishbone. If you say Avalon is better, ok. Is it open source? Can it be used on other than Intel products?


i am not sure what 'open source" means in this context.
Avalon-MM and Avalon-ST are specifications. the documents. The documents freely downloadable from Altera/Intel web site.

https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/manual/mnl_avalon_spec.pdf

A GUI tool, that connects together components, conforming to Avalon specs, which was called SOPC is 00s, then QSYS and now Intel system designer, or something like that, is proprietary close source program.
The code that a tool generates is a normal VHDL or, more often, normal Verilog, that contains copyright statement like that:
// -----------------------------------------------------------
// Legal Notice: (C)2007 Altera Corporation. All rights reserved. Your
// use of Altera Corporation's design tools, logic functions and other
// software and tools, and its AMPP partner logic functions, and any
// output files any of the foregoing (including device programming or
// simulation files), and any associated documentation or information are
// expressly subject to the terms and conditions of the Altera Program
// License Subscription Agreement or other applicable license agreement,
// including, without limitation, that your use is for the sole purpose
// of programming logic devices manufactured by Altera and sold by Altera
// or its authorized distributors. Please refer to the applicable
// agreement for further details.

So, you can't legally use QSYS-generated code with non-Intel devices.
But (IANAL) nobody prevents you from writing your own interconnect generation tool. Or from not using any CAD tool at all and just connecting components manually within your HDL. Isn't it mostly what you do with Wishbone components, anyway?

Sorry, I don't know what you are referring to. But my concern with the bus is that it is entirely possible and not at all uncommon for such a design to have aspects which are under license. Some time ago it was ruled that the Z80 did not infringe on Intel's 8080 design, but the nemonics were copyrighted so Zilog had to develop their own assembler syntax. ARM decided to protect their CPU design with a patent on some aspect of interrupt handling if I recall correctly. So while there are equivalent CPUs on the market (RISC-V for example), there are no ARM clones even though all the ARM architecture documents are freely available.

The point is I don't know if this Altera bus is protected in some way or not. That's why I was asking. IANAL either

I think the term open source is pretty clear in all contexts. Lattice has their own CPU designs for use in FPGAs. The difference is they don't care if you use then in a Xilinx chip.


Rick C.

-++ Tesla referral code - https://ts.la/richard11209
 
On 07/02/2019 22:43, gnuarm.deletethisbit@gmail.com wrote:
On Thursday, February 7, 2019 at 4:00:57 PM UTC-5,
already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 8:36:36 PM UTC+2,
gnuarm.del...@gmail.com wrote:

I'm not familiar with Avalon and I don't know what N2 is.

N2 is my shortcut for Nios2.

A popular bus in the FPGA embedded world is Wishbone.


I payed attention that Wishbone is popular in Lattice cycles. But
Altera world is many times bigger than Lattice and here Avalon is a
king. Also, when performance matters, Aavalon is much better
technically.

I'm not familiar with what bus is preferred where. I just know that
every project I've looked at on OpenCores using a standard bus used
Wishbone. If you say Avalon is better, ok. Is it open source? Can
it be used on other than Intel products?

I have no idea of the legal aspects of Avalon (I only ever used it on
Altera devices, long ago). But technically it is very similar to
Wishbone for many common uses. Things always get complicated when you
need priorities, bursts, variable wait states, etc., but for simpler and
static connections, I don't remember it as being difficult to mix them.
(It was many years ago when I did this, however.)

<https://en.wikipedia.org/wiki/Wishbone_(computer_bus)#Comparisons>
 
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:
Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.

Can you quantify criticality of your real-time requirements?

Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.
The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.
 
On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Can you quantify criticality of your real-time requirements?

Eh? You are asking my requirement or asking how important it is? Not sure how to answer that question. I can only say that my CPU designs give single cycle execution, so I can design with them the same way I design the hardware in VHDL.


> Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?

It increases interrupt latency which is not a problem if you aren't using interrupts, a common technique for such embedded processors. Otherwise multi-cycle instructions complicate the CPU instruction decoder. Using a short instruction format allows minimal decode logic. Adding a cycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly.


> Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Cache, branch predictors??? You have that with 1 kLUT CPUs??? I think we design in very different worlds. My program storage is inside the FPGA and runs at the full speed of the CPU. The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay, oh well) so no branch prediction needed.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits).. Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Huh? So my block RAM stack is pipelined or are you saying I'm only imagining it runs in one clock cycle? Instructions are things like

ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt). The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code.


> Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

Or both. To get the block RAMs single cycle the read and write happen on different phases of the main clock. I think read is on falling edge while write is on rising edge like the rest of the logic. Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically. Why would Harvard ever be a problem for an embedded CPU?


> And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard.

Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functionality.


> Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.

I only use the low grade parts. I haven't used NIOS and this processor won't get to 380 MHz I'm pretty sure. Pipelining it would be counter it's design goals but might be practical, never thought about it.


> The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Or the instruction is simple and runs fast.


> Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

That's where my CPU lies, I think it was 600 LUT4s last time I checked.

Rick C.
 
On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.com wrote:
On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail..com wrote:

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Can you quantify criticality of your real-time requirements?

Eh? You are asking my requirement or asking how important it is?

How important they are. What happens if particular instruction most of the time
takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-level requirements impacted?


Not sure how to answer that question. I can only say that my CPU designs give single cycle execution, so I can design with them the same way I design the hardware in VHDL.


Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?

It increases interrupt latency which is not a problem if you aren't using interrupts, a common technique for such embedded processors.

I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA ourselves. Or throw multiple soft cores on multiple tasks. That why I am interested in *small* soft cores in the first place.

> Otherwise multi-cycle instructions complicate the CPU instruction decoder.

I see no connection to decoder. May be, you mean microsequencer?
Generally, I disagree. At least for very fast clock rates it is easier to design non-pipelined or partially pipelined core where every instruction flows through several phases.

Or, may be, you think about variable-length instructions? That's again, orthogonal to number of clocks per instruction. Anyway, I think that variable-length instructions are very cool, but not for 500-700 LUT4s budget. I would start to consider VLI for something like 1200 LUT4s.

Using a short instruction format allows minimal decode logic. Adding a cycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly.


Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Cache, branch predictors??? You have that with 1 kLUT CPUs??? I think we design in very different worlds.

I don't *want* data caches in sort of tasks that I do with this small cores.. Instruction cache is something else. I am not against them in "hard" MCUs..
In small soft cores that we are discussing right now they are impractical rather than evil.
But static branch prediction is something else. I can see how static branch prediction is practical in 700-800 LUT4s. I didn't have it implemented in my half-dozen (in the mean time the # is growing). But it is practical, esp.. for applications that spend most of the time in very short loops.

> My program storage is inside the FPGA and runs at the full speed of the CPU. The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay,

I am starting to suspect that you have very special definition of "not pipelined" that differs from definition used in literature.

oh well) so no branch prediction needed.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Huh? So my block RAM stack is pipelined or are you saying I'm only imagining it runs in one clock cycle? Instructions are things like

ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt). The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code.


Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

Or both. To get the block RAMs single cycle the read and write happen on different phases of the main clock. I think read is on falling edge while write is on rising edge like the rest of the logic. Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically. Why would Harvard ever be a problem for an embedded CPU?

Less of the problem when you are in full control of software stack.
When you are not in full control, sometimes compilers like to place data, esp. jump tables for implementing HLL switch/case construct, in program memory.
Still, even with full control of the code generation tools, sometimes you want
architecture consisting of tiny startup code that loads the bulk of the code from external memory, most commonly from SPI flash.
Another, less common possible reason is saving space by placing code and data in the same memory block. Esp. when blocks are relatively big and there are few of them.

And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard.

Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functionality.

It is certainly easier when branching is based on arithmetic flags rather than
on the content of register, like a case in MIPS derivatives, including Nios2 and RISC-V. But still hard. You have to wait for instruction to arrive from memory, decode an instruction, do logical operations on flags and select between two alternatives based on result of logical operation, all in one cycle.
If branch is PC-relative, which is the case in nearly all popular 32-bit architectures, you also have to do an address addition, all in the same cycle..

But even if it's somehow doable for PC-relative branches, I don't see how, assuming that stack is stored in block memory, it is doable for *indirect* jumps. I'd guess, you are somehow cutting corners here, most probably by requiring the address of indirect jump to be in the top-of-stack register that is not in block memory.

Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.

I only use the low grade parts. I haven't used NIOS

Nios, not NIOS. The proper name and spelling is Nios2, because for a brief period in early 00s Altera had completely different architecture that was called Nios.

and this processor won't get to 380 MHz I'm pretty sure. Pipelining it would be counter it's design goals but might be practical, never thought about it.


The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Or the instruction is simple and runs fast.

I don't doubt that you did it, but answers like that smell hand-waving.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

That's where my CPU lies, I think it was 600 LUT4s last time I checked.

Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count (5 variations, logical and arithmetic right shift, logical left shift, rotate right, rotate left)?
Does it include zero-extended and sign-extended byte and half-word loads (fetches, in you language) ?
In my cores these two functions combined are the biggest block, bigger than 32-bit ALU, and comparable in size with result writeback mux.
Also, I assume that you cores have no multiplier, right?


> Rick C.
 
On Thursday, February 14, 2019 at 8:38:47 AM UTC-5, already...@yahoo.com wrote:
On Thursday, February 14, 2019 at 1:24:40 PM UTC+2, gnuarm.del...@gmail.com wrote:
On Thursday, February 14, 2019 at 5:07:53 AM UTC-5, already...@yahoo.com wrote:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Can you quantify criticality of your real-time requirements?

Eh? You are asking my requirement or asking how important it is?

How important they are. What happens if particular instruction most of the time
takes n clocks, but sometimes, rarely, could take n+2 clocks? Are system-level requirements impacted?

Of course, that depends on the application. In some cases it would simply not work correctly because it was designed into the rest of the logic not entirely unlike a FSM. In other cases it would make the timing indeterminate which means it would make it harder to design the logic surrounding this piece.


Not sure how to answer that question. I can only say that my CPU designs give single cycle execution, so I can design with them the same way I design the hardware in VHDL.


Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?

It increases interrupt latency which is not a problem if you aren't using interrupts, a common technique for such embedded processors.

I don't like interrupts in small systems. Neither in MCUs nor in FPGAs.
In MCUs nowadays we have bad ass DMAs. In FPGAs we can build bad ass DMA ourselves. Or throw multiple soft cores on multiple tasks. That why I am interested in *small* soft cores in the first place.

Yup, interrupts can be very bad. But if you requirements are to do one thing in software that has real time requirements (such as service an ADC/DAC or fast UART) while the rest of the code is managing functions with much more relaxed real time requirements, using an interrupt can eliminate a CPU core or the design of a custom DMA with particular features that are easy in software.

There are things that are easy to do in hardware and things that are easy to do in software with some overlap. Using a single CPU and many interrupts fits into the domain of not so easy to do. That doesn't make simple use of interrupts a bad thing.


Otherwise multi-cycle instructions complicate the CPU instruction decoder.

I see no connection to decoder. May be, you mean microsequencer?

Decoder has outputs y(i) = f(x(j)) where x(j) is all the inputs and y(i) is all the outputs and f() is the function mapping inputs to outputs. If you have multiple states for instructions the decoding function has more inputs than if you only decode instructions and whatever state flags might be used such as carry or zero or interrupt input.

In general this will result in more complex instruction decoding.


> Generally, I disagree. At least for very fast clock rates it is easier to design non-pipelined or partially pipelined core where every instruction flows through several phases.

If by "easier" you mean possible, then yes. That's why they use pipelining, to achieve clock speeds that otherwise can't be met. But it is seldom simple since pipelining is more than just adding registers. Instructions interact and on branches the pipeline has to be flushed, etc.


> Or, may be, you think about variable-length instructions? That's again, orthogonal to number of clocks per instruction. Anyway, I think that variable-length instructions are very cool, but not for 500-700 LUT4s budget. I would start to consider VLI for something like 1200 LUT4s.

Nope, just talking about using multiple clock cycles for instructions. Using variable number of clock cycles would be more complex in general and multiple length instructions even worse... in general. There are always possibilities to simplify some aspect of this by complicating some aspect of that.


Using a short instruction format allows minimal decode logic. Adding a cycle counter increases the number of inputs to the instruction decode block and so complicates the logic significantly.


Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.

Cache, branch predictors??? You have that with 1 kLUT CPUs??? I think we design in very different worlds.

I don't *want* data caches in sort of tasks that I do with this small cores. Instruction cache is something else. I am not against them in "hard" MCUs.
In small soft cores that we are discussing right now they are impractical rather than evil.

Or unneeded. If the programs fits in the on chip memory, no cache is needed. What sort of programming are you doing in <1kLUT CPUs that would require slow off-chip program storage?


> But static branch prediction is something else. I can see how static branch prediction is practical in 700-800 LUT4s. I didn't have it implemented in my half-dozen (in the mean time the # is growing). But it is practical, esp. for applications that spend most of the time in very short loops.

If the jump instruction is one clock cycle and no pipeline, jump prediction is not possible I think.


My program storage is inside the FPGA and runs at the full speed of the CPU. The CPU is not pipelined (according to me, someone insisted that it was a 2 level pipeline, but with no pipeline delay,

I am starting to suspect that you have very special definition of "not pipelined" that differs from definition used in literature.

Ok, not sure what that means. Every instruction takes one clock cycle. While a given instruction is being executed the next instruction is being fetched, but the *actual* next instruction, not the "possible" next instruction. All branches happen during the branch instruction execution which fetches the correct next instruction.

This guy said I was pipelining the fetch and execute... I see no purpose in calling that pipelining since it carries no baggage of any sort.


oh well) so no branch prediction needed.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Huh? So my block RAM stack is pipelined or are you saying I'm only imagining it runs in one clock cycle? Instructions are things like

ADD, CALL, SHRC (shift right with carry), FETCH (read memory), RET (return from call), RETI (return from interrupt). The interrupt pushes return address to return stack and PSW to data stack in one cycle with no latency so, like the other instructions is single cycle, again making using it like designing with registers in the HDL code.


Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

Or both. To get the block RAMs single cycle the read and write happen on different phases of the main clock. I think read is on falling edge while write is on rising edge like the rest of the logic. Instructions and data are in physically separate memory within the same address map, but no way to use either one as the other mechanically. Why would Harvard ever be a problem for an embedded CPU?


Less of the problem when you are in full control of software stack.
When you are not in full control, sometimes compilers like to place data, esp. jump tables for implementing HLL switch/case construct, in program memory.
Still, even with full control of the code generation tools, sometimes you want
architecture consisting of tiny startup code that loads the bulk of the code from external memory, most commonly from SPI flash.
Another, less common possible reason is saving space by placing code and data in the same memory block. Esp. when blocks are relatively big and there are few of them.

There is nothing to prevent loading code into program memory. It's all one address space and can be written to by machine code. So I guess it's not really Harvard, it's just physically separate memory. Since instructions are not a word wide, I think the program memory does not implement a full word width.. to be honest, I don't recall. I haven't used this CPU in years. I've been programming in Forth on PCs more recently.

Another stack processor is the J1 which is used in a number of applications and even had a TCP/IP stack implemented in about 8 kW (kB?) (kinstructions?). You can find info on it with a google search. It is every bit as small as mine and a lot better documented and programmed in Forth while mine is programmed in assembly which is similar to Forth.


And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard.

Not hard when the CPU is simple and designed to be easy to implement rather than designing it to be like all the other CPUs with complicated functionality.


It is certainly easier when branching is based on arithmetic flags rather than
on the content of register, like a case in MIPS derivatives, including Nios2 and RISC-V. But still hard. You have to wait for instruction to arrive from memory, decode an instruction, do logical operations on flags and select between two alternatives based on result of logical operation, all in one cycle.
If branch is PC-relative, which is the case in nearly all popular 32-bit architectures, you also have to do an address addition, all in the same cycle.

I guess this is where I disagree on the pipelining aspect of my design. I register the current instruction so the memory fetch is in the previous cycle based on that instruction. So my delay path starts with the instruction, not the instruction pointer. The instruction decode for each section of the CPU is in parallel of course. The three sections of the CPU are the instruction fetch, the data path and the address path. The data path and address path roughly correspond to the data and return stacks in Forth. In my CPU they can operate separately and the return stack can perform simple math like increment/decrement/test since it handles addressing memory. In Forth everything is done on the data stack other than holding the return addresses, managing DO loop counts and user specific operations.

My CPU has both PC relative addressing and absolute addressing. One way I optimize for speed is by careful management of the low level implementation.. For example I use an adder as a multiplexor when it's not adding. A+0 is A, 0+B is B, A+B is well, A+B.


> But even if it's somehow doable for PC-relative branches, I don't see how, assuming that stack is stored in block memory, it is doable for *indirect* jumps. I'd guess, you are somehow cutting corners here, most probably by requiring the address of indirect jump to be in the top-of-stack register that is not in block memory.

Indirect addressing??? Indirect addressing requires multiple instructions, yes. The return stack is used for address calculations typically and that stack is fed directly into the instruction fetch logic... it is the "return" stack (or address unit, your choice) after all.


Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.

I only use the low grade parts. I haven't used NIOS

Nios, not NIOS. The proper name and spelling is Nios2, because for a brief period in early 00s Altera had completely different architecture that was called Nios.

I haven't used those processors either.


and this processor won't get to 380 MHz I'm pretty sure. Pipelining it would be counter it's design goals but might be practical, never thought about it.


The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Or the instruction is simple and runs fast.


I don't doubt that you did it, but answers like that smell hand-waving.

Ok, whatever that means.


Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.

That's where my CPU lies, I think it was 600 LUT4s last time I checked.


Does it include single-cycle 32-bit shift/rotate by arbitrary 5-bit count (5 variations, logical and arithmetic right shift, logical left shift, rotate right, rotate left)?

There are shift instructions. It does not have a barrel shifter if that is what you are asking. A barrel shifter is not really a CPU. It is a CPU feature and is large and slow. Why slow down the rest of the CPU with a very slow feature? That is the sort of thing that should be external hardware..

When they design CPU chips, they have already made compromises that require larger, slower logic which require pipelining. The barrel shifter is perfect for pipelining, so it fits right in.


> Does it include zero-extended and sign-extended byte and half-word loads (fetches, in you language) ?

I don't recall, but I'll say no. I do recall some form of sign extension, but I may be thinking of setting the top of stack by the flags. Forth has words that treat the word on the top of stack as a word, so the mapping is better if this is implemented. I'm not convinced this is really better than using the flags directly in the asm, but for now I'm compromising. I'm not really a compiler writer, so...


> In my cores these two functions combined are the biggest block, bigger than 32-bit ALU, and comparable in size with result writeback mux.

Sure, the barrel shifter is O(n^^2) like a multiplier. That's why in small CPUs it is often done in loops. Since loops can be made efficient with the right instructions that's a good way to go. If you really need the optimum speed for barrel shifting, then I guess a large block of logic and pipelining is the way to go.

I needed to implement multiplications, but they are on 24 bit words that are being shifted into and out of a CODEC bit serial. I found a software shift and add to work perfectly well, no need for special hardware.

Boman was using his J1 for video work (don't recall the details) but the Microblaze was too slow and used too much memory. The J1 did the same functions faster and in less code with generic instructions, nothing unique to the application if I remember correctly... not that the Microblaze is the gold standard.


> Also, I assume that you cores have no multiplier, right?

By "cores" you mean CPUs? Core actually, remember the interrupt, one CPU, one interrupt. Yes, no hard multiplier as yet. The pure hardware implementation of the CODEC app used shift and add in hardware as well but new features were needed and space was running out in the small FPGA, 3 kLUTs. The slower, simpler stuff could be ported to software easily for an overall reduction in LUT4 usage along with the new features.

I don't typically try to compete with the functionality of ARMs with my CPU designs. To me they are FPGA logic adjuncts. So I try to make them as simple as the other logic.

I wrote some code for a DDS in software once as a benchmark for CPU instruction set designs. The shortest and fastest I came up with was a hybrid between a stack CPU and a register CPU where objects near the top of stack could be addressed rather than having to always move things around to put the nouns where the verbs could reach them. I have no idea how to program that in anything other than assembly which would be ok with me. I used an excel spread sheet to analyze the 50 to 90 instructions in this routine. It would be interesting to write an assembler that would produce the same outputs.

Rick C.
 
32 bit RISC mcus with 32 registers... do you have any actual devices in
mind?

Hul

already5chosen@yahoo.com wrote:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Can you quantify criticality of your real-time requirements?

Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.
The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.
 
On Thursday, February 14, 2019 at 11:15:58 PM UTC+2, Hul Tytus wrote:
32 bit RISC mcus with 32 registers... do you have any actual devices in
mind?

Hul

First, I don't like to answer to top-poster.
Next time I wouldn't answer.

The discussion was primarily about soft cores.
Two most popular soft cores Nios2 and Microblaze are 32-bit RISCs with 32 registers.

In "hard" MCUs there are MIPS-based products from Microchip.

More recently there appeared few RISC-V MCUs. Probbaly more is going to follow.

In the past there were popular PPC-based MCU devices from various vendors. They are less popular today, but still exist. Freescale (now NXP) e200 core variants are designed specifically for MCU applications.
https://www.nxp.com/products/product-selector:pRODUCT-SELECTOR#/category/c731_c381_c248

So, not the whole 32-bit MCU world is ARM Cortex-M. Just most of it ;-)
 
Den 2019-02-14 kl. 11:07, skrev already5chosen@yahoo.com:
On Thursday, February 7, 2019 at 11:43:39 PM UTC+2, gnuarm.del...@gmail.com wrote:

Ok, if you are doing C in FPGA CPUs then you are in a different world than the stuff I've worked on. My projects use a CPU as a controller and often have very critical real time requirements. While C doesn't prevent that, I prefer to just code in assembly language and more importantly, use a CPU design that provides single cycle execution of all instructions. That's why I like stack processors, they are easy to design, use a very simple instruction set and the assembly language can be very close to the Forth high level language.


Can you quantify criticality of your real-time requirements?

Also, even for most critical requirements, what's wrong with multiple cycles per instructions as long as # of cycles is known up front?
Things like caches and branch predictors indeed cause variability (witch by itself is o.k. for 99.9% of uses), but that's orthogonal to # of cycles per instruction.


Many stack based CPUs can be implemented in 1k LUT4s or less. They can run fast, >100 MHz and typically are not pipelined.

1 cycle per instruction not pipelined means that stack can not be implemented
in memory block(s). Which, in combination with 1K LUT4s means that either stack is very shallow or it is not wide (i.e. 16 bits rather than 32 bits). Either of it means that you need many more instructions (relatively to 32-bit RISC with 32 or 16 registers) to complete the job.

Also 1 cycle per instruction necessitates either strict Harvard memories or true dual-ported memories.

And even with all that conditions in place, non-pipelined conditional branches at 100 MHz sound hard. Not impossible if your FPGA is very fast, like top-speed Arria-10, where you can instantiate Nios2e at 380 MHz and full-featured Nios2f at 300 MHz+. But it does look impossible in low speed grades budget parts, like slowest speed grades of Cyclone4E/10LP or even of Cyclone5. And I suppose that Lattice Mach series is somewhat slower than even those.
The only way that I can see non-pipelined conditional branches work at 100 MHz in low end devices is if your architecture has branch delay slot. But that by itself is sort of pipelining, just instead of being done in HW, it is pipelining exposed to SW.

Besides, my current hobby interest is in 500-700 LUT4s rather than in 1000+ LUT4s. If 1000 LUT4 available then 1400 LUT4 are probably available too, so one can as well use OTS Nios2f which is pretty fast and validated to the level that hobbyist's cores can't even dream about.



I think the best way to get exact performance is to implement a
multithreaded architecture.
This is not the smallest CPU architecture, but the pipeline will run at
very high frequency.

The Multithreaded architecture I have used has
a classic three stage pipeline, fetch, decode, execute,
so there are three instructions active all the time.

The architecture implements ONLY 1 clock cycle in each stage.

Many CPUs implement multicycle functionality, by having statemachines
inside the decode stage.
Thge decode stage can either control the execute stage (the datapath)
directly by decoding the instruction in the fetch stage output,
or it can control the execute stage from one of several statemachines
implementing things like interrupt entry, interrupt exit etc.

The datapath can easily require 80-120 control signals,
so each statemachine needs to have the same number of state registers.
On top of that you need to multiplex all the statemachines together.
This is a considerable amount of logic.

I do it a little bit differently. The CPU has an instruction set
which is basically 16 bit + immediates. This gives room for 16 registers
if you want to have a decent instruction set. 8 bit instruction
and 2 x 4 bit register addresses.

The instruction decoder support an extended 22 bit instruction set.
This gives room for a 10 bit extended instruction set, and 2 x 6 bit
register addresses.
The extended register address space is used for two purposes.
1. To address special registers like the PSR
2. To address a constant ROM, for a few useful constants.

The fetch stage can fetch instructions from two places.
1. The instruction queue(2). The instruction queue only supports 16 bit
instructions with 16/32 bit immediates.
2. A small ROM which provides 22 bit instructions (with 22 bit immediates)

Whenever something happens which normally would require a multicycle
instruction, the thread makes a subroutine jump (0 clock cycle jump)
into the ROM, and executes 22 bit instructions.

A typical use would be an interrupt.
To clear the interrupt flag, you want to clear one bit in the PSR.

The instruction ROM contains
ANDC PSR, const22 ; AND constantROM[22] with PSR.
; ConstantROM[22] == 0xFFFFFEFF
; Clear bit 9 (I) of PSR


To implement multithreading, I need a single decoder,
but multiple register banks, one per thread.
Several special purpose registers per thread (like PSR)
is also needed.

I also need multiple instruction queues (one per thread)

To speed up the pipeline, it is important to follow a simple rule.
A thread cannot ever execute in a cycle, if the instruction
depends in anyway on the result of the previous instruction.
If that rule is followed, you do not need to feedback
the result of an ALU operation to the ALU.

The simplest way to follow the rule is to never
let a thread execute during two adjacent clock cycles.
This limits the performance of a thread to max 1/2 that of
what the CPU is capable of but at the same time,
there is less logic in the critical path, so you
can increase the clock frequency.

Now you suddenly can run code with exact properties.
You can say that I want to execute 524 instructions
per millisecond, and that is what the CPU will do.

You can let all the interrupts be executed in one thread,
so you do not disturb the time critical threads.

The architecture is well suited for FPGA work since
you can use standard dual port RAMs for registers.

I use two dual port RAMs to implement the register banks (each has one
read port and one write port)
The writes are connected together, so you have in effect a register
memory with 1 write port and 2 read ports.

If the CPU architectural model has, lets say, 16 registers x 32,
and you use 2 x (256 x 32) dual port RAMs, you have storage for
16 threads. 2 x (16 CPUs x 16 registers x 32 bits)
If you use 512 x 32 bit DPRAMs you have room for 32 threads.

If you want to study a real example look at the MIPS multithreaded cores
https://www.mips.com/products/architectures/ase/multi-threading/

They decided to build that after I presented my research to their CTO.
They had more focus on the performance than the real time control
which is a pity.
FPGA designers do not have that limitation.

AP
 

Welcome to EDABoard.com

Sponsor

Back
Top