highest frequency periodic interrupt?...

On Mon, 16 Jan 2023 04:54:38 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

On 1/16/2023 3:21 AM, Martin Brown wrote:
On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.

On many of the modern CPUs there is a freerunning 64 bit counter clocked at
once per cycle. Intel deprecates using it for such purposes but I have never
found it a problem provided that you bracket it before and after with CPUID
to force all the pipelines into an empty state.

The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here:

https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters

I prefer hard numbers to a vague scope trace.

Two downsides:
- you have to instrument your code (but, if you\'re concerned with performance,
   you\'ve already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a manner
as you can - that isn\'t quite the same as instrumenting the code (although it
can be).

It depends on how visible the information of interest is to outside
observers. If you have to \"do something\" to make it so, then you
may as well put in the instrumentation and get things as you want them.

I have never found profile directed compilers to be the least bit useful on my
fast maths codes because their automatic code instrumentation breaks the very
code that it is supposed to be testing (in the sense of wrecking cache lines
and locality etc.).

Exactly. The same holds true of adding invariants to code;
removing them (#ifndef DEBUG) changes the code -- subtly but
nonetheless. So, you have to put in place two levels of
final test:
- check to see if you THINK it will pass REAL final test
- actually DO the final test

When installing copy protection/anti-tamper mechanisms in products,
there\'s a time when you\'ve just enabled them and, thus, changed
how the product runs. If it *stops* running (properly), you have
to wonder if your \"measures\" are at fault or if some latent
bug has crept in, aggravated by the slight differences in
execution patterns.

The only profiling method I have found to work reasonably well is probably by
chance the highest frequency periodic ISR I have ever used in anger which was
to profile code by accumulating a snapshot of PC addresses allowing just a few
machine instructions to execute at a time. It used to work well back in the old
days when 640k was the limit and code would reliably load into exactly the same
locations every run.

It is a great way to find the hotspots where most time is spent.

IMO, this is where logic analyzers shine. I don\'t agree with
using them to \"trace code\" (during debug) as there are better ways to
get that information. But, *watching* to see how code runs (passively)
can be a real win. Especially when you are trying to watch for
RARE aberrant behavior.

- it doesn\'t tell you about anything that happens *before* the code runs
   (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour - we once
caught a CPU chip where RTI didn\'t always do what it said on the tin and the
instruction following the RTI instruction got executed with a frequency of
about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a
non-disclosure agreement.

If I\'m really serious about finding out why something is unusually slow I
run a dangerous system level driver that allows me full access to the model
specific registers to monitor cache misses and pipeline stalls.

But, those results can change from instance to instance (as can latency,
execution time, etc.).  So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.

It just means that you have to collect an array of data and take a look at it
later and offline. Much like you would when testing that a library function
does exactly what it is supposed to.

Yes. So, you either have the code do the collection (using a black box)
*or* have to have an external device (logic analyzer) that can collect it
for you.

The former is nice because the code can actually make decisions
(at run time) that a passive observer often can\'t (because the
observer can\'t see all of the pertinent data). But, that starts
to have a pronounced impact on the *intended* code...

Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but it can
add to the variance. You always get a few outliers here and there in user code
if a higher level disk or network interrupt steals cycles.

Being an embedded system developer, the issues that often muck
up execution are often periodic -- but, with periods that are varied
enough that they only beat against the observed phenomenon occasionally.

I am always amused by folks WHO OBSERVE A F*CKUP. Then, when they
can\'t reproduce it or identify a likely cause, ACT AS IF IT NEVER
HAPPENED! Sheesh, you\'re not relying on some third-hand report
of an anomaly... YOU SAW IT! How can you pretend it didn\'t happen?

Apply all of your engineering creativity.
 
On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:
Am 16.01.23 um 13:25 schrieb Don Y:
On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
Am 16.01.23 um 11:23 schrieb Martin Brown:

I didn\'t properly appreciate at the time quite how good this trick was for
realtime work until we tried to implement the same algorithms on the much
later and on paper faster 68000 series of CPUs.

At TU Berlin we had a place called the Zoo where there was
at least one sample of each CPU family. We used the Zoo to
port Andrew Tanenbaum\'s Experimental Machine to all of them
under equal conditions. That was a p-code engine from the
Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league.

But, only for THAT particular benchmark.  I learned, early on,
that I could create a benchmark for damn near any two processors
to make *either* (my choice) look better, just by choosing
the conditions of the test.

That was not a benchmark; that was a given large p-code machine
with the intent to use the same compilers everywhere. Not unlike
UCSD-Pascal.

Everything that you use as an example of performance is a benchmark.
If you are running a computer for a teaching course, do you
really care how fast the code executes? Can you name any products
that embodied that code -- where the performance of the hosting
processor would have a cost-performance tradeoff?

Do you care how fast your LISP machine runs?

Having no cache AND no registers was a braindead idea.

They could argue that adding cache was a logical way to
design \"systems\" with the device.

with a non-existing cache controller and cache rams that
cost as much as the cpu. I got a feeling for the price
of cache when I designed this:
https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/
    

TI is a chip vendor. They look at things differently than
folks who *use* the chips.

NatSemi used to make DRAM controllers -- that cost more than
the DRAM! \"What does this do that a few multiplexers *won\'t*,
for us?\"

When you buy things in volume, you\'re \"paying for the plastic\";
the die have a relatively insignificant part in the cost.

Remember, the 9900/99K were from the \"home computer\" era.
They lost out to a dog slow 8086!

8086 was NOT slow. Have you ever used an Olivetti M20 with
a competently engineered memory system? That even challenged
early ATs when protected mode was not needed.

The 8086 (4.77MHz) was slower than a Z80, at the time.
Because applications that could use its extra abilities
were relatively few and far between.

By contrast, the number of embedded designs outpaced
\"PCs\" by likely two orders of magnitudes; far better
\"bang for buck\".

Some friends built a hardware machine around the Fairchild
Clipper. They found out that moving the hard disk driver
just a few bytes made a difference between speedy and slow
as molasse. When the data was through under the head you had
to wait for another disc revolution.

Unless, of course, you were already planning on being busy
doing something else, at that time.   :>  \"Benchmarks lie\"

That benchmark was Unix System V, as licensed from Bell.
Find something better to do when you need to swap.

How many video games was it used in? Pinball machines?
Medical devices? Process control systems? Navigation
systems? etc. None of those run a UNIX kernel nor any of
the sorts of algorithms that you\'d *find* in a UNIX
kernel.

So, why would I look at the performance of a processor running
UNIX... if my product is NOT running UNIX?

Instead, I\'d wonder how well it ran *my* code and what the
*cost* of that performance was -- as that will determine if
I can *price* my product so that it sells in a given market.
 
On a sunny day (Mon, 16 Jan 2023 06:29:25 -0800) it happened John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote in
<cenashpl5f7cdk37nq4dab7hcdlqknptqh@4ax.com>:

On Mon, 16 Jan 2023 10:21:43 +0000, Martin Brown
\'\'\'newspam\'\'\'@nonad.co.uk> wrote:

On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.

On many of the modern CPUs there is a freerunning 64 bit counter
clocked at once per cycle. Intel deprecates using it for such purposes
but I have never found it a problem provided that you bracket it
before and after with CPUID to force all the pipelines into an empty
state.

The equivalent DWT_CYCCNT on the Arm CPUs that support it is described
here:

https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters

I prefer hard numbers to a vague scope trace.

Two downsides:
- you have to instrument your code (but, if you\'re concerned with
performance,
  you\'ve already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a
manner as you can - that isn\'t quite the same as instrumenting the code
(although it can be).

I have never found profile directed compilers to be the least bit useful
on my fast maths codes because their automatic code instrumentation
breaks the very code that it is supposed to be testing (in the sense of
wrecking cache lines and locality etc.).

The only profiling method I have found to work reasonably well is
probably by chance the highest frequency periodic ISR I have ever used
in anger which was to profile code by accumulating a snapshot of PC
addresses allowing just a few machine instructions to execute at a time.
It used to work well back in the old days when 640k was the limit and
code would reliably load into exactly the same locations every run.

It is a great way to find the hotspots where most time is spent.

- it doesn\'t tell you about anything that happens *before* the code runs
  (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour -
we once caught a CPU chip where RTI didn\'t always do what it said on the
tin and the instruction following the RTI instruction got executed with
a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but
we had to sign a non-disclosure agreement.

If I\'m really serious about finding out why something is unusually
slow I run a dangerous system level driver that allows me full access
to the model specific registers to monitor cache misses and pipeline
stalls.

But, those results can change from instance to instance (as can latency,
execution time, etc.).  So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.

It just means that you have to collect an array of data and take a look
at it later and offline. Much like you would when testing that a library
function does exactly what it is supposed to.

Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but
it can add to the variance. You always get a few outliers here and there
in user code if a higher level disk or network interrupt steals cycles.

Do you have a way of KNOWING when your expectations (which you
have now decided are REQUIRMENTS!) are NOT being met?  And, if so,
what do you do (at runtime) with that information?   (\"I\'m sorry,
one of my basic assumptions is proving to be false and I am not
equipped to deal with that...\")

Instrumenting for timing tests is very much development rather than
production code. ie is it fast enough or do we have to work harder.

Like you I prefer HLL code but I will use ASM if I have to or there is
no other way (like wanting 80 bit reals in the MS compiler). Actually I
am working on a class library to allow somewhat clunky access to it.

They annoyingly zapped access to 80 bit reals in v6 I think it was for
\"compatibility\" reasons since SSE2 and later can only do 64bit reals.

PowerBasic has 80-bit reals as a native variable type.

As far as timing analysis goes, we always bring out a few port pins to
test points, from uPs and FPGAs, so we can scope things. Raise a pin
at ISR entry, drop it before the RTI, scope it.

We wrote one Linux program that just toggled a test point as fast as
it could. That was interesting on a scope, namely the parts that
didn\'t toggle.

It all depends using rspberry Pi as FM transmitter (80 to 100 MHz or so):
https://linuxhint.com/turn-raspberry-pi-fm-transmitter/

That code gave me the following idea,
freq_pi:
http://panteltje.com/panteltje/newsflex/download.html#freq_pi

and that was for a very old Pi model,
somebody then ported it to a later model, no idea how fast you can go on a Pi4.
 
On Sat, 14 Jan 2023 22:21:08 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

On 1/14/2023 10:10 PM, upsidedown@downunder.com wrote:
In the past coding ISRs in assembly was the way to go, but the
complexity of current processors (cache, pipelining) makes it hard to
beat a _good_ compiler.

Exactly. And, it\'s usually easier to see what you are trying
to do in a HLL vs. ASM (and heaven forbid you want to port
the application to a different processor!)

The problem with using an HLL is making sure you actually
understand some \"line of code\" translates into when it comes
to actual opcode/memory accesses (not just which instructions
but, rather, the *cost* of those instructions)

And, this can change, based on *how* the compiler is invoked
(how aggressive the code generator)

The main principle still is to minimize the number of registers saved
at
interrupt entry (and restored at exit).On a primitive processor only
the processor status word and program counter needs to be saved (and
restored). Additional registers may need to be saved(restored if the
ISR uses them.

Some \"advanced\" processors still support a \"Fast IRQ\" that saves
just an accumulator and PSW. A tacit acknowledgement that you
don\'t want to have to save the *entire* processor state (as
you likely don\'t know what portions of it the compiler *might*
call on).

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

As with everything, *how* you use them can make a difference.
E.g., if your ISR reenables interrupts (prior to completion), it
can make sense to use \"expensive\" instruction sequences (assuming
the ISR doesn\'t interrupt itself).

[Degenerate example: the scheduler being invoked!]

In a system with multiple ISRs, spending too long in a single ISR is a
bad idea. Better just read inputs in the ISR and postpone the time
consuming processing to a lower priority \"pseudo ISR\" (SW ISR).

Many processors have software interrupts (SWI), traps or whatever each
manufacture is calling it.

In such environment the HW ISR at close to the end issues the TRAP
instruction (SWI request), which is not activated as long as the HW
ISR is still executing. When the HW ISR exits, interrupts are enabled.

If there is an other HW interrupt(s) pending, those are first
executed. When no more HW interrupts are pending the SW ISR can start
executing. This SW ISR can be quite time consuming. A new hardware
interrupt request may interrupt the SW ISR.

When the SW ISR finally exits, the originally interrupted program is
resumed.
 
On 1/16/2023 10:39 AM, upsidedown@downunder.com wrote:
On Sat, 14 Jan 2023 22:21:08 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

On 1/14/2023 10:10 PM, upsidedown@downunder.com wrote:
In the past coding ISRs in assembly was the way to go, but the
complexity of current processors (cache, pipelining) makes it hard to
beat a _good_ compiler.

Exactly. And, it\'s usually easier to see what you are trying
to do in a HLL vs. ASM (and heaven forbid you want to port
the application to a different processor!)

The problem with using an HLL is making sure you actually
understand some \"line of code\" translates into when it comes
to actual opcode/memory accesses (not just which instructions
but, rather, the *cost* of those instructions)

And, this can change, based on *how* the compiler is invoked
(how aggressive the code generator)

The main principle still is to minimize the number of registers saved
at
interrupt entry (and restored at exit).On a primitive processor only
the processor status word and program counter needs to be saved (and
restored). Additional registers may need to be saved(restored if the
ISR uses them.

Some \"advanced\" processors still support a \"Fast IRQ\" that saves
just an accumulator and PSW. A tacit acknowledgement that you
don\'t want to have to save the *entire* processor state (as
you likely don\'t know what portions of it the compiler *might*
call on).

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

As with everything, *how* you use them can make a difference.
E.g., if your ISR reenables interrupts (prior to completion), it
can make sense to use \"expensive\" instruction sequences (assuming
the ISR doesn\'t interrupt itself).

[Degenerate example: the scheduler being invoked!]

In a system with multiple ISRs, spending too long in a single ISR is a
bad idea.

Of course! Paraphrasing the man with the wild hair: \"Do as little
as possible -- but no less!\"

Better just read inputs in the ISR and postpone the time
consuming processing to a lower priority \"pseudo ISR\" (SW ISR).

Or \"outputs\". I always build a \"driver\" (that runs in the ISR)
and a \"handler\" for \"devices\". Only the handler need be concerned
with the driver. And, system code should only need to deal with
the handler.

If writing in ASM, one can dick with the stack frame and \"arrange\"
to RTI to an intermediate level of code that runs below the
background but above the ISR. This lets you allow more ISRs to
be serviced while you are, effectively, still servicing the
previous one.

You can also structure the ISR as a state machine (if appropriate
to the task at hand) to remove conditionals from executing in
the ISR. So, subsequent IRQs are dispatched to different
routines, instead of having a \"big\" ISR that tries to juggle
the various needs.

[I have a clever little bit of code that makes this relatively inexpensive
but requires the code to reside in writeable RAM (as it is self-modifying)]

Many processors have software interrupts (SWI), traps or whatever each
manufacture is calling it.

In such environment the HW ISR at close to the end issues the TRAP
instruction (SWI request), which is not activated as long as the HW
ISR is still executing. When the HW ISR exits, interrupts are enabled.

A trap is often expensive. You can emulate this functionality (as above).

If there is an other HW interrupt(s) pending, those are first
executed. When no more HW interrupts are pending the SW ISR can start
executing. This SW ISR can be quite time consuming. A new hardware
interrupt request may interrupt the SW ISR.

When the SW ISR finally exits, the originally interrupted program is
resumed.

Nowadays, processors have more than (legacy) interrupts utilizing the
same \"exception\" mechanism. Ideally, you structure all of your
\"exception handlers\" with a similar set of guidelines, even though
some don\'t have the timeliness constraints of traditional ISRs.
 
Am 16.01.23 um 17:19 schrieb Don Y:
On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:

That was not a benchmark; that was a given large p-code machine
with the intent to use the same compilers everywhere. Not unlike
UCSD-Pascal.

Everything that you use as an example of performance is a benchmark.

That was not created as a benchmark. The goal was to have the
same Compiler and operating system on most of the upcoming
microssystems available. Not too unexpected for an operation
system department at a univ.
And when there were underperformers, that would not go
unnoticed.

> Do you care how fast your LISP machine runs?

We had some ICL Perqs; not my cup of meat.
I had a small Prolog system on my Z80, funny but
nothing for real work.

And yes, I was interested how fast my machines ran.
In the VLSI course, I talked a group of other students
into doing a stack machine much like Tanenbaum\'s, only
simpler in HP\'s dynamic NMOS process. Unluckily, we caught
a metal flake from a neighbor project that the DRC did not get.


with a non-existing cache controller and cache rams that
cost as much as the cpu. I got a feeling for the price
of cache when I designed this:

https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/      

TI is a chip vendor.  They look at things differently than
folks who *use* the chips.

So is Intel.


NatSemi used to make DRAM controllers -- that cost more than
the DRAM!  \"What does this do that a few multiplexers *won\'t*,
for us?\"

When you buy things in volume, you\'re \"paying for the plastic\";
the die have a relatively insignificant part in the cost.

Remember, the 9900/99K were from the \"home computer\" era.
They lost out to a dog slow 8086!

8086 was NOT slow. Have you ever used an Olivetti M20 with
a competently engineered memory system? That even challenged
early ATs when protected mode was not needed.

The 8086 (4.77MHz) was slower than a Z80, at the time.
Because applications that could use its extra abilities
were relatively few and far between.

Ah, I had both of them, in the same 19\" box.

Unless, of course, you were already planning on being busy
doing something else, at that time.   :>  \"Benchmarks lie\"

That benchmark was Unix System V, as licensed from Bell.
Find something better to do when you need to swap.

How many video games was it used in?  Pinball machines?
Medical devices?  Process control systems?  Navigation
systems?  etc.  None of those run a UNIX kernel nor any of
the sorts of algorithms that you\'d *find* in a UNIX
kernel.

Pinball machines with a Fairchild Clipper? Do you have
an idea what a Clipper module did cost? The machine was
intended as a multi user Unix machine. I later got a paid
project to build a VME bus terminal concentrator based
on 80186 for it.

Why should I care about medical devices, video games, pinball
or navigation systems? GPS was an experiment at that time and
the 50 Baud navigation strings no problem for sure.

So, why would I look at the performance of a processor running
UNIX... if my product is NOT running UNIX?

The product WAS running UNIX. I wrote they had bought a
commercial source license from Bell.

Cheers, Gerhard
 
mandag den 16. januar 2023 kl. 18.39.13 UTC+1 skrev upsid...@downunder.com:
On Sat, 14 Jan 2023 22:21:08 -0700, Don Y
blocked...@foo.invalid> wrote:
On 1/14/2023 10:10 PM, upsid...@downunder.com wrote:
In the past coding ISRs in assembly was the way to go, but the
complexity of current processors (cache, pipelining) makes it hard to
beat a _good_ compiler.

Exactly. And, it\'s usually easier to see what you are trying
to do in a HLL vs. ASM (and heaven forbid you want to port
the application to a different processor!)

The problem with using an HLL is making sure you actually
understand some \"line of code\" translates into when it comes
to actual opcode/memory accesses (not just which instructions
but, rather, the *cost* of those instructions)

And, this can change, based on *how* the compiler is invoked
(how aggressive the code generator)

The main principle still is to minimize the number of registers saved
at
interrupt entry (and restored at exit).On a primitive processor only
the processor status word and program counter needs to be saved (and
restored). Additional registers may need to be saved(restored if the
ISR uses them.

Some \"advanced\" processors still support a \"Fast IRQ\" that saves
just an accumulator and PSW. A tacit acknowledgement that you
don\'t want to have to save the *entire* processor state (as
you likely don\'t know what portions of it the compiler *might*
call on).

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

As with everything, *how* you use them can make a difference.
E.g., if your ISR reenables interrupts (prior to completion), it
can make sense to use \"expensive\" instruction sequences (assuming
the ISR doesn\'t interrupt itself).

[Degenerate example: the scheduler being invoked!]

In a system with multiple ISRs, spending too long in a single ISR is a
bad idea. Better just read inputs in the ISR and postpone the time
consuming processing to a lower priority \"pseudo ISR\" (SW ISR).

Many processors have software interrupts (SWI), traps or whatever each
manufacture is calling it.

In such environment the HW ISR at close to the end issues the TRAP
instruction (SWI request), which is not activated as long as the HW
ISR is still executing. When the HW ISR exits, interrupts are enabled.

If there is an other HW interrupt(s) pending, those are first
executed. When no more HW interrupts are pending the SW ISR can start
executing. This SW ISR can be quite time consuming. A new hardware
interrupt request may interrupt the SW ISR.

When the SW ISR finally exits, the originally interrupted program is
resumed.

I\'ve sometime done that by setting the pending bit on an otherwise unused
interrupt set at a low priority, cortex-m does tail chaining so RTI from
an interrupt while another is pending is effectively just a jump and change in priority

another (and tricky to get right) way is to add a stack frame with the new code\'s
address and do an RTI, ala\' a task switch in an OS
 
On 1/16/2023 12:30 PM, Gerhard Hoffmann wrote:
Am 16.01.23 um 17:19 schrieb Don Y:
On 1/16/2023 7:27 AM, Gerhard Hoffmann wrote:

That was not a benchmark; that was a given large p-code machine
with the intent to use the same compilers everywhere. Not unlike
UCSD-Pascal.

Everything that you use as an example of performance is a benchmark.

That was not created as a benchmark. The goal was to have the
same Compiler and operating system on most of the upcoming
microssystems available. Not too unexpected for an operation
system department at a univ.
And when there were underperformers, that would not go
unnoticed.

You used <whatever> as a way of comparing the performance of
different processors. It matters not why the application
was *created*; you, HERE, used it as a metric for comparing
which makes it a benchmark.

If I report how well my app runs on N different processors,
I am doing so to demonstrate the \"value\" of those different
processors, when running my app.

If I can assign a cost for each platform, then I can tell
you where my app runs \"most economically\".

Do you care how fast your LISP machine runs?

We had some ICL Perqs; not my cup of meat.
I had a small Prolog system on my Z80, funny but
nothing for real work.

And yes, I was interested how fast my machines ran.
In the VLSI course, I talked a group of other students
into doing a stack machine much like Tanenbaum\'s, only
simpler in HP\'s dynamic NMOS process. Unluckily, we caught
a metal flake from a neighbor project that the DRC did not get.

So, you migrated EVERY instance over to the \"best\"
processor? Or, was it just an academic exercise?

I build things. I make design decisions based on
how much performance I can get for a given outlay of
money.

When I was doing video games, teh software folks clamored for
a 2MHz design -- claiming the processor was only a dollar or two
more expensive than the 1MHz part. But, the ramifications to
the rest of the system were considerably more; what value a 2MHz
part if it spends extra cycles waiting for each opcode fetch?

So, why add a dollar to the BoM if you aren\'t going to get
any performance increase for it? And, no one is going to
justify adding all the other costs to make the 2MHz part
actually run twice as fast as the 1MHz part.

with a non-existing cache controller and cache rams that
cost as much as the cpu. I got a feeling for the price
of cache when I designed this:

https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/      

TI is a chip vendor.  They look at things differently than
folks who *use* the chips.

So is Intel.

Where\'s Fairchild, today? TI still struggles to produce
mainstream (non-DSP) processors -- without ARM\'s IP.

NatSemi used to make DRAM controllers -- that cost more than
the DRAM!  \"What does this do that a few multiplexers *won\'t*,
for us?\"

When you buy things in volume, you\'re \"paying for the plastic\";
the die have a relatively insignificant part in the cost.

Remember, the 9900/99K were from the \"home computer\" era.
They lost out to a dog slow 8086!

8086 was NOT slow. Have you ever used an Olivetti M20 with
a competently engineered memory system? That even challenged
early ATs when protected mode was not needed.

The 8086 (4.77MHz) was slower than a Z80, at the time.
Because applications that could use its extra abilities
were relatively few and far between.

Ah, I had both of them, in the same 19\" box.

How many did you sell?

Unless, of course, you were already planning on being busy
doing something else, at that time.   :>  \"Benchmarks lie\"

That benchmark was Unix System V, as licensed from Bell.
Find something better to do when you need to swap.

How many video games was it used in?  Pinball machines?
Medical devices?  Process control systems?  Navigation
systems?  etc.  None of those run a UNIX kernel nor any of
the sorts of algorithms that you\'d *find* in a UNIX
kernel.

Pinball machines with a Fairchild Clipper? Do you have
an idea what a Clipper module did cost? The machine was
intended as a multi user Unix machine. I later got a paid
project to build a VME bus terminal concentrator based
on 80186 for it.

We looked at putting T11\'s and F11\'s in games.

Why should I care about medical devices, video games, pinball
or navigation systems? GPS was an experiment at that time and
the 50 Baud navigation strings no problem for sure.

Because people make decisions based on the products that they
sell. I\'m not going to put a costly processor in a *mouse*.
And, I\'m not going to put a 5c processor in a process control
system.

So, why would I look at the performance of a processor running
UNIX... if my product is NOT running UNIX?

The product WAS running UNIX. I wrote they had bought a
commercial source license from Bell.

How many did you sell? You can put a PLC in a product -- if
the product can bear the cost/size of a PLC. *If* it
\"benchmarks appropriately\" for your product (price,
performance, development time, etc.). Yet, you might be
able to provide the same functionality with a little SoC.
How do you make that decision?
 
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].

Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted.  Remember, you can ALWAYS throw an exception,
even in an ISR!  E.g. the \"push\" can signal a page fault).  The link
register (think \"BAL\" -- what\'s old is now new!  :> ) determines *how*
the ISR terminates.

On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.

Getting a page fault in an ISR is hugely problematic, if this is
possible it compromises the entire design (so much about interrupt
latency). In dps for power there is a \"block translated area\" (no
page translation for it, it is always there) where interrupt handling
code is to be placed.
And there are 3 stack pointers in dps for 32 bit power: user, supervisor
and interrupt. The interrupt stack pointer is always translated (also
in BAT memory) and any exception first stacks a few registers in that
interrupt stack; then it can switch to say supervisor stack pointer
and go on to preempt a task, just do a system call etc.

.....

Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ

What do you do if you throw an exception BEFORE (or while!) doing
that stacking?  Does the CPU panic?  :>  (e.g., a double fault on
a 68k!)

I talked about this above, must have anticipated the question :).

....

Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

The \"application processors\" most definitely let you exert control over
the cache -- as well as processor affinity.

But, you *really* need to be wary about doing this as it sorely
impacts the utility of those mechanisms on the rest of your code!
I.e., if you wire-down part of the cache to expedite an ISR, then
you have forever taken that resource away from the rest of your code
to use.  Are you smart enough to know how to make that decision,
\"in general\" (specific cases are a different story)?

The power core I mostly use can lock parts of the cache, IIRC in
1/4-th (i.e. 4k) increments. I have never used that though.

I was somewhat surprised that ARM has the ability to truly prioritize
interrupts, 68k style. Both you and Lasse said that, this is an
important thing to have.
 
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.

PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.

look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed
 
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed
 
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.
 
mandag den 16. januar 2023 kl. 22.56.54 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.

yes, the cortex-M basically only have one IRQ like the old ARM, but a module called the Nested Vectored Interrupt Controller was added to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a \"back door\"
 
On 1/17/2023 0:05, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 22.56.54 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.

yes, the cortex-M basically only have one IRQ like the old ARM,
but a module called the Nested Vectored Interrupt Controller was added
to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a \"back door\"

Wait a second, this is not like 68k. If there is just one IRQ
line the external controller cannot interrupt the core while it
deals with an interrupt and is still being masked. Or is there something
more sophisticated in that?
 
mandag den 16. januar 2023 kl. 23.31.19 UTC+1 skrev Dimiter Popoff:
On 1/17/2023 0:05, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 22.56.54 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.

yes, the cortex-M basically only have one IRQ like the old ARM,
but a module called the Nested Vectored Interrupt Controller was added
to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a \"back door\"
Wait a second, this is not like 68k. If there is just one IRQ
line the external controller cannot interrupt the core while it
deals with an interrupt and is still being masked. Or is there something
more sophisticated in that?

the core it self only has one interrupt signal, but the NVIC takes care of all the stuff that
makes it possible for higher priority interrupts to interrupt an already running interrupt

being \"in\" an interrupt only mask that priority and lower priorities
 
On 1/17/2023 0:45, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 23.31.19 UTC+1 skrev Dimiter Popoff:
On 1/17/2023 0:05, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 22.56.54 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.

yes, the cortex-M basically only have one IRQ like the old ARM,
but a module called the Nested Vectored Interrupt Controller was added
to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a \"back door\"
Wait a second, this is not like 68k. If there is just one IRQ
line the external controller cannot interrupt the core while it
deals with an interrupt and is still being masked. Or is there something
more sophisticated in that?

the core it self only has one interrupt signal, but the NVIC takes care of all the stuff that
makes it possible for higher priority interrupts to interrupt an already running interrupt

being \"in\" an interrupt only mask that priority and lower priorities

I am not sure I get it, I have had such an interrupt controller on
the power cores I have used (first one on the mpc8240, late 90-s).

What happens to the core when the IRQ line from the interrupt
controller is asserted? On power, this would cause the core
to set the interrupt mask bit and go to a specific address,
roughly speaking. Then it is up to the core to clear *its*
mask bit in its MSR, no matter what the interrupt controller
does.

On 68k/coldfire it is different; the core has a 3 bit interrupt
mask in its status register and 7 interrupt wires.
If you assert say IRQ2 the core sets its mask to 2 which masks
IRQ2 and goes through the IRQ2 vector. Now if IRQ3 is asserted
it will be taken the same way immediately and the mask will be
set to 3; if IRQ1 comes it will have to wait until the core sets
it mask bits to 0. Level 7 IRQ is unmaskable (useful only for
debugging purposes, not in general code).

(Please feel free to ignore me if you don\'t feel like explaining
all that ARM stuff now, obviously I can dig it myself if I
really need it).
 
mandag den 16. januar 2023 kl. 23.56.31 UTC+1 skrev Dimiter Popoff:
On 1/17/2023 0:45, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 23.31.19 UTC+1 skrev Dimiter Popoff:
On 1/17/2023 0:05, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 22.56.54 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 23:22, Lasse Langwadt Christensen wrote:
mandag den 16. januar 2023 kl. 21.53.01 UTC+1 skrev Dimiter Popoff:
On 1/16/2023 11:10, Don Y wrote:
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].
Lasse also mentioned it, I see and it makes sense; I did not realize
this was a \"small\" flavour of ARM, I am not familiar with any ARM.

now there is basically two types cortex-Mx which is a 32bit microcontroller
with increasing x features are added like DSP instructions single, and double, FPU

and the cortex-A which is a 32/64bit cpu used in cell phones etc.


PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.
On power, you get the PC and the MSR saved in special purpose registers
assigned to do exactly that. You have to stack them yourself, even that.
You don\'t even have a stack pointer, it is up to you to assign one
of the GPR-s to that. The 603e core maximizes delegation to a huge
extent and this is very convenient when you know what you are doing.
Even when you get a TLB miss, all you get is an exception plus
r0-r3 being switched to a shadow bank, Z80 style, meant for you
to locate the entry in the page translation table and put it in the
TLB; if not in the table you do a page fault, go into allocation,
swapping if needed, fixing etc., you know how it goes. You don\'t need
to stack anything, the 4 regs you have are enough to do the TLB fix.


look similar to the older generation ARM, ARM7-TDMI

stack pointer was also just a GP register defined as stack pointer

it had one IRQ, and it only shadowed the return address and status register
and an FIQ (fast) that shadowed the return address, status register, and (afair)
seven general purpose registers

quite a bit code needed to find the interrupt source, stacking, etc. and even worse
if preemption was needed

Finding the source does not take much on SOCs with an interrupt
priority encoder, they all have that one nowadays (off-core, it does
only prioritize which vector will be supplied to the core which has
just one IRQ line).
Preemption from a hardware interrupt is a complex thing to do whatever
the scheme unless the interrupt is guaranteed to be received only while
running at user level.
But true prioritized interrupts 68k style is a huge step forward ARM
have made, this does make a difference.

yes, the cortex-M basically only have one IRQ like the old ARM,
but a module called the Nested Vectored Interrupt Controller was added
to the cortex-m that does all the decoding, vectoring, priority, stacking, unstacking thought a \"back door\"
Wait a second, this is not like 68k. If there is just one IRQ
line the external controller cannot interrupt the core while it
deals with an interrupt and is still being masked. Or is there something
more sophisticated in that?

the core it self only has one interrupt signal, but the NVIC takes care of all the stuff that
makes it possible for higher priority interrupts to interrupt an already running interrupt

being \"in\" an interrupt only mask that priority and lower priorities


I am not sure I get it, I have had such an interrupt controller on
the power cores I have used (first one on the mpc8240, late 90-s).

What happens to the core when the IRQ line from the interrupt
controller is asserted? On power, this would cause the core
to set the interrupt mask bit and go to a specific address,
roughly speaking. Then it is up to the core to clear *its*
mask bit in its MSR, no matter what the interrupt controller
does.

On 68k/coldfire it is different; the core has a 3 bit interrupt
mask in its status register and 7 interrupt wires.
If you assert say IRQ2 the core sets its mask to 2 which masks
IRQ2 and goes through the IRQ2 vector. Now if IRQ3 is asserted
it will be taken the same way immediately and the mask will be
set to 3; if IRQ1 comes it will have to wait until the core sets
it mask bits to 0. Level 7 IRQ is unmaskable (useful only for
debugging purposes, not in general code).

(Please feel free to ignore me if you don\'t feel like explaining
all that ARM stuff now, obviously I can dig it myself if I
really need it).

it is basically the same as how you describe the 68K it just isn\'t
inside the core it is in a module that sits next to the core

all the different interrupts a connected to the NVIC and single
interrupt from the NVIC to the core, all the different interrupts have a programmable priority level

when an unmasked interrupt is asserted the NVIC interrupts the core, mask
same and lower priorities, \"reenables\" interrupts and vector to that ISR

same happens if another interrupt that isn\'t masked i.e. higher priority is asserted while an ISR is running

https://www.motioncontroltips.com/wp-content/uploads/2019/03/Nested-Interrupt-Diagram-Feature.jpg

main is interrupted by ISR1, ISR1 is interrupted by the higher priority ISR2
 
On 1/16/2023 3:56 PM, Dimiter_Popoff wrote:
I am not sure I get it, I have had such an interrupt controller on
the power cores I have used (first one on the mpc8240, late 90-s).

What happens to the core when the IRQ line from the interrupt
controller is asserted? On power, this would cause the core
to set the interrupt mask bit and go to a specific address,
roughly speaking. Then it is up to the core to clear *its*
mask bit in its MSR, no matter what the interrupt controller
does.

There is one EXTERNAL interrupt signal available. All of the
INTERNAL (SoC) interrupt sources each connect to the NVIC
directly. There are dozens of such sources. The NVIC makes it
possible (to some extent) to prioritize how \"important\"
(priority) each source is.

On 68k/coldfire it is different; the core has a 3 bit interrupt
mask in its status register and 7 interrupt wires.
If you assert say IRQ2 the core sets its mask to 2 which masks
IRQ2 and goes through the IRQ2 vector. Now if IRQ3 is asserted
it will be taken the same way immediately and the mask will be
set to 3; if IRQ1 comes it will have to wait until the core sets
it mask bits to 0. Level 7 IRQ is unmaskable (useful only for
debugging purposes, not in general code).

There are 256 interrupt vectors (address to which control should
be transferred, more-or-less) each which can have a priority assigned.
Remember, *exceptions* are also treated the same (or, you
can view interrupts AS exceptions).

The NVIC looks at the priorities for all exceptions that
want to be serviced, \"now\", and picks the one with the
highest priority (lower values represent higher priorities).

Thereafter (until the exception is \"done\"), only exceptions of
higher priority can interrupt the handling of the exception.

So, a bus fault can interrupt your \"handling\" of the
jiffy (also an exception); an NMI can interrupt the
handling of a timer interrupt (because timers come into
the NVIC as \"external interrupts\"). etc.

(Please feel free to ignore me if you don\'t feel like explaining
all that ARM stuff now, obviously I can dig it myself if I
really need it).
 
On 1/16/2023 1:52 PM, Dimiter_Popoff wrote:
I was somewhat surprised that ARM has the ability to truly prioritize
interrupts, 68k style. Both you and Lasse said that, this is an
important thing to have.

It\'s almost essential when you have as many *possible* IRQ sources
(\"exceptions\") as these devices *can* support. You need <something>
to at least summarize the state of all (unmasked) requests that you
can quickly examine AT entry to the service routine -- otherwise
you\'d be wasting time polling a bunch of status registers: \"Are YOU
the source of this IRQ?\"

You have to also remember that ARM is just IP. Vendors \"assemble\"
the IP into real devices. So, they have a library of \"blocks\"
to choose from. The \"Timer Status Register\" will be present in
a *timer* block -- which may or may not be present in a
particular design. Ditto PWM controller, etc.

You want a means of bringing together this (IRQ) information
from disjoint blocks to a place where they can all be
examined/processed. Instead of just giving you a 256bit
status \"word\" representing the status of each IRQ source,
the NVIC also lets you arrange them into some order that
is appropriate for YOUR application.

And, because *it* ultimately signals the core to handle the
exception, *it* knows when an ISR is being executed -- and
WHY (priority).
 
On a sunny day (Mon, 16 Jan 2023 06:29:25 -0800) it happened John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote in
<cenashpl5f7cdk37nq4dab7hcdlqknptqh@4ax.com>:


We wrote one Linux program that just toggled a test point as fast as
it could. That was interesting on a scope, namely the parts that
didn\'t toggle.

That way the task switch will interrupt, any multitasker does that.

It all depends look at
\'using raspberry Pi as FM transmitter\' (80 to 100 MHz or so):
https://linuxhint.com/turn-raspberry-pi-fm-transmitter/

That code gave me the following idea,
freq_pi:
http://panteltje.com/panteltje/newsflex/download.html#freq_pi

and that was for a very old Pi model,
somebody then ported it to a later model, no idea how fast you can go on a Pi4.

So, basically, you mentioned:
\"will give it (Pi4) to some programmer\"

Now let me tell you this!!!

STUDY THE HARDWARE
that Pi4 has some quite complex very powerful hardware on board.
Using that will beat any \'just code a loop in C or asm or whatever\' or interrupt stuff.
The same goes for most micros, most have extra hardware on board.
What we see now is programmers who have no clue about hardware
building libraries that are bloated and slow
that then are used by others that have even less knowledge of the hardware
AND of programming and user interfaces to make modern bloat like microsore and jmail email crap.

So my advice to you: study the hardware of that Pi4 you bought for all those dollars.
Most of the time no need to learn ARM asm (and all its registers) just make use of the hardware there is.

\'A programmer\' likely will not understand the hardware and resort to bloat where it is not needed.
 

Welcome to EDABoard.com

Sponsor

Back
Top