highest frequency periodic interrupt?...

søndag den 15. januar 2023 kl. 15.10.17 UTC+1 skrev Dimiter Popoff:
On 1/15/2023 12:48, Lasse Langwadt Christensen wrote:
søndag den 15. januar 2023 kl. 06.10.24 UTC+1 skrev upsid...@downunder.com:
On Sat, 14 Jan 2023 04:47:22 GMT, Jan Panteltje
pNaonSt...@yahoo.com> wrote:

On a sunny day (Fri, 13 Jan 2023 15:46:16 -0800) it happened John Larkin
jla...@highlandSNIPMEtechnology.com> wrote in
q5p3shh8f34tt34ka...@4ax.com>:

What\'s the fastest periodic IRQ that you have ever run?

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

Well in that sort of thing you need to think in asm, instruction times,
but I have no experience with the RP2040, and little with ASM on ARM.
Should be simple to test how long the C code takes, do you have an RP2040?
Playing with one would be a good starting point.
Should I get one? Was thinking just for fun...
In the past coding ISRs in assembly was the way to go, but the
complexity of current processors (cache, pipelining) makes it hard to
beat a _good_ compiler.

The main principle still is to minimize the number of registers saved
at
interrupt entry (and restored at exit).On a primitive processor only
the processor status word and program counter needs to be saved (and
restored). Additional registers may need to be saved(restored if the
ISR uses them.

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

Some compilers may have \"interrupt\" keywords or similar extensions and
the compiler knows which registers need to be saved in the ISR. To
help the compiler, include all functions that are called by the ISR in
the same module(preferably in-lined) prior to the ISR, so that the
compiler knows what needs to be saved. Do not call external library
routines from ISR, since the compiler doesn\'t know which registers
need to be saved and saves all.

cortex-m automatically stack the registers needed to call a regular C function
and if it has an FPU it supports \"lazy stacking\" which means it keeps track of
whether the FPU is used and only stack/un-stack them when they are used

it also knows that if another interrupt is pending at ISR exit is doesn\'t need to
to un-stack/stack before calling the other interrupt

How many registers does it stack automatically?

eight and usually in parallel with the fetch of the ISR address and instructions
so with overhead 12 cycles from interrupt to first instruction of the ISR is executed

I knew the HLL nonsense
would catch up with CPU design eventually. Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ
routine, along with registers to be used in it.

\"good\" depending of what your objective is, automatic stacking save code space
and the time it takes to fetch that code

Memory accesses are
the bottleneck, and with HLL code being bloated as it is chances
are some cache will have to be flushed to make room for stacking.
Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

we are talking cortex-m with no real caches. The pico has a special cache to
run code directly from slow external serial flash resonable speed, but you can
tell the compiler to copy and keep a function in ram
 
søndag den 15. januar 2023 kl. 15.29.08 UTC+1 skrev Dimiter Popoff:
On 1/14/2023 1:46, John Larkin wrote:
What\'s the fastest periodic IRQ that you have ever run?

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

10 us for a 100+ MHz CPU should be doable; I don\'t know about ARM
though, they keep on surprising me with this or that nonsense. (never
used one, just by chance stumbling on that sort of thing).
What you might need to consider is that on modern day CPUs you
don\'t have the nice prioritized IRQ scheme you must be used to from
the CPU32; once in an interrupt you are just masked for all interrupts,
they have some priority resolver which only resolves which interrupt
will come next *after* you get unmasked.

not on a cortex-m. The interrupt controller has programmable priority for each interrupt.
Higher priority interrupts interrupt lower priority interrupts. Another sub priority
determines which interrupt to run first if two or more interrupts off the same priority
is pending
 
On Sun, 15 Jan 2023 13:53:12 -0800 (PST), Lasse Langwadt Christensen
<langwadt@fonz.dk> wrote:

søndag den 15. januar 2023 kl. 15.10.17 UTC+1 skrev Dimiter Popoff:
On 1/15/2023 12:48, Lasse Langwadt Christensen wrote:
søndag den 15. januar 2023 kl. 06.10.24 UTC+1 skrev upsid...@downunder.com:
On Sat, 14 Jan 2023 04:47:22 GMT, Jan Panteltje
pNaonSt...@yahoo.com> wrote:

On a sunny day (Fri, 13 Jan 2023 15:46:16 -0800) it happened John Larkin
jla...@highlandSNIPMEtechnology.com> wrote in
q5p3shh8f34tt34ka...@4ax.com>:

What\'s the fastest periodic IRQ that you have ever run?

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

Well in that sort of thing you need to think in asm, instruction times,
but I have no experience with the RP2040, and little with ASM on ARM.
Should be simple to test how long the C code takes, do you have an RP2040?
Playing with one would be a good starting point.
Should I get one? Was thinking just for fun...
In the past coding ISRs in assembly was the way to go, but the
complexity of current processors (cache, pipelining) makes it hard to
beat a _good_ compiler.

The main principle still is to minimize the number of registers saved
at
interrupt entry (and restored at exit).On a primitive processor only
the processor status word and program counter needs to be saved (and
restored). Additional registers may need to be saved(restored if the
ISR uses them.

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

Some compilers may have \"interrupt\" keywords or similar extensions and
the compiler knows which registers need to be saved in the ISR. To
help the compiler, include all functions that are called by the ISR in
the same module(preferably in-lined) prior to the ISR, so that the
compiler knows what needs to be saved. Do not call external library
routines from ISR, since the compiler doesn\'t know which registers
need to be saved and saves all.

cortex-m automatically stack the registers needed to call a regular C function
and if it has an FPU it supports \"lazy stacking\" which means it keeps track of
whether the FPU is used and only stack/un-stack them when they are used

it also knows that if another interrupt is pending at ISR exit is doesn\'t need to
to un-stack/stack before calling the other interrupt

How many registers does it stack automatically?

eight and usually in parallel with the fetch of the ISR address and instructions
so with overhead 12 cycles from interrupt to first instruction of the ISR is executed

I knew the HLL nonsense
would catch up with CPU design eventually. Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ
routine, along with registers to be used in it.

\"good\" depending of what your objective is, automatic stacking save code space
and the time it takes to fetch that code

Memory accesses are
the bottleneck, and with HLL code being bloated as it is chances
are some cache will have to be flushed to make room for stacking.
Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

we are talking cortex-m with no real caches. The pico has a special cache to
run code directly from slow external serial flash resonable speed, but you can
tell the compiler to copy and keep a function in ram

That\'s the thing to do: run the fast control loop on one CPU in ram
and let the other CPU do the slow stuff and thrash cache.
 
On Sun, 15 Jan 2023 16:13:21 -0500, Joe Gwinn <joegwinn@comcast.net>
wrote:

On Sun, 15 Jan 2023 11:16:45 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 12:16:36 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Sun, 15 Jan 2023 08:00:39 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 04:39:22 GMT, Jan Panteltje
pNaonStpealmtje@yahoo.com> wrote:

On a sunny day (Sat, 14 Jan 2023 10:21:59 -0800) it happened John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote in
epr5sh59k5q62qkapubhkfk8ubf9r0vnng@4ax.com>:

On Sat, 14 Jan 2023 15:52:49 +0000, Martin Brown
\'\'\'newspam\'\'\'@nonad.co.uk> wrote:

On 13/01/2023 23:46, John Larkin wrote:
What\'s the fastest periodic IRQ that you have ever run?

Usually try to avoid having fast periodic IRQs in favour of offloading
them onto some dedicated hardware. But CPUs were slower then than now.

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

It might be worth benchmarking how fast the FPU really is on that device
(for representative sample code). The Intel i5 & i7 can do all except
divide in a single cycle these days - I don\'t know what Arm is like in
this respect. You get some +*- for free close to every divide too.

The RP2040 chip has FP routines in the rom, apparently code with some
sorts of hardware assist, but it\'s callable subroutines and not native
instructions to a hardware FP engine. When it returns it\'s done.

Various web sites seem to confuse microseconds and nanoseconds. 150 us
does seem slow for a \"fast\" fp operation. We\'ll have to do
experiments.

I wrote one math package for the 68K, with the format signed 32.32.
That behaved just like floating point in real life, but was small and
fast and avoided drecky scaled integers.


*BIG* time penalty for having two divides or branches too close
together. Worth playing around to find patterns the CPU does well.

Without true hardware FP, call locations probably don\'t matter.


Beware that what you measure gets controlled but for polynomials up to 5
term or rationals up to about 5,2 call overhead may dominate the
execution time (particularly if the stupid compiler puts a 16byte
structure across a cache boundary on the stack).

We occasionally use polynomials, but 2nd order and rarely 3rd is
enough to get analog i/o close enough.


Forcing inlining of small code sections can help. DO it to excess and it
will slow things down - there is a sweet spot. Loop unrolling is much
less useful these days now that branch prediction is so good.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

ISR code is generally very short and best done in assembler if you want
it as quick as possible. Examining the code generation of GCC is
worthwhile since it sucks compared to Intel(better) and MS (best).

In my tests GCC is between 30% and 3x slower than Intel or MS for C/C++
when generating Intel CPU specific SIMD code with maximum optimisation.

MS compiler still does pretty stupid things like internal compiler
generated SIMD objects of 128, 256 or 512 bits (16, 33 or 64 byte) and
having them crossing a cache line boundary.

Nobody has answered my question. Generalizations about software timing
abound but hard numbers are rare. Programmers don\'t seem to use
oscilloscopes much.

That is silly
http://panteltje.com/panteltje/pic/scope_pic/index.html

Try reading the asm, it is well commented.
:)

And if you are talking Linux or other multi-taskers there is a lot more involved.

I was thinking about doing closed-loop control, switching power
supplies and dummy loads and such, using one core of an RP2040 instead
of an FPGA. That would be coded hard-metal, no OS or RTOS.

I guess I don\'t really need interrupts. I could run a single
persistant loop that waits on a timer until it\'s time to compute
again, to run at for instance 100 KHz. If execution time is reasonably
constant, it could just loop as fast as it can; even simpler. I like
that one.

This is a very common approach, being pioneered by Bell Labs when
designing the first digital telephone switch, the 1ESS:

.<https://en.wikipedia.org/wiki/Number_One_Electronic_Switching_System

The approach endures in such things as missile autopilots, but always
with some way to gracefully handle when the control code occasionally
runs too long and isn\'t done in time for the next frame to start.

I was thinking of an endless loop that just runs compute bound as hard
as it can. The \"next frame\" is the top of the loop. The control loop
time base is whatever the average loop execution time is.

As you say, no interrupt overhead.

To be more specific, the frames effectively run at interrupt priority,
triggered by a timer interrupt, but we also run various background
tasks at user level utilizing whatever CPU is left over, if any. The
sample rate is set by controller dynamics, and going faster does not
help. Especially if FFTs are being performed over a moving window of
samples.

Joe Gwinn

No, just one control loop runing full-blast on one of the CPUs,
running in sram, and no interrupts.

I don\'t think a power supply needs FFTS. Maybe a little lowpass
filtering, but that\'s just a few lines of code. Or one line.

The actual control loop might be a page of code.
 
On Sun, 15 Jan 2023 14:33:50 -0800, John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 16:13:21 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Sun, 15 Jan 2023 11:16:45 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 12:16:36 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Sun, 15 Jan 2023 08:00:39 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 04:39:22 GMT, Jan Panteltje
pNaonStpealmtje@yahoo.com> wrote:

On a sunny day (Sat, 14 Jan 2023 10:21:59 -0800) it happened John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote in
epr5sh59k5q62qkapubhkfk8ubf9r0vnng@4ax.com>:

On Sat, 14 Jan 2023 15:52:49 +0000, Martin Brown
\'\'\'newspam\'\'\'@nonad.co.uk> wrote:

On 13/01/2023 23:46, John Larkin wrote:
What\'s the fastest periodic IRQ that you have ever run?

Usually try to avoid having fast periodic IRQs in favour of offloading
them onto some dedicated hardware. But CPUs were slower then than now.

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

It might be worth benchmarking how fast the FPU really is on that device
(for representative sample code). The Intel i5 & i7 can do all except
divide in a single cycle these days - I don\'t know what Arm is like in
this respect. You get some +*- for free close to every divide too.

The RP2040 chip has FP routines in the rom, apparently code with some
sorts of hardware assist, but it\'s callable subroutines and not native
instructions to a hardware FP engine. When it returns it\'s done.

Various web sites seem to confuse microseconds and nanoseconds. 150 us
does seem slow for a \"fast\" fp operation. We\'ll have to do
experiments.

I wrote one math package for the 68K, with the format signed 32.32.
That behaved just like floating point in real life, but was small and
fast and avoided drecky scaled integers.


*BIG* time penalty for having two divides or branches too close
together. Worth playing around to find patterns the CPU does well.

Without true hardware FP, call locations probably don\'t matter.


Beware that what you measure gets controlled but for polynomials up to 5
term or rationals up to about 5,2 call overhead may dominate the
execution time (particularly if the stupid compiler puts a 16byte
structure across a cache boundary on the stack).

We occasionally use polynomials, but 2nd order and rarely 3rd is
enough to get analog i/o close enough.


Forcing inlining of small code sections can help. DO it to excess and it
will slow things down - there is a sweet spot. Loop unrolling is much
less useful these days now that branch prediction is so good.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

ISR code is generally very short and best done in assembler if you want
it as quick as possible. Examining the code generation of GCC is
worthwhile since it sucks compared to Intel(better) and MS (best).

In my tests GCC is between 30% and 3x slower than Intel or MS for C/C++
when generating Intel CPU specific SIMD code with maximum optimisation.

MS compiler still does pretty stupid things like internal compiler
generated SIMD objects of 128, 256 or 512 bits (16, 33 or 64 byte) and
having them crossing a cache line boundary.

Nobody has answered my question. Generalizations about software timing
abound but hard numbers are rare. Programmers don\'t seem to use
oscilloscopes much.

That is silly
http://panteltje.com/panteltje/pic/scope_pic/index.html

Try reading the asm, it is well commented.
:)

And if you are talking Linux or other multi-taskers there is a lot more involved.

I was thinking about doing closed-loop control, switching power
supplies and dummy loads and such, using one core of an RP2040 instead
of an FPGA. That would be coded hard-metal, no OS or RTOS.

I guess I don\'t really need interrupts. I could run a single
persistant loop that waits on a timer until it\'s time to compute
again, to run at for instance 100 KHz. If execution time is reasonably
constant, it could just loop as fast as it can; even simpler. I like
that one.

This is a very common approach, being pioneered by Bell Labs when
designing the first digital telephone switch, the 1ESS:

.<https://en.wikipedia.org/wiki/Number_One_Electronic_Switching_System

The approach endures in such things as missile autopilots, but always
with some way to gracefully handle when the control code occasionally
runs too long and isn\'t done in time for the next frame to start.

I was thinking of an endless loop that just runs compute bound as hard
as it can. The \"next frame\" is the top of the loop. The control loop
time base is whatever the average loop execution time is.

As you say, no interrupt overhead.

To be more specific, the frames effectively run at interrupt priority,
triggered by a timer interrupt, but we also run various background
tasks at user level utilizing whatever CPU is left over, if any. The
sample rate is set by controller dynamics, and going faster does not
help. Especially if FFTs are being performed over a moving window of
samples.

Joe Gwinn

No, just one control loop runing full-blast on one of the CPUs,
running in sram, and no interrupts.

I don\'t think a power supply needs FFTS. Maybe a little lowpass
filtering, but that\'s just a few lines of code. Or one line.

Probably so. I was thinking radars and missile autopilots.

Generally, the FFTs (or anything lengthy) are not done at interrupt
level. The interrupt code grabs and stores the data in ram, sets a
flag to release the user level code doing the signal processing, and
then exits the interrupt. Whereupon the user level code commences
running the signal processing code. Otherwise, the system could not
respond to important but rare interrupts.


>The actual control loop might be a page of code.

Could be. What I\'ve seen the power-supply folk do is to use SPICE to
tweak the PS\'s control law, which is generally implemented in a FIR
filter. IIR filters are feared because they can become unstable,
especially in the somewhat wild environment of a power supply

Joe Gwinn.
 
John Larkin <jlarkin@highlandsnipmetechnology.com> wrote:
What\'s the fastest periodic IRQ that you have ever run?

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

Not exactly periodic but I did 2Mb/s interrupt driven bi-directional
serial communication. That is about 5uS between characters and
there were 2 interrupts per character (one to receive, the other
to transmit answer). In other words, about 400kHz inerrupt rate.
That was on STM32F103 running at 72 MHz (that is Cortex M3). I also
tried 3Mb/s, but apparently that was too much for USB bus in PC
(standard 12Mb/s port).

Concerning interrupt overhead, for STM32F030 running code from
RAM overhead seem to be between 26-28 clocks. More precisely,
I had very simple interrupt handler that just increments a variable
(millisecond counter). \"Work\" part of the interrupt handler
should execute in 7 clocks. When I timed busy loop interrupt
increased execution time of the loop by 33-35 clocks. That
agrees reasonably well with cycle counts for Cortex-M0 published
in ARM forums: 16 clocks to enter to interrupt handler and 12
clocks to get back to main program. Processor in Pi Pico is
Cortex-M0+ which is supposed to take 15 clocks to enter to interrupt
handler. So you can expect 1 clock less overhead than for Cortex-M0.

Concerning useful procedures, there is a lot of things which can
slow down the code. For example read-modify-write cycle on I/O
port is likely to insert some extra wait states. Most MCU-s
execute code from flash, and usually flash can not run at max
CPU speed so there are extra wait states. For example Cortex-M4
running from one RAM bank and having stack in separate RAM bank
can do interrupt like above in 27-28 clocks, so overhead probably
is 20-21 cycles (I write probably because Cortex-M4 has complex
rules concerning instruction times so I am not sure if interrupt
handler takes 7 clocks). But different configuration can brings
time up to 42-48 clocks. Cortex-M3 (which should have very close
times to Cortex-M4) running from flash with 0 wait states (8MHz
clock) needs 24 clocks to execute interrupt handler, but with
2 wait states (needed to run at 72MHz) needs 29 to 31 clocks
and more when there are more wait states.

RP2040 in Pi Pico normally runs form RAM, so should be free
from slowdown due to flash. But with two cores and several
DMA channels there may be bus contention. Still, interrups
rates of order 1M/s should not be a problem.

--
Waldek Hebisch
 
On Sun, 15 Jan 2023 18:21:12 -0500, Joe Gwinn <joegwinn@comcast.net>
wrote:

On Sun, 15 Jan 2023 14:33:50 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 16:13:21 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Sun, 15 Jan 2023 11:16:45 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 12:16:36 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Sun, 15 Jan 2023 08:00:39 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Sun, 15 Jan 2023 04:39:22 GMT, Jan Panteltje
pNaonStpealmtje@yahoo.com> wrote:

On a sunny day (Sat, 14 Jan 2023 10:21:59 -0800) it happened John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote in
epr5sh59k5q62qkapubhkfk8ubf9r0vnng@4ax.com>:

On Sat, 14 Jan 2023 15:52:49 +0000, Martin Brown
\'\'\'newspam\'\'\'@nonad.co.uk> wrote:

On 13/01/2023 23:46, John Larkin wrote:
What\'s the fastest periodic IRQ that you have ever run?

Usually try to avoid having fast periodic IRQs in favour of offloading
them onto some dedicated hardware. But CPUs were slower then than now.

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

It might be worth benchmarking how fast the FPU really is on that device
(for representative sample code). The Intel i5 & i7 can do all except
divide in a single cycle these days - I don\'t know what Arm is like in
this respect. You get some +*- for free close to every divide too.

The RP2040 chip has FP routines in the rom, apparently code with some
sorts of hardware assist, but it\'s callable subroutines and not native
instructions to a hardware FP engine. When it returns it\'s done.

Various web sites seem to confuse microseconds and nanoseconds. 150 us
does seem slow for a \"fast\" fp operation. We\'ll have to do
experiments.

I wrote one math package for the 68K, with the format signed 32.32.
That behaved just like floating point in real life, but was small and
fast and avoided drecky scaled integers.


*BIG* time penalty for having two divides or branches too close
together. Worth playing around to find patterns the CPU does well.

Without true hardware FP, call locations probably don\'t matter.


Beware that what you measure gets controlled but for polynomials up to 5
term or rationals up to about 5,2 call overhead may dominate the
execution time (particularly if the stupid compiler puts a 16byte
structure across a cache boundary on the stack).

We occasionally use polynomials, but 2nd order and rarely 3rd is
enough to get analog i/o close enough.


Forcing inlining of small code sections can help. DO it to excess and it
will slow things down - there is a sweet spot. Loop unrolling is much
less useful these days now that branch prediction is so good.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

ISR code is generally very short and best done in assembler if you want
it as quick as possible. Examining the code generation of GCC is
worthwhile since it sucks compared to Intel(better) and MS (best).

In my tests GCC is between 30% and 3x slower than Intel or MS for C/C++
when generating Intel CPU specific SIMD code with maximum optimisation.

MS compiler still does pretty stupid things like internal compiler
generated SIMD objects of 128, 256 or 512 bits (16, 33 or 64 byte) and
having them crossing a cache line boundary.

Nobody has answered my question. Generalizations about software timing
abound but hard numbers are rare. Programmers don\'t seem to use
oscilloscopes much.

That is silly
http://panteltje.com/panteltje/pic/scope_pic/index.html

Try reading the asm, it is well commented.
:)

And if you are talking Linux or other multi-taskers there is a lot more involved.

I was thinking about doing closed-loop control, switching power
supplies and dummy loads and such, using one core of an RP2040 instead
of an FPGA. That would be coded hard-metal, no OS or RTOS.

I guess I don\'t really need interrupts. I could run a single
persistant loop that waits on a timer until it\'s time to compute
again, to run at for instance 100 KHz. If execution time is reasonably
constant, it could just loop as fast as it can; even simpler. I like
that one.

This is a very common approach, being pioneered by Bell Labs when
designing the first digital telephone switch, the 1ESS:

.<https://en.wikipedia.org/wiki/Number_One_Electronic_Switching_System

The approach endures in such things as missile autopilots, but always
with some way to gracefully handle when the control code occasionally
runs too long and isn\'t done in time for the next frame to start.

I was thinking of an endless loop that just runs compute bound as hard
as it can. The \"next frame\" is the top of the loop. The control loop
time base is whatever the average loop execution time is.

As you say, no interrupt overhead.

To be more specific, the frames effectively run at interrupt priority,
triggered by a timer interrupt, but we also run various background
tasks at user level utilizing whatever CPU is left over, if any. The
sample rate is set by controller dynamics, and going faster does not
help. Especially if FFTs are being performed over a moving window of
samples.

Joe Gwinn

No, just one control loop runing full-blast on one of the CPUs,
running in sram, and no interrupts.

I don\'t think a power supply needs FFTS. Maybe a little lowpass
filtering, but that\'s just a few lines of code. Or one line.

Probably so. I was thinking radars and missile autopilots.

Generally, the FFTs (or anything lengthy) are not done at interrupt
level. The interrupt code grabs and stores the data in ram, sets a
flag to release the user level code doing the signal processing, and
then exits the interrupt. Whereupon the user level code commences
running the signal processing code. Otherwise, the system could not
respond to important but rare interrupts.


The actual control loop might be a page of code.

Could be. What I\'ve seen the power-supply folk do is to use SPICE to
tweak the PS\'s control law, which is generally implemented in a FIR
filter. IIR filters are feared because they can become unstable,
especially in the somewhat wild environment of a power supply

Joe Gwinn.

A proportional+integral error amplifier is all that most power
supplies need. That is easily Spiced and then easily turned into a few
lines of code.

An integrator is of course IIR. A FIR filter has a finite gain hence
some DC error.

I\'m not afraid of integrators!
 
On Sun, 15 Jan 2023 23:37:38 -0000 (UTC), antispam@math.uni.wroc.pl
wrote:

John Larkin <jlarkin@highlandsnipmetechnology.com> wrote:
What\'s the fastest periodic IRQ that you have ever run?

We have one board with 12 isolated LPC1758 ARMs. Each gets interrupted
by its on-chip ADC at 100 KHz and does a bunch of filtering and runs a
PID loop, which outputs to the on-chip DAC. We cranked the CPU clock
down some to save power, so the ISR runs for about 7 usec max.

I ask because if I use a Pi Pico on some new projects, it has a
dual-core 133 MHz CPU, and one core may have enough compute power that
we wouldn\'t need an FPGA in a lot of cases. Might even do DDS in
software.

RP2040 floating point is tempting but probably too slow for control
use. Things seem to take 50 or maybe 100 us. Back to scaled integers,
I guess.

I was also thinking that we could make a 2 or 3-bit DAC with a few
resistors. The IRQ could load that at various places and a scope would
trace execution. That would look cool. On the 1758 thing we brought
out a single bit to a test point and raised that during the ISR so we
could see ISR execution time on a scope. My c guy didn\'t believe that
a useful ISR could run at 100K and had no idea what execution time
might be.

Not exactly periodic but I did 2Mb/s interrupt driven bi-directional
serial communication. That is about 5uS between characters and
there were 2 interrupts per character (one to receive, the other
to transmit answer). In other words, about 400kHz inerrupt rate.
That was on STM32F103 running at 72 MHz (that is Cortex M3). I also
tried 3Mb/s, but apparently that was too much for USB bus in PC
(standard 12Mb/s port).

Concerning interrupt overhead, for STM32F030 running code from
RAM overhead seem to be between 26-28 clocks. More precisely,
I had very simple interrupt handler that just increments a variable
(millisecond counter). \"Work\" part of the interrupt handler
should execute in 7 clocks. When I timed busy loop interrupt
increased execution time of the loop by 33-35 clocks. That
agrees reasonably well with cycle counts for Cortex-M0 published
in ARM forums: 16 clocks to enter to interrupt handler and 12
clocks to get back to main program. Processor in Pi Pico is
Cortex-M0+ which is supposed to take 15 clocks to enter to interrupt
handler. So you can expect 1 clock less overhead than for Cortex-M0.

Concerning useful procedures, there is a lot of things which can
slow down the code. For example read-modify-write cycle on I/O
port is likely to insert some extra wait states. Most MCU-s
execute code from flash, and usually flash can not run at max
CPU speed so there are extra wait states. For example Cortex-M4
running from one RAM bank and having stack in separate RAM bank
can do interrupt like above in 27-28 clocks, so overhead probably
is 20-21 cycles (I write probably because Cortex-M4 has complex
rules concerning instruction times so I am not sure if interrupt
handler takes 7 clocks). But different configuration can brings
time up to 42-48 clocks. Cortex-M3 (which should have very close
times to Cortex-M4) running from flash with 0 wait states (8MHz
clock) needs 24 clocks to execute interrupt handler, but with
2 wait states (needed to run at 72MHz) needs 29 to 31 clocks
and more when there are more wait states.

RP2040 in Pi Pico normally runs form RAM, so should be free
from slowdown due to flash. But with two cores and several
DMA channels there may be bus contention. Still, interrups
rates of order 1M/s should not be a problem.

Sound like roughly 200 ns of overhead, interrupt entry and exit, on
the 133 MHz pico. That\'s not bad for a 100 KHz interrupt.

I probably don\'t even need 100 KHz for a power supply control loop. A
1 ms step response would be fine.

It would be fun to do a DDS in software, for an AC supply.
 
On Monday, January 16, 2023 at 1:26:03 PM UTC+11, John Larkin wrote:
On Sun, 15 Jan 2023 23:37:38 -0000 (UTC), anti...@math.uni.wroc.pl
wrote:
John Larkin <jla...@highlandsnipmetechnology.com> wrote:

<snip>

RP2040 in Pi Pico normally runs form RAM, so should be free
from slowdown due to flash. But with two cores and several
DMA channels there may be bus contention. Still, interrups
rates of order 1M/s should not be a problem.

Sound like roughly 200 ns of overhead, interrupt entry and exit, on
the 133 MHz pico. That\'s not bad for a 100 KHz interrupt.

I probably don\'t even need 100 KHz for a power supply control loop. A
1 ms step response would be fine.

It would be fun to do a DDS in software, for an AC supply.

Sort of off the point through. An AC supply has to deliver a sine-wave voltage while looking like a low impedance source to the load.

You can use software to calculate what that voltage ought to be (which is what direct digital synthesis is all about) but the switching arrangements that connected a more or less stable DC voltage source to the load and delivered the desired voltage, and the currents required to sustain that voltage, might require quite a lot of fast processing capacity to let them create the desired effect, without wasting a lot of power in the process

It wouldn\'t look much like a regular DDS source.

--
Bill Sloman, Sydney
 
On 1/15/2023 7:29 AM, Dimiter_Popoff wrote:
> 10 us for a 100+ MHz CPU should be doable;

I was doing 6us on a 386 running with I/O on the ISA bus
(where I/O accesses were dreadfully slow!)

I don\'t know about ARM
though, they keep on surprising me with this or that nonsense. (never
used one, just by chance stumbling on that sort of thing).
What you might need to consider is that on modern day CPUs you
don\'t have the nice prioritized IRQ scheme you must be used to from
the CPU32; once in an interrupt you are just masked for all interrupts,
they have some priority resolver which only resolves which interrupt
will come next *after* you get unmasked. Some I have used have a

No, there is an NVIC (nested vectored interrupt controller) that
allows the software to decide what priorities to assign to each
interrupt source. It then decides if it can preempt a \"lower\"
priority interrupt being serviced at the present time, returning
to that ISR after the higher priority ISR executes.

This is almost a requirement as many of the ARMs have *scores*
(50+) of onboard interrupt sources; you wouldn\'t want to have to
poll them *in* an ISR to decide if you wanted to reenable the
interrupt. (Of course, if you map two sources to the same priority
level, then you will have to rewsolve the conflict at run-time.)

And, of course, some sources have fixed, immutable priorities.

You can also configure the NVIC to wake the processor (if in a
\"sleep\" mode) when one of these events are detected. And, to
return to sleep after servicing the IRQ.

[And, of course, you can wait FOR an interrupt...]

The bigger problem is knowing, at design time, how these IRQs
can stack up as time spent in the (nested/cascaded) ISRs is
time that isn\'t spent running in \"Thread Mode\".

[I designed a product that could spend 100.0% of its time in
an ISR, in response to (unconstrainable) user actions. But,
when the user eventually got tired of trying to f*ck the
system up, it would silently recover where it had left off.]

second, higher priority IRQ (like the 6809 FIRQ) but on the core I have
used they differ from the 6809-s FIRQ in that the errata sheet says
they don\'t work.
On load/store machines latency should be less of an issue for the
jitter you will get as long as you don\'t do division in your code to
be interrupted.

Latency matters to the current ISR but is defined/influenced
by factors OUTSIDE that ISR. E.g., if a higher priority ISR
is executing (or, the processor is in a critical region), then
servicing THIS IRQ will be delayed. This delay may not be predictable.

Make sure you look into the FPU you\'d consider deep enough, none
will get you your 32.32 bit accuracy. 64 bit FP numbers have a 52 or
so (can\'t remember exactly now) mantissa, the rest goes on the
exponent. I have found 32 bit FP numbers convenient to store some
constants (on the core I use the load is 1 cycle, expanding
automatically to 64 bit), did not find any other use for those.

Finally, to give you some numbers :). Back during the 80-s I wrote
a floppy disk controller for the 765 on a 1 MHz 6809. It had about
10 us per byte IIRC; doing IRQ was out of question. But the 6809
had a \"sync\" opcode, if IRQs were masked it would stop and wait
for an IRQ; and would just resume execution once the line was pulled.
This worked for the fastest of floppies (5\" HD), so perhaps you
can use a 6809 :D. (I may have one or two somewhere here, 2 MHz
ones at that - in DIP40....).

A lot depends on what you have to *do* in the ISR. If all
you are doing is grabbing a value from an I/O port and
stuffing it in a FIFO, then you have very little code to
execute *in* the ISR and your ISR can be really short and
responsive. (My MT driver was like that; read the available
byte from the i/f, stuff it in a FIFO and done -- 160KHz.
It would be silly to try to do the ECC in the ISR; just let
the background process handle that at its leisure -- and,
repeat the operation, if needed -- e.g., READ REVERSE or
BACKSPACE, READ FORWARD, depending on the next scheduled
transport operation in the queue)
 
On 1/15/2023 7:10 AM, Dimiter_Popoff wrote:
How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually.

IIRC, it stacks PC, PSR, LR (i.e., R14) & R[0-3].

PC and PSR must be preserved, of course (if you have a special shadow
register for each, that\'s just an optimization -- that only works if
ISRs can\'t be interrupted. Remember, you can ALWAYS throw an exception,
even in an ISR! E.g. the \"push\" can signal a page fault). The link
register (think \"BAL\" -- what\'s old is now new! :> ) determines *how*
the ISR terminates.

So, the \"overhead\" that the processor assumes for an ISR is really just
4 (of the 12) general purpose registers. When you consider how much
\"state\" the processor holds, this isn\'t really excessive. A routine
coded in a HLL would likely be using ALL of the registers (though use
of an \"interrupt\" keyword could give it a hint to only use those that
it knows are already preserved; this would vary based on the targeted
processor. And, the compiler could dynamically decide whether
adding code to protect some additional register(s) will offset the
performance gains possible by using those extra registers *in* the
ISR code... that *it* is generating!

The bigger concern (for me) is worrying about which buses I\'m
calling on during the execution of the ISR and what other cores
might be doing ON those busses, at the same time (e.g., if I\'m
accessing a particular I/O to query/set a GPIO, is another core
accessing that same I/O -- for a different GPIO?) You never had
to worry about this in single-core architectures (excepting for
the presence of another \"bus master\")

Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ

What do you do if you throw an exception BEFORE (or while!) doing
that stacking? Does the CPU panic? :> (e.g., a double fault on
a 68k!)

[Remember, the exception handling/trap interface uses the same
mechanisms as that of an IRQ -- they are just instantiated
by different sources!]

routine, along with registers to be used in it. Memory accesses are
the bottleneck, and with HLL code being bloated as it is chances
are some cache will have to be flushed to make room for stacking.

Of course! But, invoking the ISR will likely also displace some
contents of the cache -- unless your entire ISR fits in a single
cache line *and* is already in the cache. (that includes the
DATA that your ISR may need, as well)

Remember, the whole need for cache is because processors are SO
much faster than memory!

Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

The \"application processors\" most definitely let you exert control over
the cache -- as well as processor affinity.

But, you *really* need to be wary about doing this as it sorely
impacts the utility of those mechanisms on the rest of your code!
I.e., if you wire-down part of the cache to expedite an ISR, then
you have forever taken that resource away from the rest of your code
to use. Are you smart enough to know how to make that decision,
\"in general\" (specific cases are a different story)?

The Z80 (et al.) had an \"alternate register set\". So, one could
EX AF,AF\'
EXX
at the top of an ISR -- and again, just before exit -- to preserve
(and restore) the current contents of the (main!) register set. But,
this means only one ISR can be active at a time (no nesting). Or,
requires only a specific ISR to be active (and never interrupting
itself) as the alternate register set is indistinguishable from the
\"regular\" register set. Q: Are you willing to live without the use
of the alternate registers *in* your code, just for the sake of *an* ISR?

[I\'ve had a really hard time NOT assigning specific cores to
specific portions of the design -- e.g., letting one core just
handle the RMI mechanism. I\'m not sure that I can predict how
effective such an assignment would be vs. letting the processor
*dynamically* adjust to the load, AT THAT TIME.]

Other processors have register *banks* that you can switch to/from
to expedite context switches. Same sort of restrictions apply.

The 99k allowed you to switch \"workspaces\" efficiently. But, as
workspaces resided in RAM (screwed up THAT one, eh, TI?)...

Processors with tiny states (680x with A/B and index) don\'t really
have much to preserve. OTOH, they are forever loading and storing
just to get anything done -- no place to \"hold onto\" results
INSIDE the CPU. So, has their lack of internal state made them
BETTER workhorses? Or, just lessened the work required in an ISR
(because they aren\'t very *capable*, otherwise)?
 
On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.

On many of the modern CPUs there is a freerunning 64 bit counter
clocked at once per cycle. Intel deprecates using it for such purposes
but I have never found it a problem provided that you bracket it
before and after with CPUID to force all the pipelines into an empty
state.

The equivalent DWT_CYCCNT on the Arm CPUs that support it is described
here:

https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters

I prefer hard numbers to a vague scope trace.

Two downsides:
- you have to instrument your code (but, if you\'re concerned with
performance,
  you\'ve already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a
manner as you can - that isn\'t quite the same as instrumenting the code
(although it can be).

I have never found profile directed compilers to be the least bit useful
on my fast maths codes because their automatic code instrumentation
breaks the very code that it is supposed to be testing (in the sense of
wrecking cache lines and locality etc.).

The only profiling method I have found to work reasonably well is
probably by chance the highest frequency periodic ISR I have ever used
in anger which was to profile code by accumulating a snapshot of PC
addresses allowing just a few machine instructions to execute at a time.
It used to work well back in the old days when 640k was the limit and
code would reliably load into exactly the same locations every run.

It is a great way to find the hotspots where most time is spent.

- it doesn\'t tell you about anything that happens *before* the code runs
  (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour -
we once caught a CPU chip where RTI didn\'t always do what it said on the
tin and the instruction following the RTI instruction got executed with
a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but
we had to sign a non-disclosure agreement.

If I\'m really serious about finding out why something is unusually
slow I run a dangerous system level driver that allows me full access
to the model specific registers to monitor cache misses and pipeline
stalls.

But, those results can change from instance to instance (as can latency,
execution time, etc.).  So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.

It just means that you have to collect an array of data and take a look
at it later and offline. Much like you would when testing that a library
function does exactly what it is supposed to.
Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but
it can add to the variance. You always get a few outliers here and there
in user code if a higher level disk or network interrupt steals cycles.

Do you have a way of KNOWING when your expectations (which you
have now decided are REQUIRMENTS!) are NOT being met?  And, if so,
what do you do (at runtime) with that information?   (\"I\'m sorry,
one of my basic assumptions is proving to be false and I am not
equipped to deal with that...\")

Instrumenting for timing tests is very much development rather than
production code. ie is it fast enough or do we have to work harder.

Like you I prefer HLL code but I will use ASM if I have to or there is
no other way (like wanting 80 bit reals in the MS compiler). Actually I
am working on a class library to allow somewhat clunky access to it.

They annoyingly zapped access to 80 bit reals in v6 I think it was for
\"compatibility\" reasons since SSE2 and later can only do 64bit reals.
Esp given that your implementation will likely evolve and
folks doing that work may not be as focused as you were on
this specific issue...

That will be their problem not mine ;-)

--
Regards,
Martin Brown
 
On 15/01/2023 14:10, Dimiter_Popoff wrote:
On 1/15/2023 12:48, Lasse Langwadt Christensen wrote:
søndag den 15. januar 2023 kl. 06.10.24 UTC+1 skrev
upsid...@downunder.com:

If the processor has separate FP registers and/or separate FP status
words, avoid using FP registers in ISRs.

Generally good advice unless the purpose of the interrupt is to time
share the available CPU and FPU between various competing numerical
tasks. Cooperative multitasking has lower overheads if you can do it.

For my money ISRs should do as little as possible at such a high
privilege level although checking if their interrupt flag is already set
again before returning is worthwhile for maximum burst transfer speed.

Some compilers may have \"interrupt\" keywords or similar extensions and
the compiler knows which registers need to be saved in the ISR. To
help the compiler, include all functions that are called by the ISR in
the same module(preferably in-lined) prior to the ISR, so that the
compiler knows what needs to be saved. Do not call external library
routines from ISR, since the compiler doesn\'t know which registers
need to be saved and saves all.

cortex-m automatically stack the registers needed to call a regular C
function
and if it has an FPU it supports \"lazy stacking\" which means it keeps
track of
whether the FPU is used and only stack/un-stack them when they are used

it also knows that if another interrupt is pending at ISR exit is
doesn\'t need to
to un-stack/stack before calling the other interrupt


How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually. Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ
routine, along with registers to be used in it. Memory accesses are
the bottleneck, and with HLL code being bloated as it is chances
are some cache will have to be flushed to make room for stacking.
Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

Actually there were processors which took the exact opposite position
quite early on and they were incredibly good for realtime performance
but their registers were no different to ram - they were *in* ram so was
the program counter return address. There was a master register
workspace pointer and 16 registers TI TMS9900 series for instance.

https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

I didn\'t properly appreciate at the time quite how good this trick was
for realtime work until we tried to implement the same algorithms on the
much later and on paper faster 68000 series of CPUs.

--
Regards,
Martin Brown
 
On 1/16/2023 3:23 AM, Martin Brown wrote:
For my money ISRs should do as little as possible at such a high privilege
level although checking if their interrupt flag is already set again before
returning is worthwhile for maximum burst transfer speed.

+42 on both counts.

OTOH, you have to be wary of misbehaving hardware (or, unforseen
circumstances) causing the ISR to loop continuously.

Many processors will give one (a couple?) of instructions in the
background a chance to execute (after RTI), even if there is a
\"new\" IRQ pending. So, you can gradually make *some* progress.

If, instead, you let the ISR loop, then you\'re stuck there...

I like to *quickly* reenable interrupts and take whatever measures
needed to ensure the work that *I* need to do will get done, properly,
even if postponed by a newer IRQ. This can be treacherous if a series
of different IRQ sources conspire to interrupt each other and leave
you interrupted by *yourself*, later!

How many registers does it stack automatically? I knew the HLL nonsense
would catch up with CPU design eventually. Good CPU design still means
load/store machines, stacking *nothing* at IRQ, just saving PC and CCR
to special purpose regs which can be stacked as needed by the IRQ
routine, along with registers to be used in it. Memory accesses are
the bottleneck, and with HLL code being bloated as it is chances
are some cache will have to be flushed to make room for stacking.
Some *really* well designed for control applications processors allow
you to lock a part of the cache but I doubt ARM have that, they seem to
have gone the way \"make programming a two click job\" to target a
wider audience.

Actually there were processors which took the exact opposite position quite
early on and they were incredibly good for realtime performance but their
registers were no different to ram - they were *in* ram so was the program
counter return address. There was a master register workspace pointer and 16
registers TI TMS9900 series for instance.

https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

Guttag came out to pitch the 99K, in person (we used a lot of MPUs).
But, at that time (early 80\'s?), memory speeds were already starting
to creep up past memory *cycles* in processors of that day. This was,
IMHO, a bad technological prediction on TI\'s part.

[IIRC, they also predicted sea-of-gates would be the most economical
semi-custom approach (they actually proposed a \"sea of inverters\"
wired in a mask layer much like DTL)]

I didn\'t properly appreciate at the time quite how good this trick was for
realtime work until we tried to implement the same algorithms on the much later
and on paper faster 68000 series of CPUs.
 
On 1/16/2023 3:21 AM, Martin Brown wrote:
On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.

On many of the modern CPUs there is a freerunning 64 bit counter clocked at
once per cycle. Intel deprecates using it for such purposes but I have never
found it a problem provided that you bracket it before and after with CPUID
to force all the pipelines into an empty state.

The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here:

https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters

I prefer hard numbers to a vague scope trace.

Two downsides:
- you have to instrument your code (but, if you\'re concerned with performance,
   you\'ve already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a manner
as you can - that isn\'t quite the same as instrumenting the code (although it
can be).

It depends on how visible the information of interest is to outside
observers. If you have to \"do something\" to make it so, then you
may as well put in the instrumentation and get things as you want them.

I have never found profile directed compilers to be the least bit useful on my
fast maths codes because their automatic code instrumentation breaks the very
code that it is supposed to be testing (in the sense of wrecking cache lines
and locality etc.).

Exactly. The same holds true of adding invariants to code;
removing them (#ifndef DEBUG) changes the code -- subtly but
nonetheless. So, you have to put in place two levels of
final test:
- check to see if you THINK it will pass REAL final test
- actually DO the final test

When installing copy protection/anti-tamper mechanisms in products,
there\'s a time when you\'ve just enabled them and, thus, changed
how the product runs. If it *stops* running (properly), you have
to wonder if your \"measures\" are at fault or if some latent
bug has crept in, aggravated by the slight differences in
execution patterns.

The only profiling method I have found to work reasonably well is probably by
chance the highest frequency periodic ISR I have ever used in anger which was
to profile code by accumulating a snapshot of PC addresses allowing just a few
machine instructions to execute at a time. It used to work well back in the old
days when 640k was the limit and code would reliably load into exactly the same
locations every run.

It is a great way to find the hotspots where most time is spent.

IMO, this is where logic analyzers shine. I don\'t agree with
using them to \"trace code\" (during debug) as there are better ways to
get that information. But, *watching* to see how code runs (passively)
can be a real win. Especially when you are trying to watch for
RARE aberrant behavior.

- it doesn\'t tell you about anything that happens *before* the code runs
   (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour - we once
caught a CPU chip where RTI didn\'t always do what it said on the tin and the
instruction following the RTI instruction got executed with a frequency of
about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a
non-disclosure agreement.

If I\'m really serious about finding out why something is unusually slow I
run a dangerous system level driver that allows me full access to the model
specific registers to monitor cache misses and pipeline stalls.

But, those results can change from instance to instance (as can latency,
execution time, etc.).  So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.

It just means that you have to collect an array of data and take a look at it
later and offline. Much like you would when testing that a library function
does exactly what it is supposed to.

Yes. So, you either have the code do the collection (using a black box)
*or* have to have an external device (logic analyzer) that can collect it
for you.

The former is nice because the code can actually make decisions
(at run time) that a passive observer often can\'t (because the
observer can\'t see all of the pertinent data). But, that starts
to have a pronounced impact on the *intended* code...

Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but it can
add to the variance. You always get a few outliers here and there in user code
if a higher level disk or network interrupt steals cycles.

Being an embedded system developer, the issues that often muck
up execution are often periodic -- but, with periods that are varied
enough that they only beat against the observed phenomenon occasionally.

I am always amused by folks WHO OBSERVE A F*CKUP. Then, when they
can\'t reproduce it or identify a likely cause, ACT AS IF IT NEVER
HAPPENED! Sheesh, you\'re not relying on some third-hand report
of an anomaly... YOU SAW IT! How can you pretend it didn\'t happen?

Do you have a way of KNOWING when your expectations (which you
have now decided are REQUIRMENTS!) are NOT being met?  And, if so,
what do you do (at runtime) with that information?   (\"I\'m sorry,
one of my basic assumptions is proving to be false and I am not
equipped to deal with that...\")

Instrumenting for timing tests is very much development rather than production
code. ie is it fast enough or do we have to work harder.

Again, depends on the code and application. Few (interesting!) systems have
any sort of \"steady state\". Rather, they have to react to a variety of
circumstances occurring \"whenever\" they choose. The specifications rarely
say \"in this, that or the-other situation, these timing constraints do not
apply\". And, testing for every possible pile-up of events is just not
conceivable.

The alternative is to rethink your deadlines (avoid HARD deadlines because
so few things truly *are* hard!) and how you would recover from a missed
(or delayed) deadline.

I develop with deadline support in the code so the code can sort out how to
react to situations that I can\'t foresee nor test for.

A deadline handler may not be invoked in a timely fashion (if it *was*,
then why not just code the actual *task* as the deadline handler and
get THAT guarantee?! :> ). But, at least it lets the code/system
realize that something unplanned/unintended *has* happened: \"Whaddya
gonna do about it?\"

Like you I prefer HLL code but I will use ASM if I have to or there is no other
way (like wanting 80 bit reals in the MS compiler). Actually I am working on a
class library to allow somewhat clunky access to it.

\"Measure and THEN optimize\". Let the compiler take a stab at it.
If you discover (measurement) that it\'s not meeting your expectations
(requirements), figure out why and the possible approaches you can take
to remedy that.

I designed a barcode reader that had, as a single input, the \"video\"
from the photodetector routed to an IRQ pin. Program the IRQ to sense
the white-to-black edge and wait. In ISR, capture the system time
from a high resolution timer; program the IRQ to sense the black-to-white
edge and wait. Lather, rinse, repeat (you never know *when* a user
may try to read a barcode!).

A background task would watch the FIFO maintained by the IRQ and
pull data out of it (atomically) to keep the FIFO from overflowing
as well as get a head start on decoding the \"transition times\"
into \"width intervals\".

Competing IRQs would introduce lots of latency into the captured
system times. So, I modified the ISRs to also tell me how long ago
the actual transition occured. A bit more work for the ISRs and
similarly for the task monitoring the FIFO. But, it allowed me
to capture barcodes with features as small as 0.007\" at 100IPS...
on a 2MHz 8b CPU.

A modern compiler could probably generate \"as effective\" code;
the real performance gain was obtained by changing the algorithm
instead of the implementation language.

They annoyingly zapped access to 80 bit reals in v6 I think it was for
\"compatibility\" reasons since SSE2 and later can only do 64bit reals.

I had an early product use BCD data formats (supported by an
ancient compiler). When *that* support went away, it was a
real nightmare to go through and rework everything to use
bare ints (and have to bin-to-bcd all the time)

Esp given that your implementation will likely evolve and
folks doing that work may not be as focused as you were on
this specific issue...

That will be their problem not mine ;-)

Ah, most of my projects are prototypes or proof of concept.
So, my code *will* be reworked. If that proves to be hard,
folks won\'t recommend me to other clients! :>

[So, I make it REALLY easy for folks to Do The Right Thing
(in my opinion of \"right\") to maximize their chance of getting
good results. If you want to reinvent the wheel, then don\'t
fret if you *break* it - cuz you can SEE that it worked!!]
 
Am 16.01.23 um 11:23 schrieb Martin Brown:

Actually there were processors which took the exact opposite position
quite early on and they were incredibly good for realtime performance
but their registers were no different to ram - they were *in* ram so was
the program counter return address. There was a master register
workspace pointer and 16 registers TI TMS9900 series for instance.

https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

I didn\'t properly appreciate at the time quite how good this trick was
for realtime work until we tried to implement the same algorithms on the
much later and on paper faster 68000 series of CPUs.

At TU Berlin we had a place called the Zoo where there was
at least one sample of each CPU family. We used the Zoo to
port Andrew Tanenbaum\'s Experimental Machine to all of them
under equal conditions. That was a p-code engine from the
Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league.
Having no cache AND no registers was a braindead idea.



Some friends built a hardware machine around the Fairchild
Clipper. They found out that moving the hard disk driver
just a few bytes made a difference between speedy and slow
as molasse. When the data was through under the head you had
to wait for another disc revolution.

It turned out that Fairchild simply lasered away some faulty
cache lines and sold it. No warning given.
It was entertaining to see, not being in that project.

Gerhard
 
On 1/16/2023 4:25 AM, Don Y wrote:
On 1/16/2023 3:23 AM, Martin Brown wrote:
For my money ISRs should do as little as possible at such a high privilege
level although checking if their interrupt flag is already set again before
returning is worthwhile for maximum burst transfer speed.

+42 on both counts.

OTOH, you have to be wary of misbehaving hardware (or, unforseen
circumstances) causing the ISR to loop continuously.

E.g., I particularly object to folks trying to detect counter
wrap by:
do {
high = read(HIGH)
low = read(LOW)
} while ( high != read(HIGH) )
and similar. What *guarantees* do you have that this will ever
complete? (yeah, unlikely for it to hang here, but not *impossible*!)

Better:
high = read(HIGH)
low = read(LOW)
if ( high != read(HIGH) ) {
high = high++
low = 0
}
or similar (e.g., high = high; low = LOWMAX depending on how you want to
bias the approximation).
 
On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
Am 16.01.23 um 11:23 schrieb Martin Brown:

Actually there were processors which took the exact opposite position quite
early on and they were incredibly good for realtime performance but their
registers were no different to ram - they were *in* ram so was the program
counter return address. There was a master register workspace pointer and 16
registers TI TMS9900 series for instance.

https://en.wikipedia.org/wiki/Texas_Instruments_TMS9900#Architecture

I didn\'t properly appreciate at the time quite how good this trick was for
realtime work until we tried to implement the same algorithms on the much
later and on paper faster 68000 series of CPUs.

At TU Berlin we had a place called the Zoo where there was
at least one sample of each CPU family. We used the Zoo to
port Andrew Tanenbaum\'s Experimental Machine to all of them
under equal conditions. That was a p-code engine from the
Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league.

But, only for THAT particular benchmark. I learned, early on,
that I could create a benchmark for damn near any two processors
to make *either* (my choice) look better, just by choosing
the conditions of the test.

[And, as I was often designing the hardware, my \"input\"
carried a lot of weight]

Are we holding clock frequency constant? Memory access time?
Code size? Memory dollars? Board space? \"Throughput\"?
Algorithm? etc.

I tend to like processors with lots of internal registers
from my days writing ASM; it was an acquired skill to be able
to think about how to design an algorithm so you could keep
everything *in* the processor -- instead of having to constantly
load/store/reload.

But, moving away from ASM, I\'m less concerned as to what the
programmer\'s model looks like. I\'m more interested in what
the architecture supports and how easy it is for me to make
use of those mechanisms (in hardware and software).

> Having no cache AND no registers was a braindead idea.

They could argue that adding cache was a logical way to
design \"systems\" with the device.

Remember, the 9900/99K were from the \"home computer\" era.
They lost out to a dog slow 8086!

Some friends built a hardware machine around the Fairchild
Clipper. They found out that moving the hard disk driver
just a few bytes made a difference between speedy and slow
as molasse. When the data was through under the head you had
to wait for another disc revolution.

Unless, of course, you were already planning on being busy
doing something else, at that time. :> \"Benchmarks lie\"

It turned out that Fairchild simply lasered away some faulty
cache lines and sold it. No warning given.
It was entertaining to see, not being in that project.
 
Am 16.01.23 um 13:25 schrieb Don Y:
On 1/16/2023 5:04 AM, Gerhard Hoffmann wrote:
Am 16.01.23 um 11:23 schrieb Martin Brown:

I didn\'t properly appreciate at the time quite how good this trick
was for realtime work until we tried to implement the same algorithms
on the much later and on paper faster 68000 series of CPUs.

At TU Berlin we had a place called the Zoo where there was
at least one sample of each CPU family. We used the Zoo to
port Andrew Tanenbaum\'s Experimental Machine to all of them
under equal conditions. That was a p-code engine from the
Amsterdam Free University Compiler Kit.

The 9900 was slowest, by a lage margin, Z80-league.

But, only for THAT particular benchmark.  I learned, early on,
that I could create a benchmark for damn near any two processors
to make *either* (my choice) look better, just by choosing
the conditions of the test.

That was not a benchmark; that was a given large p-code machine
with the intent to use the same compilers everywhere. Not unlike
UCSD-Pascal.


Having no cache AND no registers was a braindead idea.

They could argue that adding cache was a logical way to
design \"systems\" with the device.

with a non-existing cache controller and cache rams that
cost as much as the cpu. I got a feeling for the price
of cache when I designed this:
<
https://www.flickr.com/photos/137684711@N07/52631074700/in/dateposted-public/
Remember, the 9900/99K were from the \"home computer\" era.
They lost out to a dog slow 8086!

8086 was NOT slow. Have you ever used an Olivetti M20 with
a competently engineered memory system? That even challenged
early ATs when protected mode was not needed.

Some friends built a hardware machine around the Fairchild
Clipper. They found out that moving the hard disk driver
just a few bytes made a difference between speedy and slow
as molasse. When the data was through under the head you had
to wait for another disc revolution.

Unless, of course, you were already planning on being busy
doing something else, at that time.   :>  \"Benchmarks lie\"

That benchmark was Unix System V, as licensed from Bell.
Find something better to do when you need to swap.

Gerhard
 
On Mon, 16 Jan 2023 10:21:43 +0000, Martin Brown
<\'\'\'newspam\'\'\'@nonad.co.uk> wrote:

On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.

On many of the modern CPUs there is a freerunning 64 bit counter
clocked at once per cycle. Intel deprecates using it for such purposes
but I have never found it a problem provided that you bracket it
before and after with CPUID to force all the pipelines into an empty
state.

The equivalent DWT_CYCCNT on the Arm CPUs that support it is described
here:

https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters

I prefer hard numbers to a vague scope trace.

Two downsides:
- you have to instrument your code (but, if you\'re concerned with
performance,
  you\'ve already done this as a matter of course)

You have to make a test framework to exercise the code in as realistic a
manner as you can - that isn\'t quite the same as instrumenting the code
(although it can be).

I have never found profile directed compilers to be the least bit useful
on my fast maths codes because their automatic code instrumentation
breaks the very code that it is supposed to be testing (in the sense of
wrecking cache lines and locality etc.).

The only profiling method I have found to work reasonably well is
probably by chance the highest frequency periodic ISR I have ever used
in anger which was to profile code by accumulating a snapshot of PC
addresses allowing just a few machine instructions to execute at a time.
It used to work well back in the old days when 640k was the limit and
code would reliably load into exactly the same locations every run.

It is a great way to find the hotspots where most time is spent.

- it doesn\'t tell you about anything that happens *before* the code runs
  (e.g., latency between event and recognition thereof)

True enough. Sometimes you need a logic analyser for weird behaviour -
we once caught a CPU chip where RTI didn\'t always do what it said on the
tin and the instruction following the RTI instruction got executed with
a frequency of about 1:10^8. They replaced all the faulty CPUs FOC but
we had to sign a non-disclosure agreement.

If I\'m really serious about finding out why something is unusually
slow I run a dangerous system level driver that allows me full access
to the model specific registers to monitor cache misses and pipeline
stalls.

But, those results can change from instance to instance (as can latency,
execution time, etc.).  So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.

It just means that you have to collect an array of data and take a look
at it later and offline. Much like you would when testing that a library
function does exactly what it is supposed to.

Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)

It is quite unusual to see bad behaviour from the multilevel caches but
it can add to the variance. You always get a few outliers here and there
in user code if a higher level disk or network interrupt steals cycles.

Do you have a way of KNOWING when your expectations (which you
have now decided are REQUIRMENTS!) are NOT being met?  And, if so,
what do you do (at runtime) with that information?   (\"I\'m sorry,
one of my basic assumptions is proving to be false and I am not
equipped to deal with that...\")

Instrumenting for timing tests is very much development rather than
production code. ie is it fast enough or do we have to work harder.

Like you I prefer HLL code but I will use ASM if I have to or there is
no other way (like wanting 80 bit reals in the MS compiler). Actually I
am working on a class library to allow somewhat clunky access to it.

They annoyingly zapped access to 80 bit reals in v6 I think it was for
\"compatibility\" reasons since SSE2 and later can only do 64bit reals.

PowerBasic has 80-bit reals as a native variable type.

As far as timing analysis goes, we always bring out a few port pins to
test points, from uPs and FPGAs, so we can scope things. Raise a pin
at ISR entry, drop it before the RTI, scope it.

We wrote one Linux program that just toggled a test point as fast as
it could. That was interesting on a scope, namely the parts that
didn\'t toggle.
 

Welcome to EDABoard.com

Sponsor

Back
Top