rowhammer mitigation...

D

Don Y

Guest
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints? (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?
 
On a sunny day (Thu, 20 Apr 2023 01:42:16 -0700) it happened Don Y
<blockedofcourse@foo.invalid> wrote in <u1qttf$h93o$1@dont-email.me>:

I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints? (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?

Yes
https://en.m.wikipedia.org/wiki/Row_hammer

Maybe use static (SRAM) memory?

Depends how much memory you need.
 
On 20/04/2023 09:42, Don Y wrote:
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

There are some Dram modules that are less susceptible to this sort of
attack but ultimately the aggressor has the upper hand here since there
are plents of systems about with vulnerable CPUs and dram.

Intel has some guidelines for best practice and assess the potential
risk as \"Low\" so you pays your money and takes your choice.

https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00247.html

Cisco probably have the best discussion of it that I have seen:

https://blogs.cisco.com/security/mitigations-available-for-the-dram-row-hammer-vulnerability
Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

It might be possible to recognise the most common coding techniques used
to implement this form of attack and screen executables for it being
present before letting them execute. It has to be in a very tight loop
or loop unrolled to work so that there are not all that many ways to
implement it (although there may still be enough to make doing this
impossible).

Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints?  (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?

Although it is a theoretical risk I don\'t expect it is one that you
particularly need to fret about unless you work for GCHQ or FBI.

--
Martin Brown
 
On 4/20/2023 2:47 AM, Martin Brown wrote:
On 20/04/2023 09:42, Don Y wrote:
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

There are some Dram modules that are less susceptible to this sort of attack
but ultimately the aggressor has the upper hand here since there are plents of
systems about with vulnerable CPUs and dram.

Intel has some guidelines for best practice and assess the potential risk as
\"Low\" so you pays your money and takes your choice.

https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00247.html

Cisco probably have the best discussion of it that I have seen:

https://blogs.cisco.com/security/mitigations-available-for-the-dram-row-hammer-vulnerability

But, these aren\'t typically the types of CPUs and memory devices that
one builds \"devices\" with (if that\'s the route you\'re going, you\'re
likely just \"buying a PC\" and using *it* as your hardware platform).

Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

It might be possible to recognise the most common coding techniques used to
implement this form of attack and screen executables for it being present
before letting them execute. It has to be in a very tight loop or loop unrolled
to work so that there are not all that many ways to implement it (although
there may still be enough to make doing this impossible).

You\'d have to inspect the machine code as you can\'t reliably infer
the optimizations that the code generator (HLL) might employ in any
given stanza.

And, you can\'t deploy the code (test harness) and definitively claim
you can see/not see examples of such a problem (that you can\'t also
wonder about whether or not they\'re unrelated \"bugs\").

Furthermore, there\'s no reliable way to determine if this was a
deliberate attempt at an exploit (in which case, you can \"ban\"
the application) or just an unfortunate coincidence.

Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints?  (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?

Although it is a theoretical risk I don\'t expect it is one that you
particularly need to fret about unless you work for GCHQ or FBI.

Apparently, there are exploits in-the-wild targeting browsers, phones,
servers, etc. I don\'t imagine all of those are operated by intelligence
agencies!

I\'ve been examining proposed countermeasures, but most are geared
towards \"big iron\", not the sorts of CPUs/SoCs/MCUs that get designed
*into* products. The \"easy ones\" have all been proven ineffective.
Some of the more sophisticated rely on hardware features that aren\'t
present in these \"components\".

I think the only *practical* approach is leaving invariants in the
codebase (wired to \"panic()\") along with active enforcement of
access controls. E.g., I have a capabilities-based RTOS so *any*
violation should be treated as suspect. (\"You\'re not supposed to
be doing THAT; so why are you even trying? You\'re either an
adversary or buggy. Goodbye!\").

But, all this tells me is that the system is compromised; it doesn\'t
identify the source of the problem nor give me a remedy (short of
reloading the image) to temporarily eliminate the problem.

[And, as above, if it manifests as a corruption of a legitimate
piece of code, I\'m going to \"blame\" the corrupted code... not the
aggressor!]

As geometries shrink and power budgets fall, this sort of thing
is going to become increasingly possible!

[Even if designing a *closed* system/product, you\'d never know if
some aspect of YOUR code isn\'t going to trigger such a fault in the
future; just because it works \"now\" means nothing about *later*!
It\'s a probabilistic situation.]

We\'re living at a precarious time -- where hardware is ALMOST cheap enough
to do wonderful things but still not cheap enough to do them safely!
It\'s so much easier to wire a couple of discretes together and not
worry that R27 is surreptitiously trying to subvert Q71! :<
 
On 20/04/2023 11:21, Don Y wrote:
On 4/20/2023 2:47 AM, Martin Brown wrote:
On 20/04/2023 09:42, Don Y wrote:
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

There are some Dram modules that are less susceptible to this sort of
attack but ultimately the aggressor has the upper hand here since
there are plents of systems about with vulnerable CPUs and dram.

Intel has some guidelines for best practice and assess the potential
risk as \"Low\" so you pays your money and takes your choice.

https://www.intel.com/content/www/us/en/security-center/advisory/intel-sa-00247.html

Cisco probably have the best discussion of it that I have seen:

https://blogs.cisco.com/security/mitigations-available-for-the-dram-row-hammer-vulnerability

But, these aren\'t typically the types of CPUs and memory devices that
one builds \"devices\" with (if that\'s the route you\'re going, you\'re
likely just \"buying a PC\" and using *it* as your hardware platform).

In a device you have the option of for example only allowing code that
has been inspected and digitally signed by you to run.

In a general use PC you can try to do that but even some major hardware
manufacturers CBA to digitally sign their own drivers so end users are
habituated to clicking the override as administrator button to install
unsigned and potentially unsafe drivers at privileged levels.

Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

It might be possible to recognise the most common coding techniques
used to implement this form of attack and screen executables for it
being present before letting them execute. It has to be in a very
tight loop or loop unrolled to work so that there are not all that
many ways to implement it (although there may still be enough to make
doing this impossible).

You\'d have to inspect the machine code as you can\'t reliably infer
the optimizations that the code generator (HLL) might employ in any
given stanza.

You can inspect the machine code in hexadecimal looking for particular
4,8,16 byte code signatures that might be dodgy.

And, you can\'t deploy the code (test harness) and definitively claim
you can see/not see examples of such a problem (that you can\'t also
wonder about whether or not they\'re unrelated \"bugs\").

Rowhammer does require a characteristic very tight loop - I think that
there are no so many ways to do it that would be tight enough to work.

Furthermore, there\'s no reliable way to determine if this was a
deliberate attempt at an exploit (in which case, you can \"ban\"
the application) or just an unfortunate coincidence.

Any signature based AV or malware detection invariably has a few false
positives. I have had unusual go faster code that does things not unlike
a virus sometimes fall foul of over my zealous AV software. I could see
immediately why the AV didn\'t much like my code once I thought about it.
Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints?  (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?

Although it is a theoretical risk I don\'t expect it is one that you
particularly need to fret about unless you work for GCHQ or FBI.

Apparently, there are exploits in-the-wild targeting browsers, phones,
servers, etc.  I don\'t imagine all of those are operated by intelligence
agencies!

I guess there will be. Although it would seem usually to require the
user to have downloaded some code in the first place. That or a browser
exploit - remove email and web browsing and your attack surface is tiny.
I\'ve been examining proposed countermeasures, but most are geared
towards \"big iron\", not the sorts of CPUs/SoCs/MCUs that get designed
*into* products.  The \"easy ones\" have all been proven ineffective.
Some of the more sophisticated rely on hardware features that aren\'t
present in these \"components\".

Things in products you just don\'t allow side loading of other code.
I think the only *practical* approach is leaving invariants in the
codebase (wired to \"panic()\") along with active enforcement of
access controls.  E.g., I have a capabilities-based RTOS so *any*
violation should be treated as suspect.  (\"You\'re not supposed to
be doing THAT; so why are you even trying?  You\'re either an
adversary or buggy.  Goodbye!\").

I suspect that killing any process that steps out of line immediately
will prevent a lot of them from gaining a foothold.
But, all this tells me is that the system is compromised; it doesn\'t
identify the source of the problem nor give me a remedy (short of
reloading the image) to temporarily eliminate the problem.

[And, as above, if it manifests as a corruption of a legitimate
piece of code, I\'m going to \"blame\" the corrupted code... not the
aggressor!]

Corrupting code these days is a lot harder than it used to be. I not so
recently tried to write a tiny piece of self modifying code and it
wasn\'t allowed. Lateral thinking solution was to code in asm

jmp here+1
andi ax,MagicConstant

Where magic constant was the hex encoding of the missing instruction
that was unsupported by the inline assembler. In fact it turned out to
be supported (just undocumented - the disassembler decoded it OK).
Knowing how it decoded gave me the right mnemonic to code it.

As geometries shrink and power budgets fall, this sort of thing
is going to become increasingly possible!

[Even if designing a *closed* system/product, you\'d never know if
some aspect of YOUR code isn\'t going to trigger such a fault in the
future; just because it works \"now\" means nothing about *later*!
It\'s a probabilistic situation.]

I have been caught out that way just once when in the transition between
8086 and 80286 the TEST instruction became a lot faster and it caused a
race condition in a video driver for the NEC7220 (chip couldn\'t cope
with being polled that fast in such a tight loop) .
Changing TEST for AND in the loop solved it with a slight slowdown.

We\'re living at a precarious time -- where hardware is ALMOST cheap enough
to do wonderful things but still not cheap enough to do them safely!
It\'s so much easier to wire a couple of discretes together and not
worry that R27 is surreptitiously trying to subvert Q71!  :

I think in a closed embedded environment you are relatively safe from
malware. It is when you let people install code or apps that things can
go seriously wrong - more so given the propensity of people to click
\"OK\" without a second thought about what they are allowing to happen.

--
Martin Brown
 
On 4/20/2023 5:19 AM, Martin Brown wrote:
But, these aren\'t typically the types of CPUs and memory devices that
one builds \"devices\" with (if that\'s the route you\'re going, you\'re
likely just \"buying a PC\" and using *it* as your hardware platform).

In a device you have the option of for example only allowing code that has been
inspected and digitally signed by you to run.

Yes. But, that puts you in the loop for *every* application.
As there are potentially a multitude of \"other developers\",
*you* become the bottleneck.

In a general use PC you can try to do that but even some major hardware
manufacturers CBA to digitally sign their own drivers so end users are
habituated to clicking the override as administrator button to install unsigned
and potentially unsafe drivers at privileged levels.

Nor, do I see a reliable way of detecting such a possible
exploit -- or an \"unintended\" vulnerability caused by an
unfortunate instruction sequence (which may be generated
by a compiler and, thus, not as obvious to the coder).

It might be possible to recognise the most common coding techniques used to
implement this form of attack and screen executables for it being present
before letting them execute. It has to be in a very tight loop or loop
unrolled to work so that there are not all that many ways to implement it
(although there may still be enough to make doing this impossible).

You\'d have to inspect the machine code as you can\'t reliably infer
the optimizations that the code generator (HLL) might employ in any
given stanza.

You can inspect the machine code in hexadecimal looking for particular 4,8,16
byte code signatures that might be dodgy.

I think you really have to look at the actual instruction sequences
and their possible side-effects. E.g., anything that accesses memory
(including opcode fetches, \"transfers of control\", subroutine call/return,
etc.) can be used to \"irritate\" the problem -- not just load/stores.

And, you can\'t deploy the code (test harness) and definitively claim
you can see/not see examples of such a problem (that you can\'t also
wonder about whether or not they\'re unrelated \"bugs\").

Rowhammer does require a characteristic very tight loop - I think that there
are no so many ways to do it that would be tight enough to work.

It seems to more be driven by how many disturbances you can
cause in a unit of time -- e.g., a refresh interval. A \"slow
loop\" with N iterations causes just as much disturbance as a
\"fast loop\" with an equal number. The downside of the fast
loop is that the disturbances can accumulate before a restorative
action (refresh event).

This led to the poor-man\'s solution of just doubling the refresh rate.
But, this has also been shown to be an insufficient safeguard.

Furthermore, there\'s no reliable way to determine if this was a
deliberate attempt at an exploit (in which case, you can \"ban\"
the application) or just an unfortunate coincidence.

Any signature based AV or malware detection invariably has a few false
positives. I have had unusual go faster code that does things not unlike a
virus sometimes fall foul of over my zealous AV software. I could see
immediately why the AV didn\'t much like my code once I thought about it.

But that just complicates development; \"you\" haven\'t done anything
wrong, yet are being tasked with altering your product because of
the shortcomings of a flawed detector.

Or, having to redesign the detector to recognize legitimate code
sequences from problematic ones.

[Hardening code with anti-counterfeiting features is a similar
bit of annoying overhead. All of the effort expended to install
those features is counterproductive; it doesn\'t make the code
any *better* (and, depending on how done, can make it buggier).]

Add to the mix SoCs and it seems like all bets are off
even *in* the lab.

Constrain systems to smaller footprints?  (ain\'t gonna
happen)

Perhaps harden the OS and watch for access violations as
a side effect of such a manifestation?

Although it is a theoretical risk I don\'t expect it is one that you
particularly need to fret about unless you work for GCHQ or FBI.

Apparently, there are exploits in-the-wild targeting browsers, phones,
servers, etc.  I don\'t imagine all of those are operated by intelligence
agencies!

I guess there will be. Although it would seem usually  to require the user to
have downloaded some code in the first place. That or a browser exploit -
remove email and web browsing and your attack surface is tiny.

In my case, it\'s an \"open\" system -- much like android phones (where a user
can sideload an app without having to acquire it from a garden store).
And, an app need not be foolishly eager to malign its host! Instead, the
developer could design it to sit and wait for a period where many instances
have \"infected\" the market before manifesting its malevolence.

So, the \"buggy behavior\" may not be immediately associated with THAT
particular app -- unless your code can identify it as the source.

I\'ve been examining proposed countermeasures, but most are geared
towards \"big iron\", not the sorts of CPUs/SoCs/MCUs that get designed
*into* products.  The \"easy ones\" have all been proven ineffective.
Some of the more sophisticated rely on hardware features that aren\'t
present in these \"components\".

Things in products you just don\'t allow side loading of other code.

How successful would PCs be if the user couldn\'t load \"non-MS\" code?
You (I) want to encourage others to embelish your platform so *you*
don\'t have to do it all. And, because *you* may not foresee all of the
potential uses/applications!

I think the only *practical* approach is leaving invariants in the
codebase (wired to \"panic()\") along with active enforcement of
access controls.  E.g., I have a capabilities-based RTOS so *any*
violation should be treated as suspect.  (\"You\'re not supposed to
be doing THAT; so why are you even trying?  You\'re either an
adversary or buggy.  Goodbye!\").

I suspect that killing any process that steps out of line immediately will
prevent a lot of them from gaining a foothold.

Yes, but I can only do that by watching established interfaces.
E.g., trying to call \"shutdown\" or accessing memory outside of your
address space or invoking a function to which you haven\'t been
granted explicit permission.

But, I can\'t monitor \"memory access patterns\" WITHIN your approved
address space; most MCU/SoC offerings don\'t have that fine-grained
RUNTIME tools in place.

But, all this tells me is that the system is compromised; it doesn\'t
identify the source of the problem nor give me a remedy (short of
reloading the image) to temporarily eliminate the problem.

[And, as above, if it manifests as a corruption of a legitimate
piece of code, I\'m going to \"blame\" the corrupted code... not the
aggressor!]

Corrupting code these days is a lot harder than it used to be. I not so
recently tried to write a tiny piece of self modifying code and it wasn\'t
allowed. Lateral thinking solution was to code in asm

But, that\'s a *deliberate* write. I can mark regions of memory
as read-only, no-execute, etc. But, that only watches for explicit
attempts to do those things. It doesn\'t watch for memory to be
accessed via side-channels.

jmp here+1
andi ax,MagicConstant

Where magic constant was the hex encoding of the missing instruction that was
unsupported by the inline assembler. In fact it turned out to be supported
(just undocumented - the disassembler decoded it OK). Knowing how it decoded
gave me the right mnemonic to code it.

As geometries shrink and power budgets fall, this sort of thing
is going to become increasingly possible!

[Even if designing a *closed* system/product, you\'d never know if
some aspect of YOUR code isn\'t going to trigger such a fault in the
future; just because it works \"now\" means nothing about *later*!
It\'s a probabilistic situation.]

I have been caught out that way just once when in the transition between 8086
and 80286 the TEST instruction became a lot faster and it caused a race
condition in a video driver for the NEC7220 (chip couldn\'t cope with being
polled that fast in such a tight loop) .
Changing TEST for AND in the loop solved it with a slight slowdown.

Yeah, I\'ve seen cases where you had to add (effectively) NoOps to
counter the pipeline prefetch in processors. Especially when custom
memory hardware was involved (instruction 1 alters the address mapping but
THE WRONG instruction 2 is fetched before the new mapping took effect)

We\'re living at a precarious time -- where hardware is ALMOST cheap enough
to do wonderful things but still not cheap enough to do them safely!
It\'s so much easier to wire a couple of discretes together and not
worry that R27 is surreptitiously trying to subvert Q71!  :

I think in a closed embedded environment you are relatively safe from malware.

Is \"legitimate code\" that happens to exhibit an unfortunate side-effect
considered malware? If I fill memory (tight loop) in a particular pattern
and that causes a read-disturb induced error, is it \"my fault\"? Or, the fault
of the current hardware? (\"My code has worked fine for decades; something is
wrong with YOUR *newer* hardware...\")

It is when you let people install code or apps that things can go seriously
wrong - more so given the propensity of people to click \"OK\" without a second
thought about what they are allowing to happen.

There\'s only one way *into* my system (hardware or software additions).
But, I can only \"see\" what I can \"see\"; if I can\'t reliably recognize
a problem (intentional or accidental), then I can\'t do anything about it.

Once installed, I can refuse to allow a piece of code to execute if I
directly *observe* it doing something disallowed (e.g., accessing a
resource to which access hasn\'t been explicitly granted *in* the
installation configuration). And, I don\'t have to allow for a \"user
override\" of that decision -- just like I don\'t have to try to use a disk
that is known to be broken.

I can guard against these problems in the scripted applications
by ensuring my interpreter/compiler doesn\'t generate memory
access patterns that could run afoul in this way.

But, if I allow binaries to be installed (performance), I don\'t see any
way to limit the risk.
 
On 4/20/2023 7:00 AM, Don Y wrote:
jmp here+1
andi ax,MagicConstant

Where magic constant was the hex encoding of the missing instruction that was
unsupported by the inline assembler. In fact it turned out to be supported
(just undocumented - the disassembler decoded it OK). Knowing how it decoded
gave me the right mnemonic to code it.

On small (resource starved) processors, I often implement multitasking using
a structure like:


Main:
Call task1
Call task2
Call task3
Jmp Main

And, within any task:

Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address field
of the associated \"Call\" within Main. So, the next time around the loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.
 
On 20/04/2023 09:42, Don Y wrote:
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

Some people (including me) use BIOS settings on their PC to double (or
triple even) the refresh rate of their RAM which will help at the
expense of slowing things down a little, but it\'s difficult to determine
how much it helps.

--
Brian Gregory (in England).
 
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement multitasking
using
a structure like:


Main:
    Call task1
    Call task2
    Call task3
    Jmp Main

And, within any task:

    Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address field
of the associated \"Call\" within Main.  So, the next time around the loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the
new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

I\'d just write a proper task switcher, either interrupt driven or not as
suites the situation. It\'s usually no more than a handful of pages of C
and/or assembler.

--
Brian Gregory (in England).
 
On 4/20/2023 11:30 AM, Brian Gregory wrote:
On 20/04/2023 09:42, Don Y wrote:
I don\'t see anything that I can do in hardware *or* software to
mitigate against such exploits.

Some people (including me) use BIOS settings on their PC to double (or triple
even) the refresh rate of their RAM which will help at the expense of slowing
things down a little, but it\'s difficult to determine how much it helps.

Memtest has some tools that will check this. But, like similar tests,
a \"pass\" only means that it worked *now*...
 
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address field
of the associated \"Call\" within Main.  So, the next time around the loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or less,
total\"), you don\'t have *space* for a stack per task. And, some really small
devices don\'t have an \"accessible\" stack that one can preserve (and restore)
as part of a context switch.

I\'d just write a proper task switcher, either interrupt driven or not as suites
the situation. It\'s usually no more than a handful of pages of C and/or assembler.

It\'s not that it is difficult. Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".
 
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement
multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address
field
of the associated \"Call\" within Main.  So, the next time around the
loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the
new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or
less,
total\"), you don\'t have *space* for a stack per task.  And, some really
small
devices don\'t have an \"accessible\" stack that one can preserve (and
restore)
as part of a context switch.

I\'d just write a proper task switcher, either interrupt driven or not
as suites the situation. It\'s usually no more than a handful of pages
of C and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".

Then I don\'t get it. Unless maybe the calls to the tasks are just from
habit and could more straightforwardly be replaced by jumps or gotos.

--
Brian Gregory (in England).
 
On 20/04/2023 20:14, Brian Gregory wrote:
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement
multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\"
address field
of the associated \"Call\" within Main.  So, the next time around the
loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as
the new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or
less,
total\"), you don\'t have *space* for a stack per task.  And, some
really small
devices don\'t have an \"accessible\" stack that one can preserve (and
restore)
as part of a context switch.

You can still have a stack per task provided that it is very small.

I\'d just write a proper task switcher, either interrupt driven or not
as suites the situation. It\'s usually no more than a handful of pages
of C and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime
resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford
it\".

Then I don\'t get it. Unless maybe the calls to the tasks are just from
habit and could more straightforwardly be replaced by jumps or gotos.

I think he is describing a scheme that I recognise as coroutines in
Modula2 where diverse modules can be designed to share out CPU resources
equitably. It can work well provided that you do it carefully - tasks
that are not ready to run immediately pass control the next in line.

My recollection was that you have to be careful if there are tasks which
allocate big chunks of memory interspersed with others that do small
allocations or else the available single stack fragments over time.

TI9900 family CPUs were incredibly good at fast context switching since
the program counter and all 16 registers were in ram. This worked fine
until the context pointer for the active process found itself pointing
at ROM which doesn\'t increment too easily. I didn\'t appreciate at the
time how good it was until we tried to do the same task on a 68k.

--
Martin Brown
 
On 4/20/2023 12:14 PM, Brian Gregory wrote:
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address field
of the associated \"Call\" within Main.  So, the next time around the loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or less,
total\"), you don\'t have *space* for a stack per task.  And, some really small
devices don\'t have an \"accessible\" stack that one can preserve (and restore)
as part of a context switch.

I\'d just write a proper task switcher, either interrupt driven or not as
suites the situation. It\'s usually no more than a handful of pages of C
and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".


Then I don\'t get it. Unless maybe the calls to the tasks are just from habit
and could more straightforwardly be replaced by jumps or gotos.

By CALLing a task, the location (in Main) of the call TO the task
is available on THE (one) stack.

Calling YIELD (within the body of *a* task) puts the location AFTER
the call to YIELD on the stack.

Main:
Call task1
Call task2
There:
Call task3
Jmp Main


Task2:
...
Call YIELD
Here:...

So, while *in* YIELD (for the example above), you have:

Here (return address for call to YIELD)
There (return address for call to Task2)

YIELD can extract \"Here\" from the ToS. It can then place
this value (address) *into* the instruction preceding
the location referenced by the next stack entry at \"There\".
(because YIELD knows the nature of the instructions that
caused it to be invoked, it knows how to patch that instruction)

Having done this, it can then set the PC to \"There\",
consuming both entries off the stack. This will
effectively continue onto the next task in Main.

Which can repeat this trick, at will.

This leaves Main as:

Main:
Call task1
Call Here <----------
There:
Call task3
Jmp Main

So, the next trip around Main will cause \"Here\" to be used
as the entrypoint for Task2 -- effectively letting task2
continue from the place after the YIELD.

Lather, rinse, repeat.

If you want to *suspend* a task, you simply modify the
opcode of the CALL to something benign -- but that also
preserves the value of the latest entry point stored
in that instruction.
 
On 4/20/2023 12:32 PM, Martin Brown wrote:
On 20/04/2023 20:14, Brian Gregory wrote:
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address field
of the associated \"Call\" within Main.  So, the next time around the loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or less,
total\"), you don\'t have *space* for a stack per task.  And, some really small
devices don\'t have an \"accessible\" stack that one can preserve (and restore)
as part of a context switch.

You can still have a stack per task provided that it is very small.

Of course. But, if you want to preserve the entire state of the
task (including state of stack), it doesn\'t take many tasks to consume
a lot of RAM.

If, OTOH, you opt not to preserve *anything* (other than PC), you
can have a reasonably large stack shared by all (recall that the
stack must handle the worst case penetration -- plus ISRs -- that
each task might require!)

I\'d just write a proper task switcher, either interrupt driven or not as
suites the situation. It\'s usually no more than a handful of pages of C
and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".

Then I don\'t get it. Unless maybe the calls to the tasks are just from habit
and could more straightforwardly be replaced by jumps or gotos.

I think he is describing a scheme that I recognise as coroutines in Modula2
where diverse modules can be designed to share out CPU resources equitably. It
can work well provided that you do it carefully - tasks that are not ready to
run immediately pass control the next in line.

No. This is a non-preemptive form of multitasking -- but, one that doesn\'t
(necessarily) preserve any task state. This places limits on how you can code
the tasks as the stack must remain at the same level of penetration for *any*
YIELD operation (because it relies on being able to unwind the staack to figure
out how to patch Main).

But, it is surprisingly effective (for the sorts of applications that
use resource starved processors).

Because there is so little overhead to a task switch (a handful of opcodes),
you don\'t hesitate to liberally pepper your code with YIELDs. So, the time
around the loop can be made relatively short -- which keeps latency to any
given task low.

If you want to give more CPU to a particular task after the fact,
you can wrap that task in it\'s own little subroutine and invoke that
multiple times from Main:

Main:
Call TaskA
Call Task2
Call TaskA
Call Task3
Call TaskA
Jmp Main

TaskA:
Call Task1
Ret


Task1:
do_something
YIELD
Here:
do_something_else
YIELD
There:
do_another_thing
YIELD

So, the first time TaskA is invoked, it runs \"Task1\" up to the
first YIELD which then rewrites TaskA to be:

Call Here
Ret

The second time it is invoked, Task1 resumes at the \"Here\"
entry point. Once it reaches the second YIELD, it is again
rewritten to be:

Call There
Ret

Etc. As such, it gets invoked three times per Main iteration
while Task2 and Task3 only see one invocation per loop.

You can embelish this to allow arbitrary weightings:

Main: Call Level1
Call Level2
Call Level2
Call Level4
Call Level4
Call Level4
Call Level4
Jmp Main

The tasks called from Level1 will be invoked once per iteration;
Level2, twice; Level4 four times. Placing a task in Level1 & Level4
means the task gets invoked 5 times per iteration.

[There are slicker ways of doing this but the point is that you
can reapportion processor time to tasks AFTER they have been
written to tune performance/maximize throughput/minimize latency/etc.
WITHOUT having to go through the code and add or subtract YIELDs]

My recollection was that you have to be careful if there are tasks which
allocate big chunks of memory interspersed with others that do small
allocations or else the available single stack fragments over time.

No heap. No saved state. If you want to save *something*, save it
as a static.

This is a big kluge. But, it lets you avoid big messes of spaghetti
code that would otherwise be needed for even the simplest applications.

E.g., design an egg timer: show time remaining, blink the colon at
one hz, monitor the buttons/controls (stop/pause/reset), sound annunciator
when expired. A relatively trivial goal but quickly leads to messy code if
you have to refresh a multiplexed display, debounce the buttons, blink
the colon, update time remaining, etc.

[Imagine running a few bidir UARTs *and* doing something \"meaningful\"
without multitasking hooks!]

TI9900 family CPUs were incredibly good at fast context switching since the
program counter and all 16 registers were in ram. This worked fine until the
context pointer for the active process found itself pointing at ROM which
doesn\'t increment too easily. I didn\'t appreciate at the time how good it was
until we tried to do the same task on a 68k.

Yes. Workspaces in RAM was a great feature. But, history has shown that
memory is now a bottleneck. So, moving things out of the CPU is usually
a step down in performance.

Better to have a small RAM in the CPU and just select \"register banks\".

[I mentioned this to Guttag when he was pitching the device to us. He
didn\'t seem to think CPU/memory tradeoffs were headed the way we did!]
 
On 4/20/2023 3:53 PM, Don Y wrote:
This is a big kluge.  But, it lets you avoid big messes of spaghetti
code that would otherwise be needed for even the simplest applications.

E.g., design an egg timer:  show time remaining, blink the colon at
one hz, monitor the buttons/controls (stop/pause/reset), sound annunciator
when expired.  A relatively trivial goal but quickly leads to messy code if
you have to refresh a multiplexed display, debounce the buttons, blink
the colon, update time remaining, etc.

[Imagine running a few bidir UARTs *and* doing something \"meaningful\"
without multitasking hooks!]

Here\'s (psuedocode) a routine to flash an indicator:

while (FOREVER) {
timer7 = half_second
indicator = ON
while (timer7) {
YIELD
}
timer7 = half_second
indicator = OFF
while (timer7) {
YIELD
}
}

Some OTHER *task* decrements every \"timer*\" at some periodic rate,
clamping each value at 0 once it reaches that point.

There\'s no need to access the timer*\'s atomically -- because each
timer access happens entirely while a task retains control of the
processor. The \"state\" of the process is encoded in the PC;
the (inner) while loop that is executing determines whether the indicator
is on or off -- \"indicator\" can be a write-only hardware latch
and need not have readback capability.

And the indicator only needs to be turned on/off *once* (per cycle),
not continuously.

Want a 60/40 duty cycle? Change the first \"half_second\" to three_fifth_second
and the second \"half_second\" to two_fifth_second.
 
On 2023-04-20, Brian Gregory <void-invalid-dead-dontuse@email.invalid> wrote:
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement
multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address
field
of the associated \"Call\" within Main.  So, the next time around the
loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the
new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or
less,
total\"), you don\'t have *space* for a stack per task.  And, some really
small
devices don\'t have an \"accessible\" stack that one can preserve (and
restore)
as part of a context switch.

I\'d just write a proper task switcher, either interrupt driven or not
as suites the situation. It\'s usually no more than a handful of pages
of C and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".


Then I don\'t get it. Unless maybe the calls to the tasks are just from
habit and could more straightforwardly be replaced by jumps or gotos.

The use of call puts the address of the instruction after the call on
the stack from where it can be fetched by the yield() call.

I\'m guessing that Yield() pops the return address of the stack and saves it in a register
then pops the next return address off the stack, substracts some
constant, and puts it in a pointer register and and then saves the
first register there, adds the constant back on and does jump to that
address.


--
Jasen.
🇺🇦 Слава Україні
 
On 4/20/2023 10:35 PM, Jasen Betts wrote:
On 2023-04-20, Brian Gregory <void-invalid-dead-dontuse@email.invalid> wrote:
On 20/04/2023 19:55, Don Y wrote:
On 4/20/2023 11:48 AM, Brian Gregory wrote:
On 20/04/2023 15:10, Don Y wrote:
On small (resource starved) processors, I often implement
multitasking using
a structure like:


Main:
     Call task1
     Call task2
     Call task3
     Jmp Main

And, within any task:

     Call Yield
Here:

unwinds the stack (to get the address IN THE TASK at which the Call was
executed) and plugs the return address to replace the \"taskX\" address
field
of the associated \"Call\" within Main.  So, the next time around the
loop,
\"Call task1\" may execute as \"Call Here\".

[This makes for incredibly fast, low-overhead context switches as the
new PC
is \"restored\" directly *in* the invoking Call]

Doesn\'t work with XIP, though.

I can\'t see how that works. Do you save the task\'s stack somewhere too?

In truly \"resource starved\" applications (think \"a few KB of RAM -- or
less,
total\"), you don\'t have *space* for a stack per task.  And, some really
small
devices don\'t have an \"accessible\" stack that one can preserve (and
restore)
as part of a context switch.

I\'d just write a proper task switcher, either interrupt driven or not
as suites the situation. It\'s usually no more than a handful of pages
of C and/or assembler.

It\'s not that it is difficult.  Rather, that it consumes runtime resources.

But, being able to break an application into smaller, concurrent pieces
is too valuable a technique to discard just because \"you can\'t afford it\".


Then I don\'t get it. Unless maybe the calls to the tasks are just from
habit and could more straightforwardly be replaced by jumps or gotos.

The use of call puts the address of the instruction after the call on
the stack from where it can be fetched by the yield() call.

I\'m guessing that Yield() pops the return address of the stack and saves it in a register
then pops the next return address off the stack, substracts some
constant, and puts it in a pointer register and and then saves the
first register there,

Or, encodes the contents of the first register into a valid instruction
that references that location -- depends on the specifics of the
instruction set encoding.

The point is, the only bit of task state that is preserved is
the PC (for the location after the yield) AND that it is
preserved in an instruction that can efficiently dispatch
to that address when next encountered.

[Storing the entire state of a task typically has the
PC \"restored\" as the last step in reestablishing the
task\'s state just prior to next invocation with the
PC value (as well as the rest of the state) retrieved
from some TCB structure in general purpose memory
(here, it is stored in \"program memory\")]

adds the constant back on and does jump to that
address.

Because you don\'t have to restore any of the rest of the
machine state, you are free to make that \"jump\" in whatever
means is easiest (consuming the entire machine state in the
process, if necessary)
 
On 4/20/2023 10:35 PM, Jasen Betts wrote:
The use of call puts the address of the instruction after the call on
the stack from where it can be fetched by the yield() call.

Note that this is just one way to implement such a skeletal
multitasking hack.

If you assume the code is immutable, then you could just as
easily implement YIELD as a macro that generates inline
code that specifically stores the PC at that point into
a variable associated with whichever task it happens to be
encountered in.

[This assumes that YIELDs are only done from task-level code
and not something else (e.g., a common subr) *invoked* by a task]
 
On 21/04/2023 00:06, Don Y wrote:
On 4/20/2023 3:53 PM, Don Y wrote:
This is a big kluge.  But, it lets you avoid big messes of spaghetti
code that would otherwise be needed for even the simplest applications.

E.g., design an egg timer:  show time remaining, blink the colon at
one hz, monitor the buttons/controls (stop/pause/reset), sound
annunciator
when expired.  A relatively trivial goal but quickly leads to messy
code if
you have to refresh a multiplexed display, debounce the buttons, blink
the colon, update time remaining, etc.

[Imagine running a few bidir UARTs *and* doing something \"meaningful\"
without multitasking hooks!]

Here\'s (psuedocode) a routine to flash an indicator:

while (FOREVER) {
     timer7 = half_second
     indicator = ON
     while (timer7) {
         YIELD
     }
     timer7 = half_second
     indicator = OFF
     while (timer7) {
         YIELD
     }
}

Some OTHER *task* decrements every \"timer*\" at some periodic rate,
clamping each value at 0 once it reaches that point.

There\'s no need to access the timer*\'s atomically -- because each
timer access happens entirely while a task retains control of the
processor.  The \"state\" of the process is encoded in the PC;
the (inner) while loop that is executing determines whether the indicator
is on or off -- \"indicator\" can be a write-only hardware latch
and need not have readback capability.

And the indicator only needs to be turned on/off *once* (per cycle),
not continuously.

Want a 60/40 duty cycle?  Change the first \"half_second\" to
three_fifth_second
and the second \"half_second\" to two_fifth_second.

This all makes perfect sense and is in line with techniques I have used
on occasion except this:

Main:
Call task1
Call task2
Call task3
Jmp Main

--
Brian Gregory (in England).
 

Welcome to EDABoard.com

Sponsor

Back
Top