J
John Larkin
Guest
On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.
<blockedofcourse@foo.invalid> wrote:
On 2/24/2023 8:49 AM, Martin Brown wrote:
Anyone who is serious about timing code knows how to read the free running
system clock. RDTSC in Intel CPUs is very handy (even if they warn against
using it for this purpose it works very well).
But, if you are running code on anything other than bare metal, you are
seeing a composite effect from the OS, other coexecuting tasks, etc.
As most of the code I\'ve written is RT (all of the products I\'ve designed
but none of the desktop tools I\'ve written), I have to have a good handle
on how an algorithm will perform, in the time domain, over the range of
different operating conditions. Hence, the *design* of the algorithm
becomes keenly important.
And, while there are hardware mechanisms (on SOME processors and to varying
degrees) that act as performance enhancers, there are also things that
work to defeat these mechanisms (e.g., different tasks competing for cache
lines).
In a desktop environment, the jiffy is an eternity; not so in embedded
designs. So, while a desktop app can run \"long enough\" to make effective
use of some of these mechanisms, an embedded one may actually just be
perpetually \"muddying the waters\". (if a line of YOUR task\'s code gets
installed in the cache, but pieces of 30 other tasks run before you get
another opportunity, how likely is the cache to still be \"warm\", from
YOUR point of view?)
Other CPUs have equivalent performance monitoring registers although they may
be hidden in some fine print in dark recesses of the manual.
You often need to understand more than the CPU to be able to guesstimate
(or measure) performance. E.g., at the most abstract implementation
levels in my design, users write \"scripts\" in a JITed language (Limbo).
As it does a lot of dynamic object allocation, behind the scenes, the
GC has to run periodically.
So, if you happen to measure performance of an algorithm -- but the GC ran
during some number of those iterations -- your results can be difficult
to interpret; what\'s the performance of the algorithm vs. that of the
supporting *system*?
These days most binary operations are single cycle and potentially less if
there are sub expressions that have no interdependencies. Divides are still a
lot slower. This makes Pade 5,4 a good choice for rational approximations on
current Intel CPU\'s the numerator and denominator evaluate in parallel (the
physical hardware is that good) and some of the time for the divide is lost
along the way.
But, that\'s still only for \"bigger\" processors. Folks living in 8051-land
are still looking at subroutines to perform math operations (in anything
larger than a byte).
In the old days we were warned to avoid conditional branches but today you can
get away with them and sometimes active loop unrolling will work against you.
If it has to be as fast as possible then you need to understand every last
quirk of the target CPU.
And, then you are intimately bound to THAT target. You have to rethink your
implementation when you want (or need!) to move to another (economic reasons,
parts availability, etc.).
I find it best (embedded systems) to just design with good algorithms,
understand how their performance can vary (and try to avoid application
in those cases where it will!) and rely on the RTOS to ensure the
important stuff gets done in a timely manner.
If your notion of how your \"box\" works is largely static, then you will
likely end up with more hardware than you need. If you are cost conscious,
then you\'re at a disadvantage if you can\'t decouple things that MUST
get done, now, from things that you would LIKE to get done, now.
[This is the folly that most folks fall into with HRT designs; they
overspecify the hardware because they haven\'t considered how to
address missed deadlines. Because many things that they want to think
of as being \"hard\" deadlines really aren\'t.]
How fast things will go in practice can only be determined today by putting all
of the pieces together and seeing how fast it will run.
Exactly. Then, wonder what might happen if something gets *added* to the
mix (either a future product enhancement -- but same old hardware -- or
interactions with external agencies in ways that you hadn\'t anticipated).
[IMO, this is where the future of embedded systems lies -- especially with
IoT. Folks are going to think of all sorts of ways to leverage functionality
in devices B and F to enhance the performance of a system neither the Bnor F
designers had envisioned at their respective design times!]
[[E.g., I use an EXISTING security camera to determine when the mail has been
delivered. Why try to \"wire\" the mailbox with some sort of \"mail sensor\"??]]
Benchmarks can be misleading too. It doesn\'t tell you how the component will
behave in the environment where you will actually use it.
Yes. So, \"published\" benchmarks always have to be taken with a grain
of salt.
In the 70/80\'s (heyday of MPU diversity), vendors all had their own
favorite benchmarks that they would use to tout how great THEIR
product was, vs. their competitor(s) -- cuz they all wanted design-ins.
We\'d run our own benchmarks and make OUR decisions based on how
the product performed running the sorts of code *we* wanted to run
on the sorts of hardware COSTS that we wanted to afford.
Even C is becoming difficult, in some cases, to \'second guess\'.
And, ASM isn\'t immune as the hardware is evolving to provide
performance enhancing features that can\'t often be quantified,
at design/compile time.
I have a little puzzle on that one too. I have some verified correct code for
cube root running on x87 and SSE/AVX hardware and when benchmarked aggressively
for blocks of 10^7 cycles gets progressively faster it can be by as much as a
factor of two. I know that others have seen this effect sometimes too but it
only happens sometimes - usually on dense frequently executed x87 code. These
are cube root benchmarks:
How have you eliminated the effects of the rest of the host system
from your measurements? Are you sure other tasks aren\'t muddying
the cache? Or, forcing pages out of RAM? (Or, *allowing* you better
access to cache and physical memory?)
System cbrt on GNU and MS are best avoided entirely - both are amongst the
slowest and least accurate of all those I have ever tested. The best public
algorithm for general use is by Sun for their Unix library. It is a variant of
Kahan\'s magic constant hack. BSD is slower but OK.
Most applications aren\'t scraping around in the mud, trying to eek out a
few processor cycles. Instead, their developers are more worried about
correctness and long term \"support\"; I have algorithms that start with
large walls of commentary that essentially says: \"Here, There be Dragon\'s.
Don\'t muck with this unless you completely understand all of the following
documentation and the reasons behind each of the decisions made in THIS
implementation.\"
[So, when some idiot DOES muck with it, I can just grin and say, \"Well, I
guess YOU\'LL have to figure out what you did wrong, eh? Or, figure out how
to restore the original implementation and salvage any other hacks you
may have needed.\"]
We don\'t need no stinkin\' OS!
You may not think you need one but you do need a way to share the physical
resources between the tasks that want to use them.
Single-threaded and trivial designs can usually live without an OS.
I still see people using \"superloops\" (which should have gone away
in the 70\'s as they are brittle to maintain).
But, you\'ve really constrained a product\'s design if it doesn\'t
exploit the abstractions that an OS supplies.
I built a box, some years ago, and recreated a UNIX-like API
for the application. Client was blown away when I was able to
provide the same functionality that was available on the \"console
port\" (RS232) on ALL the ports. Simultaneously. By adding *one*
line of code for each port!
Cuz, you know, sooner or later something like that would be on
the wish list and a more naive implementation would lead to
a costly refactoring.
Cooperative multitasking can be done with interrupts for the realtime IO but
life is a lot easier with a bit of help from an RTOS.
Especially if you truly are operating with timeliness constraints.
How do you know that each task met its deadline? Can you even quantify
them? What do you do when a task *misses* its deadline? How do you
schedule your tasks\' execution to optimize resource utilization?
What happens when you add a task? Or, change a task\'s workload?
The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.