dead programming languages...

John Larkin · Mar 1, 2023

On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

On 2/24/2023 8:49 AM, Martin Brown wrote:
Anyone who is serious about timing code knows how to read the free running
system clock. RDTSC in Intel CPUs is very handy (even if they warn against
using it for this purpose it works very well).

But, if you are running code on anything other than bare metal, you are
seeing a composite effect from the OS, other coexecuting tasks, etc.

As most of the code I\'ve written is RT (all of the products I\'ve designed
but none of the desktop tools I\'ve written), I have to have a good handle
on how an algorithm will perform, in the time domain, over the range of
different operating conditions. Hence, the *design* of the algorithm
becomes keenly important.

And, while there are hardware mechanisms (on SOME processors and to varying
degrees) that act as performance enhancers, there are also things that
work to defeat these mechanisms (e.g., different tasks competing for cache
lines).

In a desktop environment, the jiffy is an eternity; not so in embedded
designs. So, while a desktop app can run \"long enough\" to make effective
use of some of these mechanisms, an embedded one may actually just be
perpetually \"muddying the waters\". (if a line of YOUR task\'s code gets
installed in the cache, but pieces of 30 other tasks run before you get
another opportunity, how likely is the cache to still be \"warm\", from
YOUR point of view?)

Other CPUs have equivalent performance monitoring registers although they may
be hidden in some fine print in dark recesses of the manual.

You often need to understand more than the CPU to be able to guesstimate
(or measure) performance. E.g., at the most abstract implementation
levels in my design, users write \"scripts\" in a JITed language (Limbo).
As it does a lot of dynamic object allocation, behind the scenes, the
GC has to run periodically.

So, if you happen to measure performance of an algorithm -- but the GC ran
during some number of those iterations -- your results can be difficult
to interpret; what\'s the performance of the algorithm vs. that of the
supporting *system*?

These days most binary operations are single cycle and potentially less if
there are sub expressions that have no interdependencies. Divides are still a
lot slower. This makes Pade 5,4 a good choice for rational approximations on
current Intel CPU\'s the numerator and denominator evaluate in parallel (the
physical hardware is that good) and some of the time for the divide is lost
along the way.

But, that\'s still only for \"bigger\" processors. Folks living in 8051-land
are still looking at subroutines to perform math operations (in anything
larger than a byte).

In the old days we were warned to avoid conditional branches but today you can
get away with them and sometimes active loop unrolling will work against you.
If it has to be as fast as possible then you need to understand every last
quirk of the target CPU.

And, then you are intimately bound to THAT target. You have to rethink your
implementation when you want (or need!) to move to another (economic reasons,
parts availability, etc.).

I find it best (embedded systems) to just design with good algorithms,
understand how their performance can vary (and try to avoid application
in those cases where it will!) and rely on the RTOS to ensure the
important stuff gets done in a timely manner.

If your notion of how your \"box\" works is largely static, then you will
likely end up with more hardware than you need. If you are cost conscious,
then you\'re at a disadvantage if you can\'t decouple things that MUST
get done, now, from things that you would LIKE to get done, now.

[This is the folly that most folks fall into with HRT designs; they
overspecify the hardware because they haven\'t considered how to
address missed deadlines. Because many things that they want to think
of as being \"hard\" deadlines really aren\'t.]

How fast things will go in practice can only be determined today by putting all
of the pieces together and seeing how fast it will run.

Exactly. Then, wonder what might happen if something gets *added* to the
mix (either a future product enhancement -- but same old hardware -- or
interactions with external agencies in ways that you hadn\'t anticipated).

[IMO, this is where the future of embedded systems lies -- especially with
IoT. Folks are going to think of all sorts of ways to leverage functionality
in devices B and F to enhance the performance of a system neither the Bnor F
designers had envisioned at their respective design times!]

[[E.g., I use an EXISTING security camera to determine when the mail has been
delivered. Why try to \"wire\" the mailbox with some sort of \"mail sensor\"??]]

Benchmarks can be misleading too. It doesn\'t tell you how the component will
behave in the environment where you will actually use it.

Yes. So, \"published\" benchmarks always have to be taken with a grain
of salt.

In the 70/80\'s (heyday of MPU diversity), vendors all had their own
favorite benchmarks that they would use to tout how great THEIR
product was, vs. their competitor(s) -- cuz they all wanted design-ins.
We\'d run our own benchmarks and make OUR decisions based on how
the product performed running the sorts of code *we* wanted to run
on the sorts of hardware COSTS that we wanted to afford.

Even C is becoming difficult, in some cases, to \'second guess\'.
And, ASM isn\'t immune as the hardware is evolving to provide
performance enhancing features that can\'t often be quantified,
at design/compile time.

I have a little puzzle on that one too. I have some verified correct code for
cube root running on x87 and SSE/AVX hardware and when benchmarked aggressively
for blocks of 10^7 cycles gets progressively faster it can be by as much as a
factor of two. I know that others have seen this effect sometimes too but it
only happens sometimes - usually on dense frequently executed x87 code. These
are cube root benchmarks:

How have you eliminated the effects of the rest of the host system
from your measurements? Are you sure other tasks aren\'t muddying
the cache? Or, forcing pages out of RAM? (Or, *allowing* you better
access to cache and physical memory?)

System cbrt on GNU and MS are best avoided entirely - both are amongst the
slowest and least accurate of all those I have ever tested. The best public
algorithm for general use is by Sun for their Unix library. It is a variant of
Kahan\'s magic constant hack. BSD is slower but OK.

Most applications aren\'t scraping around in the mud, trying to eek out a
few processor cycles. Instead, their developers are more worried about
correctness and long term \"support\"; I have algorithms that start with
large walls of commentary that essentially says: \"Here, There be Dragon\'s.
Don\'t muck with this unless you completely understand all of the following
documentation and the reasons behind each of the decisions made in THIS
implementation.\"

[So, when some idiot DOES muck with it, I can just grin and say, \"Well, I
guess YOU\'LL have to figure out what you did wrong, eh? Or, figure out how
to restore the original implementation and salvage any other hacks you
may have needed.\"]

We don\'t need no stinkin\' OS!

You may not think you need one but you do need a way to share the physical
resources between the tasks that want to use them.

Single-threaded and trivial designs can usually live without an OS.
I still see people using \"superloops\" (which should have gone away
in the 70\'s as they are brittle to maintain).

But, you\'ve really constrained a product\'s design if it doesn\'t
exploit the abstractions that an OS supplies.

I built a box, some years ago, and recreated a UNIX-like API
for the application. Client was blown away when I was able to
provide the same functionality that was available on the \"console
port\" (RS232) on ALL the ports. Simultaneously. By adding *one*
line of code for each port!

Cuz, you know, sooner or later something like that would be on
the wish list and a more naive implementation would lead to
a costly refactoring.

Cooperative multitasking can be done with interrupts for the realtime IO but
life is a lot easier with a bit of help from an RTOS.

Especially if you truly are operating with timeliness constraints.

How do you know that each task met its deadline? Can you even quantify
them? What do you do when a task *misses* its deadline? How do you
schedule your tasks\' execution to optimize resource utilization?
What happens when you add a task? Or, change a task\'s workload?

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

Don Y · Mar 1, 2023

On 3/1/2023 3:27 AM, Martin Brown wrote:

On 24/02/2023 19:32, Don Y wrote:
On 2/24/2023 8:49 AM, Martin Brown wrote:
Anyone who is serious about timing code knows how to read the free running
system clock. RDTSC in Intel CPUs is very handy (even if they warn against
using it for this purpose it works very well).

But, if you are running code on anything other than bare metal, you are
seeing a composite effect from the OS, other coexecuting tasks, etc.

Agreed you don\'t want to do it on a heavily loaded machine but if you take an
average over 10^6 function calls the OS noise is greatly diminished. Roughly a
second is good enough for 3-4 sig figs.

\"Greatly diminished\"? Really? (Honest question, I\'ve never looked
at performance on COTS OSs) Are you working on a multicore target
(so, one core could be doing the OSs work while you get to monopolize
another)?

As most of the code I\'ve written is RT (all of the products I\'ve designed
but none of the desktop tools I\'ve written), I have to have a good handle
on how an algorithm will perform, in the time domain, over the range of
different operating conditions.Â Hence, the *design* of the algorithm
becomes keenly important.

Design of algorithms is always important and knowing when to break the rules is
too. Fancy sorts on very short lists don\'t work too well.

Yes. But, there also needs to be some practical clairvoyance...
what *might* change, going forward, that I should plan for in
my choice of algorithms, now?

I had to watch for \"passphrases\" in a serial stream to
invoke specific commands (excising the passphrase characters
from the pass-thru in the process). So, I wanted to be
able to \"un-cache\" (pass thru) characters that I knew
could not be part of the passphrase ASAP. As well as
allow the string length to scale. So, instead of a naive
approach, I opted for Boyer-Moore as it inherently told
me what to discard as well as making the operation(s)
quicker.

And, while there are hardware mechanisms (on SOME processors and to varying
degrees) that act as performance enhancers, there are also things that
work to defeat these mechanisms (e.g., different tasks competing for cache
lines).

Cache line behaviour is very difficult to understand. I have some code which
flips between two different benchmark speeds depending on exactly where it
loads in relation to the 64 byte cache line boundary.

Yes, because the cache isn\'t limited to a \"local\" (space&time) view
of events. I benchmark my code with cold cache AND assume the page
executing has to be faulted in. REALLY conservative but how else can
I realistically get a WCET that won\'t bite me in some case?

In a desktop environment, the jiffy is an eternity; not so in embedded
designs.Â So, while a desktop app can run \"long enough\" to make effective
use of some of these mechanisms, an embedded one may actually just be
perpetually \"muddying the waters\".Â (if a line of YOUR task\'s code gets
installed in the cache, but pieces of 30 other tasks run before you get
another opportunity, how likely is the cache to still be \"warm\", from
YOUR point of view?)

That is always a problem as is memory fragmentation in kit that runs for
extended periods of time and allocate buffers of widely varying sizes. Many
domestic routers have this problem after a month or so of up time.

As I run 24/7/365, I have had to proactively address the problem.
First, with quotas for amount of memory as well as number of physical
pages that can be \"charged\" to a task.

Second, to kill off \"offenders\".

Third, the \"Larazator\" :> (a reference to Lazarus) to give those
offenders another chance.

[Also, the allocation of tasks to CPUs varies, dynamically so each
reassignment can reshuffle the resources available to a task]

Other CPUs have equivalent performance monitoring registers although they
may be hidden in some fine print in dark recesses of the manual.

You often need to understand more than the CPU to be able to guesstimate
(or measure) performance.Â E.g., at the most abstract implementation
levels in my design, users write \"scripts\" in a JITed language (Limbo).
As it does a lot of dynamic object allocation, behind the scenes, the
GC has to run periodically.

GC is the enemy of hard realtime work. By Murphy\'s Law it will always get
invoked just when you need every last uS you can glean.

So, don\'t over-value time! *Expect* to miss your deadline and think about
what you could/would do in that case. Especially true if your system
is \"open\" (how do you know that your task will be as important
tomorrow as it was, today?)

And, think of how you can shed resources that you don\'t really need,
anymore. E.g., each of my tasks exists in (typically) at least two
pieces:
- the first piece builds up the environment that the task will need
(resolves names, instantiates objects, etc.)
- thereafter, this code NEVER needs to run, again (like crt0). So,
why hold onto those resources? Release them for someone else
to use!

Likewise, I don\'t have a disk /per se/ but certain ranges of memory
can always be reloaded from flash (like crt0 if the task is resurrected)
so why tie up precious RAM with it?

These days most binary operations are single cycle and potentially less if
there are sub expressions that have no interdependencies. Divides are still
a lot slower. This makes Pade 5,4 a good choice for rational approximations
on current Intel CPU\'s the numerator and denominator evaluate in parallel
(the physical hardware is that good) and some of the time for the divide is
lost along the way.

But, that\'s still only for \"bigger\" processors.Â Folks living in 8051-land
are still looking at subroutines to perform math operations (in anything
larger than a byte).

I remember those days somewhat unfondly.

Yeah, try writing a network stack for an 8b processor and you quickly
see how costly even \"trivial\" operations can become!

OTOH, I think you tend to rise to the level of the challenge. We tend
not to force ourselves to be creative until we *must*.

In the old days we were warned to avoid conditional branches but today you
can get away with them and sometimes active loop unrolling will work against
you. If it has to be as fast as possible then you need to understand every
last quirk of the target CPU.

And, then you are intimately bound to THAT target.Â You have to rethink your
implementation when you want (or need!) to move to another (economic reasons,
parts availability, etc.).

You generally end up bound to one family of CPUs. We didn\'t always pick wisely
and I variously backed Zilog Z8000 (remember them?) and the brilliant for its
time NEC 7210 graphics processor. OTOH we did OK with TI 99xx/99k Motorola
68xx/68k and Intel x86 so I can\'t really complain.

I\'ve used a bunch of different processors, over the years. Esp after
leaving the 9-to-5 (as each client has their own \"investment\" and puts
a lot of bias into YOUR hardware choices... no one wants to have to buy
a new toolchain/ICE just because *you* decided processor A is better
than (existing) processor B!

And, most of them were ASM projects (vintage). So, you *know* your
code doesn\'t have a life beyond that client.

But, the *algorithms* and system design can easily be reused. And,
often, that\'s where the real thinking resides (a \"coder\" can reimplement
your design once you know what it should be!)

I find it best (embedded systems) to just design with good algorithms,
understand how their performance can vary (and try to avoid application
in those cases where it will!) and rely on the RTOS to ensure the
important stuff gets done in a timely manner.

My time doing that sort of embedded realtime stuff the battle was mostly to get
the entire thing to fit in a 4k or 8k ROM somehow! To some extent it would be
whatever speed it could manage and any glaring deficiencies in performance
would be dealt with once it was all functional.

I can\'t recall ever having to refactor an ASM design. Usually, the algorithms
chosen were the best for the application.

What was annoying was PHBs who would change the requirements for the
product AFTER you\'d finalized your system design. The idea that
a new requirement could \"not fit\" just didn\'t occur to them. This
be the path to doom!

(when I left the 9-to-5, I quickly learned to avoid T&M jobs... I\'m
not looking to make a CAREER out of this project. And, I need to
know when I\'m almost done so I can line up my next project -- if
you want changes, get someone else to do them for you! Someone
who would be glad to dig holes and refill them, in perpetuity...)

How fast things will go in practice can only be determined today by putting
all of the pieces together and seeing how fast it will run.

Exactly.Â Then, wonder what might happen if something gets *added* to the
mix (either a future product enhancement -- but same old hardware -- or
interactions with external agencies in ways that you hadn\'t anticipated).

[IMO, this is where the future of embedded systems lies -- especially with
IoT.Â Folks are going to think of all sorts of ways to leverage functionality
in devices B and F to enhance the performance of a system neither the Bnor F
designers had envisioned at their respective design times!]

It is getting harder to predict this.

A mistake I see with the current IoT approach is to build lots of stupid
little islands (\"motes\") with dedicated functionality and rely on something
*else* to tie them together. This invariably will lead to more hardware
than is needed (in the \"system\") as any idle processors will just be
wasted resources. Do you really need all those resources to monitor
the temperature of the house and control two digital output bits??
Or, to wait for a signal asking you to raise the garage door? Can\'t
you do something USEFUL while you are idling??

And, a bigger \"central node\" that scales poorly as more motes are added.
(the current approach is to rely on some cloud service that can, in
theory, scale up to meet your needs. unless, of course, it can\'t!)

More CPUs that aren\'t repurposable (word salad!) is just folly.

In the 70/80\'s (heyday of MPU diversity), vendors all had their own
favorite benchmarks that they would use to tout how great THEIR
product was, vs. their competitor(s) -- cuz they all wanted design-ins.
We\'d run our own benchmarks and make OUR decisions based on how
the product performed running the sorts of code *we* wanted to run
on the sorts of hardware COSTS that we wanted to afford.

So did compilers. I knew one lot who recognised certain of the common
benchmarks and specifically optimised to be top of the table in some.

Yes. Vendors would come to us touting how great their processors
were on these benchmarks. But, those benchmarks didn\'t represent
*our* needs, so what do we care?

We\'d ask questions like: \"How does your processor compare to
processor X for constant memory dollars?\"

Or: \"What other firms sell drop-in replacements for your devices?\"

I have a little puzzle on that one too. I have some verified correct code
for cube root running on x87 and SSE/AVX hardware and when benchmarked
aggressively for blocks of 10^7 cycles gets progressively faster it can be
by as much as a factor of two. I know that others have seen this effect
sometimes too but it only happens sometimes - usually on dense frequently
executed x87 code. These are cube root benchmarks:

How have you eliminated the effects of the rest of the host system
from your measurements?Â Are you sure other tasks aren\'t muddying
the cache?Â Or, forcing pages out of RAM?Â (Or, *allowing* you better
access to cache and physical memory?)

It is quite reproducible and others see it too. The results are consistent and
depend only on the number of consecutive x87 instructions. The results for
non-x87 code are rock solid +/- 1 cycle.

So, your stanzas are short enough that the chances of being \"interrupted\"
by the OSs need is small? (Wombats crossing the road)

[So, when some idiot DOES muck with it, I can just grin and say, \"Well, I
guess YOU\'LL have to figure out what you did wrong, eh?Â Or, figure out how
to restore the original implementation and salvage any other hacks you
may have needed.\"]

I recall someone left an undocumented booby trap of that sort in a 68k Unix
port knowing full well that when they left the company (which they intended to
do fairly shortly) someone would fall into it.

There\'s usually enough \"undocumented cruft\" that you don\'t need to bother
deliberately hobbling the next sucker^H^H^H poor soul.

On my own, I\'ve taken the opposite approach: leave everything out in the
open so I can ignore future requests for help (I work for someone else,
now). I learned, early on, that if you let clients keep pestering you
for \"just one more LITTLE thing\", the cord never gets cut. If you\'re
the type that wants to milk a contract, that may be fine (assuming you bill
for that time) but I am always eager to move on to the next problem.
Yet, don\'t want to leave anyone \"high and dry\" because I didn\'t
COMPLETELY fulfill my contractual requirements.

We don\'t need no stinkin\' OS!

You may not think you need one but you do need a way to share the physical
resources between the tasks that want to use them.

Single-threaded and trivial designs can usually live without an OS.
I still see people using \"superloops\" (which should have gone away
in the 70\'s as they are brittle to maintain).

But, you\'ve really constrained a product\'s design if it doesn\'t
exploit the abstractions that an OS supplies.

I have done simple things on PICs with everything in a loop and the odd ISR
like a bare metal LCD display - 16877 has just enough pins for that.

Some of my early designs were like that. It was particularly challenging
as there were several of us working on the same product and nothing
(OS-ish) to allow resources to be shared fairly. EVERYTHING global!
(Christ, it\'s a wonder anything *ever* worked!)

We spent more time \"comparing notes\" as to which resources were
being used, and when, to ensure you could safely make use of <something>.
\"When do you use bit 3 (literally!) in SIGNS1? I\'d like to use
it as a boolean in the baz subroutine...\"

[Always educational to rethink how you would tackle a previous design
with the technologies you have available, today! Far more useful
as you KNOW what the design entailed.]

Cooperative multitasking can be done with interrupts for the realtime IO but
life is a lot easier with a bit of help from an RTOS.

Especially if you truly are operating with timeliness constraints.

How do you know that each task met its deadline?Â Can you even quantify
them?Â What do you do when a task *misses* its deadline?Â How do you
schedule your tasks\' execution to optimize resource utilization?
What happens when you add a task?Â Or, change a task\'s workload?

What used to annoy me (and still does) are programmers that think their task
should always have the highest priority over everything else.

\"Lets make everything a bit louder than everything else\" (Deep Purple?)

That gets back to my HRT argument; it\'s considerably easier to design
as if your requirements are the only ones that matter and state
that they MUST be met. Until, of course, you consider that other folks
can have the exact same approach to THEIR problems.

So, something doesn\'t work. OR, you put in more resources than you
really need -- because none of you wanted to think about how to
be smarter in your solution.

In any serious multiprocessor you quickly learn that the task that keeps
everything else loaded with work to do is critical to performance.

But that usually comes at the expense of communications (which
often means memory). So, partitioning the problem becomes the
key issue in the design; if A has to talk to B *unnecessarily*,
then you know there\'s a problem waiting to manifest.

I recall building a barcode reader into an early product.
There were four or five tasks (we\'d call them threads,
nowadays) in that one job (\"process\" in today-speak).
Lots of buffering to pass partial results between tasks.
But, it ended up being *faster*, overall, because each
part of the process could proceed at its own pace,
irrespective of the others. Why wait for the rest of the
label to be scanned if you can start processing the portion
that has already been captured?! And, if some of that
\"data\" isn\'t recognizable as part of a valid code, then
lets get rid of it to make room for *more* that MIGHT be
(instead of waiting for \"all of it\" before processing
\"ANY of it\")

Dimiter_Popoff · Mar 1, 2023

On 3/1/2023 17:35, John Larkin wrote:

On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

On 2/24/2023 8:49 AM, Martin Brown wrote:
Anyone who is serious about timing code knows how to read the free running
system clock. RDTSC in Intel CPUs is very handy (even if they warn against
using it for this purpose it works very well).

But, if you are running code on anything other than bare metal, you are
seeing a composite effect from the OS, other coexecuting tasks, etc.

As most of the code I\'ve written is RT (all of the products I\'ve designed
but none of the desktop tools I\'ve written), I have to have a good handle
on how an algorithm will perform, in the time domain, over the range of
different operating conditions. Hence, the *design* of the algorithm
becomes keenly important.

And, while there are hardware mechanisms (on SOME processors and to varying
degrees) that act as performance enhancers, there are also things that
work to defeat these mechanisms (e.g., different tasks competing for cache
lines).

In a desktop environment, the jiffy is an eternity; not so in embedded
designs. So, while a desktop app can run \"long enough\" to make effective
use of some of these mechanisms, an embedded one may actually just be
perpetually \"muddying the waters\". (if a line of YOUR task\'s code gets
installed in the cache, but pieces of 30 other tasks run before you get
another opportunity, how likely is the cache to still be \"warm\", from
YOUR point of view?)

Other CPUs have equivalent performance monitoring registers although they may
be hidden in some fine print in dark recesses of the manual.

You often need to understand more than the CPU to be able to guesstimate
(or measure) performance. E.g., at the most abstract implementation
levels in my design, users write \"scripts\" in a JITed language (Limbo).
As it does a lot of dynamic object allocation, behind the scenes, the
GC has to run periodically.

So, if you happen to measure performance of an algorithm -- but the GC ran
during some number of those iterations -- your results can be difficult
to interpret; what\'s the performance of the algorithm vs. that of the
supporting *system*?

These days most binary operations are single cycle and potentially less if
there are sub expressions that have no interdependencies. Divides are still a
lot slower. This makes Pade 5,4 a good choice for rational approximations on
current Intel CPU\'s the numerator and denominator evaluate in parallel (the
physical hardware is that good) and some of the time for the divide is lost
along the way.

But, that\'s still only for \"bigger\" processors. Folks living in 8051-land
are still looking at subroutines to perform math operations (in anything
larger than a byte).

In the old days we were warned to avoid conditional branches but today you can
get away with them and sometimes active loop unrolling will work against you.
If it has to be as fast as possible then you need to understand every last
quirk of the target CPU.

And, then you are intimately bound to THAT target. You have to rethink your
implementation when you want (or need!) to move to another (economic reasons,
parts availability, etc.).

I find it best (embedded systems) to just design with good algorithms,
understand how their performance can vary (and try to avoid application
in those cases where it will!) and rely on the RTOS to ensure the
important stuff gets done in a timely manner.

If your notion of how your \"box\" works is largely static, then you will
likely end up with more hardware than you need. If you are cost conscious,
then you\'re at a disadvantage if you can\'t decouple things that MUST
get done, now, from things that you would LIKE to get done, now.

[This is the folly that most folks fall into with HRT designs; they
overspecify the hardware because they haven\'t considered how to
address missed deadlines. Because many things that they want to think
of as being \"hard\" deadlines really aren\'t.]

How fast things will go in practice can only be determined today by putting all
of the pieces together and seeing how fast it will run.

Exactly. Then, wonder what might happen if something gets *added* to the
mix (either a future product enhancement -- but same old hardware -- or
interactions with external agencies in ways that you hadn\'t anticipated).

[IMO, this is where the future of embedded systems lies -- especially with
IoT. Folks are going to think of all sorts of ways to leverage functionality
in devices B and F to enhance the performance of a system neither the Bnor F
designers had envisioned at their respective design times!]

[[E.g., I use an EXISTING security camera to determine when the mail has been
delivered. Why try to \"wire\" the mailbox with some sort of \"mail sensor\"??]]

Benchmarks can be misleading too. It doesn\'t tell you how the component will
behave in the environment where you will actually use it.

Yes. So, \"published\" benchmarks always have to be taken with a grain
of salt.

In the 70/80\'s (heyday of MPU diversity), vendors all had their own
favorite benchmarks that they would use to tout how great THEIR
product was, vs. their competitor(s) -- cuz they all wanted design-ins.
We\'d run our own benchmarks and make OUR decisions based on how
the product performed running the sorts of code *we* wanted to run
on the sorts of hardware COSTS that we wanted to afford.

Even C is becoming difficult, in some cases, to \'second guess\'.
And, ASM isn\'t immune as the hardware is evolving to provide
performance enhancing features that can\'t often be quantified,
at design/compile time.

I have a little puzzle on that one too. I have some verified correct code for
cube root running on x87 and SSE/AVX hardware and when benchmarked aggressively
for blocks of 10^7 cycles gets progressively faster it can be by as much as a
factor of two. I know that others have seen this effect sometimes too but it
only happens sometimes - usually on dense frequently executed x87 code. These
are cube root benchmarks:

How have you eliminated the effects of the rest of the host system
from your measurements? Are you sure other tasks aren\'t muddying
the cache? Or, forcing pages out of RAM? (Or, *allowing* you better
access to cache and physical memory?)

System cbrt on GNU and MS are best avoided entirely - both are amongst the
slowest and least accurate of all those I have ever tested. The best public
algorithm for general use is by Sun for their Unix library. It is a variant of
Kahan\'s magic constant hack. BSD is slower but OK.

Most applications aren\'t scraping around in the mud, trying to eek out a
few processor cycles. Instead, their developers are more worried about
correctness and long term \"support\"; I have algorithms that start with
large walls of commentary that essentially says: \"Here, There be Dragon\'s.
Don\'t muck with this unless you completely understand all of the following
documentation and the reasons behind each of the decisions made in THIS
implementation.\"

[So, when some idiot DOES muck with it, I can just grin and say, \"Well, I
guess YOU\'LL have to figure out what you did wrong, eh? Or, figure out how
to restore the original implementation and salvage any other hacks you
may have needed.\"]

We don\'t need no stinkin\' OS!

You may not think you need one but you do need a way to share the physical
resources between the tasks that want to use them.

Single-threaded and trivial designs can usually live without an OS.
I still see people using \"superloops\" (which should have gone away
in the 70\'s as they are brittle to maintain).

But, you\'ve really constrained a product\'s design if it doesn\'t
exploit the abstractions that an OS supplies.

I built a box, some years ago, and recreated a UNIX-like API
for the application. Client was blown away when I was able to
provide the same functionality that was available on the \"console
port\" (RS232) on ALL the ports. Simultaneously. By adding *one*
line of code for each port!

Cuz, you know, sooner or later something like that would be on
the wish list and a more naive implementation would lead to
a costly refactoring.

Cooperative multitasking can be done with interrupts for the realtime IO but
life is a lot easier with a bit of help from an RTOS.

Especially if you truly are operating with timeliness constraints.

How do you know that each task met its deadline? Can you even quantify
them? What do you do when a task *misses* its deadline? How do you
schedule your tasks\' execution to optimize resource utilization?
What happens when you add a task? Or, change a task\'s workload?

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

Getting rid of task switching won\'t make your life much easier,
in fact I don\'t see how it has made anyone\'s life harder.
You still have to manage resources, like memory, file system,
manage locks etc. etc.
Task switching will remain as far as I am concerned, however tasks
get also a \"current core\" in their descriptor. Of course it can
be wise to have some core(s) assigned to some particular task
when practical.
Multiple cores do not make \"do it and go to sleep\" any simpler
either.

Phil Hobbs · Mar 1, 2023

On 2023-03-01 10:35, John Larkin wrote:

On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

That works a lot better with multiple MCUs than with a fast multicore chip.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

John Larkin · Mar 1, 2023

On Wed, 1 Mar 2023 14:31:42 -0500, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-01 10:35, John Larkin wrote:
On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

That works a lot better with multiple MCUs than with a fast multicore chip.

Cheers

Phil Hobbs

Why?

A multicore chip can share memory regions and such for zero
inter-processor overhead.

Somebody recently introduced a 128-core ARM chip. That\'s a good start.
It was historical to treasure a CPU as a rare resource, but no longer.

Joe Gwinn · Mar 1, 2023

On Wed, 01 Mar 2023 11:48:10 -0800, John Larkin
<jlarkin@highlandSNIPMEtechnology.com> wrote:

On Wed, 1 Mar 2023 14:31:42 -0500, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-01 10:35, John Larkin wrote:
On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

That works a lot better with multiple MCUs than with a fast multicore chip.

Cheers

Phil Hobbs

Why?

A multicore chip can share memory regions and such for zero
inter-processor overhead.

It\'s a whole lot better than copying the bulk data back and forth, and
is pretty much essential for embedded realtime, where most often there
is neither time nor space for making copies.

In computer science, this is called a blackboard system (later,
elevated to a design pattern), and they think that they invented it,
but was universal for decades before.

In assembly, it was the natural design. In Fortran, it was called
COMMON, and later GLOBAL COMMON (when it spanned more than one process
space). IN POSIX/Linux, it\'s called shared memory. And so on.

..<https://en.wikipedia.org/wiki/Blackboard_system>

Somebody recently introduced a 128-core ARM chip. That\'s a good start.
It was historical to treasure a CPU as a rare resource, but no longer.

True, Moore\'s Law has turned CPUs into jellybeans. Actually,
jellybean dust, a million per jellybean.

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

..<https://en.wikipedia.org/wiki/Amdahl%27s_law>

This is why it\'s hard to parallelize for instance a FFT.

Joe Gwinn

John Larkin · Mar 1, 2023

On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net>
wrote:

On Wed, 01 Mar 2023 11:48:10 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Wed, 1 Mar 2023 14:31:42 -0500, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-01 10:35, John Larkin wrote:
On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

That works a lot better with multiple MCUs than with a fast multicore chip.

Cheers

Phil Hobbs

Why?

A multicore chip can share memory regions and such for zero
inter-processor overhead.

It\'s a whole lot better than copying the bulk data back and forth, and
is pretty much essential for embedded realtime, where most often there
is neither time nor space for making copies.

In computer science, this is called a blackboard system (later,
elevated to a design pattern), and they think that they invented it,
but was universal for decades before.

In assembly, it was the natural design. In Fortran, it was called
COMMON, and later GLOBAL COMMON (when it spanned more than one process
space). IN POSIX/Linux, it\'s called shared memory. And so on.

.<https://en.wikipedia.org/wiki/Blackboard_system

Somebody recently introduced a 128-core ARM chip. That\'s a good start.
It was historical to treasure a CPU as a rare resource, but no longer.

True, Moore\'s Law has turned CPUs into jellybeans. Actually,
jellybean dust, a million per jellybean.

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

Joe Gwinn

That\'s a different problem. The big virtue of a many-CPU system isn\'t
raw compute power - which we seldom need - it\'s simplicity and
security.

There would be speed advantages not from sheer horsepower, but from
keeping some OS blunder from locking up the whole system for minutes
at a time. Again, stop thinking of CPUs and compute power as rare
resources.

Of course, a many-CPU system would need a modern OS. It would run on
one of the cores and manage absolute hardware protections of all the
rest.

Dimiter_Popoff · Mar 1, 2023

On 3/2/2023 1:09, John Larkin wrote:

On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

On Wed, 01 Mar 2023 11:48:10 -0800, John Larkin
jlarkin@highlandSNIPMEtechnology.com> wrote:

On Wed, 1 Mar 2023 14:31:42 -0500, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-01 10:35, John Larkin wrote:
On Fri, 24 Feb 2023 12:32:07 -0700, Don Y
blockedofcourse@foo.invalid> wrote:

The future is lots of CPUs. Task switching and CPU contention are so
last millenium. Each process can just own a CPU or three.

That works a lot better with multiple MCUs than with a fast multicore chip.

Cheers

Phil Hobbs

Why?

A multicore chip can share memory regions and such for zero
inter-processor overhead.

It\'s a whole lot better than copying the bulk data back and forth, and
is pretty much essential for embedded realtime, where most often there
is neither time nor space for making copies.

In computer science, this is called a blackboard system (later,
elevated to a design pattern), and they think that they invented it,
but was universal for decades before.

In assembly, it was the natural design. In Fortran, it was called
COMMON, and later GLOBAL COMMON (when it spanned more than one process
space). IN POSIX/Linux, it\'s called shared memory. And so on.

.<https://en.wikipedia.org/wiki/Blackboard_system

Somebody recently introduced a 128-core ARM chip. That\'s a good start.
It was historical to treasure a CPU as a rare resource, but no longer.

True, Moore\'s Law has turned CPUs into jellybeans. Actually,
jellybean dust, a million per jellybean.

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

Joe Gwinn

That\'s a different problem. The big virtue of a many-CPU system isn\'t
raw compute power - which we seldom need - it\'s simplicity and
security.

There would be speed advantages not from sheer horsepower, but from
keeping some OS blunder from locking up the whole system for minutes
at a time. Again, stop thinking of CPUs and compute power as rare
resources.

They are less rare of course but they do consume power and even when
put to sleep the leakage is huge on technologies which allow so many
cores (say sub 20nm) so you have to unpower them then turn on etc.
But all this is doable and is being done already, at some latency cost
which is acceptable.

Of course, a many-CPU system would need a modern OS. It would run on
one of the cores and manage absolute hardware protections of all the
rest.

This is a somewhat naive view

. Each core has its own MMU to
begin with, as of today.

OS complexity is not your enemy. Shared memory bandwidth is, sync
facilities etc. have to be deal with. I have had fun with a two core
machine (one 68340 and one 8240) sharing each other\'s address spaces,
one of them doing disk and ethernet access serving it to the both,
that some 20 years ago. It has more challenges than one might
anticipate

.

But overall you are correct, cores will become more and more.
Perhaps simpler types, perhaps a lot simpler than we know them now.

Don Y · Mar 2, 2023

On 3/1/2023 1:26 PM, Joe Gwinn wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

A *lot* of applications can benefit from multiple processors.
The practical determining factor is the amount of intercommunication
implied in the factorization.

Sharing information \"willy-nilly\" (e.g., \"global variables, shared
memory, etc.) decreases a design\'s robustness; anyone can access
anything and at anytime.

Best practices suggest information *hiding* should be the norm;
only share what you absolutely *must* and only with those
that absolutely need it. *Then*, reexamine the information
shared with an eye towards repartitioning the design to eliminate
*more* of that sharing: \"Why is THIS being shared? Would a
different approach allow it to be hidden and abstracted?\"

In a multicore processor, there are more opportunities to
share with few *obvious* consequences; the cost of sharing
is relatively low -- even with encapsulated data!

Once you move to discrete processors, then the communication
paths become more visible -- and costly. You have to
\"transmit\" data between devices that want to share up-to-date
copies of that data (unless there is some implicit means of
sharing that doesn\'t require an explicit copy... like hardware
for the task).

I.e., this forces more information hiding as a practical matter.
Sharing more *costs* more! And, adds potential vulnerabilities
to the design that wouldn\'t be in place if it was \"in one box\"
(e.g., someone can corrupt the comms between processors).

And, it forces developers to deal with the notion that part of the
\"system\" may not be available at any given time... (what do you
do when your email client is down and you want/need to check email?)
So, each device needs to act as an independent entity to handle
islanding (e.g., shut down the motor when the remote task that
asked you to turn it on unexpectedly dies; how do you decide that
to be the case, now?).

[You can use DSM to mimic the behavior of true multiport memory,
but the hardware required to do so puts you in a different
class of CPU. And, moreover, the protocols needed to
ensure that sharing beats the bandwidth available to hell!]

OTOH, separate processors means you can put processors where they
are needed. In a desktop environment, they likely just need access
to comms and storage hardware. But, in an embedded system, you
might need to monitor the temperature and mass flow in an air
handler located scores of feat from some other process vessel that
is being controlled and monitored. Even if you could run long
stretches of I/O signals back to a central processor, then the
processor would need to be able to handle those I/Os directly
both in hardware as well as the software to service it.

Clifford Heath · Mar 2, 2023

On 02/03/23 10:41, Dimiter_Popoff wrote:

OS complexity is not your enemy. Shared memory bandwidth is, sync
facilities etc. have to be deal with. I have had fun with a two core
machine (one 68340 and one 8240) sharing each other\'s address spaces,
one of them doing disk and ethernet access serving it to the both,
that some 20 years ago. It has more challenges than one might
anticipate .

I recall seeing a dual 6802 processor around 1982 which had the CPUs
working on opposite clock phases sharing the same address and data bus.
I seem to recall the memory cycles only used half a processor cycle, but
perhaps there was memory cycle stretching for collisions (or was that a
different project?). Anyhow, cool at the time.

Don Y · Mar 2, 2023

On 3/1/2023 5:07 PM, Clifford Heath wrote:

I recall seeing a dual 6802 processor around 1982 which had the CPUs working on
opposite clock phases sharing the same address and data bus. I seem to recall
the memory cycles only used half a processor cycle, but perhaps there was
memory cycle stretching for collisions (or was that a different project?).
Anyhow, cool at the time.

This was the typical way that (coinop) video games were designed,
\"back in the day\". The processor had access to the memory
(which included the frame buffer) during one half-cycle while
the video controller (and other related hardware, like BLTer)
had access on the other half-cycle.

[The 68xx family devices (as well as 65xx) made this relatively
trivial to do, in hardware. (The 68K made memory sharing much
harder, requiring a more complex arbiter to hold-off async
accesses from different bus masters)]

But, a video game has trivial sharing requirements. The display
controller is (typically) always doing reads so no possibility of
it dicking with data that the CPU might need to access. The
only real issue is ensuring the CPU doesn\'t update the FB\'s
contents in a region of memory that is *currently* being painted
onto the CRT. Otherwise, you get visual artifacts as half of
an object may appear in its previous position while the other
half appears in its updated position -- a sort of visual
\"tearing\". But, you can avoid this just by creating temporal
events (IRQs) alerting the CPU to the *approximate* position
of the display refresh (e.g., \"I\'m now at the top of the screen
so draw the stuff that is at the bottom of the screen before
I get there! I\'ll let you know when I get to the middle of
the screen so you can redirect your attention to the objects
to be drawn at the TOP of the screen\")

whit3rd · Mar 2, 2023

On Wednesday, March 1, 2023 at 4:07:15â¯PM UTC-8, Clifford Heath wrote:

I recall seeing a dual 6802 processor around 1982 which had the CPUs
working on opposite clock phases sharing the same address and data bus.

I recall a CDP1802 design that did the same thing; the chip was primitive, but
it did tolerate clock stoppage to release the memory bus.

Dimiter_Popoff · Mar 2, 2023

On 3/2/2023 2:07, Clifford Heath wrote:

On 02/03/23 10:41, Dimiter_Popoff wrote:
OS complexity is not your enemy. Shared memory bandwidth is, sync
facilities etc. have to be deal with. I have had fun with a two core
machine (one 68340 and one 8240) sharing each other\'s address spaces,
one of them doing disk and ethernet access serving it to the both,
that some 20 years ago. It has more challenges than one might
anticipate .

I recall seeing a dual 6802 processor around 1982 which had the CPUs
working on opposite clock phases sharing the same address and data bus.
I seem to recall the memory cycles only used half a processor cycle, but
perhaps there was memory cycle stretching for collisions (or was that a
different project?). Anyhow, cool at the time.

There should have been no need for stretching, for the 6800 (02 was the
same, just with a clock generator included) it takes just shifting the
clocks at 180 degree and provide the memory is fast enough to do
it in half a cycle. I have done this for a 6809 and a 6845 (a CRT
controller) more than once.
Later (early 90-s) sharing the memory with no clock stretching for
a CPU32 (a 68340) was somewhat trickier but I still managed it,
took fast enough memory and some more complex logic (in a 16v8...

to shift the accesses by a cycle (may be also by 2, don\'t remember)
depending on how the CRT controller (a Hitachi spinoff of the 6845
for LCD displays) clock would meet the CPU access.

Jan Panteltje · Mar 2, 2023

On a sunny day (Wed, 01 Mar 2023 15:26:08 -0500) it happened Joe Gwinn
<joegwinn@comcast.net> wrote in <48cvvhh0jcb3324t8tvm0fr1fsm1it1f7p@4ax.com>:

>This is why it\'s hard to parallelize for instance a FFT.

Any prism will do a FFT on EM radiation of the right wavelength instantaneously.
Put a camera in front of it
maaazz is not always the solution...

server · Mar 2, 2023

On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net>
wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

If you have a lot of cores, why not use discrete FT instead of FFT,
especially if you do not need all points or the input size is
something different from some power of 2 ?

Phil Hobbs · Mar 2, 2023

On 2023-03-02 04:17, upsidedown@downunder.com wrote:

On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

If you have a lot of cores, why not use discrete FT instead of FFT,
especially if you do not need all points or the input size is
something different from some power of 2 ?

There are fast algorithms for all array lengths, including prime
numbers. See e.g. <https://www.fftw.org/>.

FFTs can be parallellized reasonably well. IIRC it\'s the bit-reversal
step that makes radix-2 transforms a bit slow on highly multicore SMPs.

<http://www.fftw.org/parallel/parallel-fftw.html>

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

Joe Gwinn · Mar 2, 2023

On Thu, 02 Mar 2023 11:17:08 +0200, upsidedown@downunder.com wrote:

On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

If you have a lot of cores, why not use discrete FT instead of FFT,
especially if you do not need all points or the input size is
something different from some power of 2 ?

Actually, the problem is with all integral transforms, regardless of
the existence of a fast algorithm, because by definition an integral
transform uses the entire input to generate each point in the output.

But there are problems where there are things that can be computed in
parallel independently. These are called \"embarrassingly parallel\"
problems.

Joe Gwinn

Joe Gwinn · Mar 2, 2023

On Thu, 2 Mar 2023 09:12:20 -0500, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-02 04:17, upsidedown@downunder.com wrote:
On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

If you have a lot of cores, why not use discrete FT instead of FFT,
especially if you do not need all points or the input size is
something different from some power of 2 ?

There are fast algorithms for all array lengths, including prime
numbers. See e.g. <https://www.fftw.org/>.

FFTs can be parallellized reasonably well. IIRC it\'s the bit-reversal
step that makes radix-2 transforms a bit slow on highly multicore SMPs.

http://www.fftw.org/parallel/parallel-fftw.html

Yes, all true. But in radar we often want million-point (2^20)
discrete Fourier transforms.

The bottom line for attempts to do just that were that time (latency)
to done ceased decreasing and began increasing somewhere around 4 to 6
parallel processors.

This was a big problem because (marketing?) people assumed that if you
used 10^4 processors, time to done would be reduced by 10^4. Oops.
Even after people decided that 256K (2^18) points would suffice, still
no go. They also believed the marketing MIPS of the computer
platform, but that\'s another story. All these fables were killed by a
direct FFT benchmark.

The eventual solution was to use specialized array-processing
hardware.

Joe Gwinn

wmartin · Mar 2, 2023

On 2/22/23 11:05, John Larkin wrote:

https://en.wikipedia.org/wiki/Timeline_of_programming_languages

Now I\'m told that we should be coding hard embedded products in C++ or
Rust.

Usenet needs a \"dead horse language\", to speed up describing all the
dead horse beatings...

Phil Hobbs · Mar 3, 2023

On 2023-03-02 10:49, Joe Gwinn wrote:

On Thu, 2 Mar 2023 09:12:20 -0500, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2023-03-02 04:17, upsidedown@downunder.com wrote:
On Wed, 01 Mar 2023 15:26:08 -0500, Joe Gwinn <joegwinn@comcast.net
wrote:

But the more fundamental limitation is Amdahl\'s Law, which applies to
all parallel computations:

.<https://en.wikipedia.org/wiki/Amdahl%27s_law

This is why it\'s hard to parallelize for instance a FFT.

If you have a lot of cores, why not use discrete FT instead of FFT,
especially if you do not need all points or the input size is
something different from some power of 2 ?

There are fast algorithms for all array lengths, including prime
numbers. See e.g. <https://www.fftw.org/>.

FFTs can be parallellized reasonably well. IIRC it\'s the bit-reversal
step that makes radix-2 transforms a bit slow on highly multicore SMPs.

http://www.fftw.org/parallel/parallel-fftw.html

Yes, all true. But in radar we often want million-point (2^20)
discrete Fourier transforms.

Well, for that job you ain\'t going to be using an N**2 algorithm
(straight DFT), multicore or no multicore.

The bottom line for attempts to do just that were that time (latency)
to done ceased decreasing and began increasing somewhere around 4 to 6
parallel processors.

That seems pessimistic. My EM simulator code scales nearly linearly up
to 40 processors, which is as far as I\'ve tested it. Of course you
folks probably didn\'t have Infiniband or GbE connections. The biggest
killer is handshake latency, IME.

This was a big problem because (marketing?) people assumed that if you
used 10^4 processors, time to done would be reduced by 10^4. Oops.
Even after people decided that 256K (2^18) points would suffice, still
no go. They also believed the marketing MIPS of the computer
platform, but that\'s another story. All these fables were killed by a
direct FFT benchmark.

The eventual solution was to use specialized array-processing
hardware.

That, I believe. One reason FPGAs don\'t get used for ordinary computing
workloads is the gigantic task switching overhead that would impose.
But when you don\'t need to switch tasks, a big array is very very powerful.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com

dead programming languages...

John Larkin

Guest

Don Y

Guest

Dimiter_Popoff

Guest

Phil Hobbs

Guest

John Larkin

Guest

Joe Gwinn

Guest

John Larkin

Guest

Dimiter_Popoff

Guest

Don Y

Guest

Clifford Heath

Guest

Don Y

Guest

whit3rd

Guest

Dimiter_Popoff

Guest

Jan Panteltje

Guest

server

Guest

Phil Hobbs

Guest

Joe Gwinn

Guest

Joe Gwinn

Guest

wmartin

Guest

Phil Hobbs

Guest

Log in

Welcome to EDABoard.com

Sponsor