M
Martin Brown
Guest
On 23/02/2023 16:43, John Larkin wrote:
Anyone who is serious about timing code knows how to read the free
running system clock. RDTSC in Intel CPUs is very handy (even if they
warn against using it for this purpose it works very well).
Other CPUs have equivalent performance monitoring registers although
they may be hidden in some fine print in dark recesses of the manual.
These days most binary operations are single cycle and potentially less
if there are sub expressions that have no interdependencies. Divides are
still a lot slower. This makes Pade 5,4 a good choice for rational
approximations on current Intel CPU\'s the numerator and denominator
evaluate in parallel (the physical hardware is that good) and some of
the time for the divide is lost along the way.
In the old days we were warned to avoid conditional branches but today
you can get away with them and sometimes active loop unrolling will work
against you. If it has to be as fast as possible then you need to
understand every last quirk of the target CPU.
Speculative execution is a bit like quantum mechanics in that it
explores both paths simultaneously and then the wavefunction collapses
onto the path that was taken once the outcome of the comparison is
known. This significantly alters the viability of some algorithms.
The only caveat is that everything stops dead if two divides or
comparisons occur before the previous has completely finished executing
and you get a hard pipeline stall (a very bad for speed).
How fast things will go in practice can only be determined today by
putting all of the pieces together and seeing how fast it will run. On
some CPU\'s Halley\'s more complicated 3rd order method runs as fast or
sometimes faster than the more commonly used Newton-Raphson.
This has changed quite recently. On an i5-12600 they benchmark at 25
cycles and 26 cycles respectively but Halley has cubic convergence
whereas NR is only quadratic (and intrinsically less stable).
Benchmarks can be misleading too. It doesn\'t tell you how the component
will behave in the environment where you will actually use it.
I have a little puzzle on that one too. I have some verified correct
code for cube root running on x87 and SSE/AVX hardware and when
benchmarked aggressively for blocks of 10^7 cycles gets progressively
faster it can be by as much as a factor of two. I know that others have
seen this effect sometimes too but it only happens sometimes - usually
on dense frequently executed x87 code. These are cube root benchmarks:
ASM_log2 67 64 64 64 64 ...
ASM_2^x 107 90 90 90 90 ...
ASM_cbrt 122 123 96 64 64 ...
ASM_cbrt2 152 153 122 85 85 ...
Note how all the assembler based code speeds up a bit or even a lot
after the first few few measurements then stabilises.
ASM_log2 and ASM_2^x are very short routines.
ASM cbrt contains one FP divide. cbrt2 has no division.
Compare these with timings of pure SSE2 or AVX SIMD code:
Pow(x,1/3) 136 136 137 136 138
exp(log(x)/3) 168 164 147 147 171
N_cbrt 69 69 69 69 69
I also have to mention how impressed I am with the latest 2023 Intel
compiler - their system cbrt is more accurate than is possible with a
strict 64 bit only FP representation and fastest too!
Sneaky use of fused multiply add allows very precise refinement of the
answer using more than the 53 bit mantissa of 64bit FP.
System cbrt on GNU and MS are best avoided entirely - both are amongst
the slowest and least accurate of all those I have ever tested. The best
public algorithm for general use is by Sun for their Unix library. It is
a variant of Kahan\'s magic constant hack. BSD is slower but OK.
This is amazingly a best buy thanks to remarkable recent speed
improvements in exp and log library functions (and also sqrt).
double cbrt(double x)
{
double y, t;
if (x == 0) return x;
if (x > 0 ) y = exp(log(x)/3); else y = -exp(log(-x)/3);
t = y*y*y;
return y - y*(x-t)/(x+2*t); // Halley refinement
}
Is faster than most other algorithms at least on Intel CPUs and as
accurate as all but the very cunning Intel C++ library routine.
PS don\'t be tempted to algebraically rearrange to y*(2x+t)/(x+2t)
It might look neater but you will lose (at least) a factor of two
numerical precision by using that form of the expression.
To make a horrible mess. Few problems scale well on multiple CPUs.
You may not think you need one but you do need a way to share the
physical resources between the tasks that want to use them.
Cooperative multitasking can be done with interrupts for the realtime IO
but life is a lot easier with a bit of help from an RTOS.
--
Martin Brown
On Wed, 22 Feb 2023 20:33:57 -0700, Don Y
blockedofcourse@foo.invalid> wrote:
On 2/22/2023 8:15 PM, Sylvia Else wrote:
But can you afford the memory and time overheads inherent in run-time range
checks of things like array accesses?
That;s a small cost. Modern tools can often make (some) of those
tests at compile time.
The bigger problem with many newer languages is that they rely heavily
on dynamic memory allocation, garbage collection, etc.
And, most of the folks I\'ve met can\'t look at a line of arbitrary
code and tell you -- with *confidence* -- that they know what it
costs to execute, regardless of how comfortable they are with
the language in question.
Programmers typically can\'t estimate run times for chunks of their
code. They typically guess pessimistically, by roughly 10:1.
Anyone who is serious about timing code knows how to read the free
running system clock. RDTSC in Intel CPUs is very handy (even if they
warn against using it for this purpose it works very well).
Other CPUs have equivalent performance monitoring registers although
they may be hidden in some fine print in dark recesses of the manual.
There\'s nothing unreasonable about an IRQ doing a decent amount of i/o
and signal processing on a small ARM at 100 KHz, if programmed in bare
c.
These days most binary operations are single cycle and potentially less
if there are sub expressions that have no interdependencies. Divides are
still a lot slower. This makes Pade 5,4 a good choice for rational
approximations on current Intel CPU\'s the numerator and denominator
evaluate in parallel (the physical hardware is that good) and some of
the time for the divide is lost along the way.
In the old days we were warned to avoid conditional branches but today
you can get away with them and sometimes active loop unrolling will work
against you. If it has to be as fast as possible then you need to
understand every last quirk of the target CPU.
Speculative execution is a bit like quantum mechanics in that it
explores both paths simultaneously and then the wavefunction collapses
onto the path that was taken once the outcome of the comparison is
known. This significantly alters the viability of some algorithms.
The only caveat is that everything stops dead if two divides or
comparisons occur before the previous has completely finished executing
and you get a hard pipeline stall (a very bad for speed).
How fast things will go in practice can only be determined today by
putting all of the pieces together and seeing how fast it will run. On
some CPU\'s Halley\'s more complicated 3rd order method runs as fast or
sometimes faster than the more commonly used Newton-Raphson.
This has changed quite recently. On an i5-12600 they benchmark at 25
cycles and 26 cycles respectively but Halley has cubic convergence
whereas NR is only quadratic (and intrinsically less stable).
Benchmarks can be misleading too. It doesn\'t tell you how the component
will behave in the environment where you will actually use it.
Even C is becoming difficult, in some cases, to \'second guess\'.
And, ASM isn\'t immune as the hardware is evolving to provide
performance enhancing features that can\'t often be quantified,
at design/compile time.
I have a little puzzle on that one too. I have some verified correct
code for cube root running on x87 and SSE/AVX hardware and when
benchmarked aggressively for blocks of 10^7 cycles gets progressively
faster it can be by as much as a factor of two. I know that others have
seen this effect sometimes too but it only happens sometimes - usually
on dense frequently executed x87 code. These are cube root benchmarks:
ASM_log2 67 64 64 64 64 ...
ASM_2^x 107 90 90 90 90 ...
ASM_cbrt 122 123 96 64 64 ...
ASM_cbrt2 152 153 122 85 85 ...
Note how all the assembler based code speeds up a bit or even a lot
after the first few few measurements then stabilises.
ASM_log2 and ASM_2^x are very short routines.
ASM cbrt contains one FP divide. cbrt2 has no division.
Compare these with timings of pure SSE2 or AVX SIMD code:
Pow(x,1/3) 136 136 137 136 138
exp(log(x)/3) 168 164 147 147 171
N_cbrt 69 69 69 69 69
I also have to mention how impressed I am with the latest 2023 Intel
compiler - their system cbrt is more accurate than is possible with a
strict 64 bit only FP representation and fastest too!
Sneaky use of fused multiply add allows very precise refinement of the
answer using more than the 53 bit mantissa of 64bit FP.
System cbrt on GNU and MS are best avoided entirely - both are amongst
the slowest and least accurate of all those I have ever tested. The best
public algorithm for general use is by Sun for their Unix library. It is
a variant of Kahan\'s magic constant hack. BSD is slower but OK.
This is amazingly a best buy thanks to remarkable recent speed
improvements in exp and log library functions (and also sqrt).
double cbrt(double x)
{
double y, t;
if (x == 0) return x;
if (x > 0 ) y = exp(log(x)/3); else y = -exp(log(-x)/3);
t = y*y*y;
return y - y*(x-t)/(x+2*t); // Halley refinement
}
Is faster than most other algorithms at least on Intel CPUs and as
accurate as all but the very cunning Intel C++ library routine.
PS don\'t be tempted to algebraically rearrange to y*(2x+t)/(x+2t)
It might look neater but you will lose (at least) a factor of two
numerical precision by using that form of the expression.
And, while the interface looks like a traditional function call,
the developer now has to consider the possibility that the
service may be unavailable -- even if it WAS available on the
previous line of code! (developers have a tough time thinking
in pseudo-parallel, let alone *true* parallelism)
Lots of CPUs help.
To make a horrible mess. Few problems scale well on multiple CPUs.
So, the language becomes less of an issue but the system design
and OS features/mechanisms (the days of toy RTOSs are rapidly coming
to an end)
We don\'t need no stinkin\' OS!
You may not think you need one but you do need a way to share the
physical resources between the tasks that want to use them.
Cooperative multitasking can be done with interrupts for the realtime IO
but life is a lot easier with a bit of help from an RTOS.
--
Martin Brown