J
John Larkin
Guest
On Mon, 16 Jan 2023 04:54:38 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:
Apply all of your engineering creativity.
<blockedofcourse@foo.invalid> wrote:
On 1/16/2023 3:21 AM, Martin Brown wrote:
On 15/01/2023 10:11, Don Y wrote:
On 1/15/2023 2:48 AM, Martin Brown wrote:
I prefer to use RDTSC for my Intel timings anyway.
On many of the modern CPUs there is a freerunning 64 bit counter clocked at
once per cycle. Intel deprecates using it for such purposes but I have never
found it a problem provided that you bracket it before and after with CPUID
to force all the pipelines into an empty state.
The equivalent DWT_CYCCNT on the Arm CPUs that support it is described here:
https://stackoverflow.com/questions/32610019/arm-m4-instructions-per-cycle-ipc-counters
I prefer hard numbers to a vague scope trace.
Two downsides:
- you have to instrument your code (but, if you\'re concerned with performance,
you\'ve already done this as a matter of course)
You have to make a test framework to exercise the code in as realistic a manner
as you can - that isn\'t quite the same as instrumenting the code (although it
can be).
It depends on how visible the information of interest is to outside
observers. If you have to \"do something\" to make it so, then you
may as well put in the instrumentation and get things as you want them.
I have never found profile directed compilers to be the least bit useful on my
fast maths codes because their automatic code instrumentation breaks the very
code that it is supposed to be testing (in the sense of wrecking cache lines
and locality etc.).
Exactly. The same holds true of adding invariants to code;
removing them (#ifndef DEBUG) changes the code -- subtly but
nonetheless. So, you have to put in place two levels of
final test:
- check to see if you THINK it will pass REAL final test
- actually DO the final test
When installing copy protection/anti-tamper mechanisms in products,
there\'s a time when you\'ve just enabled them and, thus, changed
how the product runs. If it *stops* running (properly), you have
to wonder if your \"measures\" are at fault or if some latent
bug has crept in, aggravated by the slight differences in
execution patterns.
The only profiling method I have found to work reasonably well is probably by
chance the highest frequency periodic ISR I have ever used in anger which was
to profile code by accumulating a snapshot of PC addresses allowing just a few
machine instructions to execute at a time. It used to work well back in the old
days when 640k was the limit and code would reliably load into exactly the same
locations every run.
It is a great way to find the hotspots where most time is spent.
IMO, this is where logic analyzers shine. I don\'t agree with
using them to \"trace code\" (during debug) as there are better ways to
get that information. But, *watching* to see how code runs (passively)
can be a real win. Especially when you are trying to watch for
RARE aberrant behavior.
- it doesn\'t tell you about anything that happens *before* the code runs
(e.g., latency between event and recognition thereof)
True enough. Sometimes you need a logic analyser for weird behaviour - we once
caught a CPU chip where RTI didn\'t always do what it said on the tin and the
instruction following the RTI instruction got executed with a frequency of
about 1:10^8. They replaced all the faulty CPUs FOC but we had to sign a
non-disclosure agreement.
If I\'m really serious about finding out why something is unusually slow I
run a dangerous system level driver that allows me full access to the model
specific registers to monitor cache misses and pipeline stalls.
But, those results can change from instance to instance (as can latency,
execution time, etc.). So, you need to look at the *distribution* of
values and then think about whether that truly represents \"typical\"
and/or *worst* case.
It just means that you have to collect an array of data and take a look at it
later and offline. Much like you would when testing that a library function
does exactly what it is supposed to.
Yes. So, you either have the code do the collection (using a black box)
*or* have to have an external device (logic analyzer) that can collect it
for you.
The former is nice because the code can actually make decisions
(at run time) that a passive observer often can\'t (because the
observer can\'t see all of the pertinent data). But, that starts
to have a pronounced impact on the *intended* code...
Relying on exact timings is sort of naive; it ignores how much
things can vary with the running system (is the software in a
critical region when the ISR is invoked?) and the running
*hardware* (multilevel caches, etc.)
It is quite unusual to see bad behaviour from the multilevel caches but it can
add to the variance. You always get a few outliers here and there in user code
if a higher level disk or network interrupt steals cycles.
Being an embedded system developer, the issues that often muck
up execution are often periodic -- but, with periods that are varied
enough that they only beat against the observed phenomenon occasionally.
I am always amused by folks WHO OBSERVE A F*CKUP. Then, when they
can\'t reproduce it or identify a likely cause, ACT AS IF IT NEVER
HAPPENED! Sheesh, you\'re not relying on some third-hand report
of an anomaly... YOU SAW IT! How can you pretend it didn\'t happen?
Apply all of your engineering creativity.