Approach to Finding the Root Cause of Failures

Rick C · Apr 1, 2020

On Tuesday, March 31, 2020 at 8:30:16 PM UTC-4, Nomen Nescio wrote:
> i need guydense from helpful outsides on how 2 troll

Depends. If you are in a bass boat you need a trolling motor and a car battery.

--

Rick C.

+- Get 1,000 miles of free Supercharging
+- Tesla referral code - https://ts.la/richard11209

Apr 1, 2020

On Tuesday, March 31, 2020 at 7:02:43 PM UTC-4, Phil Allison wrote:

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

I guess i succeeded in my post then?

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?

** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Your dopey rules are all context free generalizations, so totally meaningless.

...... Phil

Phil Allison · Apr 1, 2020

Rick Cretin the Bullshitter wrote:

On Tuesday, March 31, 2020 at 7:02:43 PM UTC-4, Phil Allison wrote:
blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

Whew, pot calling the kettle black!

** Another blatant lie and mindless insult.

Rick must be one of those nut case slogan chanters who try to "de-platform " anyone with a different point of view.

..... Phil

George Herold · Apr 1, 2020

On Tuesday, March 31, 2020 at 7:34:08 PM UTC-4, jla...@highlandsniptechnology.com wrote:

On Tue, 31 Mar 2020 13:35:57 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:

On Tuesday, March 31, 2020 at 3:04:56 PM UTC-4, John Larkin wrote:
On Tue, 31 Mar 2020 14:55:02 -0400, ABLE1 <somewhere@nowhere.net
wrote:

On 3/31/2020 11:34 AM, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Whoops!! 2nd try!!

With all the above being typed and read I have a much simpler way
to look the problem.

Just use the the "Not Method of Troubleshooting".

The Not Method goes like this.

It's Not this!!
It's Not that!!
Once you have identified all the Not's, the only thing left
is not a not, but is the real problem.
Fix it or replace and move on!!

Now I am sure someone will find fault with my method, well Ok then!!
Some days the Not's just have to be adjusted.

Have a good day!!

Les

That's the Sherlock Holmes technique. It doesn't work very well. The
list of NOTs to test is too big, and you are unlikely to include in
the list the things you missed when you did the design.

The NOT technique is a last resort. (it's how I found a
leaky toggle switch... we had a bag of leaky
switches, most circuits didn't care if there
is a few meg ohm of resistance.) Before you pull all
your hair out, you pull all the components out and replace 'em.
But how do you know the replacement component is good!
Quickly a knotty nightmare.

I find it best to get as much data as possible,
and then sleep on it*. When you think about it 'actively'
you tend to get stuck in your first assumption rut.
(And if your first assumption had been right, it'd be
fixed/found already. :^)

George H.
*or go explain the problem to someone else... not that they will
be able to help (well they might) but because having to explain it
makes you go over the whole circuit and may remind you of the part
you haven't been thinking about.

--

John Larkin Highland Technology, Inc
picosecond timing precision measurement

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com

I say to myself, "This was designed by an idiot. What stupid mistake
did he make?"

There is a tendency to blame parts, when the problem is usually
design.\

I'm mostly talking about circuits I designed.
But I do say the exact same things to myself. :^)

George H.

--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard

Bill Sloman · Apr 1, 2020

On Wednesday, April 1, 2020 at 11:54:44 AM UTC+11, Phil Allison wrote:

Rick Cretin the Bullshitter wrote:

On Tuesday, March 31, 2020 at 7:02:43 PM UTC-4, Phil Allison wrote:
blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

Whew, pot calling the kettle black!

** Another blatant lie and mindless insult.

Rick must be one of those nut case slogan chanters who try to "de-platform " anyone with a different point of view.

Phil clearly has a different point of view, but dignifying his insights by calling them a "point of view" rather misses their rabid element.

--
Bill Sloman, Sydney

Apr 1, 2020

On Tue, 31 Mar 2020 19:33:22 -0700 (PDT), George Herold
<ggherold@gmail.com> wrote:

On Tuesday, March 31, 2020 at 7:34:08 PM UTC-4, jla...@highlandsniptechnology.com wrote:
On Tue, 31 Mar 2020 13:35:57 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:

On Tuesday, March 31, 2020 at 3:04:56 PM UTC-4, John Larkin wrote:
On Tue, 31 Mar 2020 14:55:02 -0400, ABLE1 <somewhere@nowhere.net
wrote:

On 3/31/2020 11:34 AM, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Whoops!! 2nd try!!

With all the above being typed and read I have a much simpler way
to look the problem.

Just use the the "Not Method of Troubleshooting".

The Not Method goes like this.

It's Not this!!
It's Not that!!
Once you have identified all the Not's, the only thing left
is not a not, but is the real problem.
Fix it or replace and move on!!

Now I am sure someone will find fault with my method, well Ok then!!
Some days the Not's just have to be adjusted.

Have a good day!!

Les

That's the Sherlock Holmes technique. It doesn't work very well. The
list of NOTs to test is too big, and you are unlikely to include in
the list the things you missed when you did the design.

The NOT technique is a last resort. (it's how I found a
leaky toggle switch... we had a bag of leaky
switches, most circuits didn't care if there
is a few meg ohm of resistance.) Before you pull all
your hair out, you pull all the components out and replace 'em.
But how do you know the replacement component is good!
Quickly a knotty nightmare.

I find it best to get as much data as possible,
and then sleep on it*. When you think about it 'actively'
you tend to get stuck in your first assumption rut.
(And if your first assumption had been right, it'd be
fixed/found already. :^)

George H.
*or go explain the problem to someone else... not that they will
be able to help (well they might) but because having to explain it
makes you go over the whole circuit and may remind you of the part
you haven't been thinking about.

--

John Larkin Highland Technology, Inc
picosecond timing precision measurement

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com

I say to myself, "This was designed by an idiot. What stupid mistake
did he make?"

There is a tendency to blame parts, when the problem is usually
design.\
I'm mostly talking about circuits I designed.
But I do say the exact same things to myself. :^)

I meant myself.

--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard

Bill Sloman · Apr 1, 2020

On Wednesday, April 1, 2020 at 2:55:31 PM UTC+11, jla...@highlandsniptechnology.com wrote:

On Tue, 31 Mar 2020 19:33:22 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:
On Tuesday, March 31, 2020 at 7:34:08 PM UTC-4, jla...@highlandsniptechnology.com wrote:
On Tue, 31 Mar 2020 13:35:57 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:
On Tuesday, March 31, 2020 at 3:04:56 PM UTC-4, John Larkin wrote:
On Tue, 31 Mar 2020 14:55:02 -0400, ABLE1 <somewhere@nowhere.net
wrote:
On 3/31/2020 11:34 AM, blocher@columbus.rr.com wrote:

<snip>

I say to myself, "This was designed by an idiot. What stupid mistake
did he make?"

There is a tendency to blame parts, when the problem is usually
design.\
I'm mostly talking about circuits I designed.
But I do say the exact same things to myself. :^)

I meant myself.

Cognitive behaviour therapy would probably suggest that you should settle for "What did I miss here?".

Calling yourself and idiot - no matter how correctly - doesn't set up a good frame of mind.

--
Bill Sloman, Sydney

Michael Terrell · Apr 1, 2020

On Tuesday, March 31, 2020 at 7:02:43 PM UTC-4, Phil Allison wrote:

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?

** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Your dopey rules are all context free generalizations, so totally meaningless.

Wrong. Finding a one off field failures are not the same as failure analysis at the factory that affects an entire production run. I did it at Microdyne, where it was often traced to out of spec components, or the OEM had changed their manufacturing process. Another cause was purchasing substituting unauthorized components. Like when they switched suppliers of variable inductors without asking for samples and verifying the new coils. Their excuse was "5% is always better than 10%, isn't it?" The SRF was 25 to 35% lower on the 5% parts, so we had an entire run of boards that had insufficient bandwidth. You won't find problems like that as a service tech.

David Brown · Apr 1, 2020

On 31/03/2020 20:40, Rick C wrote:

On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never their
fault

That's because it's usually a hardware fault - and it can be solved
by using a bigger capacitor

You laugh, I once used a telephony part that had a PSRR of 0dB which
I had missed. (Who expects 0 dB?) On the customer's work bench they
were getting noise in the audio that turned out to be from the DSP
power consumption. They were using clip leads to provide power to
the UUT and the on board capacitance wasn't enough to mitigate it.
We told them to use better power connections and also used a larger
cap.

I had a smiley, but I have seen more than a few systems reliability
improved by adding a bigger capacitor. There is a rule in software
development that "almost all programming can be viewed as an exercise in
caching". (Yes, it is an exaggeration - but there's a grain of truth in
it.) Capacitors are the hardware equivalent of software caches.

Mind you, I have seen problems with too big capacitors too. I remember
long ago trying to find why a card communicated find (at 9600 baud
RS-232) with some computers but not others. Looking with a scope, the
RS-232 signals were lovely triangle waves - someone had added 100 nF
capacitors to the lines to reduce the noise...

Apr 1, 2020

David Brown <david.brown@hesbynett.no> wrote in
news:r61klr$369$1@dont-email.me:

On 31/03/2020 20:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown
wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never
their fault

That's because it's usually a hardware fault - and it can be
solved by using a bigger capacitor

You laugh, I once used a telephony part that had a PSRR of 0dB
which I had missed. (Who expects 0 dB?) On the customer's work
bench they were getting noise in the audio that turned out to be
from the DSP power consumption. They were using clip leads to
provide power to the UUT and the on board capacitance wasn't
enough to mitigate it. We told them to use better power
connections and also used a larger cap.

I had a smiley, but I have seen more than a few systems
reliability improved by adding a bigger capacitor. There is a
rule in software development that "almost all programming can be
viewed as an exercise in caching". (Yes, it is an exaggeration -
but there's a grain of truth in it.) Capacitors are the hardware
equivalent of software caches.

Mind you, I have seen problems with too big capacitors too. I
remember long ago trying to find why a card communicated find (at
9600 baud RS-232) with some computers but not others. Looking
with a scope, the RS-232 signals were lovely triangle waves -
someone had added 100 nF capacitors to the lines to reduce the
noise...

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

Old net and system admin guys usually like keeping systems up and
running at all times too.
The big computer rooms of the sixties would lose thousands and hour
in insurance if the room temperature rose above a preset level like
63°F.

Rick C · Apr 1, 2020

On Wednesday, April 1, 2020 at 4:51:11 AM UTC-4, David Brown wrote:

On 31/03/2020 20:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never their
fault

That's because it's usually a hardware fault - and it can be solved
by using a bigger capacitor

You laugh, I once used a telephony part that had a PSRR of 0dB which
I had missed. (Who expects 0 dB?) On the customer's work bench they
were getting noise in the audio that turned out to be from the DSP
power consumption. They were using clip leads to provide power to
the UUT and the on board capacitance wasn't enough to mitigate it.
We told them to use better power connections and also used a larger
cap.

I had a smiley, but I have seen more than a few systems reliability
improved by adding a bigger capacitor. There is a rule in software
development that "almost all programming can be viewed as an exercise in
caching". (Yes, it is an exaggeration - but there's a grain of truth in
it.) Capacitors are the hardware equivalent of software caches.

Mind you, I have seen problems with too big capacitors too. I remember
long ago trying to find why a card communicated find (at 9600 baud
RS-232) with some computers but not others. Looking with a scope, the
RS-232 signals were lovely triangle waves - someone had added 100 nF
capacitors to the lines to reduce the noise...

Yeah, I've gone overboard too. The board I'm making now has a 150 uF tant on the 12 volt line because I have no real specs on the power source and there are multiple boards it's used on as a daughercard for anyway. The original design was for a 1 kHz tone signal driving a 50 ohm load Âą8 volts.. So I used the biggest part I could find not knowing what else might be on that power rail glitching away. Turns out 150 uF x 8 daughtercards was a bit much for the supply at power up! Fortunately the chip they used had a cap you could change to set the ramp speed and once it was dialed back it worked fine.

After that the only problem was ham fisted installers who shove the boards into the rack misaligned scraping these tall caps right off the card!

--

Rick C.

++ Get 1,000 miles of free Supercharging
++ Tesla referral code - https://ts.la/richard11209

Rick C · Apr 1, 2020

On Wednesday, April 1, 2020 at 6:21:52 AM UTC-4, Phil Allison wrote:

Michael Terrell is nut Case WANKER wrote:

--------------------------------

Phil Allison wrote:
blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?

** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Your dopey rules are all context free generalizations, so totally meaningless.

Wrong.

** No it ain't.

Finding a one off field failures are not the same as failure
analysis at the factory that affects an entire production run.

** If Terrell would kindly like to READ the actual FUCKING question - he just might - and I mean *just* see how funking WRONG he is.

I will not hold my breath, cos the guy is basically a raving lunatic.

Yeah, what you say about him is mostly true, but just as true about you as well. Certainly the "raving" part applies to you in spades. No? Do you think you don't "rave"?

--

Rick C.

--- Get 1,000 miles of free Supercharging
--- Tesla referral code - https://ts.la/richard11209

Phil Allison · Apr 1, 2020

Michael Terrell is nut Case WANKER wrote:

--------------------------------

Phil Allison wrote:

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?

** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Your dopey rules are all context free generalizations, so totally meaningless.

Wrong.

** No it ain't.

Finding a one off field failures are not the same as failure
analysis at the factory that affects an entire production run.

** If Terrell would kindly like to READ the actual FUCKING question - he just might - and I mean *just* see how funking WRONG he is.

I will not hold my breath, cos the guy is basically a raving lunatic.

..... Phil

Michael Terrell · Apr 1, 2020

On Wednesday, April 1, 2020 at 6:21:52 AM UTC-4, Phil Allison wrote:

Michael Terrell is nut Case WANKER wrote:

--------------------------------

Phil Allison wrote:
blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:

** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?

** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Your dopey rules are all context free generalizations, so totally meaningless.

Wrong.

** No it ain't.

Finding a one off field failures are not the same as failure
analysis at the factory that affects an entire production run.

** If Terrell would kindly like to READ the actual FUCKING question - he just might - and I mean *just* see how funking WRONG he is.

I will not hold my breath, cos the guy is basically a raving lunatic.

Yawn.......................

Martin Brown · Apr 1, 2020

On 31/03/2020 16:34, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

Rank the tests you have by their ability to cut down the area where the
fault must lie. In my field McCabes CCI metric is quite good for that.

If the code complexity index is too high there is a very good chance
that the code doesn't actually work correctly.

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

I disagree with this at least in part - you should make a list of things
which ought to be true and a list of invariants that you expect to
remain true if things are operating correctly.

> 3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

Always worry if a test can pass or fail apparently at random.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

Here I disagree massively. A well chosen helpful outsider can sometimes
help you break a problem even if they are not all that skilled in the
art. Explaining to a junior who isn't afraid to ask apparently dumb
questions can sometimes allow you to see your own mistaken assumptions.
With practice verbally explaining it to an empty chair can also work
since it runs the problem through a different part of the brain.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

Sooner you catch a fault the less it costs.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Explaining your reasoning to a relatively junior engineer (or if it is a
really tough problem an engineer of the same rank or higher and who
thinks differently to how you do) can be very powerful.

--
Regards,
Martin Brown

David Brown · Apr 1, 2020

On 01/04/2020 11:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

He is right.

I keep my PC's on all the time. Even a Windows machine can run for
months without a restart if treated with due care and kindness. But
it's not just about risk of failure - I usually have so many projects
open at a time on different workspaces (on the Linux systems) that is a
big effort and waste of time to restart the thing.

Apr 1, 2020

On Wednesday, April 1, 2020 at 7:07:36 AM UTC-4, Martin Brown wrote:

On 31/03/2020 16:34, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

Rank the tests you have by their ability to cut down the area where the
fault must lie. In my field McCabes CCI metric is quite good for that.

If the code complexity index is too high there is a very good chance
that the code doesn't actually work correctly.

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

I disagree with this at least in part - you should make a list of things
which ought to be true and a list of invariants that you expect to
remain true if things are operating correctly.

To clarify, I meant assigning weighting factors to the conclusions you make as you run through various tests.

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

Always worry if a test can pass or fail apparently at random.

Again, to clarify, it is not that the test randomly changes the result, it is that there is some subtle missing element in the test that you missed the forst time, and that subtlety results in an opposite result.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

Here I disagree massively. A well chosen helpful outsider can sometimes
help you break a problem even if they are not all that skilled in the
art.
Most of the time we do not get to choose who helps us. There are meetings with a room full of ideas. The intentions are all good....but the road to hell is paved with good intentions.

Explaining to a junior who isn't afraid to ask apparently dumb
questions can sometimes allow you to see your own mistaken assumptions.
With practice verbally explaining it to an empty chair can also work
since it runs the problem through a different part of the brain.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

Sooner you catch a fault the less it costs.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Explaining your reasoning to a relatively junior engineer (or if it is a
really tough problem an engineer of the same rank or higher and who
thinks differently to how you do) can be very powerful.

--
Regards,
Martin Brown

Apr 1, 2020

On Wednesday, April 1, 2020 at 7:54:09 AM UTC-4, David Brown wrote:

On 01/04/2020 11:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

He is right.

the toughest issue I had to find was a power up issue. It turned out that the memory part manufacturer had a bug in their handshake codes at power up and occasionally it threw a bad code which then set the DSP into a wrong clock speed which then resulted in the NVRAM getting corrupted....the unit bricked (although recoverable at the factory with a complete reprogram) . There was a cryptic note in the data sheet which when we finally realized that the cryptic note seemed to rhyme with our problem we contacted the manufacturer. They then gave us the complete story which was that all date codes prior to a particular time were susceptible to the problem and date codes after were fixed.

I would have loved to hear the debate about how to put that note in the data sheet. Frankly, they knew that if they were totally candid, then the part was not valid so they wanted to mask it, but , I guess, some engineer was screaming about how bad this was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our customer was mad that they had bricked units in their airplane but when we presented them the problem, it was not our fault and we had been tenacious in finding the problem. And nobody looks bad for designing the thing wrong.

I keep my PC's on all the time. Even a Windows machine can run for
months without a restart if treated with due care and kindness. But
it's not just about risk of failure - I usually have so many projects
open at a time on different workspaces (on the Linux systems) that is a
big effort and waste of time to restart the thing.

Apr 1, 2020

On Wednesday, April 1, 2020 at 8:40:03 AM UTC-4, blo...@columbus.rr.com wrote:

On Wednesday, April 1, 2020 at 7:54:09 AM UTC-4, David Brown wrote:
On 01/04/2020 11:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

He is right.

the toughest issue I had to find was a power up issue. It turned out that the memory part manufacturer had a bug in their handshake codes at power up and occasionally it threw a bad code which then set the DSP into a wrong clock speed which then resulted in the NVRAM getting corrupted....the unit bricked (although recoverable at the factory with a complete reprogram) . There was a cryptic note in the data sheet which when we finally realized that the cryptic note seemed to rhyme with our problem we contacted the manufacturer. They then gave us the complete story which was that all date codes prior to a particular time were susceptible to the problem and date codes after were fixed.

I would have loved to hear the debate about how to put that note in the data sheet. Frankly, they knew that if they were totally candid, then the part was not valid so they wanted to mask it, but , I guess, some engineer was screaming about how bad this was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our customer was mad that they had bricked units in their airplane but when we presented them the problem, it was not our fault and we had been tenacious in finding the problem. And nobody looks bad for designing the thing wrong.

Also, there was one obscure LED on the board that gave an indication that the boot load had finished. Had that LED not been on the board, I do not think we would have ever found the problem. Normally at power up the LED turned on then turned off when everything finished initializing. In this case the LED stuck on, so we knew it was a power on issue. Still a real bugger to find.

I keep my PC's on all the time. Even a Windows machine can run for
months without a restart if treated with due care and kindness. But
it's not just about risk of failure - I usually have so many projects
open at a time on different workspaces (on the Linux systems) that is a
big effort and waste of time to restart the thing.

David Brown · Apr 1, 2020

On 01/04/2020 11:47, Rick C wrote:

On Wednesday, April 1, 2020 at 4:51:11 AM UTC-4, David Brown wrote:
On 31/03/2020 20:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown
wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never
their fault

That's because it's usually a hardware fault - and it can be
solved by using a bigger capacitor

You laugh, I once used a telephony part that had a PSRR of 0dB
which I had missed. (Who expects 0 dB?) On the customer's work
bench they were getting noise in the audio that turned out to be
from the DSP power consumption. They were using clip leads to
provide power to the UUT and the on board capacitance wasn't
enough to mitigate it. We told them to use better power
connections and also used a larger cap.

I had a smiley, but I have seen more than a few systems
reliability improved by adding a bigger capacitor. There is a rule
in software development that "almost all programming can be viewed
as an exercise in caching". (Yes, it is an exaggeration - but
there's a grain of truth in it.) Capacitors are the hardware
equivalent of software caches.

Mind you, I have seen problems with too big capacitors too. I
remember long ago trying to find why a card communicated find (at
9600 baud RS-232) with some computers but not others. Looking with
a scope, the RS-232 signals were lovely triangle waves - someone
had added 100 nF capacitors to the lines to reduce the noise...

Yeah, I've gone overboard too. The board I'm making now has a 150 uF
tant on the 12 volt line because I have no real specs on the power
source and there are multiple boards it's used on as a daughercard
for anyway. The original design was for a 1 kHz tone signal driving
a 50 ohm load Âą8 volts. So I used the biggest part I could find not
knowing what else might be on that power rail glitching away. Turns
out 150 uF x 8 daughtercards was a bit much for the supply at power
up! Fortunately the chip they used had a cap you could change to set
the ramp speed and once it was dialed back it worked fine.

After that the only problem was ham fisted installers who shove the
boards into the rack misaligned scraping these tall caps right off
the card!

Ham-fisted installers are always a problem! We made a number of systems
that were used in farming industries, and we'd get boards back for
service that were an incredible mess, with electronics fried and
connectors and sockets broken. The hand-scribbled failure report would
say things like "the socket was the wrong size - I had to use a hammer
to get the plug in". Round plugs and square holes were no hinder to
these guys.

Approach to Finding the Root Cause of Failures

Rick C

Guest

Guest

Phil Allison

Guest

George Herold

Guest

Bill Sloman

Guest

Guest

Bill Sloman

Guest

Michael Terrell

Guest

David Brown

Guest

Guest

Rick C

Guest

Rick C

Guest

Phil Allison

Guest

Michael Terrell

Guest

Martin Brown

Guest

David Brown

Guest

Guest

Guest

Guest

David Brown

Guest

Welcome to EDABoard.com

Sponsor

Online statistics

Forum statistics

Approach to Finding the Root Cause of Failures

Rick C

Guest

Guest

Phil Allison

Guest

George Herold

Guest

Bill Sloman

Guest

Guest

Bill Sloman

Guest

Michael Terrell

Guest

David Brown

Guest

Guest

Rick C

Guest

Rick C

Guest

Phil Allison

Guest

Michael Terrell

Guest

Martin Brown

Guest

David Brown

Guest

Guest

Guest

Guest

David Brown

Guest

Log in

Welcome to EDABoard.com

Sponsor