Approach to Finding the Root Cause of Failures

On Wed, 1 Apr 2020 12:07:30 +0100, Martin Brown
<'''newspam'''@nezumi.demon.co.uk> wrote:

On 31/03/2020 16:34, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

Rank the tests you have by their ability to cut down the area where the
fault must lie. In my field McCabes CCI metric is quite good for that.

Do a binary search when you can. Keep cutting the solution space in
half.



--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
On Tue, 31 Mar 2020 16:02:29 -0700 (PDT), Phil Allison
<pallison49@gmail.com> wrote:

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:


** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?


** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Let's fire all those engineers and replace them with guitar repairmen.



--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
On 2020-03-31 20:54, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 7:02:43 PM UTC-4, Phil Allison wrote:
blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:


** IOW another mindless troll.

I guess i succeeded in my post then?

Yeah, you're on a roll. Good work.

Cheers

Phil Hobbs


--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 2020-03-31 15:10, John Larkin wrote:
On Tue, 31 Mar 2020 08:34:34 -0700 (PDT), blocher@columbus.rr.com
wrote:


Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......


One thing that helps to find intermittents is temperature testing. If
you temp test new designs, you'll have a lot fewer bugs later.

Yup. Cold spray and a heat gun can reveal all sorts of buried treasure.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 2020-04-01 05:17, DecadentLinuxUserNumeroUno@decadence.org wrote:
David Brown <david.brown@hesbynett.no> wrote in
news:r61klr$369$1@dont-email.me:

On 31/03/2020 20:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown
wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never
their fault :)


That's because it's usually a hardware fault - and it can be
solved by using a bigger capacitor :)

You laugh, I once used a telephony part that had a PSRR of 0dB
which I had missed. (Who expects 0 dB?) On the customer's work
bench they were getting noise in the audio that turned out to be
from the DSP power consumption. They were using clip leads to
provide power to the UUT and the on board capacitance wasn't
enough to mitigate it. We told them to use better power
connections and also used a larger cap.


I had a smiley, but I have seen more than a few systems
reliability improved by adding a bigger capacitor. There is a
rule in software development that "almost all programming can be
viewed as an exercise in caching". (Yes, it is an exaggeration -
but there's a grain of truth in it.) Capacitors are the hardware
equivalent of software caches.


Mind you, I have seen problems with too big capacitors too. I
remember long ago trying to find why a card communicated find (at
9600 baud RS-232) with some computers but not others. Looking
with a scope, the RS-232 signals were lovely triangle waves -
someone had added 100 nF capacitors to the lines to reduce the
noise...

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

Old net and system admin guys usually like keeping systems up and
running at all times too.
The big computer rooms of the sixties would lose thousands and hour
in insurance if the room temperature rose above a preset level like
63°F.

Yup. At IBM Watson we used to shut the whole place down over Labor Day
weekend. It always took a couple of days to get the silicon fab line
back up, because things like corroded connections and worn-out motors
tend to fail at inrush.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 2020-03-31 18:50, George Herold wrote:
On Tuesday, March 31, 2020 at 4:08:59 PM UTC-4, Phil Hobbs wrote:
On 2020-03-31 14:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4, blo...@columbus.rr.com wrote:
Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Also - the FPGA guys and the SW guys will only acknowledge a problem when it is laid out under their nose. It is never their fault :)


That's because it's usually a hardware fault - and it can be solved by
using a bigger capacitor :)

You laugh, I once used a telephony part that had a PSRR of 0dB which I had missed. (Who expects 0 dB?) On the customer's work bench they were getting noise in the audio that turned out to be from the DSP power consumption. They were using clip leads to provide power to the UUT and the on board capacitance wasn't enough to mitigate it. We told them to use better power connections and also used a larger cap.

0 dB of PSRR??? How can you even do that exactly??? CP Clare, what a piece of work they are. The other CP Clare part had a problem that virtually made it unusable, but they didn't point it out in the data sheet. I wonder if they actually use engineers or if they just let high school kids design their ICs?

Are you quoting that WRT the input or the output? PSRR and CMRR are
normally quoted input-referred, i.e. to find out the effect you have to
multiply by the overall gain.

There are lots of parts that can have negative-dB PSRR as referred to
the output.
At higher frequencies aren't there many opamps that cross
0 dB PSRR. At least for one of the rails.

Negative PSRR is usually horrible in "single supply" op amps, because,
duh, they expect you to use a single positive supply. ;)

> (That's why God* invented the cap. multiplier.)

Yup.

George H.
*or one of his offspring....

Well, children, anyway. ;)

who did do the cap mult. first?

Dunno. I first saw it in an audio amp project in a magazine, circa
1977. The LED + NPN emitter-follower voltage reference, I saw in an
article of Walt Jung's at about the same time.

We should revisit that "how many two-transistor circuits are there?"
thread at some point.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Tuesday, March 31, 2020 at 11:55:31 PM UTC-4, jla...@highlandsniptechnology.com wrote:
On Tue, 31 Mar 2020 19:33:22 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:

On Tuesday, March 31, 2020 at 7:34:08 PM UTC-4, jla...@highlandsniptechnology.com wrote:
On Tue, 31 Mar 2020 13:35:57 -0700 (PDT), George Herold
ggherold@gmail.com> wrote:

On Tuesday, March 31, 2020 at 3:04:56 PM UTC-4, John Larkin wrote:
On Tue, 31 Mar 2020 14:55:02 -0400, ABLE1 <somewhere@nowhere.net
wrote:

On 3/31/2020 11:34 AM, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing.. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......

Whoops!! 2nd try!!

With all the above being typed and read I have a much simpler way
to look the problem.

Just use the the "Not Method of Troubleshooting".

The Not Method goes like this.

It's Not this!!
It's Not that!!
Once you have identified all the Not's, the only thing left
is not a not, but is the real problem.
Fix it or replace and move on!!

Now I am sure someone will find fault with my method, well Ok then!!
Some days the Not's just have to be adjusted.

Have a good day!!

Les

That's the Sherlock Holmes technique. It doesn't work very well. The
list of NOTs to test is too big, and you are unlikely to include in
the list the things you missed when you did the design.

The NOT technique is a last resort. (it's how I found a
leaky toggle switch... we had a bag of leaky
switches, most circuits didn't care if there
is a few meg ohm of resistance.) Before you pull all
your hair out, you pull all the components out and replace 'em.
But how do you know the replacement component is good!
Quickly a knotty nightmare.

I find it best to get as much data as possible,
and then sleep on it*. When you think about it 'actively'
you tend to get stuck in your first assumption rut.
(And if your first assumption had been right, it'd be
fixed/found already. :^)

George H.
*or go explain the problem to someone else... not that they will
be able to help (well they might) but because having to explain it
makes you go over the whole circuit and may remind you of the part
you haven't been thinking about.

--

John Larkin Highland Technology, Inc
picosecond timing precision measurement

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com


I say to myself, "This was designed by an idiot. What stupid mistake
did he make?"

There is a tendency to blame parts, when the problem is usually
design.\
I'm mostly talking about circuits I designed.
But I do say the exact same things to myself. :^)

I meant myself.
Grin.. sorry. Humor doesn't work well when not face to face.

GH
--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
On 2020-04-01 08:43, blocher@columbus.rr.com wrote:
On Wednesday, April 1, 2020 at 8:40:03 AM UTC-4, blo...@columbus.rr.com wrote:
On Wednesday, April 1, 2020 at 7:54:09 AM UTC-4, David Brown wrote:
On 01/04/2020 11:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

I have a trusted engineer friend who once said that most failures
occur at power up or power down. He always left his computers at
work and his home up all the time.

He is right.

the toughest issue I had to find was a power up issue. It turned out that the memory part manufacturer had a bug in their handshake codes at power up and occasionally it threw a bad code which then set the DSP into a wrong clock speed which then resulted in the NVRAM getting corrupted....the unit bricked (although recoverable at the factory with a complete reprogram) . There was a cryptic note in the data sheet which when we finally realized that the cryptic note seemed to rhyme with our problem we contacted the manufacturer. They then gave us the complete story which was that all date codes prior to a particular time were susceptible to the problem and date codes after were fixed.

I would have loved to hear the debate about how to put that note in the data sheet. Frankly, they knew that if they were totally candid, then the part was not valid so they wanted to mask it, but , I guess, some engineer was screaming about how bad this was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our customer was mad that they had bricked units in their airplane but when we presented them the problem, it was not our fault and we had been tenacious in finding the problem. And nobody looks bad for designing the thing wrong.

Also, there was one obscure LED on the board that gave an indication that the boot load had finished. Had that LED not been on the board, I do not think we would have ever found the problem. Normally at power up the LED turned on then turned off when everything finished initializing. In this case the LED stuck on, so we knew it was a power on issue. Still a real bugger to find.

Having the appropriate number of blinky LEDs is key. Sometimes when I
run short of pins, I'll have the housekeeping loop output a state code
from a UART. That's super helpful in keeping track of state machines
and so on.

I keep my PC's on all the time. Even a Windows machine can run for
months without a restart if treated with due care and kindness. But
it's not just about risk of failure - I usually have so many projects
open at a time on different workspaces (on the Linux systems) that is a
big effort and waste of time to restart the thing.

I sometimes do that too, but only when the project is under version
control, which most are. (Github/Gitlab private repos are good for
projects where nobody else would know what it is. Not so much for the
crown jewels.)

Cheers

Phil Hobbs


--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On 2020-03-31 13:59, Tom Gardner wrote:
On 31/03/20 18:17, George Herold wrote:
Hmm OK.  I designate two types of problem solving.

1.) Your (prototype) gizmo is not working.
I call this de-bugging.  The problem could be somewhere in the
gizmo, or you may have made a fundamental error in your idea.
Those are the hardest types of problems.

2.) You've got several working units but this one from production
has a problem not seen before.
I call that trouble shooting... it's easier because you've got working
units, so you know it can't be a fundamental problem.
It could still be a design problem.  Like you didn't spec the spread in
cap ESR on the voltage regulator and the odd high or low esr cap causes
your voltage regulator to oscillate.

Add 3) It fails on some customers' site, but not elsewhere.

Now, is it because the customers' equipment is at fault or
the spec is inadequate (whatever that might mean)?

IME this is usually EMI. ;)
(Courtesy of Palindromic Engineering.)

Taking a perfectly good piece of equipment and having folks connect it
up with BNC cables between two different racks, with nice large ground
loops and a VF drive in the HVAC overhead is one typical way to do this.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Wed, 1 Apr 2020 13:15:50 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2020-03-31 15:10, John Larkin wrote:
On Tue, 31 Mar 2020 08:34:34 -0700 (PDT), blocher@columbus.rr.com
wrote:


Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

2. Assign weighting factors to everything you believe. Never assign a weighting factor of 1 to anything until you know you have the problem solved

3. Expect to have to do certain tests over again and that you will draw an opposite conclusion when you repeat a test than what you concluded after the first test.

4. Taking guidance from "helpful" outsiders is challenging. On the one had they care and are smart, on the other hand if you go about chasing other peoples ideas (often conceived of to just demonstrate they are concerned in a meeting) you will never get an a clear path to troubleshoot the problem in your own way.
Help is a two edged sword. It is important but can sometimes be problematic.

5. As an aside - I have learned that when I "see something" during the design phase, I no longer look at that as a curse, but as a blessing. It is going to come back and get you later.

6. Get past the notion that having nothing to show for a days work is bad. As a designer you can show a days work for a days pay. In root cause you feel like you have accomplished nothing for a long time. Frequently, though , these problems are the most visible problems in an organization and can make a difference between losing a customer and keeping one.

7. Look for contradictions in your thinking. Use other people to help you find contradictions in your thinking.

OK - enough for now......


One thing that helps to find intermittents is temperature testing. If
you temp test new designs, you'll have a lot fewer bugs later.


Yup. Cold spray and a heat gun can reveal all sorts of buried treasure.

Cheers

Phil Hobbs

I have cut through days of someone chasing a timing error, by
spritzing around for a few seconds with an ancient can of Radio Shack
ozone-destroying freeze spray.




--

John Larkin Highland Technology, Inc

Science teaches us to doubt.

Claude Bernard
 
On 2020-04-01 10:38, jlarkin@highlandsniptechnology.com wrote:
On Wed, 1 Apr 2020 12:07:30 +0100, Martin Brown
'''newspam'''@nezumi.demon.co.uk> wrote:

On 31/03/2020 16:34, blocher@columbus.rr.com wrote:

Another topic that I hope can elicit engineering discussion:

What makes up a good skill set for finding the root cause of a failure that is rare, intermittent or obscure?

Over the past several years I have been more involved in root cause failure than I was when I was doing more design work. In many ways I think it is more challenging than design work. It takes a mindset that is different than design.

Here is my reminder list when doing root cause studies

1. never root for a particular outcome when performing a test. Root for not being fooled by the results of your test

Rank the tests you have by their ability to cut down the area where the
fault must lie. In my field McCabes CCI metric is quite good for that.

Do a binary search when you can. Keep cutting the solution space in
half.

Works great for simply-connected systems.

Another point is to pay special attention to where the signal changes
domains, e.g. fibre to free space, optical to electronic, analogue to
digital, time domain to frequency domain.

Everybody's first digital lock-in design fails, because they aren't
sufficiently paranoid about the A/D subsystem.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
On Wed, 1 Apr 2020 12:58:34 -0400, Phil Hobbs
<pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2020-03-31 13:59, Tom Gardner wrote:
On 31/03/20 18:17, George Herold wrote:
Hmm OK.  I designate two types of problem solving.

1.) Your (prototype) gizmo is not working.
I call this de-bugging.  The problem could be somewhere in the
gizmo, or you may have made a fundamental error in your idea.
Those are the hardest types of problems.

2.) You've got several working units but this one from production
has a problem not seen before.
I call that trouble shooting... it's easier because you've got working
units, so you know it can't be a fundamental problem.
It could still be a design problem.  Like you didn't spec the spread in
cap ESR on the voltage regulator and the odd high or low esr cap causes
your voltage regulator to oscillate.

Add 3) It fails on some customers' site, but not elsewhere.

Now, is it because the customers' equipment is at fault or
the spec is inadequate (whatever that might mean)?


IME this is usually EMI. ;)
(Courtesy of Palindromic Engineering.)

Taking a perfectly good piece of equipment and having folks connect it
up with BNC cables between two different racks, with nice large ground
loops and a VF drive in the HVAC overhead is one typical way to do this.

Cheers

Phil Hobbs

https://www.dropbox.com/s/iajkj7t0z4orwxq/Acceptable%20Energy%20Level.jpg?raw=1

Both traces are from UV photodiodes. They expect me to time-stamp them
to picosecond resolution.

I never knew that there was negative light.

--

John Larkin Highland Technology, Inc
picosecond timing precision measurement

jlarkin att highlandtechnology dott com
http://www.highlandtechnology.com
 
jlarkin@highlandsniptechnology.com wrote in
news:cpd98f587i9du46lu9uuu30d6fktvtb6l5@4ax.com:

On Tue, 31 Mar 2020 16:02:29 -0700 (PDT), Phil Allison
pallison49@gmail.com> wrote:

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:


** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?


** Analyse the actual failure first.

Something good service techs do every day, but few designers have
a clue about.

Let's fire all those engineers and replace them with guitar
repairmen.

Are you trying to tell us that all of your hands on experience
had/has no value or merit in your current grasp of the realm?
 
On Wednesday, April 1, 2020 at 7:58:38 AM UTC-4, David Brown wrote:
On 01/04/2020 11:47, Rick C wrote:
On Wednesday, April 1, 2020 at 4:51:11 AM UTC-4, David Brown wrote:
On 31/03/2020 20:40, Rick C wrote:
On Tuesday, March 31, 2020 at 12:41:36 PM UTC-4, David Brown
wrote:
On 31/03/2020 17:40, blocher@columbus.rr.com wrote:
On Tuesday, March 31, 2020 at 11:34:44 AM UTC-4,
snip

Also - the FPGA guys and the SW guys will only acknowledge a
problem when it is laid out under their nose. It is never
their fault :)


That's because it's usually a hardware fault - and it can be
solved by using a bigger capacitor :)

You laugh, I once used a telephony part that had a PSRR of 0dB
which I had missed. (Who expects 0 dB?) On the customer's work
bench they were getting noise in the audio that turned out to be
from the DSP power consumption. They were using clip leads to
provide power to the UUT and the on board capacitance wasn't
enough to mitigate it. We told them to use better power
connections and also used a larger cap.


I had a smiley, but I have seen more than a few systems
reliability improved by adding a bigger capacitor. There is a rule
in software development that "almost all programming can be viewed
as an exercise in caching". (Yes, it is an exaggeration - but
there's a grain of truth in it.) Capacitors are the hardware
equivalent of software caches.


Mind you, I have seen problems with too big capacitors too. I
remember long ago trying to find why a card communicated find (at
9600 baud RS-232) with some computers but not others. Looking with
a scope, the RS-232 signals were lovely triangle waves - someone
had added 100 nF capacitors to the lines to reduce the noise...

Yeah, I've gone overboard too. The board I'm making now has a 150 uF
tant on the 12 volt line because I have no real specs on the power
source and there are multiple boards it's used on as a daughercard
for anyway. The original design was for a 1 kHz tone signal driving
a 50 ohm load Âą8 volts. So I used the biggest part I could find not
knowing what else might be on that power rail glitching away. Turns
out 150 uF x 8 daughtercards was a bit much for the supply at power
up! Fortunately the chip they used had a cap you could change to set
the ramp speed and once it was dialed back it worked fine.

After that the only problem was ham fisted installers who shove the
boards into the rack misaligned scraping these tall caps right off
the card!


Ham-fisted installers are always a problem! We made a number of systems
that were used in farming industries, and we'd get boards back for
service that were an incredible mess, with electronics fried and
connectors and sockets broken. The hand-scribbled failure report would
say things like "the socket was the wrong size - I had to use a hammer
to get the plug in". Round plugs and square holes were no hinder to
these guys.

I think that's not ham fisted, that's the whole damn pig!

--

Rick C.

--+ Get 1,000 miles of free Supercharging
--+ Tesla referral code - https://ts.la/richard11209
 
On 2020-04-01 14:25, John Larkin wrote:
On Wed, 1 Apr 2020 12:58:34 -0400, Phil Hobbs
pcdhSpamMeSenseless@electrooptical.net> wrote:

On 2020-03-31 13:59, Tom Gardner wrote:
On 31/03/20 18:17, George Herold wrote:
Hmm OK.  I designate two types of problem solving.

1.) Your (prototype) gizmo is not working.
I call this de-bugging.  The problem could be somewhere in the
gizmo, or you may have made a fundamental error in your idea.
Those are the hardest types of problems.

2.) You've got several working units but this one from production
has a problem not seen before.
I call that trouble shooting... it's easier because you've got working
units, so you know it can't be a fundamental problem.
It could still be a design problem.  Like you didn't spec the spread in
cap ESR on the voltage regulator and the odd high or low esr cap causes
your voltage regulator to oscillate.

Add 3) It fails on some customers' site, but not elsewhere.

Now, is it because the customers' equipment is at fault or
the spec is inadequate (whatever that might mean)?


IME this is usually EMI. ;)
(Courtesy of Palindromic Engineering.)

Taking a perfectly good piece of equipment and having folks connect it
up with BNC cables between two different racks, with nice large ground
loops and a VF drive in the HVAC overhead is one typical way to do this.

Cheers

Phil Hobbs

https://www.dropbox.com/s/iajkj7t0z4orwxq/Acceptable%20Energy%20Level.jpg?raw=1

Both traces are from UV photodiodes. They expect me to time-stamp them
to picosecond resolution.

Just write the spec properly, no worries. ;)

I never knew that there was negative light.

Well, in the Silmarillion there's the Unlight of Ungoliant. Must be
something like that going on.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 
jla...@highlandsniptechnology.com wrote:

---------------------------------------------

blo...@columbus.rr.com wrote:

-------------------------------
Another topic that I hope can elicit engineering discussion:


** IOW another mindless troll.

What makes up a good skill set for finding the root cause of a
failure that is rare, intermittent or obscure?


** Analyse the actual failure first.

Something good service techs do every day, but few designers have a clue about.

Let's fire all those engineers and replace them with guitar repairmen.

** JL simply has nothing to say, but nevertheless insists on saying it over and over.

Wot a pathetic piece of shit he is.



..... Phil
 
Phil Hobbs <pcdhSpamMeSenseless@electrooptical.net> wrote in
news:ed8ce9f9-ae02-88c6-e069-2b34b17143ab@electrooptical.net:

On 2020-04-01 08:43, blocher@columbus.rr.com wrote:
On Wednesday, April 1, 2020 at 8:40:03 AM UTC-4,
blo...@columbus.rr.com wrote:
On Wednesday, April 1, 2020 at 7:54:09 AM UTC-4, David Brown
wrote:
On 01/04/2020 11:17, DecadentLinuxUserNumeroUno@decadence.org
wrote:

I have a trusted engineer friend who once said that most
failures
occur at power up or power down. He always left his computers
at work and his home up all the time.

He is right.

the toughest issue I had to find was a power up issue. It
turned out that the memory part manufacturer had a bug in their
handshake codes at power up and occasionally it threw a bad code
which then set the DSP into a wrong clock speed which then
resulted in the NVRAM getting corrupted....the unit bricked
(although recoverable at the factory with a complete reprogram)
. There was a cryptic note in the data sheet which when we
finally realized that the cryptic note seemed to rhyme with our
problem we contacted the manufacturer. They then gave us the
complete story which was that all date codes prior to a
particular time were susceptible to the problem and date codes
after were fixed.

I would have loved to hear the debate about how to put that note
in the data sheet. Frankly, they knew that if they were totally
candid, then the part was not valid so they wanted to mask it,
but , I guess, some engineer was screaming about how bad this
was and they agreed to the cryptic note.

As another aside, this was kind of a good one for us because our
customer was mad that they had bricked units in their airplane
but when we presented them the problem, it was not our fault and
we had been tenacious in finding the problem. And nobody looks
bad for designing the thing wrong.

Also, there was one obscure LED on the board that gave an
indication that the boot load had finished. Had that LED not
been on the board, I do not think we would have ever found the
problem. Normally at power up the LED turned on then turned off
when everything finished initializing. In this case the LED stuck
on, so we knew it was a power on issue. Still a real bugger to
find.

Having the appropriate number of blinky LEDs is key. Sometimes
when I run short of pins, I'll have the housekeeping loop output a
state code from a UART. That's super helpful in keeping track of
state machines and so on.


I keep my PC's on all the time. Even a Windows machine can run
for months without a restart if treated with due care and
kindness. But it's not just about risk of failure - I usually
have so many projects open at a time on different workspaces
(on the Linux systems) that is a big effort and waste of time
to restart the thing.

I sometimes do that too, but only when the project is under
version control, which most are. (Github/Gitlab private repos are
good for projects where nobody else would know what it is. Not so
much for the crown jewels.)

Cheers

Phil Hobbs

I had a 'next step' 286 PC way back then. It ad a really cool BIOS
and an LCD display on the front of the case that showed the BIOS POST
progress at each step. Once it was booted up, it showed hard drive
cylinder and sector access numbers. Like that could tell one
something then. Oh My, the 32MB drive just failed and I noticed the
track it was on when it happened. Yeah, sure... that would have been
useful to know. For those drive recovery guys. Even then what it
reads at the moment of a crash may not coincide with where the
platter failure was anyway. So I saw no use for that part, though it
was cool to see where it was hitting the drive at.

They have LED touch panels on printers. I figured that
motherboard makers would have status/setup panels by now.

Hey, there is the new standard. Was "ATX". Now it could be
"MATX" for "Monitored ATX", so the case makers could make provisions
for the panels.
 
On Wednesday, April 1, 2020 at 9:51:35 AM UTC-7, Phil Hobbs wrote:
On 2020-04-01 05:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

Old net and system admin guys usually like keeping systems up and
running at all times too

Yup. At IBM Watson we used to shut the whole place down over Labor Day
weekend. It always took a couple of days to get the silicon fab line
back up, because things like corroded connections and worn-out motors
tend to fail at inrush.

But, replacing corroded connections and worn-out motors in threes after
startup might involve less down-time than getting the fab line
shut down three times at unscheduled times.
 
whit3rd <whit3rd@gmail.com> wrote in
news:641b0a17-a8ff-4dd1-883d-fa81aa3b40fa@googlegroups.com:

On Wednesday, April 1, 2020 at 9:51:35 AM UTC-7, Phil Hobbs wrote:
On 2020-04-01 05:17, DecadentLinuxUserNumeroUno@decadence.org
wrote:

Old net and system admin guys usually like keeping systems
up and
running at all times too

Yup. At IBM Watson we used to shut the whole place down over
Labor Day weekend. It always took a couple of days to get the
silicon fab line back up, because things like corroded
connections and worn-out motors tend to fail at inrush.

But, replacing corroded connections and worn-out motors in threes
after startup might involve less down-time than getting the fab
line shut down three times at unscheduled times.

The whole fab damily?
 
On 2020-04-02 01:29, whit3rd wrote:
On Wednesday, April 1, 2020 at 9:51:35 AM UTC-7, Phil Hobbs wrote:
On 2020-04-01 05:17, DecadentLinuxUserNumeroUno@decadence.org wrote:

Old net and system admin guys usually like keeping systems up and
running at all times too

Yup. At IBM Watson we used to shut the whole place down over Labor Day
weekend. It always took a couple of days to get the silicon fab line
back up, because things like corroded connections and worn-out motors
tend to fail at inrush.

But, replacing corroded connections and worn-out motors in threes after
startup might involve less down-time than getting the fab line
shut down three times at unscheduled times.

Might well be. They kept doing it, anyway.

Cheers

Phil Hobbs

--
Dr Philip C D Hobbs
Principal Consultant
ElectroOptical Innovations LLC / Hobbs ElectroOptics
Optics, Electro-optics, Photonics, Analog Electronics
Briarcliff Manor NY 10510

http://electrooptical.net
http://hobbs-eo.com
 

Welcome to EDABoard.com

Sponsor

Back
Top