|
Dorian Shainin and the Red X
“Talk to the parts; they are smarter than the engineers.” – Dorian Shainin
The HP 9825A project attempted to do something very risky: take several very new technologies, cram them into one computer, and then sell the computer as a reliable HP product. Failure or unreliable operation of any one of those component technologies would result in a lemon, one that Bill and Dave would surely frown upon because their names would be printed on the computer.
Here’s Don Morris’ story of how the HP 9825A got reliability:
“I spent the middle eight years of my career as an HP quality Manager (most of it also as Group QA Manager for the semiconductor group. I was on the original advisory board that set up the US Baldridge Quality Award and I was a Baldridge judge the first two years of the program. That was how I met the president of the United States in the Whitehouse. Only [HP President and CEO] John Young got to talk to the prez... children are to stand and be seen and not heard. But before all that, I got to meet and work with one of the world’s foremost quality gurus. I had not been interested in quality, per se, at the time nor knowledgeable about Dorian Shainin at that time. But somehow we got connected. I can't remember how, but I was really, really concerned about the field reliability in the HP 9825. I know it seems so passé now, but I was a religious convert to the idea that HP built only good stuff! So I hired Dorian Shainin to come out and give the 9825 development team a couple of days training and to provide me with some consulting. We were developing a boatload of new technologies for the HP 9825:
- NMOS II, a new IC fabrication process
- A new 16-bit hybrid microprocessor with a new interconnect technology
- A new tape drive (the DC100 cartridge drive)
- A never-used-before thermal print head
- The first foam-molded case for an HP product
- A new operating system, system software, and an interpreted language with all-new features
There was just way too much new and untested stuff for me to be comfortable with. I knew we were heading for trouble. So Dorian came out and gave his normal quality spiel. Just for calibration, Dorian was the reliability consultant that handled the lunar lander module (LM) for the US Apollo program. His QA methods were what we used in the testing of the '25. In the evening, I would take him to dinner at the Black Steer [a popular watering hole in Loveland, Colorado. I don't remember what Dorian was drinking but I was drinking “Jack and Coke.” Dorian told me that he thought he was the best reliability consultant for the program, but that he knew with 100% certainty that the reason he got the NASA contract was because he was the low cost bidder! How? He only required three insanely expensive test modules to achieve 99.999% reliability. I won't go into his test methods, but they were evidently adequate for both NASA and HP. As he would get tanked up after dinner, Dorian became ever more and more valuable as a consultant because he’d tell me more. One funny story: when he finished with NASA, they wanted him to certify that the LMs were the five nines reliable! Talk about managers wanting to cover their asses! It was at that point Dorian brought out his fine print they had signed: 1. It’s impossible to determine reliability chances for an unknown environment. 2. That reliability level required 1000s and 1000s of testing hours. 3. Dorian’s contract stated, in the fine print of course, that he only certified that the devices would be better when he was done than when he began. Nuggets of wisdom from the gentlemen in these dark caverns of the Steer: 1. New failure modes arising from exotic and untried technologies will always get you. If you find one of them, then say “Thank you” and nail it. Don't search for new failure modes because it isn't worth it. Let them find you. 2. Even with new and exotic technologies and designs, there will still be more failures from stupid, obvious, and preventable flaws than from the exotic causes. Therefore, spend the bulk of the effort eliminating the stupid mistakes. By expecting exotic and unexpected failures, you will design for better survivability when a failure does occur.
Shainan’s approach was to find weaknesses and to eliminate them if possible through component upgrades or improved design. For instance, on high/low-temperature operations, the data-sheet specs had to be met with design margin, but Dorian wanted to test to failure by going outside of the margins. When the first component failed, it was “the weakest link” and we did something to improve its design margin. Many individual components were changed and given addition margin with this technique. In several places, the parts costs were higher than other options, but I have no regrets. More expensive parts may have saved some critical application from failing at a bad time. That was HP’s reputation.
We didn’t always fix problems with wider-margined part specs. Sometimes, we added circuitry. For example, the thermal print head on the HP 9825 was also used in the HP 9815 desktop calculator. It runs slower on the HP 9815 because it runs open loop. The HP 9815 operates the printer more slowly, at a lower current, to prevent thermal damage. Because we had a higher parts budget on the HP 9825, we added a feedback thermistor to the print-head assembly, which gave us margin for both hot and cold operation.
Shainin’s ideas also helped us with mechanical problems. Low-temperature mechanical shock testing was breaking things that should not break. For example, we needed a vibration table to find a resonant frequency of the cooling fan assembly. The plastic fan support was physically contacting a memory pc board and we thought that was impossible.
We’d made a major, late change to the fan. The prototype HP 9825 units were passing our QA tests and Henry Kohoutek was happy, but then he showed me the early HP 9825 failure data and the mature field-failure rates for earlier desktop calculators. Henry’s data showed higher failure rates for components designed into the ‘25 than the same components in the HP 9810. Same components; worse failure rate; something’s amiss. Now that was irritating.
Henry explained the elevated failure rate. The higher internal temperatures in the machine accelerated component failure rates. The HP 9825 mechanical design was not amenable to directed internal airflow. The HP 9825 was a smaller machine and was more tightly packed with electronics than the previous generation of desktop calculators. Cool air had a harder time making its way through the HP 9825.
We were using a cheap fan with a shaded-pole motor and had about 20 units running in an aluminum frame on one of our cubes for a couple of years. They seemed OK. So even though the cheap fan was new to HP, we were happy with the fan reliability based on the fan’s performance in the 20 test units. In fact, the fan’s reliability was turning out to project better than the fan reliabilty in the older machines. That was the fan’s reliability, not the reliability of the components being cooled by the fan. We took the biggest air mover we had in stock, stuck it into a prototype machine, and it immediately lowered the internal, worst-case temperatures by about 10 degrees C, as I recall. Lower temperatures everywhere! The bigger, noisier fan itself had a higher failure rate, but the lower internal temperatures would cut the failure rates of the other components by a hefty amount. Sure, the better fan was noisy. But in our work cubes, we lived with constant paging, so I figured “No big deal.” The projected savings in warranty (just in warranty costs) per machine would be much larger than the incremental cost of the fan.
So we did it. We changed out the fan even though it would drive up the selling price because the warranty cost was simply a fixed percentage in the pricing formula. So I figured we’d save HP money and give the customer a better product.
Consequently, the plastic fan support for the new fan was a rush job and it required retesting. Now the prototypes were failing the drop test; components were breaking on the board just above the motherboard below the stacked memory boards. The broken components could only have been damaged by being hit from a memory board above, or so we thought. We couldn't fix it, or so we thought.
|
|
|
|
For the Cost of a Fan
Now about cost. At the costing meeting there was a range of possible prices. I had watched my earlier calculator project, the 9805, languish with poor sales and did not want that again. I thought the HP 9805 didn’t sell because of a high price, so my chosen price point for the HP 9825 was a low one. Our marketing manager had come up with a monstrously high number. The pricing meeting drug on and on... R&D, Marketing, and General Manager Tom Kelly. After a while Tom got up and said he had no more time and we could continue if we wanted, but he had already decided what the price should be. “Well, tell us” we said. Even marketing was flabbergasted that Tom would kill the product by picking a price that was either $1K or $2K above the highest we had even considered. After Bob Watson took over Tom's job (Tom went to open the new HP Fort Collins facility), I told Bob that my respect for Tom's judgment had certainly improved when I saw that he was right. Bob said the same thing. The 9825 sold like hotcakes. It killed the HP 9830 dead in its tracks. We went into an immediate backlog situation with the HP 9825 sales and we stayed there for a year. A lower price? As they say, that would have left $$$ laying on the table. And that extra margin from the higher sales price allowed us to do whatever was needed to get the NMOS II process yields up and expand the HP 9825 product line. So that extra $20 in the fan was easily covered!
|
|
|
|
Steve Hobson and I put one of the prototype machines on a shake table and experimented. With just the right orientation and with a carefully chosen resonant frequency, everything in the machine became a single resonant structure. That big, heavy fan held by a cheap molded plastic part? Part of the resonant structure. At the right shaking frequency and with a strobe light to slowly walk through one of the resonant cycles, we saw that lower board bend down, contort itself, and then it poked the victim components up while the lower corner of the fan moved in and mowed them down. So this was the natural frequency where parts would ring if struck the assembly like a bell.
The HP9825 used little white nylon spacers that were screwed to a lower board and that stuck up to capture the board above. This configuration allowed the board stack, joined by a multi-connector ribbon cable, to be fold upward on a hinge while the HP 9825 was operating. It was a troubleshooting feature. By moving a couple of these supports, we ruined the mechanical resonance. Things settled down on the shake table and the fan stopped chopping components up. It would hit the side (edge) of one of the pc boards, but that was all. Back into drop test and all was fine.
The main thing in Shainin’s ideas was push a potential failure vector all the way to failure, not just to the specification + margin. I have never failed to see anything but confirmation of Dorian’s ideas. His multi-variant stress tests were the basis of the HP 9825 testing. The actual number of testing hours was not high. Like NASA, I too had only three precious proto units. We’d stress one until it failed. Then we’d put the next unit in a different stress test while fixing the broken unit (testing different possible failure vectors). All proto units got all the fixes as they broke and came out of the tests. So when we went into the two- and three-variable stress tests, we had all the fixes on the machines. As soon as the easy flaws and fixes were found, we took the un-fatigued unit, updated it with all the improvements, and repeated the same test sequence to verify that we had truly fixed the easy flaws. As on the lunar lander in in the Apollo program, we forgot about the hard problems on the grounds that only the customers would find them no matter what we did. To mangle the words of Walter Cronkite, ‘... and that's the way it was.’ ”
The "big red X" Dorian Shainin recognized that the Pareto principle could be applied to the solving design and manufacturing flaws caused by manufacturing variation problems. Shainin concluded that one flawed cause-and-effect relationship had to be stronger than the others with respect to a particular failure mode amongst the thousands of variables that might cause a failure. Shainin called this primary cause “The Big Red X.”
Don Morris’ story about his experiences now continues:
“When I moved from the Calculator Products R&D lab into Jerry Harmon’s IC department as the engineering manager, there were weekly meetings with corporate help and it was all about Dorian Shainin’s big red X. One of my major accomplishments while working at HP was realizing that there was no big red X in the IC department. The fabrication process was so out of control and there were so many process flaws to fix that all the possible improvements had to be prioritized by how long it would take to accompllish each fix, no matter its importance. So we’d do the one-hour projects first, then the four-hour projects, then the one-day projects, and so on. After a month or so of this troubleshooting, things began to settle down so we could do experiments to find the next big problem. Or rather, problems.”
|
|