If you work in a startup, especially one in hardware, especially one in aerospace you’ve probably heard the words “failure is not an option” uttered more times than you can count. It’s the default rallying cry for a team at the end of a long day (or night), a series of long days and nights, or the end of the funding runway. It’s almost always delivered by some manager or other brave person who has either not contributed in any real way, or who seems to be acting on behalf of the problem itself.
Everyone thinks the quote comes from Gene Kranz, the steely-eyed NASA flight director, but it’s actually an invention of (and tag line for) the 90’s movie Apollo 13.
From Wikipedia:
In preparation for the movie, the script writers, Al Reinert and Bill Broyles, came down to Clear Lake to interview [FIDO Flight Controller Jerry Bostick] on "What are the people in Mission Control really like?" One of their questions was "Weren't there times when everybody, or at least a few people, just panicked?" My answer was "No, when bad things happened, we just calmly laid out all the options, and failure was not one of them." ... I immediately sensed that Bill Broyles wanted to leave and assumed that he was bored with the interview. Only months later did I learn that when they got in their car to leave, he started screaming, "That's it! That's the tag line for the whole movie, Failure is not an option."
Though Kranz did employ the phrase for the title of his memoir, it came out after the movie, so it’s fair to say his character played by Ed Harris inspired the choice. It’s a great line, I don’t blame him. I bet it moved a lot of books. This is important to note because 1) the line is not etched into the bedrock of aerospace history, it came from Hollywood and 2) I see those words inspire lazy thinking that’s, ironically, killing our ability to succeed.
Let me explain.
Failure is *absolutely* an option.
First of all, failure is not only an option, but also the most likely option. It is the entropically favorable outcomes, no less, which means it’s damn near the default state. Our job as design and test engineers is to play the Second Law to a draw — maybe increase a little entropy universally so that it stays low locally. Put another way, we must figure out the ways things will fail and then make hardware, code, and processes which obviate those modes.
Understanding the ways we can fail is step one in this process. Risk and hazard mitigation are a *huge* part of the design phase of engineering. It’s an entire section of my design review template1 and the associated process, which digs deeper each iteration. Running FMEA’s, quantifying hazards, and looking at failure modes are all part of understanding how a system works and creating redundancies to prevent mission loss.
Sometimes we need to fail
This column is about avoiding pithy turns-of-phrase, but at the risk of undermining myself (with a Silicon Valley truism, no less) it’s important to fail fast and fail forward. We’re making assumptions in design and doing our best to anticipate real world conditions. It’s too big of a job to design something and stick it right into service. Component tests, system activations, and qualification testing are ways that we can find unplanned dynamics. I see a lot of tests get skipped because they’re “expensive” or too hard. I assure you it is more expensive and much harder to have your thing fail publicly instead.
Failure can also be psychologically important for a team. I remember Flight 3 of Falcon 9, which was the COTS demo for NASA. I had just joined the company, and the Dragon capsule was having some challenges executing commands. The mission represented a huge milestone, as it would be the first commercial vehicle to dock with the ISS. I went out for drinks with some of the seasoned vets during the time when it was unclear whether Dragon would be allowed to approach. They were worried, but they’d also each worked on Falcon 1 which was marked by constant failure. They spoke about how it was good that some of the green engineers (aka, me and others starting at the same time) got to see that everything didn’t just magically work. Dragon recovered and docked, but I never forgot the lesson or the way it bonded us as we set out on the huge task of building the west coast launch site at Vandy where failure stalked us constantly.
Planning is boring but essential
Failure is like a grizzly bear. We shouldn’t fear it, but we damn well better respect it. A good design process, a robust test program, and process that includes all contingencies is crucial to success. We must avoid magical thinking such as “We can do it we’re XXX we always get it done!” That’s dangerous, and it demonstrates a lack of understanding of why a team succeeds. Things don’t “always get done” because some VP said so out loud, they get done because some dummy like me goes outside and works all night to make it so.
If you watch Apollo 13 (which you should, because it’s great) you’ll see the attitude of both the flight team in Houston and the astronauts on board isn’t actually “failure is not an option.” They watched the explosion. They are watching the oxygen drop on board. They all know what’s likely. But they keep making decisions that choosing anything besides failure. They look at every piece, part, and process on hand and assemble them in ways that beat back the obvious outcome.
Failure is always an option; we must work constantly to make it the least likely one.
I’ll post it soon, but DM me if you want to see an early version.