Field Guide to Human Error Notes
Tagged: learning readings education essay
One of my responsibilities at my job is to make software systems of the distributed variety more reliable. The modern Internet is critical to people's everyday lives: from ordering food or telling us where to drive, to communicating with our clients, coworkers, and family. In each of these cases, even a little bit of downtime is enough to make the user consider switching to a competitor that can guarantee an extra nine of availability. The people who design airplanes, factories, and power plants have been researching this "reliability" stuff for decades. Why not see what they've come up with?
Sidney Dekker's "The Field Guide to Human Error" is the most accessible book I've found on this topic. While it focuses on how to increase safety in industrial projects, the knowledge applies just as well to software reliability. Dekker introduces a way of thinking about accidents, or in our case, about service outages, that results in a progressive refining of the service into a more reliable state. The book not only offers a high-level, abstract model, but also a set of concrete practices that can be applied to software projects.
I hope the following notes will make a case for getting and reading the book.
Safety Can Be Created
Imagine that safety is this fuzzy, aether-like material that imbues systems around us. Every action you take either adds or takes away from that system's ability to withstand pressure and shock. It exists on top of all the clever design choices that are supposed to make the system safe.
Take, for example, backpacking. Packing extra water or spare batteries allows me, a bipedal hiking system, to successfully endure a wider range of possible hiking conditions. Forgetting to take a map or pacing myself too aggressively decreases that range and increases the probability of failure: getting lost, prematurely tired, or dehydrated. The extra safety is paid for by more weight that I need to carry. Which items will give me the most safety? In what conditions? It took me a bunch of hikes to gain the experience to answer those questions. Early on, I lugged around a humorously large amount of weight, once including even a double D-cell flashlight on a day hike. With time, I began to grow an understanding of which type of safety each item provides under what conditions resulting in safer, more comfortable excursions.
It's also a hint that safety is an iterative process. You can't just get it right from the start. As you build and operate a system, you gain a better understanding of the problem domain, which allows you to revisit earlier design decisions and make improvements. It's why a meta-process like blameless post-mortems is so effective: it enables a team to adapt and correct course.
Old View vs. New View
The Old View is the classic "who dun it?" approach to accidents. It assumes that systems are safe by design and failure only happens when operators make a mistake. It's all about asking "who?" Who forgot what? Who flipped the switch? Who made this commit? Who lost situational awareness? It's intuitive, fast, and very satisfying. If only we get rid of the bad apples, our system will stop failing, right? The aim is to produce two things: guilt and punishment. But aren't we looking to produce safety?
The New View is about searching for the "what?" What component failed? What disrupted the supply? What interfered with the alarm? What allowed this commit to deploy to production?
The system is seen as a complex organism made of many moving parts, generating and processing many events, often simultaneously, and with some amount of illegibility. It is not safe by default because even designers can't predict the almost exponential number of possible combinations between events. To overcome this, we need human operators who can continuously create more safety in the system. Investigations focus on learning the state of the system before, during, and after an incident and figuring out what pushed it from a working to a broken state. Instead of producing guilt and punishment, the game is about producing safety by coming up with concrete recommendations for improvements. Blameless post-mortems are New View and they are great at building trust and curiosity that improves the whole organization. I'm glad to see them gaining traction within our industry.
Root cause analysis (RCA) is a well known practice, but beware that focusing on finding a root cause oversimplifies the problem because its goal is to find that one special defect. This results in a smaller number of improvements because it causes us to lock onto one node among many responsible for the failure.
Narrow and Broad End of the Tunnel
Dekker uses a metaphor to illustrate how human operators behave during a system failure: a tunnel that's broad at the beginning but narrow at the end.
When a system is beginning to fail, the operators are standing at the broad end of the tunnel. They have a lot of data coming at them and many options to choose from. The choices they make slowly narrow the future choices they can make until they arrive at the narrow end of the tunnel that symbolizes a broken system and very limited options.
When investigating accidents, our built-in hindsight bias kicks and pushes us to construct a chain-of-events model that trace failure in a straight line, from narrow to broad end, from present to the past. The result is that the operators appear to make the wrong choice every step of the way despite the correct actions being obvious, which in turn pushes us towards punishing them.
What if, during the investigation, instead of going backward in time, we go forward instead, and experience the uncertainty of the operators at key moments? Imagine being there, faced with an anomaly in the data you're getting, with multiple decisions looming ahead and time running out: do you spend more time on investigating or do you take the most likely corrective measure? Is the metric you're looking at faulty or is the component it's tracking breaking? Does this fall under your job description or should you alert someone else? Add to that the many often conflicting constraints communicated to you from above: should you get the job done quickly? Or safely? Or efficiently?
There are many advantages to viewing the world using this metaphor. It produces more actionable recommendations: should the operators be trained for handling this type of failure? Did they have access to the right metrics? Did they have the right tools to handle the accident effectively? Did they get the right support from management to make the right decision as quickly as possible? It also injects a healthy dose of empathy and respect for the operators. No longer are they bumbling idiots not doing their job right, but are decent people operating in an uncertain world, trying to do the best job they can.
Leading up and Down
Dekker emphasizes the need for information to flow smoothly up and down in an organization. The front-line employees get the most detailed and up to date information, because they're exposed to the raw quirks and rhythms of the system. Through this, they adapt the operating procedures for more efficiency through little shortcuts. The farther you go from here, the less details you see. Team leads, supervisors, managers, department heads, and directors get an increasingly distilled and abstract view of the system. This allows them to see the bigger picture and tackle problems that are over the horizon.
If information doesn't flow up, then the people responsible for long term goals can't make the right decisions. Faulty processes that lead to accidents won't get changed and resources won't get allocated for improving safety. It's too easy to look at the high level picture and fool yourself into believing that everything is working fine. It's important to push responsibility down in an organization, so that people facing the problems have the resources they need to make whatever fixes are necessary. This is the only sustainable way not to drown in an ocean of recurring problems. To make this effective, information must also flow downward. By communicating the goals clearly, leaders will make the everyday life for the front-lines easier by helping them to decide how to prioritize work.
A healthy feedback loop allows the organization to evolve and adapt to changing conditions.
Drifting into Failure
Safety and reliability are hard to reason about. They're usually expressed as mushy probability distributions, which makes it easy for humans to trade some of that fuzzy stuff for concrete wins. For example, by omitting some testing, we can deliver a feature sooner and get praise from our boss or positive reviews from users. But exactly how much safety did that decision cost? Would other people see this trade-off in the same way? Are your coworkers engaging in similar trades that you don't know about?
When you decrease reliability, you usually don't get hard and immediate feedback. Skipping a unit test doesn't normally cause everything to burst into flames. One day, perhaps months later, someone will execute that code path with a different set of variables and then everything will explode. Because this process is slow and almost invisible, Dekker calls it drifting toward failure. I've first discovered this idea as "normalization of deviance" in Danluu's excellent post.
It's scary. It's hard to think about and even harder to express. Yet it's there, crawling unseen between lines of code, depleting the safety margins, until there's no more reliability to borrow from and suddenly the whole thing collapses. Dekker and Danluu agree that the only way to counteract this is to force yourself into a perpetual state of unease and vigilance. More concretely:
- Pay attention to weak signals (newcomers may notice the drift more easily, you should listen to them!)
- Resist the urge to be unreasonably optimistic (just because the thing hasn't blown up before with the safety off doesn't mean it won't blow up this time).
- Realize that oversight and monitoring are never-ending (constant, quiet paranoia!) and invest time and effort accordingly, even if a new feature looks oh so promising.
It's a hard pill to swallow on both the individual and organization level. From the first, it requires working against your instincts and living with a constant feeling of uncertainty. From the second it requires developing a culture of trust and humbleness as well as consciously balancing and rebalancing the long view against short term wins.