SRE Toolbox: Investigations
Tagged: sre work
Imagine you're on call and you get paged. Some service is broken and it doesn't look like it's a quick fix. Over the next hour or two, you check each part of the system until you find the problem and carefully deploy a patch. Next morning, as you're sharing the story with your team, one of them interrupts you: "Oh, I had the same thing happen last month."
If only you knew about it, you could have paged them for help. Better yet, if you knew how they fixed it, you could have fixed it yourself. That would have saved the users some frustration, the company a couple of bad reviews, and you a few hours of sleep. If only you knew.
There's a practice that your team can use to avoid these situations. It will not only reduce your mean-time-to-recovery and help coordinate everyone in the case of a larger outage, but also help with sprint planning and team onboarding. The way it works is that each team member writes and shares notes about anything they investigate.
Three real life examples
Once, after joining a new team, I was added to its on call rotation. Being unfamiliar with the code, tools, and people, I felt nervous. Of course, my first night, I get paged. The issue wasn't obvious, but the service was running out of resources. Before paging my backup, I searched through the team's notes for the alert name. Lucky me, the first note matching my query described a similar problem in the same component. Re-using the troubleshooting steps from the note, I found what was wrong and fixed it.
Another time, when I had settled into the team, I was paged. The problem felt familiar, so I queried our notes and found an exact match. It turned out to have been written by me some months earlier. I mitigated the issue quickly, then looked over more notes to see if others had seen this issue as well. A couple of notes came up, which meant that my team had been fixing the same problem over and over again–wasting time. That was enough reason to invest a few hours to fix the root cause and free up time for people on on call duty.
Then there was one time when I was investigating an odd-looking availability chart. A few hours in, I hit a wall. What notes I had so far described why I thought it may be a problem and included some other charts. I decided to share them with my team, hoping someone could see something I couldn't. It turned out that one of my colleagues was an expert in the component I was investigating. My notes allowed him to skip some leg work and we soon were making progress again.
Making it work
The idea is simple. Team members write notes about anything they investigate, the notes must be easy to search, and the practice must be as lightweight as possible.
Each note should: - describe a problem and a solution if one is found. - contain information like error messages, log snippets, source code, commands and their output, links to or screenshots of charts, etc. - be written as soon as possible–put down a sentence or two, just when you start to investigate, then add more as you discover new facts. - be concise and loosely structured.
Here's an example note:
Database replication latency spike at 2am. Backlog chart <link>, latency lag chart <link>.
I sshed to 192.168.1.10. It appears the replica is experiencing timeouts: (<path to log file>): <relevant log lines>. Running "systemctl status replicator" shows frequent restarts (<command output>). Running dmesg doesn't show any OOMs."
Notice how little work and attention a note like this requires–it shouldn't distract an engineer from the work they're doing.
Where should all these notes go? I don't know of any software built specifically for this purpose, but I would go with Discourse. It has a powerful text editor, good search functionality, and the topic-comment structure lends itself well to investigations–each investigation is a topic, while comments present new facts and allow others to join in or ask questions. This interface also makes it easy for everyone, including those outside the team, to browse and follow problems.
Finally, it's good to think of these notes as similar to blameless post-mortems, so that engineers feel safe to investigate problems and put their findings into writing. Sometimes, an investigation will lead nowhere, but that's OK. It's better to waste a little bit of time and catch more bugs, then save a few minutes and miss a problem that snowballs into an outage. To help create this sense of vigilance and trust, managers should only read notes to gauge the health of the system or the morale of the team. After all, the whole purpose of this practice is to help engineers run a system smoothly.