The Thrill of SRE

Published: 2020-07-31
Tagged: essay sre

It's a beautiful June day here in the Pacific Northwest. The sun is shining and a cool breeze keeps me comfortable. Wandering through the rustling leaves, my mind asks itself, "Why am I into this whole SRE thing anyway?"

SRE involves a lot of annoying things. Alerts that wake you up at night. Long, mind-bending investigations (it's always the network). Awkward discussions aimed at persuading others to fix problems. There's also no clear scope of work or skill. I might be debugging TCP connection problems, writing a new service, and leading a post-mortem meeting, all in the same day. But there's something that keeps me coming back for more, something thrilling.

To understand why, I have to take you back in time to the year 2000. I'm 10, sitting in front of a small, glowing screen and playing with the computer. Yes, not "on it", but "with it." I'm learning dos batch scripting, turbo pascal, and HTML. The last one brings me the most joy because it gets the machine to display images, colored text, and gifs. Now, fast-forward to January 2013. It's my first day at an internship at a small web agency. My manager puts me to work on a Ruby on Rails/jQuery/AngularJS project and I'm giving it everything I've got. After some time, I discover that poking around servers and networks and writing backend code is way more fun than javascript and browsers. Fast-forward again, to April 2017, and I'm setting up Linux on my new company laptop. My official title now says "Site Reliability Engineer."

I still remember the systems interview for that role. It was a scenario taken from a past incident. The interviewer was the game master and I was a level 1 reliability wizard on call. He drew a schematic of the system on a whiteboard. Then I got an alert. As I'm investigating, the clean whiteboard fills out with arrows, squiggles, and names. A project manager messages me. Users are beginning to experience timeouts. The failure is spreading to other systems. We're both standing up now. "Can I ssh into that box? Can I access the logs? Ok, I grep for 'ERROR." Nothing? Wait, are there separate error logs? I use 'find' to find them. Sweet, got them. What do they say? Does that IP belong to our subnet? Ok, ok..." In the end, I got a (very) rough fix in and the interviewer told me I passed. Riding the crowded subway back home, I couldn't help but think, "holy shit, that was fun."

Something magical happens when systems grow above a certain size. Complexity transforms a legible, deterministic structure into an organic blob. Network latencies and timeouts and the number of components causes systems to stop behaving in a predictable way. Look at it this way: a simple program fits in your head, but you explore and discover a system. How the tables have turned. Developers are constantly changing code and moving pieces around, so there's never a chance to feel too comfortable. Complexity, that mystical force, guarantees that when something breaks, it'll be damn interesting.

It's difficult to describe the satisfaction of chasing down a hard problem through a system's many layers. I envision it like a film noir classic: the SRE, a hardboiled detective of course, pouring over clues and interrogating faulty components. The clock is ticking and the bad guys, the bugs, are always a step ahead because they don't have to play by the rules. Maybe someone deployed an innocent-looking change? Or changed a DNS record? What if... what if an employee left the company and that event broke an ownership chain? Suddenly, the clues all connect! _Of course, it's something completely unintuitive!

A movie might end there, but an SRE has to ensure the problem doesn't happen again. Solving the same problem over and over again is a complete waste of time. Most investigations reveal that someone did something they shouldn't have done. It's tempting to write it off as human error. But there has to be a reason why an engineer chose that specific action. It must have seemed safe and good. Why? Were they too tired and didn't notice the risk? Did the system give them a false sense of security? Did something obscure their view of the consequences? This is where the real fixing happens. The solution could be a new tool, like a linter, or redesign a piece of the UI to give the engineer more information, or maybe even a talk with the team lead about giving reliability-related work more priority. It's really satisfying to fix the problem that generates problems.

My day to day work brings back those childhood memories about playing with a computer. One day, I would make an HTML page about Diablo II, the next day a map for a game, then a script to prank a friend. As an SRE, I'm afforded the same broad playground: security, networks, OS internals, development, CI/CD, teaching, and more. My colleagues reflect a similar diversity. Some come from traditional sysadmin roles, others used to be backend engineers, a few worked with mainframes, and there's always one or two that migrated from the information security world. The SRE subculture, with its focus on failure, seems cynical at first, but underneath that surface is a strong optimism that things can be better. That data won't leak and crucial systems won't go offline. We're nowhere near that world, but we have the tools and ideas to get us there.

I won't sugarcoat it. Some days it's rough. The alerts, on call, and everything always breaking get to me. The future looks bleak then. Every step, no matter how careful, can be an ambush, a trapdoor, a banana peel. Why can't things be simple?

Well, as in life, so in virtual reality. Our world is insanely complex and forever changing. If at any point in time, there's no uncertainty and risk, it means I'm dead. Through that lens then, I look at the gnarly knot of problems and think "this is some pretty cool shit right here." Twelve year old me would dive into it without hesitation. Adult me sees a little farther beyond: a lot of software sucks and it needs to suck less because it's a powerful tool that we can use against Molech (if you haven't read this before, this will be the best thing you've read this year).

Many challenges in SRE can solved with code, but the most important ones have to do with people. How do we improve our understanding of problems? How can we build better classes of tools? How do we teach others to be better at problem-solving than we are? Those are all blank spots on our maps and exploring them is one hell of a fun game.

Comments

There aren't any comments here.

Add new comment

A code cave is a series of null bytes in a process's memory.

Programming, exploring, tinkering, philosophizing.

All views expressed are my own.