SRE is a Way of Seeing, Not a Checklist
Tagged: guide sre productivity
Explaining SRE is hard. It doesn't matter if the other person is a junior or senior engineer, or a product or sales person--it's always a struggle for me. Sometimes I describe its function within the business ("customers are willing to pay for a more reliable service"). Other times I resort to a quick tour of my daily responsibilities ("discussing and setting up SLAs, ensuring post-mortems are productive..."). But it always feels like I'm skating on the surface. It leaves me feeling like I've failed to share something wonderful.
I've witnessed again and again how SRE delivers measurable improvements that mean faster and more dependable systems, much to the delight of users. Because of that, the industry ought to have rapidly adopted these methods. I expected a surge of performance/reliability improvements across online services. But adoption has been slow and many services continue to be a pain to use.
Example: When I tried to purchase baggage allowance for an upcoming flight, the website displayed an error message--then asked to copy the message and email it to their IT support. (This was just a few weeks ago, in 2022 D:).
I lived with this confusion until one day it hit me: SRE isn't just a grab-bag of practices, something you can turn into a checklist, but a different way of seeing computer systems.
The best way I can describe is through John Boyd's OODA loop concept:
When I first found out about OODA, I understood it as a series of steps you execute, and the faster you do it, the better results you can expect. But with time, my understanding shifted to how Venkatesh Rao describes it: "[a] mindfulness aid to keep your decision-making creatively and imaginatively aware and attuned to the environment (...)."
Only then did I understand what Boyd meant when he said that "Orientation" is the most important part of OODA--when you look at the diagram, notice how orientation influences every other part of the loop.
SRE is all about orientation. When an organization implements SRE--SLOs/SLAs, post-mortems, and all the other fun stuff--it begins to see and understand its software differently than before. It's as if it unlocked a new set of knobs for dialing in a desired level of quality. This makes deciding trade-offs between time, cost, and quality easier, leading to more optimal results.
But changing how someone orients, how they see the world, is hard. It's like explaining your culture to someone unfamiliar with it. At the beginning, all they can see are external things like weird food or funny clothing. But they don't get WHY the food and clothing is the way it is--they can't see all the implicit connections between environment, actions, etc.
Now, this isn't just an academic exercise. Thinking of SRE this way makes it easier to explain and guide its adoption for a couple of reasons.
First, it sets the right expectations. Because it's a new and unfamiliar way of seeing for most people, you can bet that getting SRE setup will take serious time and effort, so you can prepare both technically and emotionally.
Second, it clarifies your role as teacher and ambassador. Because you're not just building dashboards and automation, you're actually educating you peers about a new way of running software systems. So everything you do, every incident you tackle, every improvement you make serves as an example that helps others see things in a new light.
Third, it shifts your mindset to one of a gardener. Because what you're doing is tending to how an organization works, so you have to constantly judge and adjust what you're doing every single day. This stands in contrast to seeing SRE as merely a grab-bag of practices that can be distilled into a simple checklist. If you tried this approach with a garden, most of your plants would die.
Finally, consider the paradox in all this: SRE aims to make software systems more legible, yet it itself is so illegible. I guess that would explain why different organizations get such different results.