Observability: Getting from B to C
I was asked to elaborate on the tweet below, and I have been meaning to blog more, so I figured I would flesh out the thought here.
I don’t think what I tweeted above is that novel or interesting of an idea, but what I am getting at is the following: there’s a rough spectrum of Monitoring -not Observability, which has its own spectrum- that goes from Fly Blind Until You Get In Trouble (Point A) on the left, to Page On All The Things (Point B), to Page On The Critical Paths (Point C), all the way at the Zeno’s Paradox unreachable right. Most well-intentioned and well-led teams I have interacted with seem to get stuck at Point B.
The culture I have seen develop at Point B is that every page is a crisis, and every minor issue that isn’t immediately surfaced gets treated as something that should be paged on. You’re basically overreacting constantly, on purpose. It’s usually because someone in your org saw the positive impact that improved paging had on downtime and immediately succumbed to Hammer and Nail Syndrome. Part of this is because our current Monitoring tools – again, not Observability, but Monitoring specifically- encourage overreaction and lack of granularity about what a signal means.
In my opinion the sustainable Monitoring stance is at Point C, not because it’s evolved or accurate or what-have-you, but because it privileges not burning out your team. The problem is that unless you are incredibly good at keeping track of the vibe in your org, it is incredibly hard to get to Point C, because you have no tools to push them into a better place.
Point C looks like the following: The team identifies a critical path, and that critical path has an SLO. The team then puts in place monitoring on their SLIs to let them know when a risk to that SLO arises, and tracks over time how often they miss their SLO. Critically, if they miss their SLO enough times that they burn through their error budget, a contract exists between the team and their stakeholders that allows the team to thoughtfully address the causes of these errors. SLOs and SLAs cannot sustainably exist in the absence of this social contract and the error budgets that support it. When these three components go hand in hand it can create a wonderful virtuous cycle that slowly tumbles the team towards calmer on-call cycles.
However, there exists a problem in the way I have seen these Objectives, Indicators, and Budgets set. The problem is exactly that sequencing. Teams will generally try to set Objectives, often a difficult-enough task on its own in a fast-growing organization. They then will figure out Indicators, and monitor those Indicators until a point where they threaten missing the Objective.
When you’re somewhere between Points A and B though, you don’t necessarily know exactly what you should care about enough to set an Objective. Your product might be in flux, or you might be in a complex product space where engineering team’s don’t exactly know what the #1 concern of their users is. The standard, understandable response is to try and care about everything, eventually burning the team out. Trying to derive your critical paths and SLOs from this position is difficult, if you can even convince your team to do so, because you will have no baseline for a priority cutoff (ie is a #5 priority pageable or no?).
What I was trying to express above, was that at Point B Nothing stops teams for continuously creating more and more alerts. I’ve been guilty of it myself. There exists no counter-pressure equivalent to Point C’s “SLOs vs Error Budgets” balance. So why don’t we create one?
While most teams create error budgets 1) downstream of SLOs and 2) related to one specific SLO, what if we were to create a more generic limit, and actually set the objective downstream of the limit itself? For example, teams trying to move towards SLOs and Error Budgets could put down:
“We are willing to get paged no more than 5 times per week, and no more than two times overnight in a given week. If we exceed this limit, the following cycle at least one engineer will be tasked with establishing whether the monitors that paged us are tracking Goals, or Non-Goals. We will track all Goals and Non-Goals, and use these to inform any SLOs we eventually set. If the offending pages relate to Goals, we will triage issues and write a plan of action.”
When I write this out, it seems to me like a more experienced practitioner may respond with “well.. duh”, but the key point here is that if your organization is struggling with establishing or adopting SLOs and Error Budgets, it might be worth it to flip the problem on its head and start with the question of “How often are we willing to be interrupted?” and establish a guardrail that will push your team forward from there.
This was originally going to be a series of tweets, so apologies if it’s a bit of a ramble.