Watching the watchers - Fork My Brain

note: Title page. I want to talk about what happens when things go wrong. Maybe use the "it's on fire everything is fine" meme. ---  Image of Pyroscope incident. note: Here's a real example of something that happened in production. ==More explanation of graph== In some ways, this was a success. We had everything instrumented. Our stack was observable. So why wasn't this enough? Why was it that this problem was not a once-off but a recurring problem that we, with our highly observable everything, couldn't resolve for years? Well, it's because observability isn't enough. Sure, it's observable, but when you have complex systems, you don't know *where* to observe. You don't know what to look at. And being observable says nothing about getting to the root cause of issues, much less fixing them. So if our goal is simply "observability", it's a pretty limited one. Observability is not the end quality we want our systems to have; it's an incidental one. The goal should be continuous reliability. Observability is just *one of* the means to that end. ---  Image of forge note: Talk about how a forge is used to create weapons that last. Forges not only shape weapons to begin with, but make sure they're tempered enough to withstand use on the battlefield. ==Is there something about how putting a thing through the fire makes it stronger?== ---  ## What is continuous reliability? (FORGE) + Framework: Resilient infrastructure design + Overview: Continuous observability + Recovery: Incident management + Growth: Continuous testing and improvement + Engagement: Culture note: Continuous reliability is the process of putting our system through the FORGE. (RIGHT) It means having the right *Framework* in place: an underlying architecture that is resilient from the get-go and doesn't need to be rearchitected later when the system grows. (RIGHT) It means always having an *Overview* of the system, through the use of observability tools that let us peek under the hood at any time. (RIGHT) It also means having *Recovery* mechanisms set up so that when things do go wrong, there is a process for the system to be fixed, reforged if needed, and sent off into battle again. (RIGHT) It means leaving room for *Growth*: making sure that there's enough slack between components for new, better measures to be implemented. It means a system of continuous testing to make sure regressions don't get introduced. (RIGHT) And continuous reliability also depends on the *Engagement* of SREs, of software developers, of stakeholders, and of people like us who are interested in making sure our systems keep improving. I'll give examples for how we're doing all of these at Grafana, but first, maybe I should give you a summary of who we even are. ---  Grafana full stack logos note: ---  ## Framework: Resilient infrastructure design ---  ## Overview: Continuous observability ---  ## Recovery: Incident management ---  ## Growth: Continuous testing and improvement ---  ## Engagement: Culture ---  ## What is continuous reliability? (FORGE) + Framework: Resilient infrastructure design + Overview: Continuous observability + Recovery: Incident managemnet + Growth: Continuous testing and improvement + Engagement: Culture ---