# [[Watching the watchers - KubeCon NA 2024]] ## Details Conference:: [[KubeCon NA 2024]] Date::[[2024-11-13]] Length:: 35 minutes [Video of the talk on the CNCF channel](https://www.youtube.com/watch?v=pCxmwvqTklo) [My copy of the video](https://youtu.be/d-4GThNevjc) ## Abstract > [!abstract] Watching the Watchers: How we do continous reliability at Grafana Labs > Nothing is foolproof. Everything fails eventually. Observability tools help predict and lessen the impact of those failures, as the watchers of your software systems. But who watches the watchers? > > At Grafana Labs, we're not immune to production incidents. Just like any company, we still sometimes move too quickly. We run complex, microservices-based systems ourselves, so we have to eat our own dogfood on a daily basis. > > In this talk, I reveal: > - how we solved a years-long mystery that cost us $100,000+ > - how we got our internal Mimir clusters to reliably hold 1.3 billion time series for metrics > - what we've had to do to scale our Loki clusters to handle 324 TB of logs a day > - what our Grafana dashboards to monitor Grafana Cloud look like > > Sometimes, it's easier to learn from failures in observability than from successes. This talk is a confession of some of our worst sins as well as a realistic look under the hood at how we're improving the continuous reliability of our stack. > ## Structure - Intro to Grafana Labs - "Observability" isn't enough. What are we observing? Continuous reliability should be the goal - What is continuous reliability? (FORGE) - Framework: Resilient infrastructure design - Overview: Continuous observability - Recovery: Incident management - Growth: Continuous testing and improvement - Engagement: Culture - Framework: Resilient infrastructure design - *Case study:* how our Loki clusters can handle 324 TB of logs a day - Loki in microservices mode, and explain how other databases follow a similar pattern and why - *Case study:* We had [an incident](https://grafana.com/blog/2021/03/26/how-we-responded-to-a-2-hour-outage-in-our-grafana-cloud-hosted-prometheus-service/) where a single customer was able to overwhelm our Prometheus clusters (then [[Cortex]]) and cause a [[Cascading Failures|Cascading failure]] throughout the stack. Later, we used [[k6 (tool)|k6]] to test that [[Grafana Mimir|Mimir]] can reliably hold 1.3 billion time series for metrics - Overview: Continuous observability - Instrumentation (eBPF, Beyla, Alloy, manual) - Meta-observability-- [[Erik Sommer]]'s [Paranoid Observability](https://docs.google.com/presentation/d/137MoKdf9KZu9YP24IYXIlbjVP2ikfgInxUxEIGJNA74/edit#slide=id.g30c913941f1_0_1290) - Screenshot of Grafana dashboards to monitor Grafana Cloud - Recovery: Incident management - Who is on call? How many people? - [[Grafana OnCall]] + [[Grafana Incident]] - Every team is responsible for what they build. There's no wall that they chuck their code over to some ops team that have sole responsibility for everything - Growth: Continuous testing and improvement - Keda, k6 - *Case study:* The Pyroscope story: $100,000+ mystery - *Case study:* Grafana Bench - Faro - Engagement: Culture - Default to transparency... even when it's "too much" for some people's comfort - cross-pollination among teams is encouraged - dogfooding is highly encouraged - Hackathons - Learning in public: RAD videos ## Information - [How we use metamonitoring Prometheus servers to monitor all other Prometheus servers at Grafana Labs](https://grafana.com/blog/2021/04/08/how-we-use-metamonitoring-prometheus-servers-to-monitor-all-other-prometheus-servers-at-grafana-labs/) - [[Jeroen Op 't Eynde]] - [[Erik Sommer]]'s [Paranoid Observability](https://docs.google.com/presentation/d/137MoKdf9KZu9YP24IYXIlbjVP2ikfgInxUxEIGJNA74/edit#slide=id.g30c913941f1_0_1290) - [How we scaled our new Prometheus TSDB Grafana Mimir to 1 billion active series](https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/) - (blog) [How we responded to a 2-hour outage in our Grafana Cloud Hosted Prometheus service](https://grafana.com/blog/2021/03/26/how-we-responded-to-a-2-hour-outage-in-our-grafana-cloud-hosted-prometheus-service/) - (blog) [How adding Kubernetes label selectors caused an outage in Grafana Cloud Logs — and how we resolved it](https://grafana.com/blog/2022/08/31/how-adding-kubernetes-label-selectors-caused-an-outage-in-grafana-cloud-logs-and-how-we-resolved-it/) - (blog) [How a Production Outage Was Caused Using Kubernetes Pod Priorities](https://grafana.com/blog/2019/07/24/how-a-production-outage-was-caused-using-kubernetes-pod-priorities/) - (slides) [Grafana Bench Workshop](https://docs.google.com/presentation/d/1NypV_s2jmyCXr6I67J7Iqbb3GRIP6aAjdN9qxOV59Sc/edit?usp=sharing) - (slides) [Ryan Perry's Pyroscope demo](https://docs.google.com/presentation/d/1t7SDSXXaLTja2DEgyZbzBNxJu2O_gUNqD147tHQWOJw/edit#slide=id.g26d0763b53a_0_0) - (slides) [Bryan Boreham's metrics and logs cloud architecture](https://docs.google.com/presentation/d/1lXi0vbA3i0gc-EsA4ToyIGcl0OWKFLi_Y5tvtyhhJds/edit) Faro > [!NOTE] From Kostas > Of the top of my head > > - Hosted Grafana (our cloud) > - OnCall > - k6 > - Synthetic Monitoring > - Us (Frontend Observability) > - K8s app > - Authentication App > - Cloud Connections > - Application Observability > - Cloud Home > > If you want some screenshots from the dashboards they use:[https://appo11y.grafana.net/a/grafana-kowalski-app/apps](https://appo11y.grafana.net/a/grafana-kowalski-app/apps) > [https://ops.grafana.net/a/grafana-kowalski-app/apps](https://ops.grafana.net/a/grafana-kowalski-app/apps) ## The presentation [[Watching the watchers 1]]