# [[Watching the watchers - KubeCon NA 2024]]
## Details
Conference:: [[KubeCon NA 2024]]
Date::[[2024-11-13]]
Length:: 35 minutes
[Video of the talk on the CNCF channel](https://www.youtube.com/watch?v=pCxmwvqTklo)
[My copy of the video](https://youtu.be/d-4GThNevjc)
## Abstract
> [!abstract] Watching the Watchers: How we do continous reliability at Grafana Labs
> Nothing is foolproof. Everything fails eventually. Observability tools help predict and lessen the impact of those failures, as the watchers of your software systems. But who watches the watchers?
>
> At Grafana Labs, we're not immune to production incidents. Just like any company, we still sometimes move too quickly. We run complex, microservices-based systems ourselves, so we have to eat our own dogfood on a daily basis.
>
> In this talk, I reveal:
> - how we solved a years-long mystery that cost us $100,000+
> - how we got our internal Mimir clusters to reliably hold 1.3 billion time series for metrics
> - what we've had to do to scale our Loki clusters to handle 324 TB of logs a day
> - what our Grafana dashboards to monitor Grafana Cloud look like
>
> Sometimes, it's easier to learn from failures in observability than from successes. This talk is a confession of some of our worst sins as well as a realistic look under the hood at how we're improving the continuous reliability of our stack.
>
## Structure
- Intro to Grafana Labs
- "Observability" isn't enough. What are we observing? Continuous reliability should be the goal
- What is continuous reliability? (FORGE)
- Framework: Resilient infrastructure design
- Overview: Continuous observability
- Recovery: Incident management
- Growth: Continuous testing and improvement
- Engagement: Culture
- Framework: Resilient infrastructure design
- *Case study:* how our Loki clusters can handle 324 TB of logs a day
- Loki in microservices mode, and explain how other databases follow a similar pattern and why
- *Case study:* We had [an incident](https://grafana.com/blog/2021/03/26/how-we-responded-to-a-2-hour-outage-in-our-grafana-cloud-hosted-prometheus-service/) where a single customer was able to overwhelm our Prometheus clusters (then [[Cortex]]) and cause a [[Cascading Failures|Cascading failure]] throughout the stack. Later, we used [[k6 (tool)|k6]] to test that [[Grafana Mimir|Mimir]] can reliably hold 1.3 billion time series for metrics
- Overview: Continuous observability
- Instrumentation (eBPF, Beyla, Alloy, manual)
- Meta-observability-- [[Erik Sommer]]'s [Paranoid Observability](https://docs.google.com/presentation/d/137MoKdf9KZu9YP24IYXIlbjVP2ikfgInxUxEIGJNA74/edit#slide=id.g30c913941f1_0_1290)
- Screenshot of Grafana dashboards to monitor Grafana Cloud
- Recovery: Incident management
- Who is on call? How many people?
- [[Grafana OnCall]] + [[Grafana Incident]]
- Every team is responsible for what they build. There's no wall that they chuck their code over to some ops team that have sole responsibility for everything
- Growth: Continuous testing and improvement
- Keda, k6
- *Case study:* The Pyroscope story: $100,000+ mystery
- *Case study:* Grafana Bench
- Faro
- Engagement: Culture
- Default to transparency... even when it's "too much" for some people's comfort
- cross-pollination among teams is encouraged
- dogfooding is highly encouraged
- Hackathons
- Learning in public: RAD videos
## Information
- [How we use metamonitoring Prometheus servers to monitor all other Prometheus servers at Grafana Labs](https://grafana.com/blog/2021/04/08/how-we-use-metamonitoring-prometheus-servers-to-monitor-all-other-prometheus-servers-at-grafana-labs/)
- [[Jeroen Op 't Eynde]]
- [[Erik Sommer]]'s [Paranoid Observability](https://docs.google.com/presentation/d/137MoKdf9KZu9YP24IYXIlbjVP2ikfgInxUxEIGJNA74/edit#slide=id.g30c913941f1_0_1290)
- [How we scaled our new Prometheus TSDB Grafana Mimir to 1 billion active series](https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series/)
- (blog) [How we responded to a 2-hour outage in our Grafana Cloud Hosted Prometheus service](https://grafana.com/blog/2021/03/26/how-we-responded-to-a-2-hour-outage-in-our-grafana-cloud-hosted-prometheus-service/)
- (blog) [How adding Kubernetes label selectors caused an outage in Grafana Cloud Logs — and how we resolved it](https://grafana.com/blog/2022/08/31/how-adding-kubernetes-label-selectors-caused-an-outage-in-grafana-cloud-logs-and-how-we-resolved-it/)
- (blog) [How a Production Outage Was Caused Using Kubernetes Pod Priorities](https://grafana.com/blog/2019/07/24/how-a-production-outage-was-caused-using-kubernetes-pod-priorities/)
- (slides) [Grafana Bench Workshop](https://docs.google.com/presentation/d/1NypV_s2jmyCXr6I67J7Iqbb3GRIP6aAjdN9qxOV59Sc/edit?usp=sharing)
- (slides) [Ryan Perry's Pyroscope demo](https://docs.google.com/presentation/d/1t7SDSXXaLTja2DEgyZbzBNxJu2O_gUNqD147tHQWOJw/edit#slide=id.g26d0763b53a_0_0)
- (slides) [Bryan Boreham's metrics and logs cloud architecture](https://docs.google.com/presentation/d/1lXi0vbA3i0gc-EsA4ToyIGcl0OWKFLi_Y5tvtyhhJds/edit)
Faro
> [!NOTE] From Kostas
> Of the top of my head
>
> - Hosted Grafana (our cloud)
> - OnCall
> - k6
> - Synthetic Monitoring
> - Us (Frontend Observability)
> - K8s app
> - Authentication App
> - Cloud Connections
> - Application Observability
> - Cloud Home
>
> If you want some screenshots from the dashboards they use:[https://appo11y.grafana.net/a/grafana-kowalski-app/apps](https://appo11y.grafana.net/a/grafana-kowalski-app/apps)
> [https://ops.grafana.net/a/grafana-kowalski-app/apps](https://ops.grafana.net/a/grafana-kowalski-app/apps)
## The presentation
[[Watching the watchers 1]]