How We Scaled Our New Prometheus TSDB Grafana Mimir to 1 Billion Active Series

# How We Scaled Our New Prometheus TSDB Grafana Mimir to 1 Billion Active Series ![rw-book-cover](https://grafana.com/static/assets/meta/grafana-mimir-1b-series-dashboard.png) URL:: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series Author:: Marco Pracucci ## Highlights > In 2021, we tested a Grafana Mimir cluster with a single tenant having 1 billion active series. We have made substantial improvements to Grafana Mimir since running this test. For commercial customers, we are able to test at higher scales with the explicit intention of publishing our results afterwards. ([View Highlight](https://read.readwise.io/read/01g2cf1ms4y9bkq05s6q3newwe)) > You can replicate our results by running [this Grafana k6 script](https://github.com/grafana/mimir/tree/main/operations/k6) against a Grafana Mimir cluster spanning 1,500 replicas, about 7,000 CPU cores, and 30 TiB of RAM. ([View Highlight](https://read.readwise.io/read/01g2cf1rcgj0ecfnnps0njjjgd)) > • A **value** is the result of a measurement. For example, 100 HTTP requests received by a server since startup. > • A **sample** is a value together with a specific time at which the value was measured. For example, 100 HTTP requests received by a server since startup, measured on 2022-03-30 13:00 UTC. > • A **metric** is the unique identifier of a resource being measured. It’s composed of the metric family name with labels attached to it. For example, http_requests_total{pod=”app-1”} and http_requests_total{pod=”app-2”} are two different metrics from the metric family http_requests_total. > • A **time series** (or “series”) is the combination of the metric identifier and any number of samples related to it. For example, http_requests_total{pod=”app-1”} with samples 100@13:00:00 UTC, 105@13:00:15 UTC, and so on. > • An **active time series** (or “active series”) is a metric which has recorded at least one sample within the last 20 minutes. ([View Highlight](https://read.readwise.io/read/01g2cf4cjj5npakvy2x105r5c1)) > In this blog post, we are talking about **1 billion *active series* in a single Grafana Mimir cluster: 1 billion *time series* that have received a *sample* within the last 20 minutes**. ([View Highlight](https://read.readwise.io/read/01g2cf5jqxgvb72bgh6xak0twg)) > Due to Mimir’s user-configurable redundancy, three ingesters take in every sample by default. For 1 billion active series, 3 billion time series are ingested and later deduplicated back down into 1 billion during compaction. ([View Highlight](https://read.readwise.io/read/01g2cfdyfthdmy0f9pnrphcsk1)) > We started the testing with a relatively small load of 200 million active series, and we progressively scaled up over the course of a few weeks, all the way up to 1 billion active series. ([View Highlight](https://read.readwise.io/read/01g2cfqe0wpjx3y7rx34273khe)) > Moreover, assuming a scrape interval of 20 seconds, the cluster receives about 1 billion / 20s = 50 million samples per second. ([View Highlight](https://read.readwise.io/read/01g2cfsaas9t0wp72b12peaw93)) > The enhancements we introduced include: ([View Highlight](https://read.readwise.io/read/01g2cfvdpd1nz5a94pmknfrmxy)) > Mimir uses [hash rings](https://grafana.com/docs/mimir/latest/operators-guide/architecture/hash-ring/?pg=blog&plcmt=body-txt) to share work across several replicas of a specific component. For example, all ingester replicas in the cluster build up a shared ingesters hash ring, which is used to shard and replicate received time series. ([View Highlight](https://read.readwise.io/read/01g2cfvnqm224pbyev9ts1cwyw)) > The hash ring is a data structure shared across the Mimir replicas using a key-value store. Mimir supports Consul, etcd, or memberlist (based on Gossip protocol). ([View Highlight](https://read.readwise.io/read/01g2cfwvhfzrrjkyyx15tmftk0)) > Compactor ([View Highlight](https://read.readwise.io/read/01g2cfy6d47edrds02d82c3m19)) > The [Mimir compactor](https://grafana.com/docs/mimir/latest/operators-guide/architecture/components/compactor/?pg=blog&plcmt=body-txt) is the component responsible for compacting multiple TSDB blocks uploaded by ingesters together into a single larger block. ([View Highlight](https://read.readwise.io/read/01g2cfy7mnra3qqsttha14ntr3)) > First of all, the compactor couldn’t keep up with the compaction workload. ([View Highlight](https://read.readwise.io/read/01g2cg11rw8f15xahy296qy9rj)) > Second, and most importantly, even if the compactor was able to finish the compaction, the resulting compacted block was broken. ([View Highlight](https://read.readwise.io/read/01g2cg15736pzddq7wh97sbash)) > To overcome these limitations, we built a new compaction algorithm, called split-and-merge compactor, which enables sharding and horizontally scaling the compaction of same-tenant and same-time-range blocks. ([View Highlight](https://read.readwise.io/read/01g2chaggm7h825v53ay29bef6)) > [![](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png)](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png) > [Grafana Mimir: split-and-merge compactor](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png) ([View Highlight](https://read.readwise.io/read/01g2chavmgkdrkqszq74np3nbw)) > High-cardinality queries ([View Highlight](https://read.readwise.io/read/01g2chbe8fsm4na7w3k4zne7c3)) > Running Mimir with 1 billion active series is not just about being able to ingest such a large volume of metrics, but also being able to effectively query them. ([View Highlight](https://read.readwise.io/read/01g2chbfyspxqhjyrbj5xmen3d)) > To overcome this limitation, Mimir combines two main strategies to scale the execution of a single query across multiple CPU cores and machines: > • Shard a query by time (time splitting) > • Shard a query by series shards (query sharding) > Time splitting breaks long-range queries into multiple queries. If a query spans over multiple days, it gets split into multiple one-day queries, and each query runs in parallel. > Query sharding builds on top of time splitting, and breaks the series dataset into smaller pieces. These smaller pieces are called shards. Each shard then gets queried in a partial query, and those partial queries are distributed across multiple machines. The results of those partial queries are then aggregated, and the full query result is returned to the client. ([View Highlight](https://read.readwise.io/read/01g2chdsqv3rf24r664rxx84bj)) > We tested query sharding extensively. **We observed, on average, a 10x reduction in execution time for high-cardinality and CPU-intensive queries**. ([View Highlight](https://read.readwise.io/read/01g2che2mgykqxkcaxf756j6fw)) > Ingester disk I/O operations distribution ([View Highlight](https://read.readwise.io/read/01g2chet2pwdfsky648zjffdva)) > The reason why it solves the issue just partially is because a chunk can’t span across multiple blocks. Every two hours, at the boundary of each block time range, all chunks are forced to close and are written to disk. ([View Highlight](https://read.readwise.io/read/01g2chhfzqjv1d8vjqr379qck3)) > To further mitigate this, we worked to change the chunks writing logic in TSDB to a queue-based approach, with a background process to asynchronously write chunks to disk. We did this work in [Mimir first](https://github.com/grafana/mimir-prometheus/pull/57) and then proposed to upstream it to [Prometheus](https://github.com/prometheus/prometheus/pull/10051). ([View Highlight](https://read.readwise.io/read/01g2chk8ve0n5y9mtjmwpy30ed)) > Prometheus TSDB enhancements ([View Highlight](https://read.readwise.io/read/01g2chkyvb5rw30bf02k6xp9pj)) > We found a race condition during the TSDB WAL replay, which was causing subsequent blocks compacted from the TSDB head to contain out-of-order chunks. After testing the fix internally, we opened an [upstream PR](https://github.com/prometheus/prometheus/pull/9856) to fix it. ([View Highlight](https://read.readwise.io/read/01g2chmp0cf54dfsf7x42d1rnw)) > Multi-zone rollouts ([View Highlight](https://read.readwise.io/read/01g2chphtaakf9g1nsgabwby9s)) > Running a large Mimir cluster introduces serious challenges even to rollout changes. When Mimir is deployed in [microservices mode](https://grafana.com/docs/mimir/latest/operators-guide/architecture/deployment-modes/?plcmt=footer#microservices-mode), ingesters and store-gateways should be rolled out sequentially in order to guarantee no failures on the write and read paths during the rollout. ([View Highlight](https://read.readwise.io/read/01g2chrzwrz0ehvrz2zb0npv7d)) > To speed up rollouts, we leveraged [Mimir zone-aware replication](https://grafana.com/docs/mimir/latest/operators-guide/configuring/configuring-zone-aware-replication/?plcmt=footer). Instead of treating ingesters individually, we treat them as grouped by zones. On the level of zones, roll-outs still need to happen sequentially. Within zones, an arbitrary number of ingesters can be rolled out concurrently. ([View Highlight](https://read.readwise.io/read/01g2chv1rsksz8pf9sjy8jbhn7)) --- Title: How We Scaled Our New Prometheus TSDB Grafana Mimir to 1 Billion Active Series Author: Marco Pracucci Tags: readwise, articles date: 2024-01-30 --- # How We Scaled Our New Prometheus TSDB Grafana Mimir to 1 Billion Active Series ![rw-book-cover](https://grafana.com/static/assets/meta/grafana-mimir-1b-series-dashboard.png) URL:: https://grafana.com/blog/2022/04/08/how-we-scaled-our-new-prometheus-tsdb-grafana-mimir-to-1-billion-active-series Author:: Marco Pracucci ## AI-Generated Summary To make sure that Grafana Mimir could support such a high scale of metrics, we performed extensive load testing and optimized performance where we found challenges. ## Highlights > In 2021, we tested a Grafana Mimir cluster with a single tenant having 1 billion active series. We have made substantial improvements to Grafana Mimir since running this test. For commercial customers, we are able to test at higher scales with the explicit intention of publishing our results afterwards. ([View Highlight](https://read.readwise.io/read/01g2cf1ms4y9bkq05s6q3newwe)) > You can replicate our results by running [this Grafana k6 script](https://github.com/grafana/mimir/tree/main/operations/k6) against a Grafana Mimir cluster spanning 1,500 replicas, about 7,000 CPU cores, and 30 TiB of RAM. ([View Highlight](https://read.readwise.io/read/01g2cf1rcgj0ecfnnps0njjjgd)) > • A **value** is the result of a measurement. For example, 100 HTTP requests received by a server since startup. > • A **sample** is a value together with a specific time at which the value was measured. For example, 100 HTTP requests received by a server since startup, measured on 2022-03-30 13:00 UTC. > • A **metric** is the unique identifier of a resource being measured. It’s composed of the metric family name with labels attached to it. For example, http_requests_total{pod=”app-1”} and http_requests_total{pod=”app-2”} are two different metrics from the metric family http_requests_total. > • A **time series** (or “series”) is the combination of the metric identifier and any number of samples related to it. For example, http_requests_total{pod=”app-1”} with samples 100@13:00:00 UTC, 105@13:00:15 UTC, and so on. > • An **active time series** (or “active series”) is a metric which has recorded at least one sample within the last 20 minutes. ([View Highlight](https://read.readwise.io/read/01g2cf4cjj5npakvy2x105r5c1)) > In this blog post, we are talking about **1 billion *active series* in a single Grafana Mimir cluster: 1 billion *time series* that have received a *sample* within the last 20 minutes**. ([View Highlight](https://read.readwise.io/read/01g2cf5jqxgvb72bgh6xak0twg)) > Due to Mimir’s user-configurable redundancy, three ingesters take in every sample by default. For 1 billion active series, 3 billion time series are ingested and later deduplicated back down into 1 billion during compaction. ([View Highlight](https://read.readwise.io/read/01g2cfdyfthdmy0f9pnrphcsk1)) > We started the testing with a relatively small load of 200 million active series, and we progressively scaled up over the course of a few weeks, all the way up to 1 billion active series. ([View Highlight](https://read.readwise.io/read/01g2cfqe0wpjx3y7rx34273khe)) > Moreover, assuming a scrape interval of 20 seconds, the cluster receives about 1 billion / 20s = 50 million samples per second. ([View Highlight](https://read.readwise.io/read/01g2cfsaas9t0wp72b12peaw93)) > The enhancements we introduced include: ([View Highlight](https://read.readwise.io/read/01g2cfvdpd1nz5a94pmknfrmxy)) > Mimir uses [hash rings](https://grafana.com/docs/mimir/latest/operators-guide/architecture/hash-ring/?pg=blog&plcmt=body-txt) to share work across several replicas of a specific component. For example, all ingester replicas in the cluster build up a shared ingesters hash ring, which is used to shard and replicate received time series. ([View Highlight](https://read.readwise.io/read/01g2cfvnqm224pbyev9ts1cwyw)) > The hash ring is a data structure shared across the Mimir replicas using a key-value store. Mimir supports Consul, etcd, or memberlist (based on Gossip protocol). ([View Highlight](https://read.readwise.io/read/01g2cfwvhfzrrjkyyx15tmftk0)) > Compactor ([View Highlight](https://read.readwise.io/read/01g2cfy6d47edrds02d82c3m19)) > The [Mimir compactor](https://grafana.com/docs/mimir/latest/operators-guide/architecture/components/compactor/?pg=blog&plcmt=body-txt) is the component responsible for compacting multiple TSDB blocks uploaded by ingesters together into a single larger block. ([View Highlight](https://read.readwise.io/read/01g2cfy7mnra3qqsttha14ntr3)) > First of all, the compactor couldn’t keep up with the compaction workload. ([View Highlight](https://read.readwise.io/read/01g2cg11rw8f15xahy296qy9rj)) > Second, and most importantly, even if the compactor was able to finish the compaction, the resulting compacted block was broken. ([View Highlight](https://read.readwise.io/read/01g2cg15736pzddq7wh97sbash)) > To overcome these limitations, we built a new compaction algorithm, called split-and-merge compactor, which enables sharding and horizontally scaling the compaction of same-tenant and same-time-range blocks. ([View Highlight](https://read.readwise.io/read/01g2chaggm7h825v53ay29bef6)) > [![](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png)](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png) > [Grafana Mimir: split-and-merge compactor](https://grafana.com/static/assets/img/blog/grafana-mimir-1b-series-split-and-merge-compactor-diagram.png) ([View Highlight](https://read.readwise.io/read/01g2chavmgkdrkqszq74np3nbw)) > High-cardinality queries ([View Highlight](https://read.readwise.io/read/01g2chbe8fsm4na7w3k4zne7c3)) > Running Mimir with 1 billion active series is not just about being able to ingest such a large volume of metrics, but also being able to effectively query them. ([View Highlight](https://read.readwise.io/read/01g2chbfyspxqhjyrbj5xmen3d)) > To overcome this limitation, Mimir combines two main strategies to scale the execution of a single query across multiple CPU cores and machines: > • Shard a query by time (time splitting) > • Shard a query by series shards (query sharding) > Time splitting breaks long-range queries into multiple queries. If a query spans over multiple days, it gets split into multiple one-day queries, and each query runs in parallel. > Query sharding builds on top of time splitting, and breaks the series dataset into smaller pieces. These smaller pieces are called shards. Each shard then gets queried in a partial query, and those partial queries are distributed across multiple machines. The results of those partial queries are then aggregated, and the full query result is returned to the client. ([View Highlight](https://read.readwise.io/read/01g2chdsqv3rf24r664rxx84bj)) > We tested query sharding extensively. **We observed, on average, a 10x reduction in execution time for high-cardinality and CPU-intensive queries**. ([View Highlight](https://read.readwise.io/read/01g2che2mgykqxkcaxf756j6fw)) > Ingester disk I/O operations distribution ([View Highlight](https://read.readwise.io/read/01g2chet2pwdfsky648zjffdva)) > The reason why it solves the issue just partially is because a chunk can’t span across multiple blocks. Every two hours, at the boundary of each block time range, all chunks are forced to close and are written to disk. ([View Highlight](https://read.readwise.io/read/01g2chhfzqjv1d8vjqr379qck3)) > To further mitigate this, we worked to change the chunks writing logic in TSDB to a queue-based approach, with a background process to asynchronously write chunks to disk. We did this work in [Mimir first](https://github.com/grafana/mimir-prometheus/pull/57) and then proposed to upstream it to [Prometheus](https://github.com/prometheus/prometheus/pull/10051). ([View Highlight](https://read.readwise.io/read/01g2chk8ve0n5y9mtjmwpy30ed)) > Prometheus TSDB enhancements ([View Highlight](https://read.readwise.io/read/01g2chkyvb5rw30bf02k6xp9pj)) > We found a race condition during the TSDB WAL replay, which was causing subsequent blocks compacted from the TSDB head to contain out-of-order chunks. After testing the fix internally, we opened an [upstream PR](https://github.com/prometheus/prometheus/pull/9856) to fix it. ([View Highlight](https://read.readwise.io/read/01g2chmp0cf54dfsf7x42d1rnw)) > Multi-zone rollouts ([View Highlight](https://read.readwise.io/read/01g2chphtaakf9g1nsgabwby9s)) > Running a large Mimir cluster introduces serious challenges even to rollout changes. When Mimir is deployed in [microservices mode](https://grafana.com/docs/mimir/latest/operators-guide/architecture/deployment-modes/?plcmt=footer#microservices-mode), ingesters and store-gateways should be rolled out sequentially in order to guarantee no failures on the write and read paths during the rollout. ([View Highlight](https://read.readwise.io/read/01g2chrzwrz0ehvrz2zb0npv7d)) > To speed up rollouts, we leveraged [Mimir zone-aware replication](https://grafana.com/docs/mimir/latest/operators-guide/configuring/configuring-zone-aware-replication/?plcmt=footer). Instead of treating ingesters individually, we treat them as grouped by zones. On the level of zones, roll-outs still need to happen sequentially. Within zones, an arbitrary number of ingesters can be rolled out concurrently. ([View Highlight](https://read.readwise.io/read/01g2chv1rsksz8pf9sjy8jbhn7))