Dos and Donts of scaling Loki Cache (Loki Community Call - March 2025)

# [[Dos and Donts of scaling Loki Cache (Loki Community Call - March 2025)]] [in developer-advocacy](obsidian://open?vault=developer-advocacy&file=projects%2FLoki%20Community%20Calls%2F2025-03-25-loki-cache) ![[Dos and Donts of scaling Loki Cache (Loki Community Call - March 2025)]] ![](https://www.youtube.com/watch?v=3u8UxLfmq6I) Related:: "[[Memcached in Loki]], [[Grafana Loki]], [[Poyzan Taneli]], [[Paul Rogers]], [[Jay Clifford]]" ## Talking points - Cache Concepts - What is a cache? What's a memcache? What's memcached? - Cache: Software/hardware designed to speed up certain operations - Memcache: memory - memcached: key-value pairing operation, open source - What are the things that Loki caches? - biggest: chunk cache - query results (memcached frontend, memcached results cache) - 3 endpoints: metrics, volume queries, - index writes and lookups (labels) - You don't need this if you're using TSDB and not using BoltDB (memcached index queries in JSONNET and Helm chart) - What are the telltale signs that indicate you might need a cache? - (e.g. high latency retrieving from the store despite all other components working fine) - What tradeoffs do you consider when using cache? - (e.g. latency versus potential inaccuracy/outdated data, and cost implications) - What types of queries benefit most from caching, and when might caching be less effective? - (repeated query behaviour, touching the same data, most queried time period, recent data) - Last 3 hours of data - It depends on how the user is using Loki. The intent is that the data you're touching very often is what should be cached. - Memcached Tiering & Differences - What's the role of tiering in cache? - Can you describe the different components involved in memcache tiering? (Live Draw) - (e.g. Memcache frontend (metrics, labels, volume; done via a hash in header), results cache (query hash, is it in cash or not in cash), and chunks cache (biggest cache, no need for tennancy understanding chunk ref holds this metadata)) - What is the role of the results cache? - (Note: It saves computation cost) - What is the role of the chunks cache? - (Note: It reduces storage IOPS and bandwidth usage) - QUESTION from Andre Ziviani on dedup and chunks cache - what is a good/expected ratio for dedups? - What is a spillover cache and why might it not be necessary for most users? - (e.g. it prevents issues where memcache isn’t smart enough to retain parts of a query) - Q: What are the key metrics to look at to know if the cache is correctly tuned? - IOPS - bandwidth usage - computation cost - latency - read path (query) - The memcached software itself has a command `stat` that will print out all sorts of info - hit rate, eviction, how often cache is compacted. Use this in conjunction with outside metrics. - Close to 90% utilization of memcached is what we aim for. When you deploy it as a pod, there are resource limits. But there's also an internal setting that you should have 90%. Correctly tuned memached should never OOM. - Scaling Memcache - How we handle cache at Grafana - memcached-extstore attached SSDs to memcached. it gets all data in memory and then flushes to the disk. (Disk, not persistent volumes) - When should you scale chunk cache pods vertically/horizontally? - - (Danny Kopping) L1 cache - huge storage (extstore) - (Ed Welch) L2 cache - update cache based on hits and misses. This is a spillover cache. There is a handoff period of a few days or more. WHen a query hits, look at handoff time. if within L1 time, check L1. If not, check L2. - We keep 7 days in cache. We create a recording rule - Extstore is relevant for >half a PB per month. - How do we deploy memcache at scale in a Loki environment? - Mention Helm value differnces from JSONnet - (Note: Considerations include capacity up to 3TB and where to start) - How often should we revisit our caching decisions? - (Note: Reference [Community PR for not writing to cache if outside chosen window](https://github.com/grafana/loki/issues/14983)) - Storge access (API calls), Latencties that come cache, trace would show the latency of each component - How to decide on cache numbers? - (Note: A heuristic suggests that 70% of the query timerange should hit a cell in 7 days; also, the Loki team manages our own instances) - How long do cached entries stay valid? - Reset 24 hours - Don't cache when we flush to storage - cache data will go stale. Time to live (TTL) - What Poyzan and Paul are working on: dynamically sizing memcache - Recording rules, then memcache scales up or down dynamically based on the hit ratio of etc. Appropriately sizing - Failure Scenarios - What happens when the system is hit by a large number of queries that cover a long time period? - (Note: This could trigger a writeback loop for memcache) - Is my cache provisioned correctly? 1 day provisioned but querying 7 days - What are the performance impacts if we lose 20%, 50%, or all of the cache pods? - (Note: Loki can still serve up to 3 hours of data under such scenarios) - What happnes if you see your cache OOMing? Or what should you do if you want verticaly scale your cache? - Cache CPU and RAM, configuration how much its allowed to use. Default covers for this - memory_limit_mb and max_item_size - Boxes are to big bad for memcache check chunk size - Monitoring Memcache - What are some best practices for monitoring cache performance in a Loki environment? - How can you verify that Loki is actually utilizing the cache? - Cache vs other methods of query - When should you use a cache versus a recording rule? - (e.g. recording rules for data that changes over a period and is aggregatable vs. cache for similar repeated queries) %% # Text Elements # Drawing ```json {"type":"excalidraw","version":2,"source":"https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.0.25","elements":[],"appState":{"theme":"dark","gridSize":null,"viewBackgroundColor":"#ffffff"}} ``` %%