# [[Dos and Donts of scaling Loki Cache (Loki Community Call - March 2025)]]
[in developer-advocacy](obsidian://open?vault=developer-advocacy&file=projects%2FLoki%20Community%20Calls%2F2025-03-25-loki-cache)
![[Dos and Donts of scaling Loki Cache (Loki Community Call - March 2025)]]

Related:: "[[Memcached in Loki]], [[Grafana Loki]], [[Poyzan Taneli]], [[Paul Rogers]], [[Jay Clifford]]"
## Talking points
- Cache Concepts
- What is a cache? What's a memcache? What's memcached?
- Cache: Software/hardware designed to speed up certain operations
- Memcache: memory
- memcached: key-value pairing operation, open source
- What are the things that Loki caches?
- biggest: chunk cache
- query results (memcached frontend, memcached results cache)
- 3 endpoints: metrics, volume queries,
- index writes and lookups (labels)
- You don't need this if you're using TSDB and not using BoltDB (memcached index queries in JSONNET and Helm chart)
- What are the telltale signs that indicate you might need a cache?
- (e.g. high latency retrieving from the store despite all other components working fine)
- What tradeoffs do you consider when using cache?
- (e.g. latency versus potential inaccuracy/outdated data, and cost implications)
- What types of queries benefit most from caching, and when might caching be less effective?
- (repeated query behaviour, touching the same data, most queried time period, recent data)
- Last 3 hours of data
- It depends on how the user is using Loki. The intent is that the data you're touching very often is what should be cached.
- Memcached Tiering & Differences
- What's the role of tiering in cache?
- Can you describe the different components involved in memcache tiering? (Live Draw)
- (e.g. Memcache frontend (metrics, labels, volume; done via a hash in header), results cache (query hash, is it in cash or not in cash), and chunks cache (biggest cache, no need for tennancy understanding chunk ref holds this metadata))
- What is the role of the results cache?
- (Note: It saves computation cost)
- What is the role of the chunks cache?
- (Note: It reduces storage IOPS and bandwidth usage)
- QUESTION from Andre Ziviani on dedup and chunks cache - what is a good/expected ratio for dedups?
- What is a spillover cache and why might it not be necessary for most users?
- (e.g. it prevents issues where memcache isn’t smart enough to retain parts of a query)
- Q: What are the key metrics to look at to know if the cache is correctly tuned?
- IOPS
- bandwidth usage
- computation cost
- latency - read path (query)
- The memcached software itself has a command `stat` that will print out all sorts of info - hit rate, eviction, how often cache is compacted. Use this in conjunction with outside metrics.
- Close to 90% utilization of memcached is what we aim for. When you deploy it as a pod, there are resource limits. But there's also an internal setting that you should have 90%. Correctly tuned memached should never OOM.
- Scaling Memcache
- How we handle cache at Grafana
- memcached-extstore attached SSDs to memcached. it gets all data in memory and then flushes to the disk. (Disk, not persistent volumes)
- When should you scale chunk cache pods vertically/horizontally?
-
- (Danny Kopping) L1 cache - huge storage (extstore)
- (Ed Welch) L2 cache - update cache based on hits and misses. This is a spillover cache. There is a handoff period of a few days or more. WHen a query hits, look at handoff time. if within L1 time, check L1. If not, check L2.
- We keep 7 days in cache. We create a recording rule
- Extstore is relevant for >half a PB per month.
- How do we deploy memcache at scale in a Loki environment?
- Mention Helm value differnces from JSONnet
- (Note: Considerations include capacity up to 3TB and where to start)
- How often should we revisit our caching decisions?
- (Note: Reference [Community PR for not writing to cache if outside chosen window](https://github.com/grafana/loki/issues/14983))
- Storge access (API calls), Latencties that come cache, trace would show the latency of each component
- How to decide on cache numbers?
- (Note: A heuristic suggests that 70% of the query timerange should hit a cell in 7 days; also, the Loki team manages our own instances)
- How long do cached entries stay valid?
- Reset 24 hours
- Don't cache when we flush to storage
- cache data will go stale. Time to live (TTL)
- What Poyzan and Paul are working on: dynamically sizing memcache
- Recording rules, then memcache scales up or down dynamically based on the hit ratio of etc. Appropriately sizing
- Failure Scenarios
- What happens when the system is hit by a large number of queries that cover a long time period?
- (Note: This could trigger a writeback loop for memcache)
- Is my cache provisioned correctly? 1 day provisioned but querying 7 days
- What are the performance impacts if we lose 20%, 50%, or all of the cache pods?
- (Note: Loki can still serve up to 3 hours of data under such scenarios)
- What happnes if you see your cache OOMing? Or what should you do if you want verticaly scale your cache?
- Cache CPU and RAM, configuration how much its allowed to use. Default covers for this
- memory_limit_mb and max_item_size
- Boxes are to big bad for memcache check chunk size
- Monitoring Memcache
- What are some best practices for monitoring cache performance in a Loki environment?
- How can you verify that Loki is actually utilizing the cache?
- Cache vs other methods of query
- When should you use a cache versus a recording rule?
- (e.g. recording rules for data that changes over a period and is aggregatable vs. cache for similar repeated queries)
%%
# Text Elements
# Drawing
```json
{"type":"excalidraw","version":2,"source":"https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.0.25","elements":[],"appState":{"theme":"dark","gridSize":null,"viewBackgroundColor":"#ffffff"}}
```
%%