Meta-monitoring Loki - Loki Community Call May 2025

# [[ Meta-monitoring Loki - Loki Community Call May 2025]] [in developer-advocacy](obsidian://open?vault=developer-advocacy&file=projects%2FLoki%20Community%20Calls%2FMeta-monitoring%20Loki%20with%20Dylan%20Guedes) ![[ Meta-monitoring Loki - Loki Community Call May 2025]] <iframe width="560" height="315" src="https://www.youtube.com/embed/OzYHV9xHIms" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> Related:: "[[Jay Clifford]], [[Dylan Guedes]], [[Grafana Loki|Loki]], [[Meta-observability]]" ## Talking points - What is meta-monitoring and why is it necessary? - It's having a dedicated, isolated observability stack just for observing Loki - Why monitor Loki specifically? - Increased reliability - there's a lot of complexity, and it's easy to overload Loki - Compliance - How do we do this at Grafana Labs for our own production environment? - What's our setup like? - One large Loki for meta-monitoring all environments - Run us through what it's like to be on call. What type of alert would you get? What do you check first? - What are some (general, non-customer related) issues you've encountered while on call? - Read pods have poor performance > Check logs cardinality - How to meta-monitor Loki - Recommended: separate instance of Loki + Prometheus + Grafana or Grafana Cloud - What should we look at? - Questions from Matt on our Grafana Community Slack - What log levels are worth capturing and looking at within Loki?: All, especially if you're trying to troubleshoot - You can either update the config for Loki or use [[Grafana Alloy|Alloy]] to discard logs with levels you don't need. - What kind of metrics can we get from the logs that are valuable in determining the health of Loki? What are the top key indicators that Loki is healthy vs degraded? - Write path - Distributor - QPS (Queries Per Second) - Including errors - Latency (aggregated and per pod latency) - Ingesters (same as distributor) - Index (same as distributor) - Read path - Where to get meta-information about Loki - Metrics: `http://localhost:3100/metrics` endpoint (Prometheus format) - Logs: `metrics.go` has a detailed log line for every query (query duration, number of lines returned, query throughput, specific LogQL query executed, chunks searched, etc) - Configuring log levels - pod logs - kubernetes events - This would also tell you which queries are generating the most load. - Deployment - k8s-monitoring-helm: https://github.com/grafana/k8s-monitoring-helm/ - Why did we choose the K8s Monitoring Helm chart? - What are the pros and cons of using this chart? - Run us through the config for the Helm chart. What values should we use? - How do we manage cardinality? - Loki mixin - What's a Loki mixin? - How do we install a mixin? - (brief demo?) - What are some common failure scenarios? - How do you use the mixin to troubleshoot an issue in production? - Other community questions - since the query frontend "stitches" the result of all subqueries that queriers execute, is that mean that it should be able to hold the "whole" query in memory and might need double memory then querier have ? (i know that abit navie, and i guess that there is som type of lazy iteration, but still i want to understand this) - Is there any way to control and throttle such big query without putting a hard limit, such as max lines / max query bytes? ## Timestamps 00:00:00 Announcements and introductions 00:04:27 What is meta-monitoring and why should we meta-monitor Loki? 00:08:17 Q - How many instances of Loki do we run at Grafana Labs? 00:10:03 Common on-call incidents for Loki 00:15:09 What should we meta-monitor? 00:18:49 Loki Mixin demo - what metrics to monitor for Loki 00:32:34 Where to get meta-monitoring data from Loki 00:36:09 Labels and querying best practices 00:39:12 How to deploy meta-monitoring for Loki via Helm (demo) 00:53:05 Gotcha: Scrape interval for Grafana Cloud 00:54:44 Q - How can you find which query is generating the most load? 00:55:15 Q - Best practices for migrating from ELK to Loki? 00:56:22 Q - Does the query-frontend need double the memory of queriers? %% # Excalidraw Data ## Text Elements ## Embedded Files caae7077695870f678bc11a9c0b6f4a0d58ca17e: [[Loki Community Call - May 2025.png]] ## Drawing ```json { "type": "excalidraw", "version": 2, "source": "https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.11.0", "elements": [ { "id": "1ygz0qqn", "type": "image", "x": 56.3692626953125, "y": -71.14180755615234, "width": 500, "height": 281.25, "angle": 0, "strokeColor": "transparent", "backgroundColor": "transparent", "fillStyle": "hachure", "strokeWidth": 1, "strokeStyle": "solid", "roughness": 1, "opacity": 100, "roundness": null, "seed": 36400, "version": 2, "versionNonce": 1412730691, "updated": 1746709077726, "isDeleted": false, "groupIds": [], "boundElements": [], "link": null, "locked": false, "fileId": "caae7077695870f678bc11a9c0b6f4a0d58ca17e", "scale": [ 1, 1 ], "index": "a0" } ], "appState": { "theme": "dark", "viewBackgroundColor": "#ffffff", "currentItemStrokeColor": "#1e1e1e", "currentItemBackgroundColor": "transparent", "currentItemFillStyle": "solid", "currentItemStrokeWidth": 2, "currentItemStrokeStyle": "solid", "currentItemRoughness": 1, "currentItemOpacity": 100, "currentItemFontFamily": 5, "currentItemFontSize": 20, "currentItemTextAlign": "left", "currentItemStartArrowhead": null, "currentItemEndArrowhead": "arrow", "currentItemArrowType": "round", "scrollX": 324.2117919921875, "scrollY": 345.1911926269531, "zoom": { "value": 1 }, "currentItemRoundness": "round", "gridSize": 20, "gridStep": 5, "gridModeEnabled": false, "gridColor": { "Bold": "rgba(217, 217, 217, 0.5)", "Regular": "rgba(230, 230, 230, 0.5)" }, "currentStrokeOptions": null, "frameRendering": { "enabled": true, "clip": true, "name": true, "outline": true }, "objectsSnapModeEnabled": false, "activeTool": { "type": "selection", "customType": null, "locked": false, "fromSelection": false, "lastActiveTool": null } }, "files": {} } ``` %%