Alerting Best Practices - Loki Community Call February

# [[Alerting Best Practices - Loki Community Call February]] [in developer-advocacy]() ![[Alerting Best Practices - Loki Community Call February]] ![](https://www.youtube.com/watch?v=3_DtwTOppiI) Related:: "[[Grafana Loki|Loki]], [[George Robinson]], [[Jay Clifford]], [[Alerting]], [[Grafana Alerting]]" ## Timestamps 00:00:00 Introductions 00:04:35 Meet George Robinson 00:06:55 What is an alert? 00:09:12 What is a recording rule? 00:12:00 How alerting and recording rules work in Loki 00:14:19 Q: Can you use a recording rule in another recording rule? 00:15:51 How does alerting actually work? (diagram) 00:22:00 Q: When should the ruler query the querier or the store directly? 00:26:36 Q: How does the ruler work with multi-tenant Loki SSD or distributed deployment? 00:29:47 Common mistakes in alerting 00:41:00 Q: How do you use a recording rule if you have a delay in receiving logs? 00:42:51 Q: When should you use Grafana Managed Alertmanager vs Mimir Alertmanager? 00:49:53 Q: Is there a way to recover recording rules? 00:52:48 Q: Troubleshooting Loki recording rules not being sent 00:56:51 Q: Local vs. remote ruler evaluation mode 00:59:15 Q: How can you check loki-rules in a CD pipeline? 01:02:30 Q: How can I run multi-tenant queries in the ruler? 01:04:51 Q: Can I perform aggregation queries on labels extracted from metadata? 01:06:14 Q: Are all alerting rule settings available as configuration-as-code (SNS)? ## George's answers **Who are you?** **What do you do? How long have you been using Grafana?** My name is George, I'm a software engineer on the Loki team. I've been at Grafana Labs for almost 4 years, of which I spent 3 years on the Alerting team before joining the Loki team. I first used Grafana in 2016. At the time I was using a lot of the TICK stack (Telegraf, InfluxDB, etc), but used Grafana as our visualization tool. I have been using Grafana ever since, and now I work here. I spent a lot of time since I joined Grafana understanding and contributing to the Prometheus Alertmanager, and that is probably where most of my expertise are. However, I also have good knowledge of alerting best practices, and have written a couple of blog posts on a number of best practices. **What are alerts used for in Loki?** There are a number of good use cases for alerts in Loki. A lot of the time we don't control the source code of the software we run. This can be both open source and closed source software. Some software might not support metrics, and others might be too complex or difficult to change without the right experience. But with Loki, you can still set up alerts with alert rules, and create your own metrics with recording rules, to observe these programs. This is allows you to observe and monitor software that you wouldn't otherwise be able to do. Another use case is to use Loki for single event alerts or information there that cannot be well encapsulated in a metric. For example, has some event occurred at least once? Examples of single event alerts are paging the engineer on-call if one of your applications has experienced a crash. You can also use Loki alerts in situations where a Prometheus metric might not be a good choice. For example, high cardinality data that would create lots of series in Prometheus. Instead of having high cardinality series in Prometheus, you can put high cardinality data into a much smaller number of streams, and then match them with line filter expressions (grep) using LogQL. A great example of this is alerting on per-tenant tenant in multi-tenanted systems. Using Loki means high cardinality data can be extracted at query time, rather than an ingestion time. This is good because storing high cardinality data in something like Prometheus uses a large amount of memory, as you have the overhead of storing lots of series. But in Loki, you can have an order of magnitude fewer streams for the same amount of information. **What are recording rules used for in Loki?** Recording rules let you turn logs into metrics. You do this by writing a LogQL expression that queries some amount of logs and outputs time series. For example, a query that counts the number of occurrences of a specific event in the last 5 minutes. Recording rules, from the perspective of Loki, have a number of good use cases: 1. It also allows you to turn log data into time series data. While it is common to use LogQL to turn logs into time series data, recording rules give you additional flexibility as you can query the time series data from your recording rule with other metric data in your Mimir or Prometheus datasources. You might want to show this in a dashboard, or use it in an alert rule. 2. It allows you to pre-compute frequent or computationally LogQL queries into time series data. You can then query the time series data instead, which depending on how complex and how much data you want to query, can be much much faster. This is especially useful for dashboards, but it is also useful for alerts, which can be evaluated as frequently as once per minute. If your alerts are running very intensive LogQL queries, you may be able reduce the load on Loki significantly with recording rules. 3. You can use recording rules to anonymize log data into metric data to be used by other teams in your organization, without having to give other teams access to your log data. This is useful because log data can contain sensitive information and PII. **What component of Loki handles alerting and recording rules?** The Loki ruler is responsible for evaluating both alerting and recording rules. When evaluating alert rules, the Loki ruler will evaluate the LogQL in the alert rule definition at the requested interval. It keeps track of which alerts and firing and which alerts are resolved. It sends these alerts to the Alertmanager (if you have one running) where the alerts will be grouped, checked against any inhibition rules or silences, and then sent out as one or more notifications. When evaluating recording rules, the Loki ruler will evaluate the LogQL in the recording rule definition at the requested interval. Most of the time you will want to configure remote-write, so the Loki ruler sends the result of your recording rules to another datasource like Mimir or Prometheus. **Do you still need an Alertmanager?** If you want to receive notifications for firing alerts, use inhibition rules, silences, active and mute time intervals, then you need an Alertmanager. If you don't want any of this, and just want to see which alerts are firing, then you can use a visualization tool such as Grafana. It can show which rules are firing, and the alert instances for those rules without having to set up an Alertmanager. **How does alerting with Loki differ for Cloud v.s. OSS users?** In Cloud, as far as Loki is concerned, most of the features are the same. One big difference is that almost everything is set up for you when you sign up to Grafana Cloud. That includes the Alertmanager and remote write for recording rules. With OSS, you will need to set up and configure everything yourself. Grafana Labs has a great free tier if you are just looking to get started. However, one of the other differentiating features in Grafana Cloud is being able to use all the products in Grafana IRM such as Grafana Incident and Grafana on-call, Grafana ML and Grafana SLOs. A lot of the time we talk about having a single pane of glass, and Grafana Cloud offers that. You have your dashboards, alerts, on-call schedules, and incident tracking all in the same place. **When is it better to use Grafana Managed Alerts v.s. Datasource Managed Alerts?** It might help to define some terms going forward, as it will help differentiate between the two options here. There is Grafana Managed Alerts, and there is Datasource Managed Alerts. You might hear me abbreviating these as GMA an DMA. Grafana Managed Alerts are managed, stored, and evaluated within Grafana. Datasource Managed Alerts are managed, stored and evaluated inside Loki, or inside Mimir, depending on whether your alerts are on logs or metrics. Specifically, this all happens inside the Loki and Mimir rulers. The reason it is called a Datasource Managed Alert is because the alert rule definition is stored and evaluated in Loki via the ruler component. There are a couple of things to consider when evaluating Grafana Managed Alerts v.s. Datasource Managed Alerts, and the answer also depends on whether you use Grafana Cloud or whether you use our open source products. Grafana Managed Alerts has a number of features that are not available when using Datasource Managed Alerts. For example, role-based access control (RBAC), alerting state history, dashboard/panel screenshots, joining data across multiple datasources, pause/resume etc. GMA also has good integration with Grafana Incident and On-call. Grafana Managed Alerts even has it's own Prometheus Alertmanager embedded inside it, meaning you don't need to set up an Alertmanager yourselves. This is not the case with Datasource Managed Alerts. Until recently, Grafana Managed Alerts lacked support for recording rules. However, this is coming and is expected to be generally available in Grafana 12 later this year, making Grafana Managed Alerts almost feature complete with Datasource Managed Alerts. However, I say almost because if you are using our open source products, an important factor in deciding between Grafana Managed Alerts and Datasource Managed Alerts is if you operate at a large scale. In Grafana Cloud, both GMA and DMA are automatically scaled to meet your needs. However, in open source, only Datasource Managed Rules support horizontal scaling. To scale GMA, you would need to run separate Grafana installations and separate your workloads between them. GMA in Grafana OSS does not support horizontal scaling. You can still use Grafana as a user interface to manage and interact with your Datasource Managed Alerts. All you need to do it add the ruler as a connection in the Grafana UI. **What methods are there for setting up alerts and recording rules?** When using Grafana Managed Alerts, you can create alerts using the Grafana UI, using the Grafana Provisioning API, and also using Terraform. The Grafana UI is great for experimenting, testing and adjusting alert rules, and Terraform is a great place to manage rules as code. You can do both at the same time. When using Datasource Managed Alerts, you can create alerts using the Grafana UI, or using a tool like lokitool. If you have ever used mimirtool, lokitool is the equivalent for Loki. If not, lokitool allows you to upload your rules as YAML files, just like Mimir, and similar to Prometheus. This allows you to manage your rules as code, and even provision alert rules using automation workflows such as GitHub actions and CI/CD pipelines. Datasource Managed Alerts do not have official Terraform providers, but there are some community maintained providers out there. **What is the Lokitool is this the best way to set up alerts and recording rules?** Explained above. **What are the most common alerting rules?** Most alert rules on metrics are counting rates and occurrences. This could be something like an error rate in the last 5 minutes, or number of crashes in the last 5 minutes. The exact queries and metrics tends to be unique to each person's use case, but the general picture is the same, rates and occurrences. A lot of alert rules on logs are looking for information that cannot be or is not well captured with metrics. For example, as we already spoke about, single events, alerts on high cardinality data, and of coursesoftware that might not even support metrics. **When creating an alerting rule, what are the most important things to consider? (labels, time range, etc)** There are a number of important areas to consider when defining alert rules, these are: - Evaluation interval - The evaluation interval is how often the rule is evaluated. The lower this value, the more often the rule will be evaluated. The default evaluation interval of 1 minute is often a good default value that creates a good balance between knowing about problems early and load on datasources. Obviously, the more alert rules you have, the higher the load. You could want to increase the evaluation interval if the rules in a group are falling behind and missing evaluations, and you cannot move the rules into other groups for parallelization or optimize the queries any more. - For duration - For duration is how long the threshold must be met for the alert to fire. It's common for beginners to want to set the lowest for duration possible. But one of the most important factors that separate good alerts from bad alerts is that good alerts are actionable. If you are getting alerts for small blips that are un-actionable, then this is not a good alert, and will just lead to alert fatigue (ignoring the alert). When choosing the for duration, take a step back and think "when do I have enough data to know that this is a problem/or needs my attention?". Anything below that threshold can go into a dashboard instead for offline analysis. - Labels v.s. annotations - This is another topic that a lot of people get confused about. The difference between labels and annotations is that labels are what uniquely identify an alert. You can think of them like a primary key, if you are familiar with databases. What happens is the labels are hashed to derive a fingerprint, which is a series of characters and numbers. It looks just like a checksum. - You want the labels of the alert to be stable over its evaluation, and that means avoiding putting query results, for example, in labels, which can change every evaluation. In a lot of cases, this just causes you have to 100s of "duplicate" alerts. - Annotations, on the other hand, do not change an alert's fingerprint. If you want, you can put query results in annotations. You can also put human readable messages like summary and description. A good summary and description is really important for making actionable alerts. Use the summary to tell you "whats wrong" in a sentence. Use the description to explain what the alert means and where to go to start fixing it. Alternatively, set a Runbook URL annotation, containing a link to a runbook. - Time range when doing range queries - When doing range queries, you want to choose your range selector to make sure you are querying enough samples. This is especially important when doing metric queries that calculate a rate. For example, alerting on the output of one of your Loki recording rules. If the rule is evaluated once per minute, having a range selector of 5 mins would be sensible. This means any rate is calculated over the last 5 minutes. It is generally recommended to have a rate at least 4x the scrape interval. In the case of recording rules, this would be 4x the evaluation interval. - When writing alerts just using LogQL, you might want to account for any lag between your agents and Loki, otherwise you could miss some entries. To do this you can use the offset modifier, which lets your query start from now - x minutes ago. For example, you may have an alert rule that queries the last 5 minutes of data, starting from 5 minutes ago. This will help avoid problems with possible late arriving logs or metrics. **Is there a way to test alerting rules before deploying them?** This isn't very easy today. What you can do is use Grafana Explore to visualize your query and then compare what you see on the graph to the threshold you want to alert on. You can compare it to historical data and say "I would have liked to have been alerted here" and then adjust your threshold to match that. I also understand that the Alerting team have been experimenting with a feature called "backtesting" where you can evaluate an alert rule against historical data. This would let you see exactly when your rule would have fired, had it been running at that time. However, at this time, the team cannot say when this feature would be generally available. If you are interested in this, check out https://github.com/grafana/grafana/issues/53443, where you can add your support with an emoji reaction. **What are some common mistakes people make when creating alerting rules?** Here a number of common mistakes that people make when creating alert rules: 1. Putting variable data in labels (for example, the results of queries). As mentioned, this creates a number of problems where you end up with a "different" alert firing every minute, for the same problem. 2. I have seen a lot of questions about how to include the log line in an alert. You can do this by aggregating/grouping on the log line, but if the log line contains non-repeating information like timestamps (which most do), you can end up with lots of alerts firing very quickly at the same time, for the same problem. If you really really want to do this, try to match text from the log line that you know has low cardinality. 3. Aggregating away important labels, for example, the service and sometimes the pod. 4. Rate windows that are too short. This is more of an issue when alerting on metrics (such as the results of recording rules), but it can affect LogQL alerts. You'll end up with gaps in the recording rule where there is no data, and in the case of alerts it might cause alerts to resolve despite the problem still existing (only to fire again at the next evaluation). **How do you troubleshoot alerting rules?** **How do you know if an alert is working?** **Are there ways to aggregate multiple Loki alerts into a single notification?** Yes. You would do this in Alertmanager using the `group_by` field in your route/notification policy. This field tells Alertmanager how to group related alerts together, and it does so using labels. For example, suppose you had two alerts with the labels `{foo=bar,service=api}` and `{bar=baz,service=api}`. You might have a `group_by` of `service`. Since both these alerts have the same service label (`api`), these would be grouped together into a single notification. Grouping is one of the most powerful features of Alertmanager. **What are the most common recording rules?** **Are there any performance considerations to consider when creating recording rules?** One of the main purposes of recording rules is to pre-compute frequent or computationally expensive LogQL queries. However, you need to make sure that all rules in a group, whether alerting or recording, can be evaluated within the interval you have specified. Remember that rules are evaluated sequentially too, which means in order, one at a time. If evaluation of the group takes longer then the interval, subsequent evaluations might be missed. If this happens, you may find gaps/missing time series in the output of your recording rules. You can split rules out into a larger number of smaller groups, which can be evaluated concurrently. **What are some common mistakes people make when creating recording rules?** 1. Not aggregating enough labels. For example, when creating a recording rule, most of the time you do not want to keep all labels from the log query. You will want to reduce them (i.e. remove the non-important ones) by using aggregation functions such as `count by` and `sum by`. This reduces the cardinality of the metrics produced by your recording rules, makes them more efficient and faster to query, and reduces your memory and disk usage on Prometheus. 2. Rate windows that are too short. This is more of an issue when alerting on metrics (such as the results of recording rules), but it can affect LogQL alerts. You'll end up with gaps in the recording rule where there is no data, and in the case of alerts it might cause alerts to resolve despite the problem still existing (only to fire again at the next evaluation). **Do you need to scale the ruler or should one replica be enough?** This depends on the number of rules you have, how much data they query and how long the queries take to run. A very simple, but effective, metric is memory and CPU usage. If you find your ruler is maxing any of these, consider scaling it horizontally. Poor performance of the `ruler` manifests as recording rules metrics with gaps or missed alerts. This situation can be detected by alerting on the `loki_prometheus_rule_group_iterations_missed_total` metric when it has a non-zero value. If you are finding missed evaluations, or your ruler is maxing out CPU or getting OOM'd because it is running out of memory, these are good indications that you need to scale your ruler deployment. **Is the ruler stateful or stateless?** The ruler is stateful. It has per-tenant write ahead logs that are persisted to disk. The ruler uses this to recover the data output by your recording rule should it crash or be restarted. It also keeps a small in memory store whose purpose is to lazy load past evaluations when rescheduling or resharding rulers. **What is the ruler sharding strategy? Does this help with sorting expensive queries?** Rulers shard their workload (evaluating rules) using rule groups. That means all rules in a rule group will always be evaluated by the same ruler. If you have really large groups that take a long time to execute, you might want to consider splitting them up into smaller groups that can be sharded. Sharding does help with expensive queries, as each ruler acts as it's own querier, in the sense that it executes queries against the store without using the query-frontend or queriers. Another solution to this problem is to externalize rule evaluation from the `ruler` process. The `ruler` embedded query engine is single-threaded, meaning that rules are not split, sharded, or otherwise accelerated like regular Loki queries. The `query-frontend` component exists explicitly for this purpose and, when combined with a number of `querier` instances, can massively improve rule evaluation performance and lead to fewer missed iterations. It is recommend to create a separate pool of query-frontend and queries that are dedicated to evaluating recording rules and alerts. This allows rules to be given "priority" over ad-hoc queries from dashboards etc as they are used to produce metrics or alerts which may be crucial to the reliable operation of your service. **(Matt Browne) is anyone aware of what might cause loki recording rules that you define in grafana to disappear? Is there a way to recover recording rules?** **I'm trying to figure out why my Loki setup is not running recording rules and not sending the resulting metrics to the Prometheus remote write endpoint. The rules do get added to the /rules directory by the sidecar container, but I don't see anything related in the logs of the loki container in the loki-backend pod or the loki-sc-rules container, even after enabling debug logging for both. Any ideas on how to troubleshoot this?** If rules are being evaluated you will see debug log lines for `evaluating rule`. If you see this, but remote write isn't working, then I would start to look for problems with remote write. If you don't see this, I would start to look for problems with rules being loaded into Loki. I did find a bug report for this https://github.com/grafana/loki/issues/11508 where it has been suggested the problem is that the rules are loaded at start up time, but the sidecar downloads the rules after the ruler has already started running. Loki should also sync all rules from the filesystem every minute. I tested this earlier and can at least see that it works, and it works for files that are added after start up time. But I didn't have time to test with the sidecar. I think it's something we will have to look into as it looks like other users have reported this problem too. **How to include log lines into the alert notifications? (maybe just a few samples, without modifying/complicating the alert condition itself). exa: show logs causing the alert on level=error.** The only way to do this today is to extract the log line as a label. But really, I don't recommend this, because it can very easily create a cardinality explosion where you have 1000s of active alerts all for the same problem. It's not very useful. The problem is that Loki does not alert on log lines. It alerts on time series. When you create an alert in Loki, what you are doing is writing a LogQL query that turns logs into time series, and alerting on that instead. In other words, you are inherently aggregating it from something discrete into something continuous, like a rate. Your alert isn't on the log line, but it's on the rate or number of log lines, which is quite different. Instead what I recommend is thinking about writing a good summary and description annotations for the alert that tell you what is happening. You can then link to Grafana Explore, for example, in an annotation, that takes a part of the LogQL query so you can then explore the log lines there. For example, sub-queries/sample query: `sum(count_over_time({} |= "foo"))` sub query: `{} |= "foo"` I think it's a good feature request, but it needs some careful consideration. For example, if you have an alert that is counting 1000s or more lines, which lines do you want in the alert notification? Is it the first couple lines, the last couple lines, a random selection of lines? **Alerting in Multi-Tenant environments. I run Loki ingesting logs from multiple clusters. Each cluster is identical but hosted in different zones/regions and each have a tenant id. At the moment, I need to create one alert per each cluster (x20) because the Ruler component does not support multi-tenancy,Are there any plans to support multi-tenant queries in the Ruler component?** It sounds like you have created different tenants for each cluster. Is there a reason for that? Tenants are meant to be isolated from each other, this is quite an intentional decision. Think of it as like customers or teams, who should not be able to see each others data. You could create a recording rule for each tenant that remote writes to a single Prometheus, and then create an alert rule there, but you still have the same problem of creating N rules. I would consider having one tenant for all clusters, unless there is a really strong need for isolation between them. In which case, I would look into provisioning the alerts using lokitool and an automation pipeline (i.e. CI/CD). ## Notes - What is an alert - A rule or condition that is continuously evaluated according to a specified frequency. When the condition is met, the alert fires. Then you are notified. - Runs autonomously. Notification is key. - Recording rule - common use case: Saves result of computationally expensive queries as a Prometheus metric. Saves money and time by just loading the intermediate results from Prometheus. - You use the recording rule to calculate the metric that you want to alert on, and then you use an alert on it. Recording rule feeds into an alert. - Alerts and recording rules in Loki are handled by the ruler - YAML: You specify a group that is a series of rules. Say how often they're evaluated and then you specify the rules. Groups are evaluated in parallel, but within a group the rules are run sequentially. - ? Can you use the result of a recording rule as input to another recording rule? - Typically yes. In [[Grafana Mimir|Mimir]] or [[Prometheus]] you can. - In Loki no because the Loki ruler queries logs and outputs metrics. So there's no way to output logs. - How alerting works - Within Loki: Ruler by default queries the store and index using its own querier, although it *can* use the querier - If it's a recording rule: It computes the result (metrics). This goes to something like Prometheus/Mimir, where you might have other metrics. Then Prometheus/Mimir could have a ruler too, which could also do alerts. - If it's an alert rule: Fire an alert. - All alerts go to the [[Prometheus Alertmanager|Alertmanager]]. The Alertmanager's job is not to evalute rules; it's to receive and group alerts so that you don't get notifications for every single one if they're saying the same thing. - Alertmanager passes these groups through pre-determined routes that then notify via Slack, email, etc. - Pros and cons to ruler querying querier vs. store directly - Querier: Can take advantage of caching and query splitting and query sharding (querier and query frontend can split them among other queriers) - If you want to use a querier with a ruler, then you should create a separate set of queriers from their dashboard queriers because alerting and recording queries are highest priority because they're time-sensitive. - Store directly (use ruler's own querier): Rulers can be horizontally scaled and you can shard queries by group (not by rules) - To mitigate this, though, you could just run a recording rule to avoid running slow queries in the first place. - Though recording rules can generate a lot of [[AWS Simple Storage Service (S3)|S3]] traffic. - ? Ruler and multi-tenant Loki SSD/distributed deployment - defining two tenants in Loki ruler config - Common mistakes in alerting - Not being clear on what you want to know or what you need to know. Alerts should be: - Relevant - Specific - Actionable - Having a too-low evaluation interval (every 10-15 seconds). Consider this in conjunction with your collection rate. Maybe you don't need the interval to be that low. - Improper `for` duration: how long should the condition be met before it fires? - causes alerting fatigue. Consider doing something like 5min. - Using dynamic labels can impact alerting too - Labels are like primary keys of databases. They're key-value pairs. You don't want to put highly dynamic values in there because then you'll end up with duplicate alerts even if they really are about the same issue. - Instead of labels, use annotations. Descriptive, and you can also link to a dashboard within the alert without pushing. - The Loki alerts are actually on the time series data produced by the logs. That's why it's hard to put log data into time series data. - ? How do you use a recording rule if you have a delay in receiving logs? - Offset modifier lets you change the window that the query is looking at - Grafana Managed Alerts (GMA) vs Database Managed Alerts (DMA) - GMA: Managed and stored in Grafana installation - DMA: Alerts are stored and evaluated in the data source, predate GMA - If Grafana Cloud: GMA. - Other things that GMA has: RBAC, alerting state history feature, screenshots of dashboards and panels when alert fires, join datasources for alerts, pausing and resuming of alert rules, integration with Grafana Incident and OnCall. - Why DMA - GMA doesn't have recording rules yet, though it's coming. This would make GMA feature complete with DMA. - Inhibition rules are also not supported by GMA. It's coming soon to Cloud, but not OSS. - The OSS GMA doesn't have horizontal scaling and sharding of rules, whereas Mimir and Loki rulers do. You'd probably have to run separate instances of GMA and then split rules manually yourself. But with DMA it's taken care of. In Grafana Cloud, horizontal scaling and sharding is taken care of. - Troubleshooting Loki recording rules - Check that rules are recorded to filesystem. - How to tell whether you should use remote ruler evaluation (whether you use the query frontend) - Are the rules being delayed because the previous evaluation took too long? - Split queries if they're long time rangers. - Did your ruler run out of memory? - Are you missing evaluations? - Multi-tenant queries in ruler - We don't really need multi-tenant - just have one tenant across multiple clusters unless there's a really strong need for isolation - %% # Text Elements # Drawing ```json {"type":"excalidraw","version":2,"source":"https://github.com/zsviczian/obsidian-excalidraw-plugin/releases/tag/2.0.25","elements":[],"appState":{"theme":"dark","gridSize":null,"viewBackgroundColor":"#ffffff"}} ``` %%