Grafana Machine Learning

# [[Grafana Machine Learning]] [docs](https://grafana.com/docs/grafana-cloud/machine-learning/) Grafana Machine Learning is a feature available in [[Grafana Cloud]] that lets you create predictions and detect anomalies based on time-series data. The project started in 2020 and was originally created by [[Gerry Boland]], [[Ben Sully]], and [[Yas Ekinci]], with [[Luccas Quadros]] joining about a year and a half later. GML is used in [[Grafana Kubernetes Monitoring]] to predict memory and CPU utilization at the node and pod levels. ## Features There are two main features that are actively maintained: - [[Grafana Metrics Forecasting]] — predict the future behavior of a Prometheus series (or other time-series data source) and optionally alert on it. - [[Grafana Outlier Detection]] — given a group of series that should behave similarly, identify any that are behaving differently from the rest. Both can be accessed from the [[Grafana Machine Learning]] homepage or from the alerting page, where they are surfaced as a way to create dynamic alerts. ## Pricing and limits Both forecasting and outlier detection are free, even for free [[Grafana Cloud]] customers. There are limits per tier: - Free: 5 forecasts, 5 outlier detection jobs - Hosted: 10 forecasts, 10 outlier detection jobs - 100 series maximum per query (expandable to 1000 on request) Limits can be bumped on request via escalation; there's internal documentation advising when to bump and when to schedule a call to understand the customer's use case first. ## Architecture Grafana ML is a monorepo containing: - **UI plugin** — a Grafana app plugin written in TypeScript/React. - **ML API** — the [[Go]] layer responsible for management and scheduling (jobs, tenants, scheduler, predict route). - **Model plane** — the Python layer. Includes the gateway (FastAPI), the worker containers (Prophet, Chronos, etc.), and the Grafana Data library. - **Grafana Data** — a Python library that abstracts queries against the many supported data sources. Training is orchestrated via a [[Celery]] queue. When a training job is scheduled, a Python container picks up the job, queries the data via the Grafana Data layer, trains the model, pickles it, and saves it to [[Redis]]. At prediction time, the model is loaded from Redis on demand and the result is returned via a fast API that mimics a [[Prometheus]] data source. A cache layer prevents re-fetching the same data on every retrain — for example, when a daily retrain runs, only the most recent day of data is fetched. The "data source" exposed to Grafana is actually an API that pretends to be Prometheus. This is what lets ML retroactively rewrite past values (e.g. the `anomalous` boolean metric) when a model is retrained — something Prometheus itself doesn't allow easily. ## Supported data sources Not all Grafana data sources are supported. A subset is allow-listed, including most major databases and logs sources. Logs sources like [[Grafana Loki|Loki]] require the query to produce a time series (e.g. via `rate()`) rather than raw log lines. Adding a new data source requires an escalation ticket and changes to the Grafana Data library in two places (front end allow-list, back end allow-list), plus documentation updates. ## Deprecated projects Two other projects lived in the ML space and are being retired: - [[Sift Investigations]] — being replaced by [[Assistant Investigations]]. - [[LLM Gateway]] — Grafana's first LLM initiative, now superseded by [[Grafana Assistant]] and the AI SDK. The team is running a session at the offsite called "Spring cleaning" focused on how to properly deprecate legacy projects so they don't keep getting demoed by sales or appearing in the catalog. ## Status No major new feature work is happening — mostly keeping the lights on. Escalations are mainly customers requesting higher limits. ## Resources - [[2026-05-27 Machine Learning walkthrough]] — internal walkthrough by [[Luccas Quadros]]