Live - TCW 05 - Overview of AI Evaluation

# [[Live - TCW 05 - Overview of AI Evaluation]] <iframe width="560" height="315" src="https://www.youtube.com/embed/_XGLBqVpL-Y" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> Related:: "" [Google doc](https://docs.google.com/document/d/1WiTQA1bTqSxxwwE_HfWuAbbzvV1OoeIpgs0NqS0Dc2Y/edit?tab=t.nwyirt306sjk) ![[TCW 04 - overview of ai evaluation.png]] ## Topics - ai evaluations (63) - ai evals (37) - how we test ai (64) - how to test ai agents (66) - everything you need to know about ai evaluation (100) - overview of ai evaluation (89) ## Timestamps ## Notes - Guest: Yas Ekinci, engineer on the AI team (~5 years at Grafana), now focused mostly on evals. Episode is prerecorded ahead of his two-month leave. - **Why trusting AI is hard:** models are non-deterministic (it's all machine learning underneath), they write *a lot* of code, and PRs tend to be bigger and more over-engineered than human ones. The bottleneck has shifted from writing to reviewing — Nicole's framing of developers as "shepherds of intent." - Guidance is still needed but decreasing — from steering an agent 50–70 times per session a year or two ago to ~5–10 times now. Blindly vibe coding still ends badly. - **The "final answer problem":** a plausible-looking outcome isn't enough. If the process is wrong you can't trust the outcome, and AI-generated tests often just get rewritten by the agent to make changes pass. Plausible ≠ correct, and plausible-but-wrong is the dangerous case. - **What evals are:** testing whether an AI agent behaves the way you want (e.g. "who are you?" → "I'm Grafana Assistant"). The point is to know with confidence whether a change made the agent *better*, which is what enables consistent improvement over time. - **Online vs offline evals** (maps to production vs pre-prod testing): - *Online* — scores real production conversations (a sample), usually via LLM-as-a-judge. Good as a pulse check (e.g. catching quality drops when a model provider swaps models), but only tests the deployed variant and is hard to verify. Closest analogy: traditional observability / RUM. See [[Live - Grafana Sigil and AI O11y (Context Window 03)]] for the AI observability episode. - *Offline* — a "golden set" of tasks run during development/CI. Controlled and reproducible; can test every change, prompt, and implementation, and dig into specific skills (PromQL vs LogQL, dashboarding). Closest analogy: integration testing. - **How Assistant evals work:** an agent + tools (the harness) runs a task against a real-ish environment (Grafana with data sources, alerts, dashboards), then the full transcript/outcome is graded by "graders"/"verifiers." Key tension: environments must be real *and* reproducible. - The front-end [[Grafana Assistant]] runs some tools in the browser (good for auth/security), which makes testing harder — they use [[Playwright]] to replicate real browser behavior and pull telemetry out. Nicole pitched [[k6 (tool)|k6]]'s browser module (now approaching Playwright parity) as a dogfooding option; Yas was interested. - Internal eval tool is a CLI called **LLM Spec** (now ~3.0 after o11y-bench learnings); runs against cloud instances or a local controlled environment. - **[[o11y-bench]]** — public benchmark announced at GrafanaCON. Six task categories (dashboarding, investigation, logs, metrics, traces). Built on the **Harbor** framework (also used by terminal-bench), which separates harness / environment / tasks — so you can swap harnesses and compare GCX+Claude Code vs the [[Grafana MCP]]. Has a leaderboard including open models (Qwen 3.6, Kimi). - **pass@3 vs pass^3:** pass@3 = gets it right at least once in three runs (rough capability); pass^3 = gets it right all three times (reliability). 10–20% gaps between the two highlight the reliability story. - Uses a **synthetic environment** (Prometheus/Loki/Tempo in Docker) with generated data, so the correct answer is known in advance. Also uses **fact-based rubrics**: the grader runs its own known-good queries against the environment and checks the agent's answer against them — catching subtle failures (e.g. a broken Tempo tool returning no spans while the agent still answered plausibly from metadata). - **Public vs internal evals:** internal set is a superset that also includes scenarios based on real (private) conversations and Assistant-specific behavior (e.g. Slack's not-quite-markdown formatting). o11y-bench focuses on generic observability tasks and is harness-independent, grading on outcome not method. - **The evaluation loop** (subject of Yas's blog post, [[Building an evaluation loop for Grafana Assistant]], and the GrafanaCON talk): measure → make changes (new tools, prompt/instruction tweaks, bug fixes) → measure again. The loop can be run by a coding agent (via the `LLM Spec Improve` skill), since you can't unit-test natural-language "programming" the way you test code. o11y-bench is mainly the measurement piece, used more manually. - **Where it's going / open questions:** higher environment fidelity, more and harder tasks (likely an o11y-bench 2.0), getting more contributors (including PMs writing evals to capture desired product behavior), and automatically turning real online conversations into verifiable offline tasks (bridging online↔offline). Nicole floated a [[k6 (tool)|k6]]-extension-style community model for evaluators, and using o11y-bench to support BYOM (bring your own model) by showing users where smaller/local models fall short. - **Industry angle:** Yas hopes labs tune models against o11y-bench (free fine-tuning toward better observability/Grafana performance). Noted benchmark reliability concerns — the new DeepSWE benchmark reshuffled rankings (GPT-5.5 leading over Opus), partly because labs tune against popular benchmarks. - **Tooling note:** Yas is "bare bones" — Claude Code, Codex, OpenCode, one terminal plugin — and has recently been enjoying [[OpenSpec]] for ping-ponging the spec then verifying the agent followed the plan. Related episode: [[Live - TCW 04 - MCP vs CLIs for AI]]. - **Announcements/news covered:** Assistant now available for OSS/self-hosted Grafana; new "Unprompted" community Medium blog (Yas's eval-loop post was the first engineering post); new weekly AI team demo videos; Warp open-sourced; ChatGPT's "goblin problem"; GPT-5.5, Gemini 3.1 Ultra (2M-token context), Anthropic's Claude Mythos security preview; Matt Shumer's "Something Big Is Happening." ## Transcript ![[TCW 04-Evals_transcript]]