TCW 04-Evals_transcript - Fork My Brain

Nicole (00:00:02): Hey everyone. Welcome to another episode of The Context Window. This is a podcast that is semi-regular. We try to make it every month and it's all about the things that are going on with AI at Grafana. Usually this is live and you might think looking at this now that this is live, but it's actually prerecorded and that's because Yaz has a busy schedule ahead and I for sure wanted to get him on here before he left. So hi, I'm Nicole van der Hoeven. I'm a senior developer advocate at Grafana Labs and I'm assigned to the AI team these days. And one of the people on the AI team is, whoops, I always get the directions wrong. This person over here, yes. How do you say your last name actually? Yas (00:00:51): Oh, it's Ekinci. Nicole (00:00:54): Ekinci. Oh, it's the C with a little thing. Yas (00:00:57): Yeah, it's last names. C is something like that. Nicole (00:01:03): Yeah. I was totally confused because I just found out that you speak Dutch. Did not know that. Yas (00:01:11): I'm based in Belgium and I was like, oh, Nicole, van der Hoeven. I was like, okay, that sounds familiar enough to me. Nicole (00:01:18): Yeah. It's the naam van mijn ex-man. That's the name of my ex- husband, but it's okay. We still like him. Yas (00:01:27): Also, Nicole (00:01:27): You mentioned Yas (00:01:28): Before I left, but typically I'm not leaving Grafana. I'm just going on a two month Nicole (00:01:34): Crazy. I probably should. Yeah. I'm starting rumors now that this Yas (00:01:39): Is Nicole (00:01:40): Last ever Yas (00:01:41): One. I've got to be careful. You know how they spread around the virtual office. Nicole (00:01:46): Yeah. No, I am Dutch. I've lived there for nine years now. I have Dutch nationality. So I do speak it not natively, but I do speak it. Yas (00:01:55): Can we do this in Dutch fully now? Nicole (00:01:58): Yeah, I got Yas (00:01:58): That so cute. Specifically for the Dutch audience. Nicole (00:02:02): That's super niche. But how about we do it in English and tell us what you do on the AI team. Yas (00:02:11): Yeah. So yes, I am an engineer on the AI team. I've been at Grafana now for, I think- At least three Nicole (00:02:19): Years. I Yas (00:02:20): Mean, I'm getting to five years in next week. I was Nicole (00:02:24): Saying at least three years because we both have our- Exactly.This is a subtle flex for people outside of Grafana. You only get that when you've hit three years in Grafana, right? Yas (00:02:38): The only problem is due to cost savings. We didn't get the fancy remote that the initial people got. So we can't dim it as they do. We have to actually go and turn it off, but that's about it. Like Nicole (00:02:49): Animals. Yas (00:02:51): No, what do I do? So when I joined, this is pre AI. We had the machine learning team and we were building predictive models, things to make sense of the signals and do forecasting, outline action, that kind of stuff. And as the AI wave came, we all got absorbed into this newer, larger AI department. So I would say it's been quite varied. So I'm working on things like obviously the assistant or AI agent inside of Grafana, but it's been quite a lot. I mean, it's hard to say, I'm just working on this one thing. I can say the last, I think the latter half of, especially since this year, I've been more and more focused on specifically eval. So specifically how do we figure out what is the quality, what's the reliability of these things that we're deploying out? AI is a very useful tool and it works quite nicely. (00:03:58): It has utility, right? It's not like I was going to mention crypto or what was it? Oh my God. I even forgot the name. I was going to say NFTs. NFTs. NFTs, but it's definitely not like that. It definitely has used. We use it day to day. I use it day to day. My job has changed drastically over the last two years, but you still notice that it does sometimes need guidance. It does sometimes have issues with reliability. It doesn't quite get it always all the time. And even in my day-to-day work, like working with these coding agents, you kind of notice they still, even the top ones, even the GPT-5.5s and the Opuses, they still need a lot of guidance. And the thing is, the trend is that the guidance has come down. If I compare with like a year ago or like two years ago, where maybe during a session I would have to steer it maybe 50 times, 60 times, 70 times, that has kind of come down to like, now I just need to do that five or 10 times. (00:05:04): But you still need to do it. I think if you just blindly vibe code your way out of this, you're not going to end up in a good place. Nicole (00:05:12): Yeah. It really moves so quickly. Whenever I hear from someone that they tried AI and it didn't do what they were wanting it to do, I'm like, well, when was that? Because if it's like six months ago, I'm like, try it again because maybe it does now. And that's without getting into prompt engineering and maybe people don't know yet this is a new thing. We're all learning how to prompt better. But yeah, things are moving all the time. And in this episode, we're going to be talking all about evals, but first I wanted to go through just some announcements because there's a lot that's going on. Everything that I mentioned though is going to be linked in the description below. So I won't be putting up the links here, but I will put it in the description. The first is that I have a video for Grafana Assistant for Grafana OSS. (00:06:12): That's a mouthful. Assistant is available for OSS and self-hosted Grafana now. I just made a video showing you how to do that. Another thing is that Yaz ... Oh yeah. So there's this new medium blog for Grafana Labs now, which is called Unprompted. It's a cool name actually. And I really love the idea of it being a community focused blog for all things Grafana. And Yas just created one. He wrote an article about building an evaluation loop. That's kind of topical. Yas (00:06:50): Yeah, we can talk about that. Yeah, it was quite nice to be- Yeah, Nicole (00:06:52): Great. Yas (00:06:54): ... to working on this and being I think the first engineering post on unprompted, which is quite nice. Nicole (00:07:02): Yeah, that's great. And then there's also this thing that we're starting, which is the AI weekly videos. So every week the AI team, so both Yas and I are in it, have this weekly team meeting and I don't know, we just, actually, Mat Ryer had this crazy idea of like, what if we just took an excerpt from that and actually published it? So it's usually around the demo part and I edit those and I put it out and the first two are already out. So if you ever wanted to know what it's like, but just be warned that we demo things that may not ever make it into production and almost certainly even if they do, will look completely different. So you're getting it raw warts and all. Yas (00:07:54): Yeah. I need to make sure now that I don't wear my PJs anymore for the AI weekly. Everyone has to be camera. And also need to keep a PG-13. Nicole (00:08:06): Yeah. Okay. Well, also last month in AI, some quick AI news. Warp is now open source. Do you use WARP? Yes. Yas (00:08:17): I used to, but I don't like my tooling, it's quite weird. I'm very bare bones actually. So I'm like Claude Code, Codex and OpenCode. I think those are my three go- tos. And I don't do a lot of ... I think in OpenCode, I just have one plugin to help me with terminal shells. I don't do a lot of these crazy plugins and multiple agents or memory systems and everything. Everybody's making some wild claims about like, "Oh, use this or use this approach and that's going to make your development flow like a hundred X better. I'm pretty bare bones. There's a good blog post about this that's about just talk to it and it's really about like you just need to keep talking, keep guiding and I think that works quite well. The only recent I think like Shift is like I've been trying out OpenSpec and I quite like it. Nicole (00:09:11): Ah, yeah, Yas (00:09:11): Me Nicole (00:09:12): Too. Yas (00:09:13): Yeah. It's because you do, especially the explore the ping ponging of an ID. I do feel like it works better than just the regular plan mode and then it is encoded. And again, we're going to talk about reliability. OpenSpec can help with that as well because now you first ping pong the ID and then you get a nice little plan and you have a way to verify it if the agent actually did everything according to plan. So yeah, I've been doing more and more of that. Nicole (00:09:43): Yeah, I've been using that as well. And I actually have on my list of things to do is to get Lucas on here to talk about open spec. So that's for sure on the cards. Another thing, oh, this is a really cool one. ChatGPT has a goblin problem. Have you heard about this? I love this story. So what happened was that OpenAI said they released this blog post linked below saying that some versions of ChatGPT they realized had this weird bug that seemed harmless enough where it just mentioned goblins a lot and it apparently had something to do with a particular personality that you can select for ChatGPT. And it's an interesting read. I recommend it. It's just so of all things goblins, like actual goblins apparently had learned that people who are nerdy, because it was the nerdy personality that was selected, talk about goblins a disproportionate amount of times. Yas (00:10:55): And it's quite stereotypical like, oh yeah, nerdy. It has to be like magic to gathering and all that kind of stuff. I did play magic. Nicole (00:11:07): Yeah, I played D&D. It's a stereotype for a reason. There are also a few new models, GPT-5.5. Gemini 3.1 Ultra now has a two million token context window. That's huge. I think that's the biggest one of the frontier models. Oh, another big thing was Anthropic talked about the Claude Mythos preview, which is their model that is specifically targeting security vulnerabilities. Yas (00:11:45): Also quite topically interesting. I do feel like it's hard to distinguish how much of this is marketing and hype and how much of it is real. Obviously it's going to be better than what we have, but is it that much better or is it just optimized for security? I don't know. But I think they announced it, or at least I don't know if they announced it or that there are rumors that it is going to go public soon in a few weeks. Nicole (00:12:13): Yeah. I think they gave a few people private preview, a few companies, a private preview to allow them, I guess, to kind of fix some of those issues before they release it. So yeah, we'll see what's happening there. Another thing that I wanted to talk about was an awesome article by Matt Shumer and it's entitled Something Big is Happening and it's exactly what you were saying at the beginning that your job has changed and very drastically and very recently. And there's an interesting quote in it where he says," I am no longer needed for the actual technical work of my job, "which was like, " Ooh, that hits hard. "His advice was, " Think about where you stand and lean into what's hardest to replace. "Which I thought was pretty solid. Yas (00:13:15): I think it's still that guidance thing, right? So there's still some piece that where right now you still need the human in the loop, let's say, to kind of guide it. But in a way, it's also inevitable if you ask me how much code have I manually written, so the actual technical piece, very close to zero. So that is quite interesting. We'll talk about this at evals as well because in evals it doesn't work that well as it does for coding, but we'll get to that. Nicole (00:13:54): Okay. Well, let's talk about difficulties in trusting AI to begin with. You talked about just basically YOLOing it and just accepting commits and things that we're exploring to soften that open spec and stuff. Why is it so difficult to trust AI? Why don't we just YOLO all the time and just accept everything? Yas (00:14:20): So I think it's a mix of ... I mean, A, everybody knows that these models, they're non-deterministic and they are not perfect. These are not new problems to be fair. In the past there was a thing called machine learning, which is still AI. AI is just a nice little rebranding of it, but at the end of the day, all of these models, they are just based on ... They're all machine learning. When you get down to it, it's just all numbers and math and stats and they're not perfect. They're not deterministic. So every time you talk to it, it might give you slightly different response. I think that's one part of the problem. I think the second part is it writes a lot of code. Suddenly, because it's a new tool, you can 10X your output essentially. I can create because it takes much less effort for me to actually write the code. (00:15:13): I can just send 10 agents to do 10 different features of bugfix or whatever, like 10 different types of work, launch all those PRs. But then because the pace is moving so fast and because you have tens, hundreds of PRs that are open, somebody needs to review them. It also just gets hard to review because A, the PRs are typically quite big. This is also why when I talk about the guidance piece, one of the things I always do at the end when I'm done with a piece of work, I go, okay, now try to find a simpler way, simplify this because (00:15:48): It does tend to create a lot of codes. We talk about AI slop, it just creates a very big over-engineered solution to the problem and then you need to guide it to get to a much smaller version of that. Or you need to break up that PR into that whole branch of changes into smaller bits that you can merge more easily. But yeah, so it's like both bigger PRs. I think generally, probably if we actually track this, the amount of changes in aPR are bigger on average compared to human PRs. And because so many changes are happening, it's just hard to keep track. You have a mental model of the entire thing. Even in the assistant repo, we've got a mono repo setup. So there's like 10, 12 working groups that are all working on different parts of assistant and it's really hard to keep track of what every working group is doing all the time. (00:16:52): In the beginning, I can give the example of like, I was quite involved in the beginning when we were working on the Slack assistant and over time I focused on something else and now when I try to review some of those Slack changes to Slack to the Slack assistant, it does take me a while. And interestingly, I do use then AI to help me review that PR. (00:17:14): And again, you get back to the same problem of like, well, if you use AI to review the AI, well, how much can you trust that? And you also lose a bit your opinion of like, this is good code or this bad code. Nicole (00:17:28): Yeah. I forget where I got it from, but I like the phrase that now developers are becoming shepherds of intent. So we're not really writing the code anymore, but we are making sure that the intent is clear and that it is implemented. Yas (00:17:46): Yeah. But on the review, I think the burden is almost higher on the review side. On the review. Yeah. Yeah. Because writing stuff is now so like the cost of writing code has become so low that it's just cheap to just write a bunch of features and changes and let's add a whole new section to the UI and just creating that doesn't take up that much effort. But then somebody needs to actually look at that and say, "Actually this is good or this is not good and these are changes you need to make." Yeah, it's objective. Especially as these PRs get too big. Nicole (00:18:26): Yeah. And this is one of my problems with OpenSpec that it generates a lot of documentation. It is good in that there's like a clear process for describing your intent and then the proposal and the design. But yeah, AI can just generate so much stuff and what are we supposed to somehow read all of that, whether it's code or just text. And then also there's the difference between the process and the outcome and I call this like the final answer problem because often, because I'm a tester, I come from a performance testing background and I'm used to looking at the outcome and saying, "Did this test pass or fail?" But now even if the outcome looks like it might pass and it is what you expected, if the process is wrong, I don't trust that outcome. Did you just magically ... Where did that come from then? (00:19:30): And then the reverse is also true. If the process is right but then the outcome is wrong, then that's like, okay, then something else is wrong here and I'm not sure what. Yas (00:19:43): It will change. Tests are interesting now that I mentioned it because most of those tests are also AI generated and AI generated tests are, it's like on average quite implementation specific and also the AI doesn't shy away from just changing the test to make their branch pass, like to make their changes pass. I don't think we found the goal is like if you go on X or like or whatever, you'll see a lot of different claims and high people like, "Oh, use this approach, use this memory, and everything is going to work perfectly fine. And this is how you write a very high quality code with these agents." But I think you still need that human touch somewhere. I still, for any branch that I try to review, I will look at all the changes. I'm not going to just blindly rubber stamp things. I look at the changes, I want to test it out if you make change to whatever the assistant UI, I want to test it out, test try out some use cases. (00:20:50): And yeah, I think that's currently probably the biggest bottleneck of everything. Yeah. Nicole (00:21:00): And you're talking about still wanting to be involved, that's a whole human in the loop thing. One approach to testing AI for quality or trusting it is human evaluation. So this can be like sometimes you get a prompt, you get a response from an assistant and it's like thumbs up or thumbs down. That's like the most basic one, right? It is a human that clicks which of those thumbs matches their experience. But then there's also things like LMArena where a user, a real human sends a prompt and then you get two answers back from different agents and you don't know which one it is that's not identified and you just say like, "Is A better than B, or are they both bad or are they both good?" And that's useful, but it's also not really scalable. You can't run that in a CI/CD pipeline. So useful but limited Yas (00:21:58): On the eval side, I think they are all complimentary. So you got speaking two big sides of evals. In our case, we talk about offline evals and online evals. I mean, also we keep talking about evals, but what is evals? Evals is essentially you're testing to see if the AI agent is behaving like you want. So a simple eval could be, let's say, I want to make sure that my agent identifies itself correctly. So then your eval is you ask it as a user, who are you? And you expect the LLM to answer IM Grafana assistance. That's a simple one. There's much more complicated ones around querying, dashboarding, or software engineering benchmarks are very different as well. But at its core, it's really about an eval is how do you figure out if this AI agent is any good? And the reason we do that is because when we make changes to our agent, how do we know if it's better? (00:23:01): And I think if you can answer those questions with confidence, then that gives you a path in the longer run to actually improve your agent over time consistently. And I think that's probably also going to be more differentiating factor between the agents that will be actually useful in the long run versus the ones that are just like they look nice and they can have really good marketing, but then when it gets down to it, they will have reliability issues and if you just vibe change it, you're not going to end up in a good place. So that's kind of like the evals thing and like we talked about offline and online and offline. So the difference I think is like online is your, and we've recently launched AI observability for agents and online is really about that part. So online is about looking at how are real people interacting with your agent and then the online aspect is that it's online. (00:24:03): We're measuring how our agent that is deployed in production is doing against the conversations that it's having in production. So with these real users. And that's quite valuable because I mean, A, you get a score based on what real behavior looks like, but these are a bit very hard to verify. Typically we use LLMs as a judge to judge like, okay, the user asked for this, this is the whole transcript, figure out if was that like a good answer or not. But again, it is still, you're relying on LLM. So it is not as robust. It's not quite verifiable as something you would do like offline. And I think the offline counterpart of that is that offline is used more during development. It's during development, it's kind of like in your CI/CD, it's much more controlled and whereas online is just you're scoring every conversation or like a sample of the conversations you're having in the real world. (00:25:02): Offline is about you have this golden set of evals that try to cover all of the behavior that it would happen when you were to deploy this. So you still need to have an idea of like, what can my agent do and how will users interact with it? But that's kind of like what you're testing here. And the thing is like you're always testing the same set. So if you make changes, you can actually see, is it getting better at this set? But then also dig deeper and see is it getting better at PromQL versus LogQL? Is it getting better or worse at dashboarding? So I think that's the big difference. Online, really useful to have almost like a pulse check of is my agent behaving. It could also be that sometimes like Anthropic swap their models and like you see quality go down, that's something you can't really capture with offline. (00:25:51): So it's really useful for that, but online only tests the variant that is deployed. Even if you do A/B testing, it would only test those two variants. Whereas offline you can test all of your changes, your implementation, your entire code and all your prompts. Nicole (00:26:05): Yeah. I think it took me some time to translate these terms because the testing industry already has this and we just call it pre-prod and production testing. And so in pre-prod it's good because it's all the way from unit testing to system testing, system integration, which means you're testing and staging and dev and pre-prod environments. And it's good because you can really hammer the thing. You can change your scope, you can exercise absolutely everything and it's not going to go down or if it goes down, it doesn't matter. You have to be a bit more careful when it's in a production environment, but we still have things like synthetic monitoring or RUM, real user monitoring and that's the same idea just with an AI flavor, Yas (00:26:54): Definitely. On the online side, it's definitely much more almost like traditional observability in a sense. You're trying to observe a system, it just happens to be powered by AI. There is this natural language and text component to it, but at the end of the day, you still try to turn that into a number you can track over time. And then it's basically number goes up, number goes down and then alert on it. The offline aspect is I would say the closest analogy is probably integration test because you're trying to test the whole system. So it's like you give it a task, like a user prompt, you say, who are you? Or I'm trying to figure out what my CPU usage is for service X. Then it runs the whole thing, like the whole agent against a real environment and then you get something out and then separately from that, you need to then grade that outcome essentially. (00:27:49): So it's a bit closer to an integration test, I would say. Nicole (00:27:52): Okay. Yeah. So we did an AI observability episode actually and this was unfortunately without you because you were at GrafanaCON and we wanted to time it for that time. Well, not just so that you were away, but also to talk about it from our side. So I'll also leave a link to that if you want to go more into details because we actually demo it and stuff and I showed how I used it as well, but maybe we could talk a bit more about offline evals because that's not something we've talked about at all. Yas (00:28:30): Yes. So offline, it's a bit of a beast. I would recommend there's a good post by Anthropic called demystifying evals. Let's see it. It really goes into detail. I can share my screen up. I mean, this is a post, right? It's quite a lot. It talks about the structure of an eval, but just to give you, I might make this a bit bigger. When we talk about evals, this is not really like the split between offline or online in this case.This could also be online except you don't really have your environment stuff. You just have an agent that runs and your grading is just grading that conversation. But previously, so all the way back when we had simple single turn LLM calls, I'm thinking, I think one of the first ones that we launched was incident summary. So it would take all of your prompt would essentially be, "Here is all of the context of the incident. (00:29:38): Can you summarize this? " That would go through an LLM and you would get a summary out and then our grading could be things like make sure it's readable, make sure it captures all the details, make sure the length is appropriate, like it's not too long, it's not too short. And that's how we would evaluate a single turn one. The good thing here is like you just have a prompt, like you just have text You have an LLM, which is an API call. Again, this grading could either both be a mix of hard code testing checks and it could be LLM as a judge, but it's just a single term. So this is quite relatively easy to both test and set up. Whereas with agents, they get a bit more complicated. And the biggest complication is that you don't know ... Every time you run a task against an agent, it can perform the task in very different ways. (00:30:37): It can take very different trajectories. Sometimes it will finish the task in five steps, sometimes it will finish it in seven steps. But because the agent is in this loop where it keeps executing until it's done, you have to ... I mean, A, it takes longer to test. So you write a task, like in this case, this is I think a coding one. Your agent has a certain set of tools and I think the tools and the agent, this is roughly speaking the harness. Now we also have things like skills as well, but you can kind of put that in the tools bucket for now. And then the agent runs against some type of environment. So the environment in our case would be it runs against a real grafana with data sources with a certain state alert set up, dashboarding and all that. And then essentially the agent will just keep interacting with this environment using the tools to complete this task. (00:31:29): And then once it's done, it will say," I've done this. "Or if you ask a data question like, how much CPU is my service using? It will use the Prometheus tools and answer with, this is how much it's using. Then we need to take the output of all of this, all the steps it took, the whole transcript, and then grade that to verify given this task, did it actually get it right? That's the core of it. The other thing that's quite nice in this one is just kind of like an overview of all the different components, but we're going to talk about graders or sometimes also called verifiers. So you've got a suite of evaluations that are like a bunch of different tasks that run against this harness that run against the environment and then you have an evaluation that comes out. Nicole (00:32:23): Okay. So how do we do it for assistant? You talked a little bit about the environment being everything to do with Grafana, which is like a super, it's a very broad environment. There's a lot of scope there. How do we do offline evals for Yas (00:32:41): Assistant? (00:32:42): Yeah. So we have both, I think we have two things. We have the public side of benchmarks that we released recently and then we've got our internal evals. They generally follow the same approach. Just the number of tasks are different and some of like for internal evals, we might be using different data sources. Roughly speaking, what we want to do, and this is still like a big challenge is like we want to have an environment that is real, as real as possible, but the tension is like we want to have something that's very real, but we also want something that we can control. If we want to run a benchmark, we want to be able to make sure that this benchmark is reproducible. If we can't reproduce it, it kind of means you can't really fully trust the results because if I run it today and then somebody else from the team runs it tomorrow and they get a completely different score, then it's like, which one was right? (00:33:43): And so this definitely has been one of the big challenges. And maybe if I go back to how we all started. So as with everybody, like when we first started developing, we just started with some manual tests. So every time somebody would make a change, me, somebody else from team would go in and actually just chat with the assistant to see how it would do on like ... There were a few almost like I remember listing the data sources, making sure that the tool actually works was like a very simple one just to see that everything is wired up correctly. Some CPU questions like showing your logs and all that. So you would just do these types of tests manually and that to be fair, it's not necessarily a bad thing. I think you have to do something. Maybe even without any type of setup, the easiest way to get started is just have like encode this, like put this in a doc, put this in a script, like, here's our manual testing script. (00:34:42): And it's like, bring up the app, this is the first prompt scenario, this is the second one, this is the third one, this is the fourth one. And then you just essentially you judge, you are the judge, the human judge that judges whether or not the assistant did the right thing. Obviously you can kind of do this at a very small scale with like four or five scenarios. You can't do this as a human with like 500 scenarios. So from there we started looking into, okay, how can we actually run the whole thing and see what we can do? So I think the initial big challenge was really around the machinery and agent, and especially in our context, the assistant especially is like, it's a front end assistant, right? Some of the tools run in your browser that has some benefits security-wise, auth-wise, it runs as you, but it makes testing a bit hard as well. (00:35:38): So for example, for the front-end assistant, we use Playwright because we want to make sure that we are replicating the real behavior and the implementation fully as it is on the benchmarking side as well. So the benchmark just acts as like me in the browser and then it will run and go to the page and open up the sidebar and put in the doc and then we do some magic to get all the telemetry out. But yeah, so it kind of does that. And then so the machinery was about how do we test a front end plugin that was like a specific challenge. The other challenge was environment. What we do is we kind of have two approaches to it. So we have both we're testing against like cloud instances and this is mostly because, for example, for incidents, it's just a lot of data. (00:36:33): The machinery to get that snapshot of that data out and reload it in. Not that we have some things in place for that as well, but at least to get started- Nicole (00:36:46): Yeah. Sorry to interrupt. Before you go on to the environment, I was just curious really, did you consider using k6 instead of Playwright? Yas (00:36:58): Interesting. Yes and no I would say because k6 is also like a tool that the assistant needs to use, but we didn't our goal is not to load test. I think it's really to hook into the plugin. And the thing was we still needed to do some special things to get the messages, the telemetry, like things like the steps out of the frontend plugin. So that's why we have kind of like a Playwright harness that can run the front Nicole (00:37:36): End. There's actually a browser module for k6 that is going towards feature parity with Playwright. Yas (00:37:47): I did not know that. So it might be interesting. Yeah, yeah. Nicole (00:37:51): Yeah. Just like dog food all the things. Yas (00:37:55): Yeah, I haven't considered that, no. Yeah, I'd love to work Nicole (00:37:57): With you on that. Okay, sorry, go on with the environment. No, Yas (00:38:03): I was just saying environment. Yeah, so we have cloud environments that run against cloud instance and then we can run this internal eval tool like I haven't mentioned, but we call it LLM Spec. It's kind of like a CLI tool, but we can also run it against what we call like our local controlled environment. And that one is quite similar to our open benchmark that we published and maybe- Could we Nicole (00:38:32): See that actually? Yas (00:38:33): I can show you that one as well. Yeah. Nicole (00:38:35): So you've mentioned o11y-bench is the benchmark that Yaz was talking about and it was just announced at the last GrafanaCON too. It was pretty fast. Yas (00:38:48): It was pretty fast. So o11y-bench is our set. I'll start with the task. We've got a bunch of tasks around these six different categories. We've got dashboarding task, investigation, some log task metrics and traces. And with o11y-bench, we kind of A, wanted to bring the approaches we're having internally outwards and it was also like a learning opportunity because for example, we use this framework called Harbor, which is also used this one. This one is also used by the makers of terminal bench and it was quite useful to see, okay, how do they approach benchmarking from their end? So that's what we used for o11y-bench as well. And the nice thing about that is it separates the harness from the environment from the tasks, which means you can run o11y-bench against like the default harness that we have is a very simple small system prompt and the Grafana MCP. (00:39:49): But I know the folks that are working on the GCX or CLI for Grafana, they are using cloud code plus GCX and then testing it against o11y-bench and then see how does GCX compare with MCP? So that's quite a nice feature to have with this and it's also useful to, for example, test how our MCP server is doing and how we can improve our Grafana MCP as well. We've got this leaderboard. It shows you a bunch of different models. We also have interesting like some open models like the Qwen 3.6+ and the Kimi ones as well. So you get an overview of how all these models are doing against each other. Nicole (00:40:32): Yeah, they're getting good. Qwen 3.6 is awesome. Still not in the top five, but still for an open weight model, it's pretty good. Yas (00:40:44): And the interesting thing is like, especially this, right? So we talked about like reliability. So if you sort on pass@3, it looks a bit different than if you sort on pass^3. And the difference, I would kind of put it here as well. So the pass@3, it says every model, it runs the same task three times. So it will be, I think if we take one of the tasks, like let's say the metrics one, it will be like check the service health and it will run that three times. And if it gets it right one out of three times, it gets a pass on that task. So that task will be a pass here. Whereas pass^3 means to get one point you need to score, like you need to get a three out of three on every task. So we run the same task three times and then we check, did it pass every time of those three times? (00:41:38): Like you get three shots, did you score all three shots or did you just happen to be lucky and you shoot three times and you only score once? And that's kind of like the difference between, okay, the model has somewhere the rough capabilities to solve this task versus the model is actually reliably solving this task every time that you ask it to solve this task. And that distinction is quite interesting because like you see like 10, 20% differences between these two scores. (00:42:09): Yeah, just want to mention that. Nicole (00:42:12): Yeah, that's awesome. And this is public, right? Yas (00:42:17): Yep. I can go to the Git repo here as well. So we have a repo on Git. We've got all of the tasks. You can see how they're defined in here as well. So you can see like all the different things that we're using and this I think is quite interesting. So one thing that we're doing, I talked about like LLM as a judge. Now we are running o11y-bench against this synthetic environment. We are generating the data. So we have a Prometheus, a Loki, a Tempo inside of this Docker setup that we also like generate the data so that every time you run the benchmark, it's the same data. But it also means like before we even run the benchmark, we know what the answer should be because we control how the data is generated. And that means we can set up certain criteria to pass, which is like, it has to cite this particular thing and we know it has to be that particular thing because we know what the data looks like. (00:43:20): The other thing that we're doing is we're doing this thing called fact-based rubrics. So in this case, we are not just like giving the LLM judge a piece of text, but separately from what the agent does, we, the grading side, also runs like certain checks against the environment. So it can run queries, get the result and say, does the result of this known good query, this query that we know is the right answer, does that match with what the agent said? And this is quite important because sometimes these agents, they fail in very nuanced subtle ways. So if you ask it things like the CPU question, the simple one or like a latency question, like, what's the P95 latency for my API? And let's say it responds with three seconds or 3.1. How do you know if it's actually 3.1 or if it just made it up? (00:44:17): How do you know if it's not 2.5 or 5.5 or something else? So these facts are trying to resolve that by running a query that we know has the answer against this environment and then passing that through to our judge and that way we can capture this like these small nuances. And this is kind of like where benchmark gets tricky because most, especially most of the good LLMs now, they all look plausible enough. If you look at it from the (00:44:49): Outside, most of the LLM output, like the top tier models especially, they all look plausible, but plausible does not mean- That's the dangerous one. And that's the dangerous- Nicole (00:45:00): That's what I was saying with process versus outcome, right? Because if they said if the result of that question was like negative two seconds, then it's like, okay, it's clearly wrong or 500 seconds, it's like probably not. But it gives an answer that is plausible that's even more dangerous. Yas (00:45:20): Yeah. And that happens a lot. There was this, I remember I totally, there was this one change that we made to, I think the Tempo tool and it actually broke the spans. So the agent was not getting any of the spans properly, but it was still answering. And interestingly, its answer was not completely wrong because I think from the metadata it could kind of come up with, "Oh, the span should look like this. " But that was quite interesting where even without the tool results being empty, it was still trying to answer that in a realistic enough fashion, but if you didn't test for it, you wouldn't even capture this at all from just the answer from just the text, it looked realistic enough. Nicole (00:46:08): I know an early version of Assistant I said like, "Oh, I have this data source, can you create this kind of a dashboard?" And it was like, "Okay, okay." And it created a dashboard. I was like, "Huh, this doesn't line. I was just using this. There should be usage here and it doesn't seem right." And then upon interrogation, it was like, "Oh yeah, I didn't find that data source, so I just used dummy data." I'm like, Yas (00:46:38): "What?" Yeah. Interestingly, there was one, this was like when I was working on it locally, I broke all the tools. So none of the tools were going into the agent and it was still because we had some rough instructions in the system prompt about which tools to use, it would come up with some tool calls that it couldn't actually execute. Nicole (00:47:00): Okay. So this o11y-bench one, is that route, because you said that there's a public one and there's an internal one, what is missing in o11y-bench that you still need an internal Yas (00:47:13): One? That's a good one. I think internal one is, I think it's on two levels. I think one is that there are some scenarios that are kind of based on some real conversations that we had, for example, with the assistant or that are based on some data we can do from our own internal testing, like from our own internal environments, let's say. So obviously anything that involves real data, we can't put that in the public. So we can't have scenarios that are just based on public data or sorry, private data. So anything we do with o11y-bench, it is going to be limited to this synthetic data generated type of setup and we can still do a lot with that and we still use that internally as well. The way that I see it is like we've got o11y-bench or public set, but then our internal evals, the private set is just a larger collection, but the public set is still part of this as well. (00:48:16): So internally we want to test against o11y-bench as well, not just our internal evals. So I think that's one of the differences and the other, what was the other one? The other is probably because we have some very specific things we want to test about the assistant itself. I'll give the example of Slack. So in Slack we have certain things like we want the assistant to behave slightly differently in Slack. There's also some formatting things that are different than Slack compared to the web like Slack has their own version of markdown that looks like markdown, but it isn't quite marked down. And initially the initial versions of the Slack assistant had an issue properly like consistently writing a Slack markdown. So we had to like write some tests around that. So some of these tests are very specific to Grafana assistant. So those would be in the private set, but things that would be in the public set would be generally observability tasks like in the most generic general sense. (00:49:20): It's what are people trying to do with observability, ask questions about their data, like do dashboarding, like some basic Grafana API type of checks that we need to make sure that the assistants, the agents can call. And the interesting thing is that I talked a bit about that harness. So we know I talked about you can use GCX, you can test MCP, we can test our assistant, we can test any other agent actually against o11y-bench and that kind of focuses our grading towards the outcome. We don't really care how the agents solved it. Every one of these agents and harnesses have different tools. They have different ways of operating like cloud code is going to work differently than like OpenCode and the MCP server, for example. And we want to make sure that the benchmark is as independent from the how and the harness and it's really grading things based on, okay, what is the real answer? Nicole (00:50:23): Yeah. So I was also, sorry, I was playing around with the layout distracted. Yeah. I also wanted to ask what is ... We talked about your blog posts already. What is the evaluation loop? Maybe we should actually define that. Yas (00:50:45): Yes. So I think it starts with ... So at its core and we talked about this in the GrafanaCon talk, we also talk about it. I think it's up on YouTube now as well. So at its core, it's really about we measure, right? We measure the current state and we look at how good is it now and where is it working, where is it not working. Then based on that, we try to make some changes. We learn from all of these like hundreds, thousands of transcripts and figure out, okay, what are some changes we want to make? Maybe we want to add a new tool, change some tool instructions, change the system prompt, maybe fix some bugs with the implementation that are causing issues. We make the changes and then we measure again and then we make changes and then we measure again. We make changes, we measure again. (00:51:33): The thing is that this loop that can be done by a coding agent. It's not us manually doing this. So the whole setup is that if your evals are defined well enough and robust enough, so you've got like really good ways to verify if the answer was correct and you can pass this signal, like this clean feedback signal back to a coding agent. It can then make changes and then basically it can keep improving until it hits 100% on the evals and then you get into, okay, now I need to like make harder tasks or increase the coverage. But that's the core idea of the eval loop is that at scale it's nearly impossible for one human to actually understand the whole system and make targeted improvements because it's not like an agent is not programmed as code in the regular sense. It is coded like we code program with natural language and there is no clear relationship between the language and instructions you put in versus what the AI, the LLM model is going to do. (00:52:43): And the only way to really test that and probe that is like you need to run like a bunch of these loops to figure out, I made a change, now I need to know what the impact of that change was. And I can't do that with code. You can write a test and actually do that. With LLMs, you can't. Nicole (00:53:00): And is this evaluation loop equally implemented and effective for both o11y-bench and LLM Spec? Yas (00:53:11): So it is implemented internally. o11y-bench is really about just the first part I would say. The o11y-bench's goal is not to be in eval loop for agents although we indirectly have used it for GCX and MCP in that manner, but more manually, I would say. Nicole (00:53:28): Manually, yeah. Yas (00:53:30): But for example, in LLM Spec, we have this skill called LLM Spec Improve and when you call it, it will ask, okay, which agent do you want to improve? You select your agent and then it will just go and run, measure, improve, measure, improve, and so on. Nicole (00:53:46): Okay. So maybe let's zoom out a little bit and think what have you been working on now on this, whether it's o11y-bench or LLM Spec and then also where are you going? Just directions, no promises. Yas (00:54:03): So I've had a lot of work done on especially the harnesses. I talked about this Playwright stuff that's been quite painful. There's also been an interesting, I think historically, if I would say there was an LLM Spec 1.0, then there was a 2.0 revamp that was about how do you make it run faster and more robustly. Then we did o11y-bench. And then after o11y-bench, there was quite a lot of learnings from o11y-bench that we put into what I would call LLM Spec 3.0, which is the thing that I worked on after o11y-bench and Grafana. Now I think the two big challenges, I think one is I talked about environment fidelity that is, especially with all the different plugins that we have. So this is probably more of an issue for internal LLM Spec, like internal evals, more so than o11y-bench. That's one. The second big thing is just the number of tasks and how can we improve them both in quantity and quality? (00:55:09): How can we cover more of Grafana? And in same sense that o11y-bench, we weren't quite sure when we did o11y-bench, is this going to be a thing or is this a one-off thing that we're going to do or not? But I think even internally it's got a lot of interest that I would say we're definitely doing a version of that. We're going to probably work on an o11y-bench 2.0 or that type of stuff. So both for using that internally because o11y-bench has proven quite useful for the development of GCX and MCP, the Grafana MCP. So just having more harder tasks is the big one going next. And as part of that, one of the things I've been working on is like, how can we get more people to contribute? So like yesterday I was talking at the PM all-hands with the product managers on like, how can we get product managers to write some of these evals? (00:56:08): Because evals, these tasks, they also capture a bit like the desired behavior. So for product managers, it's like when somebody asks about my product, how should these agents respond ideally? Nicole (00:56:24): Yeah. It's almost like what I would love to see is some sort of kind of sub extension or I don't know what you want to call it. It's not really a plugin because when Grafana acquired k6, I was actually on k6. So that's why I'm like, k6, how about k6? How did you try it? But k6 also has extensions. So there are Grafana plugins, but k6 itself also has extensions and modules and stuff. I would love to see that same sort of model applied to evaluators because we've gotten a lot of really cool community extensions and I think evaluators would be the same. Just like I did a talk in Madrid yesterday and I showed off a bit of AI observability stuff and then the evaluator section and the questions were all like, "Can we add this? Can we add that? Why is this not in the template?" It's like, "It could be. (00:57:25): " That would be really awesome. Yas (00:57:28): Yeah. And closing, it would be nice if we could do that across both o11y-bench and the internal ones. There's still some open questions on how can we make those two work more nicely together. There is a difference because internally we're working in Golang, whereas o11y-bench, it's based on Harbor, it's Python, but the tasks themselves are kind of like language independent and so are most of the verifiers at least in design. There was probably a fourth thing and this is mostly more on the tasks, like how do we get to this better golden set over time as well? And that is really about how what we haven't kind of done it as one-offs, but we haven't quite had automate the part where we can go from AI observability, like real conversations and turn those into verifiable tasks in our internal evals because as the product moves on or as people's usage of the product moves on, your tasks that you have right now might not be relevant anymore or they might not cover the whole spectrum. (00:58:42): Imagine you launch a completely new product or if k6 didn't exist and then now k6 is in Grafana and people are asking assistant questions about k6 or interacting with k6 through Assistant, that's going to look very different. And if you don't cover that in your offline evals yet, essentially that like bridging the gap between online and offline evals and doing that in a more automated way, I think that would be like the fourth big thing. Nicole (00:59:09): Yeah. I also have some wish for o11y-bench isn't just really o11y-bench, but what it could be used for speaking totally with no internal knowledge, I'm not promising anything or whatever, but when I saw it, I was like, "Oh, this could be a step towards making Assistant BYOM, bring your own model." Because one of the barriers to that is people will try Assistant with their whatever crappy, small, tiny local LLM and think, "Oh, assistant doesn't work or something." And that's not true. It's just you need to use certain models for it. And having a standardized way to measure those and just presenting it to the user and being like, "Hey, you can do this, but here's where your model falls short and Here's where whatever Opus model is the latest is at on that leaderboard. Then they can just make an informed decision and we can say, "You can still do this, but this is what you're signing up for. Yas (01:00:14): " I would go even beyond the system. We've got GCX, which is quite nice if you're in and use GCXCLI. We've got MCP as well that people use. And if you want to use your own models with GCX with MCP, we want to make sure that this benchmark reflects, okay, if you're interacting with Grafana and on working on observability tasks, these are the best models that there are and these are the best open models that there are. Nicole (01:00:48): Good Yas (01:00:49): Point. It gets to your point of people can make a more informed decision. Nicole (01:00:55): Yeah. Well, what about not just for o11y-bench and LLM Spec, but just evaluation in general, where do you think we're still missing we as an industry, not just as a company? Yas (01:01:11): I mean, interestingly, was it yesterday or day before there was this new benchmark called DeepSWE, Deep Software Engineering that was released. And it is a bit of controversy because it kind of flips the ranking quite a lot from the most popular software engineering benchmark so far where Opus was always leading and in this benchmark it's GPT-5.5 that is leading by quite a margin. And even for example, I think 3.5, the new Gemini 3.5 Flash is beating Opus 4.6 or something. For what Nicole (01:01:46): Tasks? Yas (01:01:48): For coding tasks, for engineering tasks. So I think there's still something to be said about the reliability of these benchmarks themselves, like how obviously models are or the labs are using some of these benchmarks to both test their models against, sometimes even tune their models against those benchmarks. Nicole (01:02:15): Yeah. Yas (01:02:16): So the better I think and this also to be fair also might be a good thing for o11y-bench because I would want a lab to tune against o11y-bench, for example, so that essentially we get free fine tuning out of the box because now that o11y-bench is in the open and if we keep working on improving o11y-bench and labs use o11y-bench to tune their models on observability tasks, then in the future these newer models like 3.7 or 3.8 or even the big labs, they're going to get better at using Grafana and solving observability tasks. So the end goal I think is that overall. Nicole (01:02:55): Okay. Thank you. We just hit our time here, but it was really informative and I'm so glad that we both found time to do this. It's been rescheduled and stuff for various reasons. So thank you so much. That was a really great overview of all the evals that we're doing. Yas (01:03:18): No, thanks for this as well. It's nice to sometimes just stop getting out of our day to day and talk about stuff. Nicole (01:03:26): Yeah. And if anyone is watching this, if you have any questions about evals, you can leave it in the comments below and I'm going to bug you ask to answer them. Thanks everyone for watching. Yas (01:03:41): Bye.