# [[Live - Grafana Sigil and AI O11y (Context Window 03)]]
<iframe width="560" height="315" src="https://www.youtube.com/embed/vzygCuuf7JQ" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Related:: "[[Project Sigil]], [[Alexander Sniffin]], [[Jack Gordley]], [[Tiffany Jernigan]]"
[Google doc](https://docs.google.com/document/d/1WiTQA1bTqSxxwwE_HfWuAbbzvV1OoeIpgs0NqS0Dc2Y/edit?tab=t.40wrqyqutie1)
## Topics
- You can use AI O11y to instrument your own AI app and see canonical data about your AI
- Jack has instrumented his local AI coding tools with AI Observability
- He actually uses [[OpenCode]], an open-source coding agent
- What's an evalation?
- online (what AI O11y is running with) is running on live traffic
- LLM as judge
- regex
- offline: running on dataset conversations
- We actually use this ourselves
- System prompt analysis (under Agents)
-
## Timestamps
00:00:00 Introductions
00:02:00 The last month in AI news
00:11:09 What is AI Observability in Grafana Cloud?
00:19:20 What is an evaluation?
00:21:05 The origin of AI Observability
00:25:09 Demo: Setting up AI Observability and evaluators
00:32:04 What is LLM as judge?
00:38:43 Demo: AI Observability Analytics
00:40:52 AI O11y is based on OpenTelemetry
00:42:17 Demo: Instrumenting a local coding agent
00:47:18 Potential future agentic use cases
00:52:00 Evaluators that catch the most bugs
01:02:23 Demo: System prompt analysis
01:05:11 Guess the prompt
## References
NEWS
Opus 4.7 is out: https://www.anthropic.com/news/claude-opus-4-7
Gemma 4: https://deepmind.google/models/gemma/gemma-4/
Qwen 3.6: https://qwen.ai/blog?id=qwen3.6
o11y-bench announcement: https://grafana.com/blog/o11y-bench-open-benchmark-for-observability-agents/
Opt out of GitHub Copilot training: https://github.blog/news-insights/company-news/updates-to-github-copilot-interaction-data-usage-policy/
All the GrafanaCon announcements: https://gra.fan/gcon26
OpenCode: https://opencode.ai
(docs) Online evals on Grafana Cloud: https://grafana.com/docs/grafana-cloud/machine-learning/ai-observability/introduction/#online-evaluation
(docs) OpenTelemetry integration with AI Observability: https://grafana.com/docs/grafana-cloud/machine-learning/ai-observability/introduction/#opentelemetry-integration
The Sigil SDK: https://github.com/grafana/sigil-sdk
(blog) Anthropic: Emotion concepts and their function in a large language model: https://www.anthropic.com/research/emotion-concepts-function
OpenClaw: https://openclaw.ai
(docs) More about AI Observability: https://grafana.com/docs/grafana-cloud/machine-learning/ai-observability/
## Transcript
Speaker 1 (00:00:01):
Hi everyone. Welcome to the third episode of The Context Window, but this is actually my first time in it. And I'm brought along a couple people who know more than me. For example, this person over here did the last context window. Tiffany, why don't you introduce yourself?
Speaker 2 (00:00:21):
Hi. Yeah, I'm Tiffany Jernigan. I'm a developer advocate here at Grafana focused on tempo, so distributed tracing, prior scope for profiling, and then AI things.
Speaker 1 (00:00:34):
Oh yeah. Well, now I feel like I totally didn't even introduce myself. Hi, I'm Nicole Vonderhoven. I'm also a developer advocate and I'm focused on AI stuff and assistant. And we brought in some experts on our topic for today. Experts, would you introduce yourselves?
Speaker 3 (00:00:54):
Hi, everyone. I'm Alexander Sniffin. I also focus on AI stuff. I work on assistant investigations, AI observability, so all those fun things.
Speaker 4 (00:01:09):
And same here. I'm lucky enough to work alongside Alexander on some of these AI products that we're working on. So Grafana Assistant, AI observability, and everything in between.
Speaker 1 (00:01:19):
And how long have you been at Grafana Jack?
Speaker 4 (00:01:23):
I've only been here for a month and a half now. So it's been really fun ramping up. And yeah, things move really fast around here, so it's been very exciting.
Speaker 1 (00:01:31):
Yeah. Yeah. We totally just sniped you. Welcome to Grafana. Here's the full Grafana live streaming experience. We'll try not to heckle you too much.
Speaker 2 (00:01:42):
Keyword try.
Speaker 1 (00:01:43):
Yeah. But I think that AI just moves really quickly in general. I feel like every time I want to make a video, I'm like, wait, is this going to be still valid in two weeks? Because maybe not. So just this last month, I thought we'd go through the whole month and just some news that dropped during the last month. Just this month, Opus 4.7 was released. It was really cool. I was actually in Cafe Cursor in Lisbon for a cursor power user meetup. And it spread like wildfire. We all got the pop-up and cursor that 4.7 had released, and that was really awesome. Really powerful, really expensive so far. I'm watching the money that I'm spending or that Grafana is spending rack up a lot faster than I expected. Are you all seeing the same thing?
Speaker 4 (00:02:45):
Yeah. 100%. We're getting to a point now where some of the costs are starting to become pretty significant. So if you are coding and let an agent spend for a couple hours, you can come back and it could be a pretty hefty price tag. So definitely starting to be more conscious of that. Whereas a year ago, I was just spending and running agents as much as I wanted.
Speaker 1 (00:03:09):
Yeah. I've honestly just been running 4.6 on Max. I haven't really cared, but now I kind of feel like I should.
Speaker 2 (00:03:17):
I am bad and I haven't fully been specifically paying attention. I've been more like, "Is this answering my questions? Is it solving things for me? Can I explain this thing that I'm not 100% sure about? " Or can it confirm that, "Hey, this is actually what I think things are and let me go fix something." But yeah, now that you mentioned pricing, I guess I should ...
Speaker 1 (00:03:37):
Yeah. They apparently changed the tokenizer in 4.7 and so things really are from, they said from one to 100 to 135% as much as it was before on 4.6. They also had a bit of a snafu earlier this month when the Claude code source code was late via an NPM source map. And Q, all these people first downloading it to try to figure out how Claude works and find out all of the secrets, including a Tamagotchi thing that was almost certainly an April Fool's joke. And then hackers that put fake source codes up to get people to install them on their computers, it was mad.
Speaker 3 (00:04:34):
Yeah. I think my Tamagotchi buddy ended up being an owl. So it's pretty cool. It'll sit there beside you while you're coding. Oh,
Speaker 1 (00:04:44):
Did you actually try it? I haven't tried it. That's so cool.
Speaker 3 (00:04:47):
Yeah, you can set it up pretty easy. Yeah.
Speaker 1 (00:04:50):
Oh, wait. I will totally do that. Can you choose what you get?
Speaker 3 (00:04:54):
No, I think it's random. I think you just get whatever they give you.
Speaker 4 (00:04:59):
Yeah, people online were trying to speculate how to re-roll their pet to see if they can get a different animal, but there was a whole discussion about that.
Speaker 1 (00:05:10):
Like save scumming or something.
Speaker 4 (00:05:12):
There's like a rarity score. It's crazy.
Speaker 1 (00:05:15):
Oh my God. That's what we need. Claude code Pokemon. Shiny Tamagotchis that watch you code.
Speaker 2 (00:05:26):
The closest thing is what? Using the Infinity Data Source plugin and talking to Poke API and creating dashboards. Yeah.
Speaker 1 (00:05:36):
That would be fun. Another thing was open weight models are really actually usable. I've seen a few articles now about Quen 3.6 and Gemma four. So Gemma four is from Google and I've never tried it. Quen 3.6, I just downloaded. I've been messing around with some local models basically for plane rides because it's really hard to work without any AI now I'm finding. But has anyone tried that? Either of those, Quen 3.6 or Gemma four because the article that I read said that it rivaled Sonnet.
Speaker 3 (00:06:22):
I personally haven't, but Jack, have you?
Speaker 4 (00:06:25):
Yeah. Well, my only experience with it with Quinn so far was for, I guess this is a good time to talk about Ollie Bench. We created like a for Grafanicon release a open benchmark for Grafana related tasks. So seeing how agents could perform using the Grafana MCP on a local Grafana environment with some synthetic data. And Quinn of the open source models we tested did the best. And it actually, in our scoring, did rival Sonnet and some of the other smaller GPT models. So really powerful stuff and very high scoring. But at least according to Ollie Bench so far, a little bit behind some of the frontier, like Opus and GPT54. So pretty cool to see though, the open weight models coming back.
Speaker 1 (00:07:11):
I mean, the whole point is that Quen 3.6 runs on my Mac. That is not at all optimized for ... It's not a data center. So like the fact that it can come close to any of the frontier models is pretty amazing. Looks like Bupesh also has used Gemma and it's fast with less hallucinations. That's always good. Nice. This one I hadn't heard about until I was looking to see what I should include in this. Starting April 24, so two days from now, GitHub is actually going to train on interaction data for some plans. I think it's like free pro and pro plus unless you opt out. So that was kind of like working for Grafana for corporate ones, but for my personal one, I was kind of like, wait, wait, wait, what?
Speaker 4 (00:08:07):
There's definitely been a trend in opt-out being the default. You have to opt out of data collection. So it's getting kind of hard to keep up with all the different platforms that seem to be trying to train on your data. So yeah, a little bit scary for
Speaker 2 (00:08:22):
Things. Everyone wants to train on your data, but I'm glad at least for this one, I keep seeing every time I'm logging in because I haven't ignored it. I see the notification right there in front of my face, whereas there's sometimes places that you don't know that until you dig into it.
Speaker 1 (00:08:37):
Yeah. Well, for us, I think the biggest news is that we had GrafanaCon this week. Technically, I think it's done now. It was Monday, Tuesday, Wednesday, and we're at night. We're streaming at night, Barcelona time, which is where it was held. So we're good. Everything that we wanted to announce was announced, so we can talk about it now. Finally, and it's a good thing because there was a lot. There was Assistant Everywhere. So it was already available for Grafana Cloud, but now it's also available for Grafana OpenSource and Enterprise. There's the new Unified CLI, which is called GCX, which is a CLI for all things, Grafana. Assistant is now on Slack and Microsoft Teams, you can talk to assistant in those as well. There were new modes that I think they weren't big enough to have their own talk or anything, but they're live now.
(00:09:39):
The Learn Mode and the KSIC script authoring mode are both live, and I think that's really cool.
Speaker 2 (00:09:46):
Yeah. And then there's been the open source Grafana MCP server, and now there is actually a Grafana Cloud MCP server that's in public preview as well.
Speaker 1 (00:09:58):
And Jack, you already mentioned Alibench. Good shout out there. Oh, also, I'm releasing a blog post tomorrow that Jack helped review on benchmarking and why that matters for LLMs. But the big thing that we're talking about today, or in this one anyway, is AI observability in Grafana Cloud. Yay. Alexander, can you tell us what it is? What is this? Is it an app? Is it a plugin? Is it just something you pulled out of your drawer?
Speaker 2 (00:10:34):
None
Speaker 3 (00:10:35):
Of the above.
Speaker 2 (00:10:37):
Before, was there also an AI observe building and this is different?
Speaker 1 (00:10:42):
Yeah. Actually, I did a ... What was it? I think it was a Grafana Office Hours on LLM observability, and it was about ... There have been a few ones. We had an AI observability sort of app that was based on open lit.
Speaker 3 (00:11:05):
Yeah. So this is different. It's none of the above. This is something that we've built internally, taking a lot of the learnings that we've had from building assistant and investigations and building our own observability tool that you can actually use to instrument your own AI applications. And then that becomes telemetry that you emit to Grafana, and you can view that telemetry and Grafana with all of your other telemetry. So if you have traces, metrics, you get this unified view where you see everything, not just what you would've already seen before, but you also see canotical data related to your AI app. So it gives you an extra view on top of what you already had.
Speaker 1 (00:11:56):
What's the deal with AI observability? It feels like there's so many things that are happening in the world of AI. Observability is not quite catching up, but now finally, we're starting to see people care about what AI is doing, which is like, that's where we want to go because there's so many things that make observing AI difficult. But why do you think it's so important?
Speaker 3 (00:12:24):
Yeah. I think, well, if we look at the lineage of these AI applications, we started with really basic AI applications, basically auto completes. They were the most simple generation type of applications. We started moving into more typical NLP use cases like summarization, classification problems. And then we started realizing, "Hey, wait a second. You can actually use a language model to do reasoning with real data within a real environment, and it can actually act on that. " And that was kind of the big turning point where, wait a second, we can actually use language models as something way more. So now you have tool calling and agents and ingetic applications. Even more recently, we're getting into these harnesses that are indefinitely running type of agents. And that's where observability became really useful. Or as early on, it wasn't as obvious because the use cases were really simple, but nowadays we have these LMs and other language models that are just doing so many things now and they're acting autonomously where you need to know whether they're actually working the way you expect them to do.
(00:13:39):
And I think that's a big difference from what you might see with normal software because language models, they're kind of like this closed box, you put an input into them and they give you an output and you don't really know what it is. So you need some kind of observability to understand that.
Speaker 2 (00:13:57):
Yeah. And then in general, it's been seeming like observability has become more and more important in this current world of AI because there's a lot more stuff that's being created by AI versus a specific person. So maybe that means they may not know how everything is specifically working and then figuring out maybe just like, "Hey, there's that part, does this work? And is this thing as expected?" And then there's the whole other side of like, "Hey, I'm using this AI tooling, et cetera." Keeping track of like, is that thing doing things properly? Where are changes coming from? What's happening, where, and when, and also like that.
Speaker 1 (00:14:35):
Yeah. I guess when LLMs were just used for chatbots, that wasn't ... There are still things that you'd want to observe, like make sure that it's not saying the wrong thing or make sure it's not being mean, but there wasn't as much that it could really do other than talk, other than print text, and now it can do so much more. So it's like, wait, wait, what exactly are you doing? And I hate those some ... I still come across some LLMs where it just says like thinking, but doesn't tell you what it's thinking. I'm like, "What are you doing? Why are you taking so long? What are you looking at? Are you suddenly sending all of my information out somewhere else? I want to know that
Speaker 2 (00:15:25):
Stuff." Yeah, that was one of the things I liked. I mean, when you're using Assistant, you can see, it shrinks down what it's doing just so it's not taking up the whole screen, but you can open it up and see what all I was thinking, what it was doing, which I guess then was passed in as well to help it learn better. But what helps it also seems to help us as people because sometimes I see combobulating and I'm like, "Okay, about what? What are you doing?" And obviously there's just the idea of wanting to know why am I staying at this thing forever, but then also, I mean, there's still chances that things may not be working correctly or maybe it doesn't understand what you're asking and actually being able to see what is it doing? Was it thinking about it and be like, "No, no, no, no.
(00:16:05):
I actually wanted this and kill it before it just continues on and on. " Otherwise, it does take a while for some of these things.
Speaker 3 (00:16:11):
Yeah. I think it's always funny to look at the thinking blocks for the models and just seeing what they're doing. And a lot of times it's like, "Wait, I did this. Oh wait, no, I did this. Wait, I did this. " It goes back and forth and you actually kind of see its weird reasoning that it kind of comes up with. And then finally at the end it says, "Oh, it's actually this other thing." But before it even got to that point, it had to figure out like, "Oh, what was I doing in the first
Speaker 1 (00:16:37):
Place?" Yeah. And actually, if you get a right answer, but you don't know how it got to it, how do you know it's actually right? Because it could have just guessed, and that's really a problem. But Jack, when you joined Grafana, before you joined Grafana, were you already doing this sort of thing or were you just thrust into the middle of this?
Speaker 4 (00:16:59):
Yeah, I've been working on AI observability for a little over a year now. So it definitely got a lot more interesting to me once we moved on to coding agents and more autonomous behavior. So for me, it's really important. I don't really trust an agent even when I walk away from my computer, let's say, work on this feature for an hour, who knows what it could spin off and do. So for me, I've instrumented my coding agents using our new AI observability product just to keep an eye on it. And I have evaluators running to make sure it doesn't execute any commands on my computer that I wouldn't want it to, or if things go wrong.
Speaker 2 (00:17:40):
Does this
Speaker 4 (00:17:41):
Feel
Speaker 2 (00:17:41):
Like it started to go terminator level era that you can worry about here?
Speaker 4 (00:17:47):
100%. Yeah. And so it's definitely one of those things where if we're using an agent in production, for example, Grafana Assistant, we want to know when things go wrong. And if we hadn't instrumented it, we'd have to rely purely on customers catching issues or which you obviously don't want or just our own individual testing. We want to make sure we have something more robust in place and that's why we built this.
Speaker 1 (00:18:11):
So are you talking about like instrumenting assistant or, well, I'm sure you do that, but are you also talking about instrumenting like cloud code that you use to develop assistant?
Speaker 4 (00:18:24):
Yeah. So for my own use case, I've been using OpenCode, which is an open source coding agent, and that's because it gives a lot more visibility into the actual traces that are running. So CloudCode and Codex and some of these other proprietary coding agents, they'll emit hotel metrics, but they won't emit the actual payload of the conversations because that's their secret sauce. So the open coding agents, you can instrument them completely and see the entire conversations. You can create evaluators to read over some of the commands that are executed and make sure they're safe, alert on exposed API keys, all kinds of different applications. So I've been sticking more to the open coding agents recently, also just because I think it's cool to actually look at the inner workings myself. I guess now CloudCode is technically open source, but we don't have to talk about that.
Speaker 1 (00:19:19):
Not on purpose.
Speaker 2 (00:19:21):
I've been hearing a lot of people, like people will be talking about like evaluations, evaluators, et cetera, but like for the folks who don't know what that means, what does that mean?
Speaker 4 (00:19:32):
Yeah. So there's two different types of evaluations you can run. So online evals is what our platform focuses more on, which is basically, let's say you have an agent running and it writes a log to whatever log provider you have. Then it'll run either an LLM to look at the actual payload of it and then say basically, can you rate this conversation on a scale of one to 10 on a metric, let's say helpfulness. And so it's called LLM as a judge where you have an LLM actually creating the scores on live traffic. Our platform also lets you do like regular expression. So let's say we don't want it to recommend a competitor. We can say, "Make sure this conversation doesn't include this list of words." So you can kind of have these evaluators running over live traffic and then publish metrics that you can then do things like alert on or create a dashboard around.
(00:20:28):
So yeah, that's the online eval piece. And then offline eval is more of running a data set of conversations to make sure that your agent or whatever prompt you're working on can perform on like a data set. But yeah, we're mostly still focused on online observability at the moment for this Grafana product.
Speaker 1 (00:20:50):
So why did you start this? What was the problem that you were trying to solve with this?
Speaker 4 (00:21:00):
I'll let Alexander take that one as part of the hackathon crew. Yeah.
Speaker 3 (00:21:04):
I think this was like-
Speaker 1 (00:21:05):
Tell us about the hackathon thing. Why would you choose to work on this for a hackathon?
Speaker 3 (00:21:12):
Yeah. Well, I think this goes back to kind of like what we were talking about with assistant. If we are working with, if we have assistant interacting with users and it's doing something that we don't know what it's doing, then that could be a situation where we get that feedback from a user we should have done earlier on. So the team actually had an internal version of observability that we were already using, but we saw this as an opportunity to really like, "Hey, we have all these learnings that we've done. We already know how to do observability for assistant. Can we take that and actually build something from it that's more than just assistant?" And that's where this project came from. And that was all done in a hackathon that we just kind of scrambled in really quickly and like built a whole product really like in a week.
(00:22:02):
It was almost hard to believe, but that's where we ended up landing. We had this like working prototype by the end of it and then we started using it for other things internally. So we started instrumenting our own agents and instrumenting other agents that were being developed and we started realizing, "Hey, this actually works and we can actually start using this and it works really well." And that's where we are today.
Speaker 1 (00:22:30):
Yeah. You talk like it was a long time ago, but actually it was like the last hackathon project, like the last hackathon round. This was what, two months ago we're talking, was it even that?
Speaker 3 (00:22:47):
Yeah, not even two months. Yeah.
Speaker 1 (00:22:49):
Yeah. So it was super quick. There was a hackathon and I remember you presented and then suddenly it's now available and being announced at Grafana, that was really unprecedented success and that just goes to show how much we need this because everybody recognized that this is something we need for our own AI agents.
Speaker 3 (00:23:16):
Yeah. And I think partly why it happened so quick was because we already had an idea on how to build observability for AI agents and like AI applications. So that was going into it, we already had a good understanding and it was a matter of just building it at that point.
Speaker 2 (00:23:35):
Sorry.
Speaker 1 (00:23:36):
Go on.
Speaker 2 (00:23:37):
I've also been hearing people using the terminology of just actually useful AI observability and stuff like with respect to this and in general, what does that mean exactly?
Speaker 3 (00:23:51):
Yeah. I mean, I think the fact is like having something that actually gives us some value at the end of the day, like something that we can use and gives us an insight that we didn't previously have. So for like AI applications, the biggest thing is like behavior. I don't know what my agent's doing in production with users. If a user's saying something and it's misguiding them or doing something unsafe, I don't feel very confident as a developer releasing that to production and knowing what it's doing. So I think the actual usefulness part of it is the fact that we can actually instrument our agents really quickly set up things like evals and have that all running and be able to sleep at night.
Speaker 4 (00:24:38):
Yeah. I think too, like the actually useful ... I actually use what we're doing here. So it's not just an AI feature just for the sake of it. We have Grafana Assistant that we built this for initially and we're using it ourselves every day and it's providing us a lot of value. So it's not just one of those products that's just getting released just to get on the AI hype. It's something that we're actually using. So I think it really has earned its title there.
Speaker 1 (00:25:08):
Yeah. So what I'd actually love to do is I have not worked on this at all. I'm a user just like everybody else. And I set it up for my own little demo app that I'm using for talk next month, but I don't really know yet. I just did it. So I'd love to kind of show people what I did, where I got, and then maybe YouTube experts could tell me what to look at because there's a lot. Does that sound okay?
Speaker 4 (00:25:40):
Yeah, go for it.
Speaker 1 (00:25:42):
Okay, great. So this is my Grafana instance here. Well, this is Grafana Cloud, so it's only available for Grafana Cloud. I just wanted to briefly show people what the setup was. I already have set it up, but I thought it was interesting when you started. There's like the manual SDK setup, which is what I'm used to, but this is cool that it's like, oh, with Cursor, you just copy this prompt and then there's one for cloud code and Copilot and it was like, "Oh, we're not even doing it ourselves anymore." We are also asking our own coding agents to instrument our stuff. And actually it worked. I already had mine instrumented with line chain, so I did copy this and I had cursor do it, but then it also had to remove some of the open lit stuff, but that worked pretty well. And then you set up the evaluations.
(00:26:43):
And one thing that I was thinking of was like, I don't really know what all these evaluations are. Maybe let's go through it because I don't know if I did the right thing here. So here, before I get to the analytics and stuff, I had to choose the evaluators. Are there some things that you would recommend people start with?
Speaker 3 (00:27:09):
Yeah. So we actually have a few pre-made evaluators that you can use. And basically they're predefined templates that we already know work pretty well. And those are things that you can actually take. If you have an AI application, you can create from that template, you can fork it and then have an evaluator already created that works out of the box. And I think they work really good for basic use cases. If you have an app that just, maybe it's like a customer support app and you just want to check, is it actually being helpful or is it potentially doing something that's not helpful or misleading the user? So that's like a basic helpfulness evaluator. And yeah, and in this view, you can create a new one from scratch, but yeah, on the evaluators page, and then if you scroll down, there's the templates from there.
(00:28:00):
And those are all just like predefined ones. So conciseness, format and adherence, groundedness. These are, I think, the most basic use cases and they act as a good starting point.
Speaker 4 (00:28:13):
Yeah. I like ... Oh, go ahead.
Speaker 2 (00:28:16):
Sorry. I was going to ask, there's these different evaluator types, but how does someone know when they want to use what thing?
Speaker 4 (00:28:26):
Yeah, I think it depends on what the agent is being used for. One that I kind of gravitate towards whenever I'm setting up an agent for the first time is the PII one, the template on personally identifiable
Speaker 3 (00:28:38):
Information.
Speaker 4 (00:28:39):
I think it's always just good to make sure that I'm not going to be exposing anything like that. Yeah, it's that middle one right there. So it's good to have that evaluator going. The AI observability SDKs actually have some built-in masking using GitLakes. So it'll look for basic patterns, so it'll try to mask them already so they don't even make it into Digital or an AI observability platform. So it basically is nice to have these two layers of security so we're not collecting any information that an agent might be exposing.
Speaker 1 (00:29:15):
I would love a button that just says, "Apply everything that you think that you would recommend." Because I saw this and I was like, "Oh, wow. Which one do I do? " And I would love to just be like, "Just apply everything and then I can just delete them later if they don't
Speaker 3 (00:29:37):
Work." Yeah. We've thought about that with Assistant actually, just being able to have Assistant create your evaluators right there. And then the other thing you could do potentially is we mentioned earlier, GCX, which is our new CLI tool that just got recently announced, it also supports AI observability. So you can actually use the CLI tool for this. So if you have cloud code or you have cursor running, you You could have it actually create your evaluators too.
Speaker 1 (00:30:04):
Oh, I actually did not think of using an assistant to help me with this. Can it drive this AI observability app already?
Speaker 3 (00:30:15):
Not yet. Yeah. But it is something we've thought about. Yeah.
Speaker 1 (00:30:19):
That would be really cool. Could we go through the types of evaluators that are there? So here, if you were creating a new one, Jason schema, what is that doing?
Speaker 3 (00:30:35):
Yeah. So that one is basically a deterministic schema validator for structured output. So if your agent is doing any kind of structured output or creating JSON, that'll let you basically validate that and whether it's actually being created in the correct way. Because a lot of times, especially early on with some of the older language models, they were not really good at structured output. And oftentimes they would get the syntax slightly wrong. And that would be a big problem because if your application is trying to decode that JSON or whatever it might be, then you would have errors or you'd have to retry. So this is just a way that you can actually validate that afterwards. So just in case if something does happen.
Speaker 1 (00:31:21):
Okay. So is it just checking whether the JSON is valid or is it checking, can you say, "I want these objects to be part of the response?"
Speaker 3 (00:31:32):
It's just a validator for now. So if you give it a schema and then in that schema, you can give it different objects that it can validate against. So that way you can basically check whether it has the right fields and if there's like eNooms or something, you have the right values as well. That way you can be a little more thorough or specific as well. Because sometimes it could hallucinate in the fields themselves and the values itself.
Speaker 2 (00:32:00):
Okay. Tiffany? And then LM as a judge. I keep hearing about in general lately I've been seeing LM as judge or even I've seen the meme with LM as a judge and judge because it's dressed as a judge and everything. But is it exactly just what it sounds like or is there more to it than the obvious, like the fact that it's titled as kind of what it is?
Speaker 4 (00:32:30):
I think it's basically what it sounds like. You're asking an LLM to look at another LLM's response and rate it, which is pretty Terminator-esque. But it's interesting to see because you can define a rating scale. So this kind of depends on different use cases, but what we've seen is that sometimes using like a pass/fail, just asking the LLM judge to make a decision if it passed or failed or not is actually more reliable than saying, "How does this response look on a scale of one to 10?" Because there's actually been some research that came out that if you just flip, let's say you have five options, if you flip the order of the options, or let's say you slightly modify what the response was, the LLM judge will come up with a completely different rating. So for us, we've seen past fail has been like a better metric, but yeah.
Speaker 2 (00:33:23):
Yeah, I haven't tried that. I've been more so being like, I don't know, here, generate this code documentation, whatever, and then being like, "Hey, check against the code, et cetera." But then also being like, "Hey, okay, so I have all this stuff that cursor dumped out to me. Let me go ask Claude. Hey, is this actually accurate do you think? " So I've done that kind of thing, but not the actual judging part here that you were talking about.
Speaker 1 (00:33:52):
Yeah. I totally saw this for my demo app because my demo app is like a D&D thing that it's an app that pretends to be the Dungeon Master and is taking you through an adventure and you're the player. The user is the player. And I found it hard ... I have some tests that are like, "Okay, I must mention this particular setting that it's in or whatever," but it's like not enough because you could, as a user, you could put anything in there and I want to make sure that the app still responds to whatever the user says. And I can't think of everything that the user might say. Why not Nicole? Why not? I know, right? Am I not Omniscent? How dare the universe? But no, so I had to bring in another AI to talk to my AI app, which sounds really mental, but that's what I had to do because it was really impossible to figure out whether this non-deterministic app was producing the right output in response to something that is also non-deterministic.
(00:35:06):
I don't know what users are going to put.
Speaker 3 (00:35:09):
It's kind of a chicken and the egg problem.
Speaker 1 (00:35:12):
Yeah,
Speaker 3 (00:35:12):
Exactly.
(00:35:14):
One use case I've found really effective for LL as a judge is the groundedness template that we have. So for example, investigations, we try to ground an investigation to evidence. And one of the biggest problems with language models is that they have the ability to forget things. It's called like lost in the middle, and a lot of language models, retrieval over a long context gets worse the longer the context is. So for things like investigations where we have a ton of evidence, a lot of context, it's more likely to forget evidence. So groundedness acts as a way to basically check, "Hey, did the response of the investigation actually contain all of the evidence that it's saying in its response?" And basically verifying whether it's actually correct or not. I
Speaker 2 (00:36:08):
Feel like I need something like this just as a person because I forget.
Speaker 1 (00:36:16):
You need an evaluator.
Speaker 2 (00:36:18):
I do. I do.
Speaker 1 (00:36:21):
Okay. What are these other ones? There are two more. There's the heuristic one. Okay. Response not empty, and then one for toxicity.
Speaker 3 (00:36:32):
Yeah. The heuristic one, basically how it works is it lets you define a few expressions, like for example, like non-empty response or like, does it contain these words or strings? And you can build a bullying logic. You can say like if this and this or this, that kind of stuff. And that'll let you do like a really quick heuristic. And the advantage for that is like, well, compared to like Ellen as a judge, you're running another model that has to evaluate the output of another model, which is pretty slow and can also be expensive. So heuristic is one way to actually do like a faster evaluation. So if you know you're like looking for a specific thing or you're not looking for a specific thing, that's where heuristic actually works really well.
Speaker 2 (00:37:23):
And then Nicole, can you show the screen, I guess again? Sure.
Speaker 3 (00:37:27):
Can
Speaker 2 (00:37:28):
I add you to the screen? Whoa, okay. Okay. So when I was looking over at the documentation, like I saw the LMS judge, Jason Schema, Huristic, and then I also saw the Regix. Yeah, how
Speaker 1 (00:37:41):
Come the Retex wasn't here?
Speaker 4 (00:37:44):
Yeah, I think these are just some recommended evaluators, like the response on MC1. Yeah, it's not covering ... We should probably add a regular expression one here.
Speaker 1 (00:37:54):
By the way, did you see in the cloud code source link that for sentiment analysis, they have this frustration detector and they use Regex. It's just funny. LLMs are often used for sentiment analysis, but they just use Regex.
Speaker 4 (00:38:10):
Yeah. I mean, if we wanted to add that, you could implement that in our platform here, just a regular expression for any swear word would tell you how angry people are.
Speaker 1 (00:38:22):
Yeah.
Speaker 4 (00:38:24):
We should add that Grafana assistant.
Speaker 1 (00:38:27):
Like talking to a dungeon master who's trying to kill you. Okay. Let's get to like what-
Speaker 2 (00:38:35):
We're going to very
Speaker 1 (00:38:35):
Terminate
Speaker 2 (00:38:36):
Her. Yeah.
Speaker 1 (00:38:39):
All right. So I'm going to make this like, let's say two days. All right. So here's what I got. I haven't looked too much into it yet. What am I looking at here?
Speaker 2 (00:38:52):
You're not spending a lot of money, that's cool.
Speaker 1 (00:38:56):
I know, right? I'm like, well, I mean, this is just me like no one else. This is running locally. So it's just me that's using it, me and my tests. So not much money, but average latency is nine seconds. That can't be good. It's 95th percentile to no errors. What I really love is this part, like the open lit dashboard thing that we had before, it had so much information and I was like, "I don't know what these words even mean." And this one is an actual explanation. So like, what's with this? Token consumption dropped 90% over the period. Verify if workload shift. Okay. So already I'm like, "What? Workload shift. What?" So if I click on explain, "Oh look, it's assistant." How cool.
Speaker 4 (00:39:49):
How'd that get there?
Speaker 1 (00:39:52):
This is great. I actually haven't done this before, so I'm also learning right now. I
Speaker 4 (00:40:00):
Actually had a really good experience with this AI analysis in my own personal coding. As I mentioned, I was using open code and I was sending my traces here and one of the AI analysis found that I was using a substantial amount of tokens in my system prompt, even though the output was very, very short. So it was maybe me just asking about a small file or some small question, but I was using 20,000 tokens to do it. Oh, wow. It alerted me to that and I looked into it and I had an MCP tool that I was not using at all that was just causing so much context window bloat. So I just removed it and there you go. I ended up saving a bunch of money locally.
Speaker 1 (00:40:42):
That's cool.
Speaker 2 (00:40:43):
And then so earlier before we got to the evaluators part, there was talking about instrumentation and then you mentioned traces. How does the connection between this and open telemetry work exactly?
Speaker 4 (00:41:01):
Yeah. So there's basically two different channels that this platform accepts data through. So one is the traditional OTEL through OTLP, which accepts metrics, logs, and traces. But for some of the advanced analysis that we want to do, specifically evaluators over the actual conversation content, that's where we have a separate generations endpoint that accepts basically the full payload of the conversation, different metadata around agents, if you want to have sub-agents and things like that so that we can kind of construct it all together in the backend for one unified picture. So we kind of have those two channels that we take advantage of there. I'm not sure. Did that help answer?
Speaker 2 (00:41:46):
Yeah. And then I ended up finding the part also in documentation about the generation export and the OTLP telemetry. So I pasted part of that into the chat as well for people to dig into further if they want.
Speaker 4 (00:41:59):
Nice. Yeah. OTEL can basically do a lot of the engineering metrics like latency and number of tokens, things like that, but not the actual generation content. So that's where the text content of the chat comes in for the generations endpoint.
Speaker 1 (00:42:18):
Jack, do you want to do your demo as well? Because I've never used open code and I'm super curious to see how you're using it for this because I thought that the people that would be using this app would be kind of like us, people who are building and maintaining an AI, an agentic tool of some sort, but you're doing it as an individual.
Speaker 4 (00:42:46):
Yeah, absolutely. Let me pull up my screen here and I can show ... All right. Let's see. All right. Screen's ready if you want
Speaker 2 (00:42:58):
To- You're already
Speaker 4 (00:42:59):
Shared. Oh, great. So yeah, as I mentioned before, there's different plugins that we've created for some of the open source coding agent. Right now we only have CloudCode sends the metrics, but not the actual trace content. Open code so far is our open one that will actually send the content of your messages. So I'll show here in my own observability page here, I have ... Let's see, last
Speaker 2 (00:43:32):
... Can you zoom in a tiny bit?
Speaker 4 (00:43:35):
Yes.
(00:43:37):
Thank you. Yeah. So I have, let's see, the open code up here, and if you haven't seen it before, it's a really nice terminal UI where you can basically kind of do all the things you can do in CloudCode or Codex. So I can just say Hello World, the store directory, and I'm using just a small model just for demo sake. So it kind of lists all the things I have here. And then if I go into my observability here, you can see that my conversation pops up and has some nice little insights there. And you can see kind of what happened behind the scenes with open code and what it did. The tool that it called, which is really cool, you can see what arguments that the input and output. And you can see that I have a couple evaluators running here, executed dangerous command, false, so nothing dangerous happened there.
(00:44:33):
And then there was no match for an API key being leaked, which is another one that I kind of have just watching my back. Yeah,
Speaker 1 (00:44:45):
That should be one of the recommended one. That would be awesome.
Speaker 4 (00:44:49):
Yeah. We should probably add that one. But yeah, you can see here basically in my evaluators, I've configured this custom dangerous delete LM is a judge one that basically just looks at what the agent's tool calls were and makes sure that there's nothing dangerous. For example, deleting a file outside of the directory, and then an API key check, which is just a regular expression for anything that looks like a OpenAI or Anthropic style API key with that little SK identifier. And I did create a file outside of the working directory that I'm going to see if I can ask it to delete here. And then we'll see the evaluator in action. Let's see if it goes here. And so you can see this is a dangerous thing. So thankfully it asked me for permission, but let's just go ahead and see. I'm just going to let it happen and see what'll happen in the background because sometimes there's environments where you might let a coding agent have more permissions than this.
(00:45:49):
Thankfully OpenCode I've been very careful, but
(00:45:53):
If we go over to conversations here, you can see that in the eval section, we have one failure here and looks like that is our dangerous command evaluator. The agent deleted a file outside of the working directory, and obviously I want to know about that. And another cool feature of this platform is you can set up alerts with these. So I actually have an alert set up for whenever I have a dangerous deletion happen, and so you can see my alert here is firing, which could send me an email, it could page me. Depending on your use case, you definitely want to know about it. Let's say you're away from your computer and your agent does something, you want to make sure you come back right away. So yeah, it's kind of just my little
Speaker 1 (00:46:38):
Download- Did you hear about that woman who hooked up, I think OpenClaw to her email and then it deleted everything and she was like, "Stop, stop, stop, stop, stop." But it had already started processing and it was doing it sequentially. So by the time it got to her stops, then it had already deleted her stuff.
Speaker 3 (00:47:01):
It's
Speaker 4 (00:47:01):
A nightmare scenario. I think we're going to be hearing more and more of those kinds of stories, which is why I think it's always nice to be extra cautious
Speaker 2 (00:47:13):
With
Speaker 4 (00:47:13):
Your permission.
Speaker 2 (00:47:14):
Say you run into this scenario where it's like, "Hey, by the way, you deleted this thing," does it know enough to give suggestions how to fix for this or something else, how to fix the problem or does it mostly just let you know what you did wrong?
Speaker 4 (00:47:28):
For the evaluators, it's mostly just kind of letting you know what's going on. If you wanted to set up some sort of agent that kind of watches the evaluators to make sure, let's say this fired, you could set it up to maybe make a pull request or trigger some sort of action other than just the user doing it. But yeah, for me, at least my use cases right now, it's mostly just letting myself know what's going on and being able to have peace of mind and sleep at night whenever my agents are working.
Speaker 2 (00:47:57):
So could you have it where, say, you made a new commit and it accidentally deleted that file that it sees now, "Hey, you did this thing, roll back to get rid of your last commit." Is that a thing that you could potentially maybe do?
Speaker 4 (00:48:13):
Yeah. As long as you can't do that directly in the platform today, but you could set up some sort of workflow that pulls this information from the AI observability platform and then acts on it. Yeah, the world is your oyster when it comes to what you can do with it.
Speaker 3 (00:48:28):
Yeah. I think it actually, this opens up another topic about future use cases and having that full feedback loop is actually kind of interesting because once you start emitting telemetry, then you can actually start pulling that telemetry back into your coding environment, using your coding agents to then improve what was happening. And in this example, maybe if you have an agent.mdfile or cloud.md file, you could add rules or instructions into your coding agent and be like, "Hey, you did this thing before, correct it now so that you don't do it again in the
Speaker 1 (00:49:04):
Future." And this is what you were saying with GCX, that could be also a thing that you instruct your agent to use GCX to pull for those alerts or whatever. I haven't used it myself. And then based on that, if it does find an alert, then do some sort of action on your local machine.
Speaker 3 (00:49:28):
Yep, exactly. And it even gets more interesting is like, if you're building agents, you can actually use the telemetry from your agent to also improve your agent. So if you have alerts, you're basically getting insights into how the agent works and you can have your local coding agent go out and use this CLI to find specific things that happened and improve on it so that it can learn from mistakes that it made from its evaluation.
Speaker 1 (00:50:02):
It would be really cool if we have investigations now, it would be cool if investigations would automatically, if an investigation would automatically be started in response to an alert repeatedly firing over a certain amount of time or something.
Speaker 4 (00:50:21):
You actually can do that today actually.
Speaker 1 (00:50:24):
Yeah. Oh, okay. I didn't realize. So in your example, could you have it do that?
Speaker 3 (00:50:32):
Well, I mean, yes, the alert can trigger an investigation, but then one thing you could do is, for example, if I saw there was an issue with the investigation, I could go and use AI observability to then try to figure out, okay, what went wrong in that investigation and even try to improve the investigation tool. And it's kind of funny is we've almost created this feedback loop where internally we can improve our investigation and then our investigation can improve our AI observability. And I've used them both in combination where the investigators finding bugs in our observability tool and then the observability tools finding bugs in my investigation tool. And we've almost created a closed feedback loop, which is really cool.
Speaker 4 (00:51:24):
Yeah. And you can also give the phonicist and access to GitHub. And so let's say you have maybe the helpfulness of your agent has gone down in the last hour and it's causing some sort of alert that it dropped below, let's say like 80%, then maybe the Grafana assistant could look at GitHub and see, oh, it's actually because of a recent code change, you might want to roll this back. You can even raise a PR to roll that back. So we're getting into the territory where hopefully some of those things will be a little less hands-on and can happen in the background automatically.
Speaker 1 (00:52:00):
In your experience, which evaluator types are the most valuable or help you catch the most bugs if there is one that stands out?
Speaker 3 (00:52:15):
Personally, for what I've seen, probably the LM as a judge, just because you can get the model to reason what went wrong. And I always find that really useful, is looking at the reasoning of the evaluator, that always give me clues on what's going wrong. And I've actually caught a few bugs where like just prompt bugs where I was like, "Oh, I didn't even notice that. " And it's like something I'd go and check and like, "Oh wow, there actually is an issue with the prompt." And the evaluator would catch it. So it's definitely paying off.
Speaker 4 (00:52:44):
Yeah. Alexander too, haven't some of those LM as a judge evaluators helped when it comes to reducing number of tokens in the prompt? You can basically observe that the score stays the same on a second iteration with a lower number of tokens.
Speaker 3 (00:53:02):
Yeah. Yeah, I think so. One thing we've noticed is like the simpler the prompt gets, we actually will realize that sometimes the performance stays the same or it even gets better and that's something that you wouldn't really know until you see it happen. And we've been able to kind of optimize like our prompts and such to figure out like, "Hey, there's a lot of instructions in here, but you can actually cut back the instructions and still get decent
Speaker 2 (00:53:31):
Performance." Yeah, I was thinking a lot of stuff when I was through the documentation about system prompt analysis.
Speaker 1 (00:53:39):
Yeah. Where is that?
Speaker 4 (00:53:44):
I think it should be in that same area, especially as that one example I mentioned earlier where it caught that I had 20,000 tokens in my system prompt and then it only spit out like 10 output tokens, that insight showed up there on that main page that should be there.
Speaker 1 (00:54:02):
Oh, like in one of the three analysis ones?
Speaker 4 (00:54:08):
I think so. Alexander, is there another place there maybe or?
Speaker 3 (00:54:11):
Yeah. I think if you go under the agent page and then under agents, you can actually find a lot of information about your system prompt, the versions of your prompts.
Speaker 1 (00:54:23):
Agent, okay. Am I in the right ... Well, I've never been here.
Speaker 3 (00:54:27):
This is the right place. Okay.
Speaker 1 (00:54:29):
And
Speaker 3 (00:54:29):
There we go. Yeah. So what this is doing is it's, we've actually been able to add an attribute where you can say you've instrumented an agent and the agent is basically a collection of prompts or tool calls or whatever it might be. And then if you click under agents, it'll list all of the agents that you have. There's a tab at the top there.
Speaker 2 (00:54:53):
Next to overview on the top left.
Speaker 1 (00:54:56):
Oh, right.
Speaker 3 (00:54:58):
Yep, there we go. And then, so it recognizes that you have the D&D agent and then if you click in on that, it'll give you insights into your system prompt, like different versions of- Oh no.
Speaker 1 (00:55:13):
Why is there no system prompt recorded?
Speaker 3 (00:55:16):
It might be that in your instrumentation, you didn't have the system prompt instrumented.
Speaker 1 (00:55:22):
Oh, okay. Okay. Okay. That's something to look into. All right. I also saw ... Oh, sorry, did you want to say something else?
Speaker 3 (00:55:32):
I was going to say, when you do have the system prompt in there, we built a way to actually analyze the prompt and try to identify are there areas where the prompt is working really well or is there things where it's not working well? And it kind of gives you an insight into things that might be like things you wouldn't have noticed. Sometimes it's hard to know what is going on in your system prompt because maybe you're using tools like LangChain or like other frameworks that kind of hide the system prompt. So this is a good way to give you that kind of direct insight and like seeing the prompt right away. Sorry,
Speaker 2 (00:56:10):
I was going to ask if Jack had a readily available example that does show the system prompt that we could show otherwise by any chance.
Speaker 4 (00:56:17):
Let me check. Give me one sec.
Speaker 2 (00:56:20):
Sorry, could you hear
Speaker 4 (00:56:21):
Me saying that? Yeah.
Speaker 1 (00:56:24):
Just put you on the spot. Okay to show your screen?
Speaker 4 (00:56:32):
No. Yeah. Just give me a sec. I'll pull something up.
Speaker 1 (00:56:35):
I also wanted to show something that I saw in the meantime in the evaluation one, I was looking at one of these and so this is supposedly what the prompt is to the app, right? And then the player action was to roll. And what the app is supposed to do is recognize that it's a role and actually do a randomization between one and 20 and come up with a result. But it didn't do that. I don't know why. I just found this out as I was looking at it, that it for some reason just ignored that. And so because of that, it's saying that the relevance evaluator, yay, I did something right. It failed. Yeah, because the user requested a dice roll and the assistant provided a narrative scene instead of executing the role mechanic. I love that I'm getting it to talk about D&D like it's like legitimate work stuff.
(00:57:36):
That's so great, but that's awesome. Oh wow. Yeah. It's not valid JSON because it's not. Maybe I didn't need that. I told you I didn't know what evaluators to do. I kind of just put a bunch of them.
Speaker 3 (00:57:55):
Yeah. That's so cool. I think that's like one of the perfect examples of like where you can see like the output actually wasn't right and that would be like a scenario where like if that happens enough, you could get alerted on it hypothetically in the middle of a D&D session like, "Hey, something is not going right." And then you can look into it.
Speaker 1 (00:58:17):
Yeah. And I don't know how that happened because I've played it and I didn't, that has not happened to me yet. So great. It found something. I had run some LLMS judge tests that apparently found it. Actually the test didn't find it. AI observability found it.
Speaker 3 (00:58:40):
There was a really interesting thing published from Anthropic recently, which talked about language models having kind of an implicit emotion associated to them, which I found really interesting because it's like something you wouldn't expect, but like if you, it's not like the output of the model, but when the model does certain things over a period of time, it'll kind of pick up implicit emotion. And the example they gave it was like giving it an impossible problem and it makes the model act desperate. So in the language model, there's like the vector space where there's emotions associated to desperation and that wouldn't be apparent by the system prompt, but the model might start acting in a certain way just because it was doing something and it was failing multiple times. So for example, like an assistant, like maybe if it's looking for some kind of log or telemetry, they can't find it, it might get desperate.
(00:59:39):
So they found that like what would happen with the language models, it would eventually cheat. So it would eventually turn from desperation to cheating. And that's like something that you wouldn't really expect because you look at your system prompt, you wouldn't know that's what it should do, but it's just one of those behaviors that It just happens to be something that the model does. And that's another good example of having an evaluator is like, "Hey, could you detect scenarios where, hey, it's picking up these kind of phrases where the model's starting to drift into a direction. It probably shouldn't be drifting."
Speaker 2 (01:00:17):
I think I found it. So I put the link there because I do see it talking about reward hacks and things like this. It was only 20 days ago. So yeah, quite recent.
Speaker 3 (01:00:27):
Yeah. It's really cool. And I think we'll see that with investigations. It's a really similar problem. The agent is trying to find a root cause to something going wrong. And as it's finding the root cause, it has to do a lot of work to try to find it. And a lot of times it'll get distracted by symptoms that look like a root cause. And it'll use all this really strong verbiage like, "I found the smoking gun or I found the thing and it tricks itself into thinking it found the right thing." But what that actually ends up doing is it snowballs and it makes the agent think, "Hey, I found the last thing on the previous turn. Maybe I'm on the right direction." And that's another good use. Hey, I should probably have an eval to say, "Hey, don't use that specific language because it might influence the next turn that the agent
Speaker 1 (01:01:23):
Works." Wow, that's really cool. I've been messing around with OpenClaw recently and OpenClaw has a soul.md and an identity.md and you can tell it how it should respond to things and what kind of person it is. And the thing is, maybe it does have an effect. I know there have been some things where you program agents to behave in a certain way and have a certain disposition and having a mixture of those dispositions actually might be useful on a team. Like you want someone who's super critical and you want someone who's like, "Yeah, I'm very optimistic all the time." And then pair them together and that probably would be a good team. Jack, did you get the thing? It's okay if you don't, by the way.
Speaker 4 (01:02:24):
Yeah. I have one here with a system prompt that you can maybe show if you want. It's just an agent I pulled up here that helps with Grafana low-key visualizations. So with the whole system prompt here, you can then go and you can analyze the prompt, look at different conversations, and then it'll come up with some different measures of tool quality, token budget, things you can improve on. It might take a second here, so I'm not sure if we ...
Speaker 2 (01:02:54):
I definitely read that originally as food quality. I knew it could.
Speaker 1 (01:03:07):
I was watching the first context window episode with Matt, Van and Cyril, and they said that there were a couple instances where the system prompt was deleted and weirdly assistants started doing better. So I guess that this is why those sort of situations are why we have something like this so that we can actually track those, how the system prompt performs over time. And I know that there's a lot of stuff, even discounting some crazy things like that, like the whole system prompt being deleted. I know that we really heavily tune for specific models, and when a new one is released, we have to kind of rejig it again.
Speaker 4 (01:03:54):
Yeah. That's why it's nice that this is grounded in real conversations that happen. You can see it points out like specific conversations where failure modes happened. It can also come up with kind of like an overall rating and then some high priority findings on things to fix. And it's not just looking at the system prompt itself, it's looking at like real conversations that happen, which makes it some pretty useful suggestions here. Yeah, I think it's pretty awesome.
Speaker 1 (01:04:22):
It is. That might be the coolest part of the whole app actually. Well, because it's just so useful. Otherwise, how do you know? You just kind of run it yourself and guess?
Speaker 3 (01:04:39):
There's even a tokenized button where you can tokenize the prompt. So if you wanted to really get into what's going on in the prompt, for example, if there's certain words or like Zyntex that separates the tokens, that'll give you another level, another view into what's going on in your prompt. Although we only have a few tokenizers in there and like each model has a different one.
Speaker 2 (01:05:03):
I need this color scheme for my flame graphs.
Speaker 1 (01:05:09):
Okay. Well, we're actually already over time, but I did just want to end with something fun, a couple meetings in the weekly AI engineering ones. A few times we've had like these little games, so I thought maybe we could try that if you ... Just a quick one if you're up for it. I'm going to share my screen. No, no. It's called guess the prompt. You just have to guess. What two word prompt did I use to generate these images? Okay. Don't say it right away, but I'm curious to see if you look at it.
Speaker 2 (01:05:54):
I have one in my mind from a previous thing we did.
Speaker 1 (01:05:58):
All right. First one. Any guesses? I blocked it out there because it actually printed it.
Speaker 2 (01:06:12):
I mean, I feel like I can think of one of the words. It's lobster, but ...
Speaker 4 (01:06:21):
I was going to say open claw ghost tour. I don't know.
Speaker 1 (01:06:24):
It's two words.
Speaker 3 (01:06:29):
Blind open claw, but open claw is one word.
Speaker 1 (01:06:34):
Yeah, that's true. The way it's felt.
Speaker 2 (01:06:38):
I don't know because it's blind, but it's obviously able to learn stuff without being able to see, but it feels like it's more words. I don't even
Speaker 1 (01:06:45):
Know. Okay. I'm going to give it to Tiffany because
Speaker 2 (01:06:47):
She
Speaker 1 (01:06:47):
Already said the two words.
Speaker 2 (01:06:49):
Oh, I did. It's
Speaker 1 (01:06:50):
Blind lobster. And I was thinking of open
Speaker 2 (01:06:53):
Cloth. I was thinking just like lobster. It's literally just blind lobster. Okay. This is more- It's
Speaker 1 (01:06:58):
Simpler. Okay. It's
Speaker 2 (01:06:59):
Like that level. It's way simpler than this looks because I guess it added more because it is blind lobster, but it's making me think of Daredevil where he knows all this stuff.
Speaker 1 (01:07:10):
Oh my God. They special trust your
Speaker 4 (01:07:13):
Instinct.
Speaker 1 (01:07:14):
Yeah, I know, right? And it says the blind lobster.
Speaker 4 (01:07:18):
Inspirational. Okay.
Speaker 1 (01:07:19):
Nice.
Speaker 4 (01:07:21):
Guys.
Speaker 1 (01:07:26):
Okay. It's a little harder.
Speaker 2 (01:07:30):
Oh geez.
Speaker 4 (01:07:37):
Animal orchestra
Speaker 1 (01:07:40):
Close.
Speaker 2 (01:07:43):
There's orchestra, but at the same time it also looks sort of like magician like Harry Potter with their little- It
Speaker 4 (01:07:50):
Does look like that, the final battle.
Speaker 1 (01:07:52):
It does, like the Avada Cadava.
Speaker 4 (01:07:55):
It could be Star Wars too.
Speaker 2 (01:07:56):
And whatever the little ... When they have their own little animal and you can see it moving and stuff. Oh shoot. I don't even know. Atronis? Yes. Thank you. Let's see.
Speaker 1 (01:08:13):
Okay. The prompt for this one was blind open ... Sorry, not blind. Dueling opuses.
Speaker 2 (01:08:20):
Yeah, that was never happening.
Speaker 4 (01:08:22):
Definitely a dual.
Speaker 1 (01:08:24):
The first one was really easy. Okay? So different difficulties. Last one, and then we can end the show. I
Speaker 2 (01:08:31):
Feel like we should also grab the one from our meeting earlier today.
Speaker 1 (01:08:41):
This is what I was doing just before this
Speaker 4 (01:08:48):
Actually.'s pretty friendly. Pretty friendly guy there.
Speaker 1 (01:08:52):
I know.
Speaker 2 (01:08:53):
Open lama. The lama.
Speaker 1 (01:08:57):
Okay. Alexander got the first word. Okay. You already have put it together. Open lama. I was messing around with on the
Speaker 3 (01:09:05):
Hoodie. And linked it on the hoodie.
Speaker 1 (01:09:08):
Oh yeah. I totally didn't. It had put it on the sign, but I didn't censor it on the hoodie.
Speaker 4 (01:09:19):
Oopsie. I love the welcome sign. Everyone welcome. That's the place that I want to be.
Speaker 1 (01:09:25):
And there's also like a little rainbow thing, code freely, live kindly. It's amazing. It's amazing what AI can do just with open lama.
Speaker 4 (01:09:36):
Awesome.
Speaker 1 (01:09:37):
Okay. Well, thank you all for joining me. I definitely learned a lot more about AI observability. Thank you for showing me what I'm doing wrong and that I should figure out why my system prompt isn't showing. And seriously, I knew nothing about it before this and I just kind of like ... I didn't even look at docs, honestly. I just went through the prompt that you suggested and it worked.
Speaker 4 (01:10:07):
Now you're a power user. Let's get it.
Speaker 1 (01:10:09):
Yeah. Power user. Me and my open lama.
Speaker 2 (01:10:13):
Okay. Now you need to ... I mean, two words, power user and now we need to see what that generates.
Speaker 1 (01:10:19):
That's for the next time. This is great. So if anyone wants to know what else you can do with AI observability, you can check out the docs there. There's been a lot about it. You can also look at the keynote talk and also the talk earlier today from Grafanicon all about AI observability. Thank you to the three or public review for coming on to talk about it. Yeah, public preview. All right. And thank you everybody for watching as well. If you have any questions, leave them in the comments below and we'll be sure to get them answered. Thank you. Bye.