CLI and MCP Basics for AI Agents_transcript

Speaker 1 (00:00:03): And we're live. I think this is the most number of people that we've had at the same time on this episode before. Speaker 2 (00:00:11): I think so. And are we not in five different countries right now? You messed it up. Speaker 3 (00:00:23): Sorry. If that helps. Speaker 2 (00:00:26): I'm in Portugal. Speaker 1 (00:00:29): USA, Seattle. Speaker 4 (00:00:30): [00:00:30] And the Netherlands. Speaker 2 (00:00:36): Well, maybe we should get started. Let's introduce ourselves. I'm Nicole Vonderhoven. I'm a dev advocate at Grafana and I'm here with Tiffany. Speaker 1 (00:00:48): I was like, am I in the same direction as you on my screen, but I guess so. I Speaker 2 (00:00:52): Think so. Speaker 1 (00:00:55): Yeah. So I'm a developer advocate at Grafana for Tempo Slow [00:01:00] tracing, PyroScope for profiling and AI things. I'm going to go downwards. Hello, Ben. Speaker 5 (00:01:08): Which I think is me. Yeah. Hello, everyone. I'm a software engineer at Whanau. I've been here for five years now, so practically forever. I've been on the machine learnings team and now work on lots of different AI based products including the MCP server, which we'll be talking about a little bit [00:01:30] and several other bits and pieces. So yeah, let's go that way toward. Speaker 4 (00:01:37): I think it's the other way around. It doesn't matter. You called me out so I'll do my introduction. Yeah. So I've been with Grafana Labs for almost seven years now and had various functions, but currently I'm kind of a product manager focused on Grafana Cloud and specifically nowadays on GCX. Speaker 3 (00:02:00): [00:02:00] Yeah, I guess I'll go last. So I'm David. I've been here about the same time as Ben, so just over five years I've been building GCX with Ward and most recently been focusing on trying to understand how GCX performs with its skills against other tools. Speaker 2 (00:02:20): Awesome. So I guess today we're going to be talking about MCP versus CLI, what all those letters mean and which one [00:02:30] Ben thinks is appropriate for all situations. Speaker 5 (00:02:34): The big versus word came out. I thought we were all trying to- Speaker 2 (00:02:39): Yeah, we're picking sides, except only Ben ended up doing that. Speaker 1 (00:02:45): I wonder which one. Speaker 5 (00:02:49): I know. Speaker 2 (00:02:49): I guess some announcements first. Did you all know if you're here, then you're probably interested in what we're doing in AI. We're actually doing this maybe brave, maybe [00:03:00] silly thing where I'm taking a bit of an excerpt from every Team Weekly meeting and I'm putting it up on YouTube and it's mostly the demos part. But if you haven't seen it, here it's called Inside AI Weekly and I'm trying to get one up every Monday. But let's go through some AI news from the last month. I think one of the biggest things is that [00:03:30] 4.8 dropped. I think it was just like, was that last week or something? Speaker 1 (00:03:34): It was very recent. Speaker 4 (00:03:36): I think it was last week indeed. Yeah. It's much better than the previous one because it's a higher number. Speaker 1 (00:03:44): Yeah, it's May 20th. Speaker 2 (00:03:45): Yeah. Speaker 4 (00:03:47): The world's changing. Revolutionary. Speaker 2 (00:03:49): But I do think it's better. I think it seems more like evolution than a revolution. I think they fixed some cost inefficiencies with [00:04:00] 4.7. At least that's what I read reading between the lines. It seemed just a little bit better on everything, but especially a bit more efficient in calling tools and stuff. Speaker 4 (00:04:13): Yeah. And I also think they wanted to get a rid of the 4.7 bed rep. Yeah, I was going to say. Yeah. Speaker 5 (00:04:21): Yeah. I think there's an association with like the new tokenizer in 4.7 that they were like, okay, well quickly get paper over that and get a new one out that doesn't have that association. [00:04:30] That's the cynic in me. I don't know. Speaker 2 (00:04:34): There's a couple things from Anthropic. Anthropic is going to IPO now. Speaker 4 (00:04:41): Yeah. They did a not so secret filing I think with the sec. Speaker 2 (00:04:47): That's always scary when something that big, like one of the frontier models is potentially up for grabs. All of these quality issues, like [00:05:00] who's going to control one of the best frontier models in the world that's kind of scary. But we'll see what happens there. And also if you Speaker 4 (00:05:09): See Speaker 2 (00:05:09): Their Speaker 4 (00:05:09): Latest funding rounds. Oh sorry, I didn't want to talk over you, but if you also saw their latest funding rounds, those are big numbers. It's amazing. Like 980 billion or something valuation. It's incredible. Speaker 2 (00:05:22): Oh, wow. Speaker 4 (00:05:23): Yeah. That's a lot of cappuccinos. Speaker 1 (00:05:28): Yeah. I'm [00:05:30] just like thinking of how much money that the employees there are potentially getting. Speaker 4 (00:05:38): Yeah. Drinks are definitely on them if we Speaker 2 (00:05:41): Don't use one. They get paid in tokens. Speaker 4 (00:05:42): Yes. Speaker 2 (00:05:45): They also introduced dreaming for managed agents. So it's interesting because I think so OpenClaw was actually the first to come out with this dreaming feature. It's a memory consolidation feature, but I think they didn't say this directly, [00:06:00] but I think they got it from when the Claude source code was leaked. There were already some hints to dreaming there. OpenClaw actually implemented it first and now this is Anthropic's actual implementation. I've been playing around heavily with OpenClaw, mostly using Opus. So I'm interested in seeing what Anthropic had in mind. Speaker 4 (00:06:23): Wow. And what does it do when it dreams? I didn't follow that story. Speaker 2 (00:06:28): Yeah. It's actually a really [00:06:30] tidy thing. Oh, I wrote a blog post just Speaker 4 (00:06:33): On this. Speaker 2 (00:06:34): Shameless Speaker 4 (00:06:35): Cluck. There we go. Speaker 2 (00:06:36): Yeah. It's called The Architecture of Forgetting. It's on Context Horizon, our AI-based kind of blog. And it is basically, it's a three-stage memory process. It's basically how AI agents remember things and then also forget things on purpose. Speaker 4 (00:07:00): [00:07:00] Interesting. That is interesting. There are more and more agents or at least AI systems are operating as a human brain in that sense. We're dreaming, they are now also dreaming. What's next? Speaker 2 (00:07:15): Yeah. Open clause implementation is particularly interesting because you can look and see what it dreamt in the light and deep and REM phases and you can see what it chose [00:07:30] to carry over into long-term memory. Speaker 1 (00:07:34): I wish I had that choice. Speaker 2 (00:07:40): Also, Open Telemetry graduated from CNCF only officially on May 21 apparently. And although this isn't strictly, this isn't just an AI thing, the whole project really is leading hard into GenAI and our own AI observability app follows GenAI Semantic Convention, [00:08:00] so I thought that it was relevant. And then Google IO also happened and dropped a bunch of models. The most interesting one was Gemini Spark, which it sounds to me like their open claw, like a continuously running a personal agent that's only available on their new, brand new Google AI ultra plans, which I can't get access to because I'm on a workspace. My Google account is not personal, which [00:08:30] kind of sucks. I would like to access that. Speaker 1 (00:08:34): Yeah. And then I was also saying that it was like Antgravity 20 or something like that. Now they have some IDE and whatnot. Speaker 2 (00:08:41): Yeah. It's like VS code. It's very cursor-like I think. Haven't played around so much with it, but that's also included in the AI ultra plans. So it's like use of anti-gravity and Gemini and YouTube premium, which isn't AI and Google Drive. [00:09:00] That's how Google's going to try to win this by bundling in all of the Google things. Another thing is that Pinecone, who's the one that came, maybe not came up with, but like popularized the idea of RAG retrieval augmented generation just released this blog post saying basically we don't think that RAG is the way to go now. Instead, pre-compiled [00:09:30] knowledge is, and they're calling it PineCone Nexus. I'm not entirely sure how that works yet, but I just thought it was interesting to see whenever you see like a flip flop like this, they popularized it. So they own this space and they're kind of saying, "Well, these are the areas where it's not so Speaker 4 (00:09:51): Good." Interesting. It looks like from the article that I just scanned, it looks very similar to what we do in Grafana Cloud around the [00:10:00] knowledge graph and AI memories or the infrastructure memories that we have in Grafana assistance. So I guess we're on the right track there. Speaker 2 (00:10:11): Sounds like it. Well, since you mentioned it, Ward, why don't we get started and you can tell us what Grafana assistant is. Speaker 4 (00:10:19): What is Grafana Assistant? So Grafana Assistant is our kind of chat-based agent for all your observability questions. And the assistant is kind [00:10:30] of an engine and you can basically access it through all kinds of different means, different doors if you will. And there's two doors that we're going to talk about today, I think, the GCX door and the MCP door, but most of our customers are probably now familiar with using the assistant through kind of a chat interface, a little bit similar to ChatGPT and they cannot ask it all kinds of questions. So think about simple questions around like, help me write [00:11:00] PropQL and the query language for our metrics database. You can ask to create a dashboard, but you can also do more advanced things like do full root cause analysis or maybe help building SLOs and much, much more. But that's essentially what the assistant is and what we're making available through all kinds of various ways like MCP and GCX and APIs and the web. Speaker 1 (00:11:29): Yeah. And then also [00:11:30] if you've used Assistant, I don't know, before Grafanicon, basically we're doing everything related to stuff you're running in cloud if you're directly using Assistant, but as of Grafana, you can also use Grafana Assistant with Grafana SS and also Enterprise. And I'm putting a blog in here that came out with that if you want to see more about that. Speaker 4 (00:11:55): Yeah, exactly. Speaker 2 (00:11:56): It was the Speaker 4 (00:11:56): Big news of our GrafanaCon conference that [00:12:00] we just had. Speaker 2 (00:12:03): Yeah. And also at GrafanaCon, I noticed that we just announced so many things all at the same time and now Assistant has tools and there's a lot of people who are confused. If they want to use their AI tools, their AI agents to do things in Grafana, there are now many different ways. They can use Assistant everywhere it seems and then [00:12:30] they can use those tools. They can use the Grafana MCP. There's now Grafana Cloud MCP, there's the CLI, there's GCX. So let's figure all of this out because otherwise it's super confusing. What does it mean for Assistant to use tools and what can you do with it? Speaker 5 (00:12:52): Yeah, good question. I think tools are the thing that makes AI into agents. That's the way I think of it. And [00:13:00] it gives them the power to interact with the real world in some way, rather than just being like text in and text out. So it allows the assistant to do all of the clever things it can do such as looking for dashboards, creating dashboards. Basically, it allows the agents to perform useful tasks and to do so more accurately as well by having real knowledge bases that they can query or being able to hit APIs in predefined ways. (00:13:30): [00:13:30] About a year and a half ago, there was this kind of big thing released by Anthropic, which was the model context protocol, MCP. And this was like a standardization of the way that the tools are defined and the idea was to solve the problem of you have lots of different applications that want to talk to lots of different agents [00:14:00] and have lots of different tools to find. And how do you do that without each application such as Cursor or VS code or whatever your IDE is defining all of those tools separately, like each one of them would have to go and define their own tools to interact with GitHub and to interact with Grafana. And likewise, GitHub would have to, if they wanted their tools to be available and everything, they would have to contribute them towards each of those tools. [00:14:30] So MCP is kind of a standardized protocol of defining those tools amongst other things, it's not just tools, but it basically allows you to tell a model and an LLM, an agent, these are some tools I have and every client application knows how to talk to them. (00:14:49): So some examples are the GitHub MCP server, which has tools such as like Create issue or pull request, fetch some code and [00:15:00] any one of the various IDEs can connect that up and the user just logs in quite simply, ideally quite simply, and gets all of those tools for free. There are myriad trade offs with this. It's great for lots of reasons. It's terrible for some others, like you have very partisan, but we'll talk about some of that hopefully soon. Speaker 1 (00:15:25): Yeah. I guess for people that maybe are on the observability side less so [00:15:30] than the AI side, you can kind of think about it in a way of like how open telemetry is where instead of having to figure out something for every single backend that you have, whether it's different, whether it's for like logging or metrics or even if it's different types of like login systems or whether you're just different ones that you might be using, if you're using tracing, if you're like, "Hey, I have Jager or I have Tempo and trying to figure all that out instead of having to set up everything separately for every single one of those. It's kind of similar, different but similar. Speaker 5 (00:15:59): Right. Speaker 2 (00:16:00): [00:16:00] Okay. So it's kind of like MCEP is kind of like an API schema of sorts. It just kind of defines a standard format for how to define these tools. Speaker 5 (00:16:15): Yeah. And that in itself is really powerful. Even if you just had that plus the authentication side. So for now at last we finally have a way of saying, this is how a client should authenticate with [00:16:30] a server and how these are the RPCs or the functions of the tools that you can call and they have standardized descriptions. That alone is really valuable. I think that's not really been something that the industry has had for a long time. There's been various different proposed things on top of JSON APIs like REST and those sort of things, but this is quite good for that, from that point of view. Speaker 2 (00:16:56): Okay. So what is the CLI then? Well, the [00:17:00] CLI is not actually an AI thing, it's a command line interface, right? But to anyone who wants to answer, what is the CLI in an AI context? Speaker 4 (00:17:10): Right, right. So this CLI, indeed everybody that is kind of familiar with a command line on Unix, Linux, et cetera, it's one of the oldest concepts in computing in that sense. People are very well aware of that, but the thing in the AI [00:17:30] age is that it's actually optimized not so much for human knowledge, even though like in the case of GCX, which is the Grafana Cloud CLI, it's fine if humans use it, but it's actually optimized actually for agentic usage. So we're assuming that you're not calling the CLI directly, but you're actually calling it through cloud code, Cursor, Gemini, Codex, all those types of Agentic frameworks and that we need to make sure that [00:18:00] the agent like cursor cloud code can actually figure out based on the flags and the commands that are within GCX or within the CLI, they're able to figure out which ones to call and then the CLI will actually do the heavy lifting. (00:18:21): And the idea is because it's agent optimized that we need to make sure that it's doing that very efficient. And when we're talking about [00:18:30] efficiency, you're talking about low token usage because context and the context window, we need to make sure that there's only relevant data in the context window and that we're not fitting that up with like a big dump. Let's say you're going to go to Krafana Cloud and you use GCX and you're going to ask it, "Hey, give me all the dashboards." That can be maybe thousands of dashboards. You don't want to put that in the context. So it's really important that the CLI is indeed giving a small result back. And [00:19:00] then one of the things that we are doing is that GCX, the CLI can actually also call the assistant. So it has even access to that specialized agents that we have on Carfina cloud for observability. Speaker 1 (00:19:17): And then when I hear people talking about these things like with MCP, you just hear people talk about MCP. When people are talking about CLIs and the AI sense, I hear people talking about skills. Could someone kind of explain what that is? Speaker 3 (00:19:30): [00:19:30] Well, I guess if you consider the kind of commands that the CLI provide as the building blocks of functionality, you can decide to manually call a command to do a particular action, get a dashboard or list your dashboards or push an alert rule or something. The skills are a way of defining a workflow to maybe weave a couple of those skills together, those command calls together to perform a particular action. [00:20:00] So it's kind of like bridging a gap from a set of unopinionated command line calls towards something more like what MCP tools do, which is to provide an opinionated action for a given use case. So they're really flexible things. They're essentially just a set of markdown files with some references to maybe some other markdown files or bash scripts or something like that. But for our use case, we're trying [00:20:30] to use them to build TLI commands, build GCX commands into actionable workflows. Speaker 4 (00:20:37): Yeah. And you can maybe also see that in the case of GCX that skills are a little bit also training wheels because our bet is or our hypothesis is that the more information there is around GCX, the usage use cases that actually LLMs will be trained on GCX. So if you ask an LLM to use the GitHub CLI [00:21:00] or the AWS CLI, it knows exactly what to do because there's so much information, so much blog posts around it. So hopefully, and of course we need to prove that hopefully at some point in time there will be so much information around GCX that we don't use skillset. We don't need skills that much because it's basically part of the training set of the LLMs, but that will probably take a little bit of time because there's always kind of a lag between the information that LLMs are [00:21:30] trained on and information that's out there on the interwebs. Speaker 1 (00:21:35): And is the way that skills are described with GCX, is that the same way that, because I'm just looking at an article about Claude skills. Is skills used in the same way or is it different? Because say if someone talks about services, whether you're talking about Docker or Kubernetes or a bunch of other stuff, services doesn't all mean the same thing. Speaker 3 (00:21:55): Yeah. In our use case, skills means the same for as if you were [00:22:00] installing one of the Claude code skills or something like this. The method of delivery is something different, but the result is that you get the same files in the same place. So if you're downloading some skills from, say, one of the Claude code skills or some skills from skills.SH or some other kind of registry or place where skills are published, you are essentially downloading a set of markdown files to a specific directory on your computer that the agents know to look in. And [00:22:30] I think each vendor has a different one. So there's like a Claude looks at predefined path and maybe Codex looks at a different one, but normally these are sim linked from some kind of standard plays or something. So there are some skills in bundled within GCX and there's a specific command that you can run that essentially writes those skills files in the standard format to the expected place on your computer so that they are available for the agents to use. Speaker 4 (00:23:00): [00:23:00] Yeah. And maybe good to mention is that we're talking about CLI, MCP and maybe you never know in a few weeks times there is something that is even more hip in the AI world, you know how it goes, right? It's kind of like fashion nowadays. And from our perspective, from Grafana Cloud's perspective, from Corfana Lab's perspective, the interface is kind of the fashion even though we try to make it very high quality, very useful, [00:23:30] very available for our customers. But I think the real value is the insight that it provides, the observability insights. So one of the things that we're focusing on when we're building Grafana Cloud is not only like providing just raw data, like results of PromQL queries to their customers, but actually giving high quality observability insights. So these are like fedted insights, they're prioritized, they're actionable and we try to deliver them in a fast and cheap way. (00:23:59): So [00:24:00] how can we give an answer with the least amount of tokens in a way that is really valuable for the agent or the human that is accessing those insights? So that's really where we put a lot of time and effort in that we design products like the assistants and our APIs around that. So that's basically the doors if you will, the CLI, MCP, API doors are able to leverage that. And then if there is something next, if there's a new fashion [00:24:30] in town, then we can adopt that and we can make our insights accessible through that interface. Speaker 2 (00:24:37): We do have a question here actually from one of our Grafana champions, Huleman. He says, "Hi all, can we use a custom MCP server from Grafana AI assistant? Do we have any sample configs or docs or blog posts?" So I'm posting a link to docs here for that, but maybe Ben, would you like to take [00:25:00] this? Speaker 5 (00:25:00): Yeah, sure. Yes as the answer with a couple of caveats. I think the main caveat being it has to be a hosted custom MCP server. So it has to have to use the stream of all HTTP MCP transport. Oh, nice. I've got a demo. Yeah, I Speaker 1 (00:25:18): Had it up, so I figured, yeah, Speaker 5 (00:25:19): Here's Speaker 1 (00:25:19): One that I added and then you can go on to add here and ... Speaker 5 (00:25:24): Yeah, exactly. So long as it supports remote MCP, which means like the stream of all HTTP [00:25:30] transport has to be hosted somewhere. It can be hosted ideally publicly as the easiest way, but there are ways that you can make it work with the private network connection and it supports authentication via headers or OAuth. So as long as you kind of use pretty standard auth mechanisms, that should work fine. Speaker 2 (00:25:55): So Ben, while you're at it, maybe you could explain to us why you're on team [00:26:00] MCP. Speaker 5 (00:26:02): Okay. Yeah. Well, the reason I'm into the MCP side is because I initially wrote the MCP server, the Grafana MCP server in December 2024, so a little while ago now and that was kind of just as MCP was becoming a thing and people weren't really as aware of tools and how tools could work with agents. [00:26:30] And MCP was like a nice way of getting that and you used to get a lot for free back then because you didn't really have as many frameworks as you do now. So yeah, wrote the initial MTP server as a hackathon project and they did quite well and then we decided to open source it and since then it's kind of taken off quite nicely. We have I think more than 3,000 GitHub stars now, which is really exciting and we recently released a cloud version of the [00:27:00] MCP server, which allows you to connect seamlessly to like a Grafana cloud instance. (00:27:09): And things I really like about MCP despite all the hate is the explicitness, I guess you do have to define a lot of things like tools upfront and you have to be a little bit more explicit there, but you get what you get in exchange [00:27:30] for that is self-describing like tools and parameters, you get authentication built into the protocol, which is really nice. You can use it across lots of different transports. So you use it either locally using the standardized one or remotely with stream of HTP and that means that it kind of works everywhere roughly as long as you have the right MCP servers. You can use it whether you're in, for example, the profan [00:28:00] assistant that can connect to any remote MCP server. It doesn't matter what language it was developed in or where it's hosted as long as it has a publicly addressable URL, you can connect to it and use it. (00:28:15): Not that I don't like CLIs at all. I'm a big fan of that. Speaker 2 (00:28:19): That's what I heard. Speaker 4 (00:28:22): Yeah, what people had against MCP in the past was that MCPs [00:28:30] defined the tools and the tool definition gets loaded in the context window and when you only have like 200K context window, that fills up quite quickly because I think like I did some calculations around 76 tools, that's probably like 20K tokens that will cost you. So that's about like 10% of your token window is already being used before you even start the conversation. However, what you see nowadays that a lot of platforms, they're implementing tool search, [00:29:00] so that means that they're not loading it, they're only loading those tools on demand so that you're not filling up the token window or the context window that much. So that's one of the main pet peeves of a lot of folks that were using MCP in the past and that's getting less and less true depending on the agent that you're using to access Speaker 5 (00:29:20): The MCP. Yeah, exactly. I think the downside there, that is the main great mitigation that's appeared in the last year is the deferred [00:29:30] tool search kind of side of things. The main limitation is that the MCP host called code cursor has to implement deferred tool search itself, but that is increasingly common. So yeah, agree Ward, that was my main complaint until quite recently as well. Speaker 1 (00:29:49): And a lot of people could be like, "Oh, I'm using Cursor Cloud or insert other tool." Or not, I guess not tool, but insert another [00:30:00] I guess, what's even the correct word that strives all of those things? My brain is not working at the moment. Yeah, basically if people are already using those for their own stuff that they're doing, they're like, "Oh, I know how to use this thing. Let me just add the MCP server that I need for whether it's Grafana or GitHub or something else," and they didn't have to really start with some other entirely different thing. Speaker 4 (00:30:29): Yeah. [00:30:30] Well, a lot of times, especially in technology, good enough is sometimes just fine. It doesn't need to be perfect. There might be some downsides, but for a lot of folks, that's perfectly acceptable. Maybe from a purest point of view, people are like, "It uses a lot of tokens, but nowadays maybe depending on the way that you're interacting with agents, that doesn't really matter." And also the agents, for example, I use CloudCode. We now have one million token windows, [00:31:00] context Windows. And so yeah, it's less of a big problem, I think. But if you look at CLIs, CLIs was one of the reasons why a lot of folks like CLIs is that with a CLI, you're aiming for a near zero baseline where you have progressive disclosure. It has some commands where you can then find indeed, okay, what are the commands that they need? (00:31:27): And that is basically discovered [00:31:30] by the agent. And so you could say that a CLI has way less token usage, but again, those differences become less and less relevant, I think. Or well, not relevant, but they're closing in on each other. Speaker 2 (00:31:50): I think on area where I like MCP better is that I feel like the tool discoverability is better on MCP. [00:32:00] On CLIs, you kind of have to know those commands and yeah, you can go through dash, dash help or whatever, but you still have to go through each one and then each command has different parameters and stuff. Sometimes you just want to know what's possible before you can figure out what you want to do next. Speaker 5 (00:32:22): Yeah. And each one of those cycles to discover the arguments and the subcommands and everything takes latency and tokens [00:32:30] and that's going to be done again. Every conversation that you have, it's not something that's upfront. So that can be slightly ... This in after a while, I do wonder if it's quite fun to see it the first time you use a new CLI, you're like, "Oh my God, it's discovering everything. So good. How impressive." And that gives you that feeling of things happening and it's making progress, which is nice. I have to admit. Speaker 2 (00:32:58): What do you think, Dav? Speaker 3 (00:32:59): Yeah, [00:33:00] I think you see a lot of variability as well with the models and things like this. So for example, very anecdotally, but we've run agents trying to use GCX to interact with Grafana for various use cases and with and without the GCX skills installed to give it any hints or not. And we can see a lot of variability in whether the models know whether GCX exists or whether to bother trying [00:33:30] to look for the command structure before calling anything or just to give it a crack and see if it works. But you'll see it quite often do this kind of stuttering thing where it gives something a go, fails, then decides to read the help docs for the given command to try and look for the guidance on how to do it. So it's not so ... Well, some of the models aren't so great at trying to discover exactly what to do before they try and just try first and then fail and then discover. (00:33:57): So maybe the [00:34:00] skills are trying to bridge that gap, but you can see MCP provides a lot more guidance from the start for these models, maybe especially the cheaper, smaller ones. Speaker 1 (00:34:10): Yeah. I actually had something similar, not with the CLI, but the other day I gave Cloud a blog and was asking questions and then I was reading, I was like, "This is weird." I'm like, "Did you actually read the blog?" They're like, "No, let me go pull that now." And I'm like, Speaker 3 (00:34:26): "Oh." Yeah, Speaker 1 (00:34:26): Exactly. Speaker 4 (00:34:28): Yeah. And what [00:34:30] I think is also important for us as creators of MCPs and CLIs is that we want to make sure that we're knowing when a change is being made, that there's no regressions, it's actually an improvement. And Dev was already talking about a lot of the complexities because it's a very diverse landscape. There's new versions of LLMs every day. There's different coding tools that people use. A coding tool that's popular today might not be popular tomorrow. It can change [00:35:00] overnight. Yeah, we do need to have a way to figure out, okay, how effective are our tools for our users? And that's, I think, something that Dev has been working on as well. Speaker 3 (00:35:13): Yeah. So we can go into evaluations and benchmarks if we want to, but just initially, and Ben, I think alluded to this previously around where models have training data. So without any external data or tools, they have a certain amount of knowledge baked into [00:35:30] them. And if they've been trained on CLIs before, they'll have more information or knowledge about how they work and stuff. So you'll see agents use the GitHub CLI very effectively today because it's been around for a long time. It's very stable and things like that. But what we're seeing, especially as GCX is so new, sometimes the training data is fighting the way that we want to present the commands and stuff. A specific example of this is [00:36:00] it loves, well, CloudCode loves to, or Sonnet and Opus love to redirects Stood error to stud out and slurp all of that up into its input and do things with that. (00:36:13): But it tries to do that and then it tries to compose the output of that command line tool call into a Python script or into JQ or something like this. So what we were doing is adding hints and other usage [00:36:30] ideas to the STUD error output, but when it kind of redirects that out to stud out as well and then tries to pipe that into some other thing that assumes PO adjacent output or something like this, it was falling over getting an error. So there's a few little quirks like that that we're learning about along the way and all of those kinds of things change constantly over time as these models are trained on new data and things like that. So regardless of what we try and pin down to learn about what we need [00:37:00] to change to make our command line interface and skills better, the landscape around those things is changing too. Speaker 2 (00:37:09): Ben, are there any other ways that you think any other use cases or things that MCP servers are better at than CLI? Speaker 5 (00:37:22): So I guess as an extension of the previous thing, I think that if an MCP server is designed [00:37:30] correctly, then it should kind of serve the needs of like 80% of the use cases that people have really well. And that means that there will be some slightly higher level tools rather than just raw API call wrappers to do the most common tasks that users ask for to kind of allow users to do most of the things they need to do, maybe with an escape hatch to do things that we don't want a direct tool call [00:38:00] for, to avoid having to define hundreds of tools for every esoteric use case that you might have. So one example of this in the Grafana world would be to like query your data sources, query Prometheus or to create a dashboard. Those are some really common use cases that we can provide really decent tools for with good descriptions like decent search parameters and that allows us to serve those common use cases really well. (00:38:28): Less so the case for [00:38:30] things like, I don't know, interacting with some of the less commonly used cloud APIs that we may have, things that you set up once or twice and then never really touch again. Is it worth defining a tool and eating up the context window of then every single user who uses that tool? That's like dollars on my conscience that I have to justify every single time someone uses it, just to kind of occasionally serve those users, which [00:39:00] and often that is not the case. And I think like that is one of the things that I would recommend GCX for without a shadow of a doubt is like if you need to do everything, like you need to be able to do all of these different things that, for example, Grafana allows you to do or Cafana Cloud allows you to do, GCX is a really, really good avenue for that. Speaker 2 (00:39:25): Okay. Maybe let's go to the CLI side of things. And [00:39:30] Ward, do you want to talk about GCX and what it is and why you thought that MCP was not enough? Speaker 4 (00:39:40): Yeah, exactly. I think MCP was enough. Again, I'm not changing mine. I think it's and, and, and not or definitely not. Speaker 2 (00:39:54): Well, if he's team MCP, then you must be Team Speaker 4 (00:39:58): CLI. (00:39:59): I don't know what that is. I don't like [00:40:00] that framing. I reject the framing pedal, but you can try, but I reject it. So let me just talk like what we try to do because CLIs are very effective. That is definitely something that you see when you use cloud code and use CLI. So for example, I use the GitHub CLI a lot and I love it. It's great. It works very well. Could be that there's like a GitHub MCP server that also worked great, haven't used this, but what we notice is that [00:40:30] like designing ... Well, let's first start with what GCX is because we referred to it quite a few times already. So GCX is the new Gravana Cloud CLI. So there have been maybe quite a few already kind of CLI attempts for Grafana in the past and these are still great, but we never had a CLI that was focused on all the endpoints, all the solutions, all the products that we have in Grafana Cloud. (00:40:56): So this is our, well, I don't want to say attempt because [00:41:00] it's actually already in public preview. It's going to go GA. It's actually already quite successful, but this is our approach to a CLI optimized for agentic usage for Grafana cloud. And yeah, like I said, we're now on the road to GA. There's quite a few Grafana users, Grafana cloud users that are adopting GCX. So we now seeing like hundreds of organizations are actually using it. We don't have the amount of GitHub stores [00:41:30] yet as the MCP server has. So we're slightly behind. We need to do better. So if people are listening, go to GCX on GitHub and give it a star, just see if we can ... Maybe if there is a battle MCP versus GCX, let's see it in the GitHub stars. Maybe that's the nice way to go about it. I see Ben sweating there. (00:41:52): Let's see. I'm Speaker 5 (00:41:54): Sorry. Yeah, I've got a year advantage, but I don't know if I can maintain that. Speaker 4 (00:41:59): And we [00:42:00] see also like in our API usage, people are heavily using GCX already. We see like millions of API calls. So that's really exciting and we want to go GA as fast as possible, but there's just a few things that we want to get in order before we officially stamp a GA. So there's a few things that we focused on when we're building GCX. So of course it needed to support the entire Garfana cloud API, but we also wanted to make it agent [00:42:30] native. And I already spoke a little bit about that, but what it concretely means is that we're detecting whether you're running in agent mode, so cloud code, cursor, copilot, whatever, and then it flips the output. So we always make sure that when an agent is asking for stuff, we give an agent friendly output, typically JSON. If you, for example, as a user call that API without any specific flags, we typically give you a CLI [00:43:00] text that renders nicely, like tabular information renders very nicely. (00:43:05): So we also make sure that we add hints to all the commands. So there's like an entire commander for every piece of Gravana, there's basically a command and subcommands and we also make sure that we add hints. So the hints about, hey, if you call this command as an agent, you get probably medium sized output or large output [00:43:30] so that maybe the agent can look at, "Okay, this is maybe a command that I should not directly trigger because I can maybe get a huge result back." So the agent is hopefully based on that information, know what the costs are of calling those APIs and maybe to narrow down the API or the commands that it needs to run before running or like the plan to interact with GCX in that sense. [00:44:00] Another thing that we did is also we spilled a file. So let's say you have a big payload, maybe a very big dashboard or like a long list of alerts, you don't want to put that in the context window like, "Hey, agent here, you have like one megabyte of JSON." That makes the agent not very happy. (00:44:19): So we spill to disk in adjacent format and then we tell it what the structure is so that the agent can then decide how to read that data in an effective way so that context [00:44:30] stays clean there. Two other things that we did. So with the hints, we tell it how it can select some fields. So if you are, for example, returning alerts, what is important for alerts? It's maybe like the name, maybe some labels, maybe some like the query that's being executed, maybe the thresholds. So these are all fields within JSON and we allow the agent to specify which fields it wants [00:45:00] to get back and that's all again, to make sure that it's as efficient as possible. And then the last thing that we do is we have semantic exit codes and JSON errors. So in place of retry and a board, we can actually make sure that the agent understand what goes wrong and based on that, take the next action, plan the next action and make that a little bit better. (00:45:27): And yeah, one of the things that I'm focusing on right [00:45:30] now and thinking a lot about is that we want the feedback loop. So we want to understand how people are using it and where agents that people are using are using GCX wrong or that GCX is wrong and basically use that as kind of a feedback loop to learn how we can improve GCX and make it more effective. So that could be like maybe anonymized usage reports or something like that, or maybe that people can ship like debugging [00:46:00] or troubleshooting payload to us, but we want to get that feedback loop running to make sure that GCX is as efficient as possible. Anyway, so that's a little bit about GCX. Speaker 2 (00:46:13): Great. I kind of feel like I want to see what it looks like in practice to really see how you do something in GCX and then how you do something with Grafana MCP. Ben, do you want to go through the demo? Speaker 5 (00:46:30): [00:46:30] Yeah, sure. Okay, let's try it. I'll explain what's going on once you see ... Oh, you got it up. Amazing. Okay. So this quick demo is just two side by side Claud codes and it's intentional that they look very similar right now. The one on the left has an MCP server configured for this specific Claude code instance and the one on the right does not, but it has some skills explaining how to use GCX in the local directory. [00:47:00] So what I plan on doing is basically asking it to do the same thing and the second one I'm going to say use GCX because I haven't told it more about GCX yet. And this task is actually one that the MCP server doesn't have a direct tool for. So it's a little bit disadvantaged, I suppose. We'll see how it goes and we'll see what it gets up to because we have plenty of other examples of things that the MCP server does have a tool for. (00:47:29): So [00:47:30] let's go. So all I'm going to do is send these and we can, I mean, it's going to be impossible to keep track of all of the things it's doing, but what this should do is, this one is now calling MCP Grafana, which is the MCP server that we have on the left and the one on the right is using GCX here, it's discovered that GCX is a CLI and now it's going through all [00:48:00] of the different help that it can find to find how to create an SLO. I don't know if I even talked about the question here. It's actually create a Grafana SLO for my assistant metrics group by Cluster. So what this should do is use ... They're both talking to the exact same Grafana instance. We're going to look to see how each one achieves the goal. (00:48:23): So actually this is very quick. The one on the right here, the GCX one, almost immediately [00:48:30] it has looked through all of the different GCX commands. It's written out what it knows an SLO should look like to a YAML file that we can then check and then it's done a dry run using some GCX commands here. And then it's gone ahead and actually created it using the SLA command there. That was very quick. So this one cost, I'm just going to zoom in slightly. This one cost 45 cents [00:49:00] to do this. The one on the left is asking me some follow up questions. Let's talk about this one. Go and have a look at what it's done. So it's looked for some data sources using some NCP tools. It's search for labels using the MCP tools, it's search for metrics and it's querying to figure out how the SLO API should look, because it doesn't have a tool for that. (00:49:24): And then it's going to go ahead and it's going to ask me a couple of questions about how to do this. [00:49:30] Hopefully it will now call the Grafana API to create that SLO and we'll see exactly what we wanted. So both of those have achieved this goal. It's managed to create the SLO there. They've just done it in very different ways. One of them has just hit the API directly because it didn't have a tool for it. So I mean, this doesn't necessarily answer the question of which one you should use. Sorry, I'm trying to find out how much this one has cost on the [00:50:00] left, but I've zoomed in really far so I can't actually see the bar at the bottom. Speaker 2 (00:50:07): So then another engineer spent gross on another engineer on the AI team says AI observability could help keep track of what they're doing. Speaker 5 (00:50:19): Very true. Yes. And I need to get both of those configured in the right way. Speaker 4 (00:50:24): You're still on Opus 4.6, man. Didn't you hear these? Speaker 5 (00:50:26): Yeah, I didn't like 4.7. Yeah. And then I was actually [00:50:30] on PTO last week, so I haven't got around to switching everything to 4.8. So this is immediately invalid. Everything that I've done here means nothing, but this one costs three cents more. So I guess it's an obvious win for GCX and I need to just retire. Speaker 1 (00:50:47): Or add a tool. Speaker 5 (00:50:49): Or add a tool. Yeah. So yeah, that's kind of a little bit of a demo of how each works. Hopefully that's instructive [00:51:00] at least in some way. I'll zoom back in a bit now. Speaker 2 (00:51:05): Yeah. Okay. Well, they both got there and the price difference is not enough that I would be like, "Yes, definitely use this. " So it's still kind of like, oh, which one Speaker 5 (00:51:16): Do Speaker 2 (00:51:16): You use for which situation? Speaker 5 (00:51:19): Yeah. And one of the ways that we can help to figure this out is I'd like to write some kind of decision tree and there's some aspects [00:51:30] of that decision where it's really easy to make the choice because if you are using a remote, maybe you're using claude.ai or ChatGPT remote and that doesn't have a shelf. So it can't just use GCX because it can't run GCX commands. In that case, you have to use a remote MCP server, that is the solution. And in others, it's less obvious. [00:52:00] If you're using an IDE locally or you're using Cloud code, it could be as much as a personal preference. It could be whether you are happy running binaries, arbitrary binaries on your machine, no shade on GCX, because they're very good binaries, I'm sure. Or it could be whether you need to perform more like esoteric tasks as we talked about. (00:52:27): Another way we can try to figure [00:52:30] this out is by running evaluations and actually looking at which of these is better for an individual task, whether that's for cost or fertility usage or the speed as we just got a comment I saw. Yes, it could be that speed is what you want to optimize in which case you choose that. So Dav, I'm going to hand over to you because I know you've got things to talk about here. Speaker 3 (00:52:58): Yeah, thanks. So [00:53:00] for our own education as well as for curiosity, what we've been doing is running the Ollie Bench suite, which I'll explain what that is in a minute, both with the MCP tool and with GCX as the door to Grafana is the analogy that we've been using, isn't it? So very briefly, Olliebench is an observability benchmark that Grafana Labs have defined and it contains, this is publicly available [00:53:30] open source. You can go and run it yourself with your own agents if you want to, which is always quite interesting. So by default, and this was originally defined to try and test different models against or to see how the different models perform for interacting against Grafana to perform a variety of observability related tasks. So originally it was for you fix everything down and you compare the models, but for our use case for developing GCX [00:54:00] and the skills that we bundle with it, we kind of flip that a little bit. (00:54:04): So we fix the model as well as the harness and what we adjust slightly in the harness is we adjust the way that it interacts with Grafana. So we can run it with MTP, which is the default and we can also run it then with GCX and install the skills that GCX has and run it like so. And so this will provide a bit more balance and a bit more data maybe to the anecdotal run that Ben had and it's worth [00:54:30] mentioning as well that all of the categories of tests here, so you'll see. But I think all of the tasks that are kind of defined in this benchmark, there are MCP tools for the majority of them. (00:54:45): So yeah, there are 63 tasks across these five categories. This particular run here, so we're running Opus 46 on both of these benchmarks. This one's running MCP and then I have another tab up here where we can see we're running TCX, but everything else in theory is the same. [00:55:00] So you can spend hours to dig in through this, but there are some headline numbers at the top. So for MCP, about two thirds of the tasks completed or passed three times, we run this with a kind of N of three and it costs about $50 in tokens essentially to run all of this. And each test has a grading mechanism, defined grading mechanisms. There's some score associated with each task too. [00:55:30] And so the same, compare those headlines to running against GCX, sorry, with skills. So it's more expensive for the runs at the minute. The mean score is slightly lower. (00:55:38): There's a slightly lower incidence of all three kind of runs all passing. So we see a slightly degraded performance when using GCX. Now this is really useful for us internally for what we can do is compare these scores. We can go and look inside a particular task [00:56:00] and you'll see the prompt for each task at the top and you'll tell it, it'll show what it tried to do and it has then the kind of trajectory or the path that the agent took through all of these various things, all of the tool calls it made and you can see the responses and I go through that. And so this is really useful for us developing GCX and you can see what skills it loaded and things like that too. So we can see compare maybe the categories where it didn't perform so well compared to MCP or maybe just performed badly [00:56:30] on its own as well. (00:56:32): And we can go and inspect to see where some of those tools, where the agent went wrong or maybe where the agent kind of misused the tool, or maybe we've designed the tool in a way that's easily misused or doesn't fit the scenario and things like that. And then we can decide to either make changes to GCX, we can make changes to the skills and for transparency as these are all quite new things. Sometimes there are little [00:57:00] quirks within the benchmarking suite as well that we need to adjust. So there's like each one of these tests has a grading rubric and maybe when the whole benchmarking suite was designed with the assumption that it would use MCP as the front door to Grafana. So we've kind of broken that assumption by sticking GCX in there instead. And maybe one of the grading criteria, for example, was originally checking that a specific tool was called. (00:57:30): [00:57:30] So the agent's never got a specific MCP tool was called. So that's never going to be true in the GCX version. So we had to adjust that to be less or to be more agnostic of how it got to the end result, that kind of thing. So these kind of benchmarking runs are really useful for the headline scores and then for digging into each of those kind of paths that the agent took to perform each of those tasks [00:58:00] so that we can go and learn from that and iterate and improve the GCX tool and the skills that we provide with it. Speaker 2 (00:58:10): Awesome. This is a great way to dog food too. Now we needed this for ourselves to be able to see more, to test, to evaluate our own AI tools more. And then now Olliebench is something that's available to everyone and is not just for our stuff, [00:58:30] but it is pretty handy to put GCX and MCP like this and see what's one good at, what's the other one better at and stuff. But I want to also kind of talk about other Grafana things that might be mentioned in the same breath like assistant that is sometimes conflated with one or both of MCP and CLI. How would you say they're related to assistant? Speaker 5 (00:59:00): [00:59:00] That's a good question. I think that the GCX and MCP are almost like the tool are similar to analogists to the tools that the assistant has available. And actually the assistant is more similar to Claude Code or the actual agent harness. It has the whole harness from running the agent loop to calling different tools, different ways of managing context and doing tool search [00:59:30] and all of those things that the MCP host or the agent application normally has. The big advantage that Grafana Assistant has is that it's really well optimized for Grafana specific tasks, especially things like creating dashboards where it can actually visualize the dashboards as it goes and it can interact with a DOM and it can update things live in front of you as you're working through creating a dashboard from except [01:00:00] For example, you can change these things, which is really, really good. (01:00:03): And it also has some quite bespoke tools there for doing Grafana specific things across the whole suite between not just doing those raw API calls, but it also knows all the way the full workflow all the way through. I always reach for Grafana Assistant if I'm doing anything to do with creating dashboards or interacting debugging issues [01:00:30] that I know involve several different data sources or I've started on one page and it can navigate me to others. I find it really, really powerful for that, especially for the investigation side of things. Speaker 4 (01:00:42): Yeah. Speaker 2 (01:00:43): Same for anything visual, especially I would go to assistant first if I want to see it or if I want to learn something, teach me how to do this thing, not just create an SLO for me, but where is that on the interface, then assistant would be better for that. Speaker 4 (01:01:00): [01:01:00] Yeah, exactly. And the assistant also has access to the infra memories to the knowledge graph for the folks that are familiar with Grafana Cloud schema context. We put a lot of effort in the investigations assistant investigations as part of the assistant where we have a validated workflow working, doing root cause analysis based on hypothesis. And there's a lot of value there because the thing is [01:01:30] with GCX, with MCP, you definitely have the hands to operate Grafana clouds to work with Prafana Cloud, but it still needs to improvise based on the knowledge that LLM has and the benefit of the assistance is that we basically ... Do you say that the full banana? Was that the way that it was always mentioned? I don't know what the term is, but we basically do ... It knows Grafana cloud. It has [01:02:00] great insights about what is actually happening with your services and not so much on a raw telemetry view or level, but actually real time outlier detection and that kind of stuff. (01:02:14): So I think that assistance is still a very important part of getting valuable insights and then for the more deterministic things like GCX and MCP are great, but you always [01:02:30] need the assistance for the brains, if you will. Speaker 2 (01:02:35): And I also wanted to mention because people might get confused that there's Gvana MCP, but then there's also K6 is MCP and that's a very different thing because they're targeting, they have different goals. So the Grafana MCP is about your observability stack and the K6 MCP is about writing and running your own K6 tests [01:03:00] on your machine. So different things, but just because there might be a confusion there if you look for MCP and they both come up. Just to wrap things up because we are out of time, just wanted to do a quick when to use switch for Grafana. Luckily we had this beforehand, so I was going to read it out. If you have a non-terminal client, so something like Claude desktop or cursor, then MCP is better. If you're building [01:03:30] automated pipelines or scripts, CLI or GCX, token constraint, or if you need a full feature coverage, then the CLI is better because you get the full banana, which is what Warren said. (01:03:46): If you want Xero set up authentication for cloud users, then MCP is better for hosted. And then if you have complex multi-step workflows with data transformation, then you would use the CLI [01:04:00] plus JQ and Pipes. And then if both are available, then start with MCP for discovery, or actually I would say start with assistant first, honestly. And then go to MCP and get familiar with all the tools available. And then when you're ready for the heavy lifting, switch to GCX. And then someone said, "If you're called Ben, then just MCP all the way." That wasn't me, by the way. Speaker 1 (01:04:29): I was having [01:04:30] a question. So if you're using Assistant where you can add your own MCP servers, is there a benefit to specifically in Assistant adding the open source Grafana MCP for instance?Does it do anything differently there versus if you just had Assistant by itself or whatnot? Speaker 5 (01:04:49): That's a good question. Not something we've ever really thought about before, but I can definitely think of some use cases where the assistant itself doesn't have tools for these things. And one of them is just that [01:05:00] it can't make arbitrary calls to Grafana's API unless I've missed something over the last week and the MCP server does allow that. So it almost opens up limitless possibilities for the assistant because actually these agents do understand Grafana's API quite well. I think all of the docs are in the training set and we just saw Claude code managed to make the right API call to create an SLO even. So yeah, you could use it for that. For example, [01:05:30] there are some other tools there that maybe the assistant doesn't quite cover, although I can't think there would be many at this point. I think the assistant has generally got a pretty decent coverage of those things. Speaker 4 (01:05:41): Yeah. And it's also a little bit turtles all the way down because we can also think about maybe running GCX as part of the agent environment of the assistant environment so that you ... The assistant now can also execute Biden scripts. So it could also call the GCX CLI as part of it. So [01:06:00] you get all the functionality of GCX also available within the assistant again. And then you can actually call with the CLI, the assistant again. So it's hurtless all the way down. It's infinite recursion. Speaker 2 (01:06:13): It's the Spider-Man meme. It's Speaker 4 (01:06:15): Pointing to each other. Yeah. So probably people will leave from this podcast more confused than they were before probably, but ... Speaker 2 (01:06:28): Yeah. Well, join the club. [01:06:30] This is an attempt to try and talk things out in the current state of everything, but the reality is it's not as simple as you should always use this tool. We don't want to do that. We don't want to have to choose as a company which one is the better bet or anything. So we're just kind of saying yes to everything and we'll see what sticks. Speaker 5 (01:06:57): Yeah. It's a game of trade offs as with so many other things [01:07:00] in this field, right? Speaker 1 (01:07:02): Yeah. And we keep advancing so much and changing so much that it could be completely different in a couple months, for instance. Yep. Speaker 4 (01:07:10): It's going to be very close. Speaker 2 (01:07:11): So maybe we'll have you all back again. Speaker 4 (01:07:13): Yeah. Yeah, indeed. Yeah. We can talk about our next project that we're going to launch in these things. Speaker 2 (01:07:19): It has to be a three-letter acronym though. We can't accept anything else. Speaker 4 (01:07:25): A three-letter acronym. Okay. We can already start brainstorming the name. Then we can use [01:07:30] the name and we can figure out the technology that is actually powering it later. Speaker 2 (01:07:34): Okay. Sounds like a good plan. Thank you everyone for joining us. Thank you to the people that are on here now watching us. And then also if you're watching this after the fact and if anyone has any questions, we might not know the answers, but we'll get you someone who does. So just leave them in the comments below. Speaker 1 (01:07:56): Yeah. And then there's also slack.grafana.com. There's [01:08:00] a Grafana assistant channel in there for instance, but yeah, basically it's the community Slack. So if you want to ask questions in there, talk to people, et cetera. Speaker 2 (01:08:09): And the next episode is actually going to be with ... It's a recorded one, not because we didn't want to do it live, but because the person that I interviewed Yaz is going away for a bit and I wanted to catch him before he did, but it's going to be all about evals with a focus on offline evals since we did online with AI observability [01:08:30] a couple episodes ago, but watch out for that. But thank you everyone for watching and we'll see you next