Canonical Observability With Simon Aronsson and Michele Mancioppi

# Canonical Observability With Simon Aronsson and Michele Mancioppi ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article1.be68295a7e40.png) URL:: https://nicolevanderhoeven.com/blog/20220121-koh38-canonical-observability Author:: nicolevanderhoeven.com ## Highlights > I definitely think there is a lot of overlap. One thing that I know that you and I have been pushing forward when we’ve been discussing this previously on the stream is that performance testing is great, but it usually only allows you to do a black box perspective testing, right? You’re able to measure the performance of your system, given certain interfaces or end points that you are able to observe the performance off and pairing that with observability tools, like for instance, loci or tempo, or Prometheus’s allows you to really drill back into the stack and figure out what is causing those problems. So I definitely think that it makes sense to combine the both to be able to observe the whole system and not just the edges of it. ([View Highlight](https://read.readwise.io/read/01fv078vnpp6r536h1r2td0vnh)) > Observability is a property that you want your systems to have, monitoring is the way that we have to make most systems observable. ([View Highlight](https://read.readwise.io/read/01fv07aeqpmn5tegk64jm0e6e3)) > automatic distributed tracing ([View Highlight](https://read.readwise.io/read/01fv07b2kt4979db8aybwcjew7)) > Monitoring is the art to collecting data ([View Highlight](https://read.readwise.io/read/01fv07b7h2v8eqemyfarxf9h0n)) > I really take the view that monitoring is a means to the end of observability. ([View Highlight](https://read.readwise.io/read/01fv07bqmc82x4hp56t3a9dbse)) > It is true that in the past 15, 20 years, there has been an increase of active monitoring. ([View Highlight](https://read.readwise.io/read/01fv07d77kmwpaashag58ysjww)) > But I also wanted to say that there’s also been, there’s still some push based monitoring, but we’re also now seeing more poll based monitoring systems, more event driven ones. ([View Highlight](https://read.readwise.io/read/01fv07e68r92vtwgsqyrahr946)) > there are entire domains where we deploy a software that is still very much push based. For example, IOT, for very basic reasons of network visibility, it’s very hard to go and reach out into someone else’s rooftop and go get the data. ([View Highlight](https://read.readwise.io/read/01fv07evjqe12xreqgnqq2tqnw)) > Yeah, it’s kind of like you mentioned that our applications in general are just getting more and more complex, it reminds me the push the trend from monolith to microservices and then now some people are also going back. It’s never really gone, right. And it’s really difficult to say, no, this is the new way, nobody else is doing or should do the other way. It’s not really like that. A lot of things in tech are cyclical. ([View Highlight](https://read.readwise.io/read/01fv07f5vps90vjbqf47ze1w4n)) > • 10 years ago we used to talk about microservices as well, but then we called it service-oriented architecture and it wasn’t the new cool thing. And then we decided that that was kind of cumbersome and powerful to work with. And we started to merge together our software, again, introduced monoliths. > • But in a service record architecture, we tended to have a very expensive, very large piece of machinery called the enterprise service bus that actually was providing to some extent that kind of visibility that today we need to have through tracing. ([View Highlight](https://read.readwise.io/read/01fv07gayjp787y8sczbtd66t6)) > It was the classic fallacy of oh jolly, oh golly. The same infrastructure that is powering my observability system was probably the system that went down. ([View Highlight](https://read.readwise.io/read/01fv08epfrdrr6n4p0v0jv7bb3)) > This is the classic blast radius question. Everything has failure modes, actually, several of them, and the more abstractions we deploy, the more failure modes we introduce, for example, Simme and I are building an observability system based on Kubernetes natives. Our guidance is if you deployed, deployed to a different cluster on different hardware, because you should never put the monitoring system on the same infrastructure, close to the infrastructure stuff we monitor. ([View Highlight](https://read.readwise.io/read/01fv08jafe9zp0ysvvsvef0w4v)) > It’s not only that. We also usually recommend that if you deploy our monitoring system, deploy another monitoring system that makes sure that your monitoring system is up and set them up in a, what did they call it? That dead mans trigger? ([View Highlight](https://read.readwise.io/read/01fv08hgcd6wx401nf0ywfk5dm)) > The dead man switch is the idea that, for example, do you know in the movies when the hero goes in front of him and says, if you kill me, you will be exposed in two days time. That’s the classic deadness which is something that happens unless someone periodically goes there and tells him not to. In this case, in the case of monitoring is unless your system, your monitoring system is telling you that it’s fine, assume tragedy. ([View Highlight](https://read.readwise.io/read/01fv08jg6s4n1rxz4yj2wgdtb5)) > The dead man switch is the idea that, for example, do you know in the movies when the hero goes in front of him and says, if you kill me, you will be exposed in two days time. That’s the classic deadness which is something that happens unless someone periodically goes there and tells him not to. In this case, in the case of monitoring is unless your system, your monitoring system is telling you that it’s fine, assume tragedy. ([View Highlight](https://read.readwise.io/read/01fv08jg86gjddg88t5cp3yt29)) > So let’s talk a bit more about the characteristics of a good observability stack. We’ve already mentioned a few we’ve mentioned that, it should be independent as much as possible from the systems that they’re trying to monitor. We’ve also said that maybe we should also take into account the human side of things rather than just the technological. Maybe you should tailor the complexity of your observability stack to the size and experience and structure of your teams. What are some other things that you both could name in your ideal observability stack? ([View Highlight](https://read.readwise.io/read/01fv08kq0qahxrktwfkbap1er9)) > You need it to be simple, just as take it out, push it in your leg, and it’s done. You don’t have to think about it. The same goes for monitoring or observability systems as well I would say. ([View Highlight](https://read.readwise.io/read/01fv08m9m819d6cqdtx950bvt3)) > We did this hackathon project actually for EBPF monitoring. And the whole reason that I got into it was because the promise was you install this binary, you drop it into your cluster on your local machine. And then it just kind of exposes magically all of these things that I normally would have had to set up like node exporters for, you know, it was we’re living in a golden age of monitoring and observability I think. The things that are possible, like I can’t even keep up anymore, be talking about ergonomics ([View Highlight](https://read.readwise.io/read/01fv08n3exmf642xayy3mmsxgt)) > EBPF is a fantastic technology. We are fundamentally given a virtual machine to run inside the kernel to execute logic, that allows us to do in a centralized place. Things that in the past were almost impossible. ([View Highlight](https://read.readwise.io/read/01fv08nm7349shpgtcg0qf3b94)) > People look at distributed tracing as a way of figuring out what is slow. And it works remarkably well as long as the bottleneck is in between processes, the moment you want to drill down, why this requests into the insight, this process was taking too long. You see what you’re tracing is not so suited anymore. You end up looking at the leaf instead of the forest, because there is so many more requests going on through the system, that unless you have a lot of spans that are very, very detailed and fundamentally to trace threat level profiling, you are not going to which costs a huge amount of overhead. ([View Highlight](https://read.readwise.io/read/01fv08pwn9gvqrxcxyt75m3crv)) > Well, actually, if you go on the, okay, can I make us most shield for a link? If you go on google, WWW.ubuntu.com/whatisobservability/whatisobservability. There you see a kind of overview of the things that I find most meaningful, monitoring different types of systems. Second, I’m gonna give here that your own. ([View Highlight](https://read.readwise.io/read/01fv08ry87fe8nhz9spbvg0h6c)) --- Title: Canonical Observability With Simon Aronsson and Michele Mancioppi Author: nicolevanderhoeven.com Tags: readwise, articles date: 2024-01-30 --- # Canonical Observability With Simon Aronsson and Michele Mancioppi ![rw-book-cover](https://readwise-assets.s3.amazonaws.com/static/images/article1.be68295a7e40.png) URL:: https://nicolevanderhoeven.com/blog/20220121-koh38-canonical-observability Author:: nicolevanderhoeven.com ## AI-Generated Summary Today, my former colleague Simon Aronsson brought his new colleague, Michele Mancioppi, to k6 Office Hours to talk about observability, and how they do it at Canonical. It turned into a great conversation about trends in observability from the last decade, and how it relates to performance and reliability testing. Transcript Hello everyone, and welcome back to the first k6 Office Hours of the year. Happy 2022, I’m Nicole Vander Hoeven. ## Highlights > I definitely think there is a lot of overlap. One thing that I know that you and I have been pushing forward when we’ve been discussing this previously on the stream is that performance testing is great, but it usually only allows you to do a black box perspective testing, right? You’re able to measure the performance of your system, given certain interfaces or end points that you are able to observe the performance off and pairing that with observability tools, like for instance, loci or tempo, or Prometheus’s allows you to really drill back into the stack and figure out what is causing those problems. So I definitely think that it makes sense to combine the both to be able to observe the whole system and not just the edges of it. ([View Highlight](https://read.readwise.io/read/01fv078vnpp6r536h1r2td0vnh)) > Observability is a property that you want your systems to have, monitoring is the way that we have to make most systems observable. ([View Highlight](https://read.readwise.io/read/01fv07aeqpmn5tegk64jm0e6e3)) > automatic distributed tracing ([View Highlight](https://read.readwise.io/read/01fv07b2kt4979db8aybwcjew7)) > Monitoring is the art to collecting data ([View Highlight](https://read.readwise.io/read/01fv07b7h2v8eqemyfarxf9h0n)) > I really take the view that monitoring is a means to the end of observability. ([View Highlight](https://read.readwise.io/read/01fv07bqmc82x4hp56t3a9dbse)) > It is true that in the past 15, 20 years, there has been an increase of active monitoring. ([View Highlight](https://read.readwise.io/read/01fv07d77kmwpaashag58ysjww)) > But I also wanted to say that there’s also been, there’s still some push based monitoring, but we’re also now seeing more poll based monitoring systems, more event driven ones. ([View Highlight](https://read.readwise.io/read/01fv07e68r92vtwgsqyrahr946)) > there are entire domains where we deploy a software that is still very much push based. For example, IOT, for very basic reasons of network visibility, it’s very hard to go and reach out into someone else’s rooftop and go get the data. ([View Highlight](https://read.readwise.io/read/01fv07evjqe12xreqgnqq2tqnw)) > Yeah, it’s kind of like you mentioned that our applications in general are just getting more and more complex, it reminds me the push the trend from monolith to microservices and then now some people are also going back. It’s never really gone, right. And it’s really difficult to say, no, this is the new way, nobody else is doing or should do the other way. It’s not really like that. A lot of things in tech are cyclical. ([View Highlight](https://read.readwise.io/read/01fv07f5vps90vjbqf47ze1w4n)) > • 10 years ago we used to talk about microservices as well, but then we called it service-oriented architecture and it wasn’t the new cool thing. And then we decided that that was kind of cumbersome and powerful to work with. And we started to merge together our software, again, introduced monoliths. > • But in a service record architecture, we tended to have a very expensive, very large piece of machinery called the enterprise service bus that actually was providing to some extent that kind of visibility that today we need to have through tracing. ([View Highlight](https://read.readwise.io/read/01fv07gayjp787y8sczbtd66t6)) > It was the classic fallacy of oh jolly, oh golly. The same infrastructure that is powering my observability system was probably the system that went down. ([View Highlight](https://read.readwise.io/read/01fv08epfrdrr6n4p0v0jv7bb3)) > This is the classic blast radius question. Everything has failure modes, actually, several of them, and the more abstractions we deploy, the more failure modes we introduce, for example, Simme and I are building an observability system based on Kubernetes natives. Our guidance is if you deployed, deployed to a different cluster on different hardware, because you should never put the monitoring system on the same infrastructure, close to the infrastructure stuff we monitor. ([View Highlight](https://read.readwise.io/read/01fv08jafe9zp0ysvvsvef0w4v)) > It’s not only that. We also usually recommend that if you deploy our monitoring system, deploy another monitoring system that makes sure that your monitoring system is up and set them up in a, what did they call it? That dead mans trigger? ([View Highlight](https://read.readwise.io/read/01fv08hgcd6wx401nf0ywfk5dm)) > The dead man switch is the idea that, for example, do you know in the movies when the hero goes in front of him and says, if you kill me, you will be exposed in two days time. That’s the classic deadness which is something that happens unless someone periodically goes there and tells him not to. In this case, in the case of monitoring is unless your system, your monitoring system is telling you that it’s fine, assume tragedy. ([View Highlight](https://read.readwise.io/read/01fv08jg6s4n1rxz4yj2wgdtb5)) > The dead man switch is the idea that, for example, do you know in the movies when the hero goes in front of him and says, if you kill me, you will be exposed in two days time. That’s the classic deadness which is something that happens unless someone periodically goes there and tells him not to. In this case, in the case of monitoring is unless your system, your monitoring system is telling you that it’s fine, assume tragedy. ([View Highlight](https://read.readwise.io/read/01fv08jg86gjddg88t5cp3yt29)) > So let’s talk a bit more about the characteristics of a good observability stack. We’ve already mentioned a few we’ve mentioned that, it should be independent as much as possible from the systems that they’re trying to monitor. We’ve also said that maybe we should also take into account the human side of things rather than just the technological. Maybe you should tailor the complexity of your observability stack to the size and experience and structure of your teams. What are some other things that you both could name in your ideal observability stack? ([View Highlight](https://read.readwise.io/read/01fv08kq0qahxrktwfkbap1er9)) > You need it to be simple, just as take it out, push it in your leg, and it’s done. You don’t have to think about it. The same goes for monitoring or observability systems as well I would say. ([View Highlight](https://read.readwise.io/read/01fv08m9m819d6cqdtx950bvt3)) > We did this hackathon project actually for EBPF monitoring. And the whole reason that I got into it was because the promise was you install this binary, you drop it into your cluster on your local machine. And then it just kind of exposes magically all of these things that I normally would have had to set up like node exporters for, you know, it was we’re living in a golden age of monitoring and observability I think. The things that are possible, like I can’t even keep up anymore, be talking about ergonomics ([View Highlight](https://read.readwise.io/read/01fv08n3exmf642xayy3mmsxgt)) > EBPF is a fantastic technology. We are fundamentally given a virtual machine to run inside the kernel to execute logic, that allows us to do in a centralized place. Things that in the past were almost impossible. ([View Highlight](https://read.readwise.io/read/01fv08nm7349shpgtcg0qf3b94)) > People look at distributed tracing as a way of figuring out what is slow. And it works remarkably well as long as the bottleneck is in between processes, the moment you want to drill down, why this requests into the insight, this process was taking too long. You see what you’re tracing is not so suited anymore. You end up looking at the leaf instead of the forest, because there is so many more requests going on through the system, that unless you have a lot of spans that are very, very detailed and fundamentally to trace threat level profiling, you are not going to which costs a huge amount of overhead. ([View Highlight](https://read.readwise.io/read/01fv08pwn9gvqrxcxyt75m3crv)) > Well, actually, if you go on the, okay, can I make us most shield for a link? If you go on google, WWW.ubuntu.com/whatisobservability/whatisobservability. There you see a kind of overview of the things that I find most meaningful, monitoring different types of systems. Second, I’m gonna give here that your own. ([View Highlight](https://read.readwise.io/read/01fv08ry87fe8nhz9spbvg0h6c))