Trends in Observability - Fork My Brain

%% date:: "[[2022-08-12]], [[2022-02-25]], [[2022-02-11]], [[2022-02-03]], [[2022-07-18]], [[2022-08-29]], [[2023-11-27]], [[2024-11-07]]" parents:: [[Observability]] %% # [[Trends in Observability]] <iframe width="560" height="315" src="https://www.youtube.com/embed/4NPCzt5eF9w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> *Paul Balogh, Daniel González Lopes, Simme Aronssson, and I talk about trends in observability.* ## Observability maturity levels Companies are at different levels of maturity and differ depending on which [[Telemetry]] signals they've adopted and to what extent. ![[ryanperry-org-maturity.png]] Four stages of observability adoption by telemetry signal [^perry] According to the graphic by Ryan Perry above, there are four stages of organizational maturity per telemetry signal: - *Play with it*: The company is collecting some data, but in a limited capacity or with limited outcome. This phase is more exploratory or proof-of-concept. - *Formalize it:* The company is sold on the significance of the telemetry signal and uses it in production, but mostly in a real-time sense. - *Centralize it:* The company sees the benefit of collecting historical data and storing it in a generic [[Database]]. - *Optimize it:* The company stores enough data that it's worth putting the data in a built-for-purpose database. ## Factors driving trends in observability To understand the latest trends in observability, it may be useful to think about *why* the industry is changing. ### Awareness of the limitations of microservices-based architecture and implementation We've seen a shift from [[Monolith]]s to [[Microservices]], but after the dust has settled, we're also beginning to see some of the disadvantages of a microservices-based architecture. [[Containerization]], both in the application and observability stacks, can lead to increased complexity. [^brewer] We're especially seeing a growing acknowledgement in the increased complexity of microservices, as teams struggle to make these types of systems [[Observability|observable]], [[Reliability|reliable]], [[Availability|available]], and [[ScalabilitY|scalable]]. As an industry, the honeymoon period with microservices is over, and we're more skeptical or cautious about the wholesale adoption of microservices. This healthy skepticism has spurred the development of technologies for making microservices more observable. ### Emergence of observability as a system The complexities of making a system observable have caused observability to evolve from an ancillary property of a system to a system in itself. Just like any system, observability services and tools are beginning to be scrutinized in terms of their scalability, availability, and performance. #### Redundant observability It's become a best practice to consider redundancy in observability platforms by deploying replicas that can share the load of observability functions. #### Available observability Observability systems themselves can be susceptible to failure and outage, especially if they are not decoupled from core application functions. A good observability stack should be independent from the systems they're trying to monitor [^michele]. A recent example of the danger of shared infrastructure between observability tools and application services is [[Roblox Return to Service 1028-1031 2021|Roblox's post-mortem]] analysis of their 2021 outage. #### Performant observability Active or [[Pull-based monitoring]] methods have grown in popularity, mainly because the event-driven approach tends to result in less overhead on the part of the application component servers, less susceptibility to network issues, and lower network congestion. However, active monitoring has still not completely replaced the need for [[Push-based monitoring]]. There are still many use cases where passive, or push-based, monitoring is dominant, such as in the case of the [[Internet of things]]. [^michele] ### The complexity of observability leads to a focus on ergonomics Observability should focus not just on systems, but also on whether observability tools are usable. Observability tools are only usable if the engineers responsible for maintaining a system have the experience and knowledge necessary to identify and remediate issues. New frameworks and protocols used in building applications also increases the complexity of making those applications observable, leading to fragmented monitoring [^brewer] that makes tools that aggregate and simplify that much more valuable. ## Trends So, what's actually changing to address the concerns above? ### [[Distributed tracing]] Distributed tracing is necessary in a microservices-based architecture because it is more difficult to track down issues when there are many services that a request goes through. Distributed tracing allows us to follow the path of a request throughout a system, letting us perform root cause analysis more effectively. ### [[eBPF]] monitoring in addition to [[Instrumentation]] Observability, up until recently, has required instrumentation as a given; that is, agents must usually be installed to expose metrics from application components. Typically, multiple agents are required in a production-ready system: one for every component. In microservices-based architectures, instrumentation can take up a large part of engineers' time. Instrumentation can also be tricky for non-standard or non-mainstream services or protocols. eBPF monitoring involves running programs in an operating system [[Kernel]] without having to change the kernel code itself or use kernel modules. These programs could be used to improve observability by automatically exposing common protocols, reducing the instrumentation required to make a system observable. eBPF is still in its infancy, though many companies have begun to release tools in this space: - [[Pixie]] by [[New Relic]] - [[Datadog]]'s agent [collects data via eBPF](https://docs.datadoghq.com/network_monitoring/performance/setup/?tab=agentlinux) - [[Parca]] by [[Polar Signals]] eBPF monitoring may bring companies closer to [[Unified Observability]]. ### [[Continuous Profiling]] ### Commoditization of telemetry data It's getting cheaper and cheaper to collect and store telemetry data, and as a result, companies have higher and higher collection and storage requirements. ### [[Frontend observability]] While observability has traditionally been firmly in the domain of backend infrastructure, frontend observability tools like [[Faro]] could also be said to be a way to gain observability into an application's frontend, making information about the [[Frontend performance testing|frontend]] just as important as information about the backend. ## Other resources ### A k6 Office Hours episode about trends in observability <iframe width="560" height="315" src="https://www.youtube.com/embed/SD5Y4QejEu4" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> [^michele]: Mancioppi, M. (2022). _Canonical observability with Simon Aronsson and Michele Mancioppi_. [[Canonical Observability With Simon Aronsson and Michele Mancioppi|My highlights.]] [^brewer]: Brewer, B. (2022). *4 emerging trends point to changes in the observability landscape*. Retrieved from https://venturebeat.com/2022/01/17/4-emerging-trends-point-to-changes-in-the-observability-landscape-in-2022 [^perry]: Perry, R. (2023). *A Tale of Two Flamegraphs: Unlocking Performance Insights in a Diverse Application Landcape*. KubeCon NA 2023. Retrieved from [YouTube](https://www.youtube.com/watch?v=XES5Irk08qw).