Observability — A 3-Year Retrospective - The New Stack

# Observability — A 3-Year Retrospective - The New Stack ![rw-book-cover](https://cdn.thenewstack.io/media/2019/08/4c97de5c-cube-2366515_1280.jpg) URL:: https://thenewstack.io/observability-a-3-year-retrospective/ Author:: Charity Majors ## Highlights > Like so many other terms in software engineering, “observability” is a term borrowed from an older physical discipline: in this case, control systems engineering. Observability is the mathematical dual of controllability. ([View Highlight](https://read.readwise.io/read/01ff0xxm9tjs3jjt944r94sy23)) > *“Less formally, this means that one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output* [*sensors*](https://en.wikipedia.org/wiki/Sensors)*.”* ([View Highlight](https://read.readwise.io/read/01ff0xy2jg37d554x9av28d8s7)) > About 10 years ago, a crop of new providers emerged under the APM (Application Performance Management) umbrella. NewRelic, AppDynamics, and others billed as a better way to understand your application code. Instead of using an agent, you might install their libraries in your code, which would then track requests and report on language internals as well as request specifics. They usefully generate lots of top-10 lists so you could understand where your performance problems were coming from — by endpoint, by query, and so forth. ([View Highlight](https://read.readwise.io/read/01ff0xywt4y2jfgf6kdpa7q2an)) > These tools were a major step forward. They were still mostly metrics-based under the hood, but the perspective shift from third-party observer to the first-person observer allowed for far greater introspection of your software and its behavior. ([View Highlight](https://read.readwise.io/read/01ff0xz0zqqky6ws0477ba1d6p)) > *the kinds of systems we were building were fundamentally different than the systems those tools were developed to understand*. ([View Highlight](https://read.readwise.io/read/01ff0xzkcxhebc1xh877hted6e)) > These older tools, once revolutionary, simply no longer work for our current systems. ([View Highlight](https://read.readwise.io/read/01ff0xzrsrrf7p8a4rgpe58d5j)) > Cardinality and Its Relation to Complex Distributed Systems ([View Highlight](https://read.readwise.io/read/01ff0xzxctbmrpah3dv5g81nfs)) > Cardinality refers to the number of unique items in a set. Any unique ID will always be the highest possible cardinality, and a single value will always be the lowest possible cardinality. If you had a collection of a hundred million user records, you can guess confidently that Social Security numbers will have the highest possible cardinality; first name and last name will be high cardinality, though lower (because some names repeat); gender will be fairly low-cardinality, and “Species: human” will, presumably, be the lowest possible cardinality, should you actually bother to record it. ([View Highlight](https://read.readwise.io/read/01ff0y0d989rypcq2rbw95ydec)) > Why does this matter? Because *high-cardinality information is the most useful data for debugging* or understanding a system (think user IDs, shopping cart IDs, request IDs … basically any IDs and also instances, container, build number, span ID, etc). Unique IDs will always do the best job of identifying individual needles in a given haystack. ([View Highlight](https://read.readwise.io/read/01ff0y0q710p7x50f0kad3xhzg)) > Yet metrics-based tooling systems can only deal with *low-cardinality dimensions at scale*. ([View Highlight](https://read.readwise.io/read/01ff0y0vw9gaxdymcnwn58qdg6)) > Monitoring tools are effective for systems with a stable set of known-unknowns, and relatively few unknown-unknowns. For a system with predominantly unknown-unknowns, monitoring tools were all but useless. ([View Highlight](https://read.readwise.io/read/01ff0y27gqz1xs7dm7vzpbm6ss)) > I’m explaining this the long, painful, manual way of narrowing down your hypotheses step by step. Debugging this way involves lots of small, verifiable hypotheses, one after another, like breadcrumbs. It works *only* because I can break down by every dimension, including ones with very high cardinality. It works *only* because I have done no pre-aggregation before writing records to disk, in fact, every query I issue aggregates at read time to answer my question. It works *only* because I have gathered the data at the right level of abstraction, oriented around the request and its units of work — because the request is what maps to the user’s real lived experience. ([View Highlight](https://read.readwise.io/read/01ff0y4kkh9gnzfa76gj8a9g7x)) > By using events and passing along the full context, conversely I can ask any question of my systems and inspect its internal state, therefore I can understand any state my system has gotten itself into — even if I have never seen it before, never conceived of it before! I can understand anything that is happening inside my system, any state it may be in — without having to ship new code to handle the state. This is key. This is observability. ([View Highlight](https://read.readwise.io/read/01ff0y5h8prc60sy0cxe3nnweq)) > A system is observable to the extent that you can understand new internal system states without having to guess, pattern-match, or ship new code to understand that state. This, to me, is the most useful way to extend the control theory concept to software engineering. ([View Highlight](https://read.readwise.io/read/01ff0y6mx27vd9r44nt1chvznk)) > This is a technical distinction worth preserving and distinguishing from mere telemetry, because the ability to understand unknown-unknown states is what so many teams currently lack, and that lack is actively hurting them every single day. With modern distributed systems and the platform-ification of services, unknown-unknowns are most of what you will have to deal with for the rest of your life. It’s worth getting good at explaining them. It’s worth preserving a technical vocabulary for these solutions ([View Highlight](https://read.readwise.io/read/01ff0y6wbf312b4ga89hpmxeq0)) > Microservices communities and Kubernetes adopters were also early and quick to embrace observability. Because once you’ve blown up (or decomposed) the monolith, most of your “traditional” debugging tools no longer work. You have to return to first principles and make all these decisions again and aggregation on the request ID becomes of paramount importance; the hardest part is figuring out where the problem is in your distributed system, not debugging the code itself. ([View Highlight](https://read.readwise.io/read/01ff0y8yqxarj4j9cgd4ffy15d)) > And between 2017 and 2018, literally every vendor in the monitoring, APM and log management market segments added the term observability to their content, sites and marketing language. Which brings us to the next section on how many misuse observability and misguide others. ([View Highlight](https://read.readwise.io/read/01ff0y9md1vea3ysbzbr1tpdjr)) > In early 2018 I noticed that vendors had latched on to “distributed tracing, metrics, and logs” as “three pillars of observability.” Ben Sigelman neatly [debunked](https://lightstep.com/blog/three-pillars-zero-answers-towards-new-scorecard-observability/) this, saying: *it makes no sense because those are just three data types. You may achieve [observability] with all three, or none — what matters is what you do with the data, not the data itself.* ([View Highlight](https://read.readwise.io/read/01ff0ya56n2qnq2r0chkd39yp0)) > If you attend an industry conference today, you’re likely to hear the speakers adhere to the accurate definition — that observability is how you explain unknown-unknowns, that it’s about exploration and debugging instead of dashboards and pattern matching or accessing certain data types. ([View Highlight](https://read.readwise.io/read/01ff0yay3azj93bsr45qstkyxn)) > For example, observability is impossible without: > • raw events > • high cardinality dimensions > • no pre-aggregation, no pre-indexing (which lock you into asking predefined questions) > • read time aggregation > • arbitrarily wide events > • schema-less-ness > • structured data > • oriented around the lifecycle of the request > • batched up context > • not metrics-based > • static dashboards don’t work, it must be exploratory > • etc. > And then we built honeycomb precisely to that spec. ([View Highlight](https://read.readwise.io/read/01ff0ybmptvjn2dhdjg8dmgmgp)) > Engage in Constant Conversation with Your Code ([View Highlight](https://read.readwise.io/read/01ff0ycff04gzd0m0377zpv8y8)) > For Engineers … but also Engineering-adjacent Teams ([View Highlight](https://read.readwise.io/read/01ff0yd1f1rwy2cesjwv4dydvf)) --- Title: Observability — A 3-Year Retrospective - The New Stack Author: Charity Majors Tags: readwise, articles date: 2024-01-30 --- # Observability — A 3-Year Retrospective - The New Stack ![rw-book-cover](https://cdn.thenewstack.io/media/2019/08/4c97de5c-cube-2366515_1280.jpg) URL:: https://thenewstack.io/observability-a-3-year-retrospective/ Author:: Charity Majors ## AI-Generated Summary A summary of the observability movement over the past three years. ## Highlights > Like so many other terms in software engineering, “observability” is a term borrowed from an older physical discipline: in this case, control systems engineering. Observability is the mathematical dual of controllability. ([View Highlight](https://read.readwise.io/read/01ff0xxm9tjs3jjt944r94sy23)) > *“Less formally, this means that one can determine the behavior of the entire system from the system’s outputs. If a system is not observable, this means that the current values of some of its state variables cannot be determined through output* [*sensors*](https://en.wikipedia.org/wiki/Sensors)*.”* ([View Highlight](https://read.readwise.io/read/01ff0xy2jg37d554x9av28d8s7)) > About 10 years ago, a crop of new providers emerged under the APM (Application Performance Management) umbrella. NewRelic, AppDynamics, and others billed as a better way to understand your application code. Instead of using an agent, you might install their libraries in your code, which would then track requests and report on language internals as well as request specifics. They usefully generate lots of top-10 lists so you could understand where your performance problems were coming from — by endpoint, by query, and so forth. ([View Highlight](https://read.readwise.io/read/01ff0xywt4y2jfgf6kdpa7q2an)) > These tools were a major step forward. They were still mostly metrics-based under the hood, but the perspective shift from third-party observer to the first-person observer allowed for far greater introspection of your software and its behavior. ([View Highlight](https://read.readwise.io/read/01ff0xz0zqqky6ws0477ba1d6p)) > *the kinds of systems we were building were fundamentally different than the systems those tools were developed to understand*. ([View Highlight](https://read.readwise.io/read/01ff0xzkcxhebc1xh877hted6e)) > These older tools, once revolutionary, simply no longer work for our current systems. ([View Highlight](https://read.readwise.io/read/01ff0xzrsrrf7p8a4rgpe58d5j)) > Cardinality and Its Relation to Complex Distributed Systems ([View Highlight](https://read.readwise.io/read/01ff0xzxctbmrpah3dv5g81nfs)) > Cardinality refers to the number of unique items in a set. Any unique ID will always be the highest possible cardinality, and a single value will always be the lowest possible cardinality. If you had a collection of a hundred million user records, you can guess confidently that Social Security numbers will have the highest possible cardinality; first name and last name will be high cardinality, though lower (because some names repeat); gender will be fairly low-cardinality, and “Species: human” will, presumably, be the lowest possible cardinality, should you actually bother to record it. ([View Highlight](https://read.readwise.io/read/01ff0y0d989rypcq2rbw95ydec)) > Why does this matter? Because *high-cardinality information is the most useful data for debugging* or understanding a system (think user IDs, shopping cart IDs, request IDs … basically any IDs and also instances, container, build number, span ID, etc). Unique IDs will always do the best job of identifying individual needles in a given haystack. ([View Highlight](https://read.readwise.io/read/01ff0y0q710p7x50f0kad3xhzg)) > Yet metrics-based tooling systems can only deal with *low-cardinality dimensions at scale*. ([View Highlight](https://read.readwise.io/read/01ff0y0vw9gaxdymcnwn58qdg6)) > Monitoring tools are effective for systems with a stable set of known-unknowns, and relatively few unknown-unknowns. For a system with predominantly unknown-unknowns, monitoring tools were all but useless. ([View Highlight](https://read.readwise.io/read/01ff0y27gqz1xs7dm7vzpbm6ss)) > I’m explaining this the long, painful, manual way of narrowing down your hypotheses step by step. Debugging this way involves lots of small, verifiable hypotheses, one after another, like breadcrumbs. It works *only* because I can break down by every dimension, including ones with very high cardinality. It works *only* because I have done no pre-aggregation before writing records to disk, in fact, every query I issue aggregates at read time to answer my question. It works *only* because I have gathered the data at the right level of abstraction, oriented around the request and its units of work — because the request is what maps to the user’s real lived experience. ([View Highlight](https://read.readwise.io/read/01ff0y4kkh9gnzfa76gj8a9g7x)) > By using events and passing along the full context, conversely I can ask any question of my systems and inspect its internal state, therefore I can understand any state my system has gotten itself into — even if I have never seen it before, never conceived of it before! I can understand anything that is happening inside my system, any state it may be in — without having to ship new code to handle the state. This is key. This is observability. ([View Highlight](https://read.readwise.io/read/01ff0y5h8prc60sy0cxe3nnweq)) > A system is observable to the extent that you can understand new internal system states without having to guess, pattern-match, or ship new code to understand that state. This, to me, is the most useful way to extend the control theory concept to software engineering. ([View Highlight](https://read.readwise.io/read/01ff0y6mx27vd9r44nt1chvznk)) > This is a technical distinction worth preserving and distinguishing from mere telemetry, because the ability to understand unknown-unknown states is what so many teams currently lack, and that lack is actively hurting them every single day. With modern distributed systems and the platform-ification of services, unknown-unknowns are most of what you will have to deal with for the rest of your life. It’s worth getting good at explaining them. It’s worth preserving a technical vocabulary for these solutions ([View Highlight](https://read.readwise.io/read/01ff0y6wbf312b4ga89hpmxeq0)) > Microservices communities and Kubernetes adopters were also early and quick to embrace observability. Because once you’ve blown up (or decomposed) the monolith, most of your “traditional” debugging tools no longer work. You have to return to first principles and make all these decisions again and aggregation on the request ID becomes of paramount importance; the hardest part is figuring out where the problem is in your distributed system, not debugging the code itself. ([View Highlight](https://read.readwise.io/read/01ff0y8yqxarj4j9cgd4ffy15d)) > And between 2017 and 2018, literally every vendor in the monitoring, APM and log management market segments added the term observability to their content, sites and marketing language. Which brings us to the next section on how many misuse observability and misguide others. ([View Highlight](https://read.readwise.io/read/01ff0y9md1vea3ysbzbr1tpdjr)) > In early 2018 I noticed that vendors had latched on to “distributed tracing, metrics, and logs” as “three pillars of observability.” Ben Sigelman neatly [debunked](https://lightstep.com/blog/three-pillars-zero-answers-towards-new-scorecard-observability/) this, saying: *it makes no sense because those are just three data types. You may achieve [observability] with all three, or none — what matters is what you do with the data, not the data itself.* ([View Highlight](https://read.readwise.io/read/01ff0ya56n2qnq2r0chkd39yp0)) > If you attend an industry conference today, you’re likely to hear the speakers adhere to the accurate definition — that observability is how you explain unknown-unknowns, that it’s about exploration and debugging instead of dashboards and pattern matching or accessing certain data types. ([View Highlight](https://read.readwise.io/read/01ff0yay3azj93bsr45qstkyxn)) > For example, observability is impossible without: > • raw events > • high cardinality dimensions > • no pre-aggregation, no pre-indexing (which lock you into asking predefined questions) > • read time aggregation > • arbitrarily wide events > • schema-less-ness > • structured data > • oriented around the lifecycle of the request > • batched up context > • not metrics-based > • static dashboards don’t work, it must be exploratory > • etc. > And then we built honeycomb precisely to that spec. ([View Highlight](https://read.readwise.io/read/01ff0ybmptvjn2dhdjg8dmgmgp)) > Engage in Constant Conversation with Your Code ([View Highlight](https://read.readwise.io/read/01ff0ycff04gzd0m0377zpv8y8)) > For Engineers … but also Engineering-adjacent Teams ([View Highlight](https://read.readwise.io/read/01ff0yd1f1rwy2cesjwv4dydvf))