%% date:: [[2022-10-28]], [[2022-11-03]], [[2023-11-27]], [[2023-11-30]], [[2023-12-04]], [[2024-01-16]] %% # [[Distributed tracing]] Distributed tracing is the process of tracking a request or transaction as it progresses throughout different components within a system. It is a pillar of [[Observability]], along with metrics and logs. Unlike metrics and logs, both of which monitor events occurring on the side of the application components, distributed tracing takes a different approach by following the path of a request through an application. You can think of a trace as a type of structured [[Logs|log]] that includes context, correlation, hierarchy, and other information. [^otel] <iframe width="560" height="315" src="https://www.youtube.com/embed/zDrA7Ly3ovU?si=A_klewkaYA4KLvDx&amp;start=353&amp;end=950" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> [^goh] <iframe width="560" height="315" src="https://www.youtube.com/embed/ZirbR0ZJIOs" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> ## Why do distributed tracing? Distributed tracing is driven by the stimuli introduced to an application, as it more closely examines the effect of a single transaction. It is useful for doing performance or latency optimization as well as performing root cause analysis on identified issues. [^daniel1] Distributed tracing is useful, if not essential, when testing [[Microservices]] or [[Service-oriented architecture]]s because of the [[Disadvantages of microservices#Increased difficulty in troubleshooting|increased difficulty in troubleshooting]]. It helps us: - Monitor service health - Perform [[Root cause analysis]] to get to the bottom of an issue - Improve application performance [^simme] ### What does a trace show that metrics and logs don't? A trace shows *where* in a system the operations triggered by a user action went. In a way, it can be seen as a sort of [[Lead indicator]], whereas metrics and logs tend to be [[Lag indicator|lag indicators]]: when ### What's the difference between tracing and [[Continuous Profiling|profiling]]? ## Traces and spans A span is the smallest unit of measure that involves an operation and the duration of that operation. A span can also be found in a typical waterfall network graph. ![Screenshot including spans from Chrome DevTools](/assets/devtools-span.png) A trace is a [[Directed acyclic graph (DAG)]] of spans, [^Jaeger] in that it is presented as a diagram as it goes from one node to another. ![Spans and traces in Jaeger](/assets/jaegerspans.png) [^Jaeger] A trace is like an E-PASS device that tags cars at certain toll points, measuring the time between two points along with metadata like the license plate and ownership details. ## The trace process ### The header Tracing typically starts by adding a header to a request, usually in [[Protobuf]] format. The header contains a unique string that is used as the trace ID. ### The propagator The propagator is the format that will be used by the headers. Requests with headers are then sent in batches to a distributed tracing tool. This can be done in several formats (w3c, jaeger, b4, ot) but the most common is w3c. ### Exporter An exporter actually sends the message(s) to the endpoint (for visualization). ## [[Distributed tracing protocols]] ## [[Observability Tools#Distributed tracing Traces|Distributed tracing tools]] ## [[Distributed tracing and performance testing]] ## See also - [[Observability]] ## References [^Jaeger]: [Jaeger](https://www.jaegertracing.io/docs/1.22/architecture/) [^daniel1]: [[k6 Tech Talk 20210303]] [^simme]: Aronsson, S. (2021). _Intro to distributed tracing with Tempo, OpenTelemetry, and Grafana Cloud._ Retrieved from https://grafana.com/blog/2021/09/23/intro-to-distributed-tracing-with-tempo-opentelemetry-and-grafana-cloud . [[Intro to Distributed Tracing With Tempo, OpenTelemetry, and Grafana Cloud|My highlights]]. [^goh]: Elliott, J., van der Hoeven, N., and Balogh, P. (2023). [[GOH 22 - How to get started with Tempo with Joe Elliot]]