Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial | DailyDevLists

Loading video player...

Full Transcript

3,306 words • EN

Your users are complaining that your

application is slow. Sometimes it takes

8 seconds to respond. At other times

two,

when you check your metrics, everything

looks fine. Average response times are

acceptable or services are healthy. Your

dashboards are green. So either your

users are idiots or you're not capable

of capturing what's actually happening

with their requests. Now I tend to

assume users are right, which means I

would have to call you well. I'm not

going to do that. Instead, I'm going to

show you why you can't see what's really

happening. So here's what you're about

to learn. You will see exactly how to

track requests as they flow through

dozens of microservices. Identify which

specific operation is causing delays and

understand why your traditional

observability tools are lying to you. By

the end of this video, you will know how

to implement distributed tracing that

actually shows you what's happening in

your system. Let's start with why this

problem exists in the first place.

Here's a common challenge for

engineering teams. You're shipping

features, reviewing PRs, and running

sprints, but you don't really know where

the bottlenecks are. Why do some PR sit

in review for days? Why do certain

branches go stale before anyone notices?

Without visibility, those problems

compound. That's where the sponsor of

this video comes in. Devstats is an

engineering analytics platform that

gives leaders visibility into their

software delivery processes. Two things

stand out. First, it helps improve PR

flow by identifying review bottlenecks.

You can see which PRs are waiting too

long, who's overloaded, and where

handoffs are breaking down. Teams using

this ship faster because they catch the

friction early. Second, there's

real-time blocker detection. Devstats

spots stale branches and aging issues

before the delay releases. Instead of

discovering a problem during the sprint

retrospective, you catch it while

there's still time to fix it. The

platform also tracks Dorometrics, splint

planning accuracy, and lets you

benchmark against industry standards. If

you want to see where your engineering

time is actually going, check out

devstats.com. The link is in the

description. Big thanks to devstats for

sponsoring this video. And now, let's

get back to the main subject.

Let's say your application is

experiencing intermittent slowdowns.

Sometimes a request takes longer and at

other times it takes less time. The

inconsistency is frustrating because you

cannot predict when it will happen or

why. So what do you do? You check your

metrics and here's the kicker. The

average response time looks acceptable.

All your services report healthy.

According to your traditional

observability tools, everything is fine

but clearly it's not. From the user's

perspective, it looks simple. A user

makes a request to your application and

gets back a response. That's how it

appears on the surface. The problem is

that you're dealing with a complex

system made up of dozens of

microservices or dozens of applications.

It's so complex that you honestly don't

know what's exact path each request

takes through the architecture. You've

got an API gateway, a node service, a

payment service, an inventory service

an external payment processor like uh

Stripe, a database, a warehouse service

and many more. Sure, you have a general

idea of how things are supposed to flow

but the reality is far more intricate.

So, here's the fundamental issue. What

appears to be request flowing through

your system is actually many separate

independent requests. When a user makes

a request, that single action triggers a

cascade of individual service calls

each one independent from the others.

Each service in your architecture

receives an incoming request does its

work and then makes separate outgoing

requests to other services. The user's

request hits the API gateway which then

calls the B service, calls the payment

service and let's say inventory service.

But it doesn't stop there. The payment

service calls Stripe. The inventory

service queries the database and might

also call the warehouse service. It's a

massive web of interconnected calls

right? But here's the catch. If service

only sees its own incoming and outgoing

requests, there is no inherent way for a

service to know that its work is part of

the same logical transaction that

started with that user's request. From

each service perspective, it's just

handling independent requests. Your logs

from individual services show that

operations completed successfully. The

old service log that it validated the

token. The payment service log that it

processed the payment. the inventory

service log that it updated stock, but

you cannot correlate those log entries

across services. You cannot tell that

they are all part of the same single

user action. So you're left with

critical questions that remain

unanswers. Which microservices are

actually involved in the slow requests?

What is the actual request path through

the system? Which service is causing the

delay? And within that service, which

specific operation is slow? Is it a

database query, an external API call, a

cache lookup or some business logic? Is

it always the same service and operation

causing the problem or does it differ?

Is the bottleneck in interervice

communication or network latency? Are

there cascading timeouts or retry storms

happening? Where exactly are those extra

6 seconds being spent? Is it between

services or within a specific operation?

What you really, really, really need is

the ability to follow a single request

across all service boundaries. You need

to map its complete journey through your

system and identify exactly where the

bottleneck is. And this is where

distributed tracing becomes essential.

Is the tool that finally gives you

visibility into what's really happening

across your entire distributed system.

So let's explore how it works and how

you can implement it in your own

applications.

Open telemetry tracing gives you

distributed tracing capabilities to

track requests as they flow through your

microservices and distributed systems is

the solution to the visibility problem

that we discussed earlier. So let's

start with the fundamentals. Traces

represent the complete journey of a

logical transaction through your system

connecting all the separate requests

that were triggered by a single user

action. Remember when a user clicks

let's say checkout that's not one

request moving through your services.

It's dozens of separate independent

requests. The trace connects them all

together. Now within each trace you have

spans. Spans are individual units of

work like a database query and HTTP

request to another service or a function

call. Each span contains timing data

showing how long that specific operation

took. It has attributes providing

metadata about what happened. It records

events making significant moments like

errors or retries. And it maintains

parent and child relationships that show

how spans connect to each other. The

magic that ties all that together is

context propagation. And this is nothing

to do with AI context. This is open

telemetry context. It ensures that trace

ID and span information flow across

service boundaries. So every service

knows its work is part of the same

logical transaction. Now how do you

actually add tracing to your

applications? There are two main

approaches. Automatic instrumentation

and manual instrumentation. Let's start

with automatic instrumentation that uses

agents or libraries to instrument common

frameworks without requiring code

changes. It's quick to set up with

minimal code modifications. You can be

up and running in minutes. It covers

common frameworks out of the box like

HTTP servers, databases and messaging

systems. It has less maintenance burden

since the instrumentation is handled for

you not by you and on top of that it

provides consistent instrumentation

across all your services. The downsides

it gives you limited visibility into

custom business logic. It may capture

too much or too little data depending on

your needs. You get less control over

span names and attributes and it can add

overhead to all operations since it

instruments everything by default and

that's where manual instrumentation

kicks in. That's where you write

explicit code to create custom spans for

your business logic. This gives you full

control over what gets traced. You can

add business specific context and

attributes that matter to your domain.

You can trace only critical parts to

reduce overhead and you get better span

naming for your domain operation. And

the trade-offs well it requires code

changes and ongoing maintenance. It

needs developer effort to instrument

properly. It risks inconsistent

instrumentation across teams if everyone

does it differently and it can be

forgotten during you know rapid

development when people are moving fast.

Here's my take. Nevertheless, auto

instrumentation is fine for getting

started. I will show you what tracing

looks like, help you understand the

concepts, and give you something to

experiment with. But don't mistake it

for the destination. Real observability

requires you to understand what you're

measuring and why. Auto instrumentation

captures a bunch of noise you don't care

about while missing the stuff that

actually matters to your business logic.

Manual instrumentation forces you to

think about what's important. So use out

instrumentation to learn then move to

manual instrumentation instrument what

actually matters. That's where real

value is really. So in other words out

instrumentation sucks but use it if you

have to. Now let's look at a real

example of manual instrumentation in

action. We will extract some code from a

TypeScript file that shows you how to

create a custom span for tracing tool

executions. This code over there creates

a custom span for tracing tool

executions. You get a tracer instance.

Start a span with a descriptive name

that includes the tool being executed.

Set the span kind to internal since this

is business logic within your process

my process actually, and add custom

attributes that capture the tool name

and its input arguments. That's it. It's

that easy. And this is manual

instrumentation at work. You're

explicitly deciding what to trace and

what context to capture.

Now let's see what distributed tracing

looks like in action when you actually

run your instrumented application. This

is the Yagger UI, one of the most

popular open source tracing backends.

You can see it found two traces from the

IIMCP service. One trace is for a post

request to remediate that took around 40

seconds. The other is for version that

took around actually under two seconds.

Both show multiple spans and have some

errors reported. This is your trace

search interface. This is where you find

things you're looking for. Let's start

uh with the version endpoint. This is

just a simple function that returns the

status of the application. So how many

external interactions do you think such

a simple function the one that return

status has? Huh? One, two, maybe none.

Let's find out. Ah, look at that. Over

20 spans, that simple version endpoint

made around 20 different calls to the

rest of the system. It interacted with

Quadron DB, made quite a few calls to

the Kubernetes cluster. It interacted

with CLA LLM as well as the embedding

model. This waterfall view shows you

exactly what happened and when. Each bar

represents a span with its duration. The

keyarchy shows parent and child

relationships. You can see the main post

version at the top then execute tool

version as a child followed by

operations like quadrant health check

and kubernetes list namespace and there

are many others. Notice how some spans

have red indicators showing potential

issues related to the interactions made

in those spans. Those are important as

well. So here's the thing. Even if I'm

the one who wrote all that code and I

know it by heart, which is often not the

case, I would still not be able to

deduce all of this without tracing, the

complexity is just too damn high, even

for such a simple function. Now, let's

look at the remediation endpoint, which

is slightly more complex. This function

interacts with an LLM in a loop. In each

iteration, it sends a context to the

LLM. The AI responds with request to

execute some tools. Then in the next

iteration, the context is augmented with

the outputs of those tools and sent back

to the LLM and so on and so forth until

the LM decides it has all the

information to provide a solution. Now

here's the critical part. I cannot know

in advance how many loops will be done.

In some cases it might be zero in others

20 or any other number. So how would

they know that? How would they know

what's happening without tracing? Now

this trace shows multiple tool loop

iteration spans each one making

Kubernetes API calls like cube control

get ports cube control get deployments

cube control get services. The waterfall

shows when each operation happened and

how long it took. Some spans have red

error indicators showing that things

some things are wrong some things are

not right. Now when you click on a

specific span you get detailed

information about that operation. At the

top you can see the service name

duration and start time. The tag section

shows span attributes which are key

value pairs providing metadata about the

operation. Those are custom attributes

we added through the manual

instrumentation. Genai operation name is

to loop. Genai provider name is

entropic. You can see the specific load

model used and token usage showing it

consumed over 20,000 input tokens and

produced over 1,000 output tokens. Now

notice spun kind is client since this is

an external call to the LLM. The process

section shows which host and pot this

run on the library name and the actual

command that was executed.

Now let's break down the key components

that make distributed tracing works.

Every trace has a trace ID, a unique

identifier for the entire logical

transaction across all services. Each

operation has a span ID which is a

unique identifier for that specific

operation. The parent span ID links

spans together to form a trace hierarchy

showing which operations triggered which

other operations. Then there are span

attributes. Those are key value pairs

you saw in the tag section. They provide

metadata like HTTP methods, status codes

or custom tags you define. And span

events are timestamped log entries

within a span capturing things like

errors, exceptions or checkpoints during

execution. Now you might have noticed

span kind internal in the code earlier

and span kind client in the span

details. Those are span kinds which

categorize what type of operation the

span represents. Open telemetity defines

five span kinds. Internal for internal

operations within your organization

your application actually, client for

outgoing requests to external services

server for handling incoming requests

producer for sending messages to cues

and consumer for receiving messages from

cues. And you remember those geni

attributes you saw throughout the

examples? Well, those aren't arbitrary

names. Open telemetity defines semantic

conventions which are standardized

attribute names for specific domains.

There are conventions for HTTP, for

databases, for messaging systems, for

Genai, for Kubernetes, cloud providers

and many more. Using those conventions

means tracing tools can automatically

recognize and display domain specific

information consistently, whether you're

tracing HTTP calls, database queries

LLM interactions, or anything else. Now

and this is important in a high traffic

production systems you cannot capture

every single trace the volume would be

massive and the storage cost would kill

you dead. This is where sampling

strategies come in. There is headbased

sampling which makes decisions at trace

creation time. You might sample

probabilistically, like, hey, you, know what

capture 10% of all traces or use rate

limiting like capture 100 traces per

second. The decision is made up front

before you know anything about the

trace. Next, there is tailbased sampling

which is smarter but more complex. It

makes decisions after the trace

completes based on criteria like errors

or latency. You can say something like

hey keep all traces with errors or keep

all traces over 5 seconds. This captures

the interesting stuff while discarding

the boring successful request. Then

there is always on sampling which

captures everything. It's useful for

development but it's expensive and

impractical in production. And adaptive

sampling dynamically adjusts based on

traffic patterns increasing or

decreasing the sample rate depending on

load. And here's a critical piece

context propagation. Remember

distributed tracing only works if the

trace ID flows through all your

services. when service A call service B

service B needs to know it's part of the

same trace. Now for that open telemetry

uses W3C trace context standard for

interopability across vendors and tools.

For HTTP requests, special headers carry

the trace context which is trace parent

that contains the trace ID and span ID

and trace state which carries vendor

specific data. Now within a single

process, context propagation uses

context objects that flow through your

code. This works across both synchronous

and asynchronous operations. So your

trace doesn't break just because you're

using sync or a weight or callbacks.

What else? Yeah, open telemetry has

libraries available for all major

languages like Java, Python, Go

JavaScript.NET

and whatever else. It supports popular

frameworks out of the box including HTTP

servers, databases, messaging systems

and gRPC. So whichever language you're

using, whichever, there's no excuse not

to add tracing. So do it. Do it now. Do

it right away because you will you'll

see you will thank me later. Now once

you instrumented your code and you

capture traces, you need to export that

data somewhere. Open telemetry uses

OTLP, the open telemetry protocol as the

native protocol for exporting trace

data. And here's probably the most

important thing, the most the the big

power of open telemetry as a standard.

It's adopted by almost everyone.

Everyone you write your traces once once

and only once using open telemetry and

you can send them to any back end

without changing your instrumentation

code. Open source options like Jagger

Zipkin, and Tempo. Commercial cloud

providers like Data Dog, New Relic and

Honeycom, they all support open

telemetry. So you're not locked into a

vendor. You can switch backends, use

multiple backends simultaneously, or

migrate from one to another without

touching your application code.

Exporters botch and buffer data to

optimize network usage. And you can

configure them to send to multiple

destinations if you want redundancy or

need to support different tools.

So let's bring this back to where we

started. Your users were complaining

about slow responses, but your metric

said everything was fine. That's the

fundamental problem with distributed

systems. You cannot see what's really

happening when a user action triggers

dozens of separate requests flowing

through uh your microservices or

whatever else you have. Distributed

tracing solves this. Open telemetry

gives you the ability to follow a

logical transaction across all service

boundaries to see exactly which

operations are slow to understand the

complete path through your system to

identify bottlenecks you couldn't see

before. You get traces that connect

separate requests into one logical

transaction, spans that show timing for

each operation, attributes that capture

the context you care about, and context

propagation that makes it all work

seamlessly across services. So start

without instrumentation to learn the

concepts, then move to manual

instrumentation to capture what actually

matters to your business logic. Use

sampling strategies to keep costs

reasonable in production and leverage

the vendor natural nature of open

telemetry. You're never ever ever locked

in. Your traditional observability tools

are not lying to you. They're just blind

to what's really happening in

distributed systems. Distributed tracing

gives you the visibility you need. So

implement it. Start today. Your future

self will thank you when you're

debugging the next production issues.

Thank you for watching. See you in the

next one. Cheers.

Distributed Tracing Explained: OpenTelemetry & Jaeger Tutorial

DevOps & AI Toolkit

75 days ago

21:56

Observability & Monitoring

Rank #2

Description

Your users are complaining about slow response times—sometimes 8 seconds, other times 2 seconds—but your metrics show everything is fine. Average response times look acceptable, all services report healthy, and your dashboards are green. So what's really happening? The problem is that what looks like a single user request is actually dozens of separate, independent requests cascading through your microservices. Each service only sees its own operations, with no way to know they're part of the same logical transaction. Your logs show individual services completed successfully, but you can't correlate these entries across services or identify which specific operation is causing the delay. This video shows you exactly how to solve this blindness using distributed tracing with OpenTelemetry. You'll learn the difference between automatic and manual instrumentation, see real examples of tracing implementation in TypeScript, and analyze actual traces using Jaeger to understand request flows through complex systems. We'll cover traces, spans, context propagation, semantic conventions, sampling strategies, and how to export trace data to any backend without vendor lock-in. By the end, you'll understand why traditional observability tools can't see what's happening in distributed systems and how to implement tracing that reveals the complete journey of every request through your architecture. #DistributedTracing #OpenTelemetry #Microservices ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: DevStats 🔗 https://devstats.plug.dev/5W1oh9J ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join ▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/observability/distributed-tracing-explained-opentelemetry--jaeger-tutorial 🔗 OpenTelemetry: https://opentelemetry.io ▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below). ▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/ ▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox ▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Distributed Tracing with OpenTelemetry (OTEL) 01:18 DevStats (sponsor) 02:34 Microservices Performance Mystery 06:24 OpenTelemetry Distributed Tracing 10:57 Analyzing Traces with Jaeger 14:53 Understanding OpenTelemetry Traces 20:14 Tracing Solves Observability Blindness

Video Details

Category

Observability & Monitoring

Featured Date

January 9, 2026

Quality Rank

#2

AI Recommended