Loading video player...
Your users are complaining that your
application is slow. Sometimes it takes
8 seconds to respond. At other times
two,
when you check your metrics, everything
looks fine. Average response times are
acceptable or services are healthy. Your
dashboards are green. So either your
users are idiots or you're not capable
of capturing what's actually happening
with their requests. Now I tend to
assume users are right, which means I
would have to call you well. I'm not
going to do that. Instead, I'm going to
show you why you can't see what's really
happening. So here's what you're about
to learn. You will see exactly how to
track requests as they flow through
dozens of microservices. Identify which
specific operation is causing delays and
understand why your traditional
observability tools are lying to you. By
the end of this video, you will know how
to implement distributed tracing that
actually shows you what's happening in
your system. Let's start with why this
problem exists in the first place.
Here's a common challenge for
engineering teams. You're shipping
features, reviewing PRs, and running
sprints, but you don't really know where
the bottlenecks are. Why do some PR sit
in review for days? Why do certain
branches go stale before anyone notices?
Without visibility, those problems
compound. That's where the sponsor of
this video comes in. Devstats is an
engineering analytics platform that
gives leaders visibility into their
software delivery processes. Two things
stand out. First, it helps improve PR
flow by identifying review bottlenecks.
You can see which PRs are waiting too
long, who's overloaded, and where
handoffs are breaking down. Teams using
this ship faster because they catch the
friction early. Second, there's
real-time blocker detection. Devstats
spots stale branches and aging issues
before the delay releases. Instead of
discovering a problem during the sprint
retrospective, you catch it while
there's still time to fix it. The
platform also tracks Dorometrics, splint
planning accuracy, and lets you
benchmark against industry standards. If
you want to see where your engineering
time is actually going, check out
devstats.com. The link is in the
description. Big thanks to devstats for
sponsoring this video. And now, let's
get back to the main subject.
Let's say your application is
experiencing intermittent slowdowns.
Sometimes a request takes longer and at
other times it takes less time. The
inconsistency is frustrating because you
cannot predict when it will happen or
why. So what do you do? You check your
metrics and here's the kicker. The
average response time looks acceptable.
All your services report healthy.
According to your traditional
observability tools, everything is fine
but clearly it's not. From the user's
perspective, it looks simple. A user
makes a request to your application and
gets back a response. That's how it
appears on the surface. The problem is
that you're dealing with a complex
system made up of dozens of
microservices or dozens of applications.
It's so complex that you honestly don't
know what's exact path each request
takes through the architecture. You've
got an API gateway, a node service, a
payment service, an inventory service
an external payment processor like uh
Stripe, a database, a warehouse service
and many more. Sure, you have a general
idea of how things are supposed to flow
but the reality is far more intricate.
So, here's the fundamental issue. What
appears to be request flowing through
your system is actually many separate
independent requests. When a user makes
a request, that single action triggers a
cascade of individual service calls
each one independent from the others.
Each service in your architecture
receives an incoming request does its
work and then makes separate outgoing
requests to other services. The user's
request hits the API gateway which then
calls the B service, calls the payment
service and let's say inventory service.
But it doesn't stop there. The payment
service calls Stripe. The inventory
service queries the database and might
also call the warehouse service. It's a
massive web of interconnected calls
right? But here's the catch. If service
only sees its own incoming and outgoing
requests, there is no inherent way for a
service to know that its work is part of
the same logical transaction that
started with that user's request. From
each service perspective, it's just
handling independent requests. Your logs
from individual services show that
operations completed successfully. The
old service log that it validated the
token. The payment service log that it
processed the payment. the inventory
service log that it updated stock, but
you cannot correlate those log entries
across services. You cannot tell that
they are all part of the same single
user action. So you're left with
critical questions that remain
unanswers. Which microservices are
actually involved in the slow requests?
What is the actual request path through
the system? Which service is causing the
delay? And within that service, which
specific operation is slow? Is it a
database query, an external API call, a
cache lookup or some business logic? Is
it always the same service and operation
causing the problem or does it differ?
Is the bottleneck in interervice
communication or network latency? Are
there cascading timeouts or retry storms
happening? Where exactly are those extra
6 seconds being spent? Is it between
services or within a specific operation?
What you really, really, really need is
the ability to follow a single request
across all service boundaries. You need
to map its complete journey through your
system and identify exactly where the
bottleneck is. And this is where
distributed tracing becomes essential.
Is the tool that finally gives you
visibility into what's really happening
across your entire distributed system.
So let's explore how it works and how
you can implement it in your own
applications.
Open telemetry tracing gives you
distributed tracing capabilities to
track requests as they flow through your
microservices and distributed systems is
the solution to the visibility problem
that we discussed earlier. So let's
start with the fundamentals. Traces
represent the complete journey of a
logical transaction through your system
connecting all the separate requests
that were triggered by a single user
action. Remember when a user clicks
let's say checkout that's not one
request moving through your services.
It's dozens of separate independent
requests. The trace connects them all
together. Now within each trace you have
spans. Spans are individual units of
work like a database query and HTTP
request to another service or a function
call. Each span contains timing data
showing how long that specific operation
took. It has attributes providing
metadata about what happened. It records
events making significant moments like
errors or retries. And it maintains
parent and child relationships that show
how spans connect to each other. The
magic that ties all that together is
context propagation. And this is nothing
to do with AI context. This is open
telemetry context. It ensures that trace
ID and span information flow across
service boundaries. So every service
knows its work is part of the same
logical transaction. Now how do you
actually add tracing to your
applications? There are two main
approaches. Automatic instrumentation
and manual instrumentation. Let's start
with automatic instrumentation that uses
agents or libraries to instrument common
frameworks without requiring code
changes. It's quick to set up with
minimal code modifications. You can be
up and running in minutes. It covers
common frameworks out of the box like
HTTP servers, databases and messaging
systems. It has less maintenance burden
since the instrumentation is handled for
you not by you and on top of that it
provides consistent instrumentation
across all your services. The downsides
it gives you limited visibility into
custom business logic. It may capture
too much or too little data depending on
your needs. You get less control over
span names and attributes and it can add
overhead to all operations since it
instruments everything by default and
that's where manual instrumentation
kicks in. That's where you write
explicit code to create custom spans for
your business logic. This gives you full
control over what gets traced. You can
add business specific context and
attributes that matter to your domain.
You can trace only critical parts to
reduce overhead and you get better span
naming for your domain operation. And
the trade-offs well it requires code
changes and ongoing maintenance. It
needs developer effort to instrument
properly. It risks inconsistent
instrumentation across teams if everyone
does it differently and it can be
forgotten during you know rapid
development when people are moving fast.
Here's my take. Nevertheless, auto
instrumentation is fine for getting
started. I will show you what tracing
looks like, help you understand the
concepts, and give you something to
experiment with. But don't mistake it
for the destination. Real observability
requires you to understand what you're
measuring and why. Auto instrumentation
captures a bunch of noise you don't care
about while missing the stuff that
actually matters to your business logic.
Manual instrumentation forces you to
think about what's important. So use out
instrumentation to learn then move to
manual instrumentation instrument what
actually matters. That's where real
value is really. So in other words out
instrumentation sucks but use it if you
have to. Now let's look at a real
example of manual instrumentation in
action. We will extract some code from a
TypeScript file that shows you how to
create a custom span for tracing tool
executions. This code over there creates
a custom span for tracing tool
executions. You get a tracer instance.
Start a span with a descriptive name
that includes the tool being executed.
Set the span kind to internal since this
is business logic within your process
my process actually, and add custom
attributes that capture the tool name
and its input arguments. That's it. It's
that easy. And this is manual
instrumentation at work. You're
explicitly deciding what to trace and
what context to capture.
Now let's see what distributed tracing
looks like in action when you actually
run your instrumented application. This
is the Yagger UI, one of the most
popular open source tracing backends.
You can see it found two traces from the
IIMCP service. One trace is for a post
request to remediate that took around 40
seconds. The other is for version that
took around actually under two seconds.
Both show multiple spans and have some
errors reported. This is your trace
search interface. This is where you find
things you're looking for. Let's start
uh with the version endpoint. This is
just a simple function that returns the
status of the application. So how many
external interactions do you think such
a simple function the one that return
status has? Huh? One, two, maybe none.
Let's find out. Ah, look at that. Over
20 spans, that simple version endpoint
made around 20 different calls to the
rest of the system. It interacted with
Quadron DB, made quite a few calls to
the Kubernetes cluster. It interacted
with CLA LLM as well as the embedding
model. This waterfall view shows you
exactly what happened and when. Each bar
represents a span with its duration. The
keyarchy shows parent and child
relationships. You can see the main post
version at the top then execute tool
version as a child followed by
operations like quadrant health check
and kubernetes list namespace and there
are many others. Notice how some spans
have red indicators showing potential
issues related to the interactions made
in those spans. Those are important as
well. So here's the thing. Even if I'm
the one who wrote all that code and I
know it by heart, which is often not the
case, I would still not be able to
deduce all of this without tracing, the
complexity is just too damn high, even
for such a simple function. Now, let's
look at the remediation endpoint, which
is slightly more complex. This function
interacts with an LLM in a loop. In each
iteration, it sends a context to the
LLM. The AI responds with request to
execute some tools. Then in the next
iteration, the context is augmented with
the outputs of those tools and sent back
to the LLM and so on and so forth until
the LM decides it has all the
information to provide a solution. Now
here's the critical part. I cannot know
in advance how many loops will be done.
In some cases it might be zero in others
20 or any other number. So how would
they know that? How would they know
what's happening without tracing? Now
this trace shows multiple tool loop
iteration spans each one making
Kubernetes API calls like cube control
get ports cube control get deployments
cube control get services. The waterfall
shows when each operation happened and
how long it took. Some spans have red
error indicators showing that things
some things are wrong some things are
not right. Now when you click on a
specific span you get detailed
information about that operation. At the
top you can see the service name
duration and start time. The tag section
shows span attributes which are key
value pairs providing metadata about the
operation. Those are custom attributes
we added through the manual
instrumentation. Genai operation name is
to loop. Genai provider name is
entropic. You can see the specific load
model used and token usage showing it
consumed over 20,000 input tokens and
produced over 1,000 output tokens. Now
notice spun kind is client since this is
an external call to the LLM. The process
section shows which host and pot this
run on the library name and the actual
command that was executed.
Now let's break down the key components
that make distributed tracing works.
Every trace has a trace ID, a unique
identifier for the entire logical
transaction across all services. Each
operation has a span ID which is a
unique identifier for that specific
operation. The parent span ID links
spans together to form a trace hierarchy
showing which operations triggered which
other operations. Then there are span
attributes. Those are key value pairs
you saw in the tag section. They provide
metadata like HTTP methods, status codes
or custom tags you define. And span
events are timestamped log entries
within a span capturing things like
errors, exceptions or checkpoints during
execution. Now you might have noticed
span kind internal in the code earlier
and span kind client in the span
details. Those are span kinds which
categorize what type of operation the
span represents. Open telemetity defines
five span kinds. Internal for internal
operations within your organization
your application actually, client for
outgoing requests to external services
server for handling incoming requests
producer for sending messages to cues
and consumer for receiving messages from
cues. And you remember those geni
attributes you saw throughout the
examples? Well, those aren't arbitrary
names. Open telemetity defines semantic
conventions which are standardized
attribute names for specific domains.
There are conventions for HTTP, for
databases, for messaging systems, for
Genai, for Kubernetes, cloud providers
and many more. Using those conventions
means tracing tools can automatically
recognize and display domain specific
information consistently, whether you're
tracing HTTP calls, database queries
LLM interactions, or anything else. Now
and this is important in a high traffic
production systems you cannot capture
every single trace the volume would be
massive and the storage cost would kill
you dead. This is where sampling
strategies come in. There is headbased
sampling which makes decisions at trace
creation time. You might sample
probabilistically, like, hey, you, know what
capture 10% of all traces or use rate
limiting like capture 100 traces per
second. The decision is made up front
before you know anything about the
trace. Next, there is tailbased sampling
which is smarter but more complex. It
makes decisions after the trace
completes based on criteria like errors
or latency. You can say something like
hey keep all traces with errors or keep
all traces over 5 seconds. This captures
the interesting stuff while discarding
the boring successful request. Then
there is always on sampling which
captures everything. It's useful for
development but it's expensive and
impractical in production. And adaptive
sampling dynamically adjusts based on
traffic patterns increasing or
decreasing the sample rate depending on
load. And here's a critical piece
context propagation. Remember
distributed tracing only works if the
trace ID flows through all your
services. when service A call service B
service B needs to know it's part of the
same trace. Now for that open telemetry
uses W3C trace context standard for
interopability across vendors and tools.
For HTTP requests, special headers carry
the trace context which is trace parent
that contains the trace ID and span ID
and trace state which carries vendor
specific data. Now within a single
process, context propagation uses
context objects that flow through your
code. This works across both synchronous
and asynchronous operations. So your
trace doesn't break just because you're
using sync or a weight or callbacks.
What else? Yeah, open telemetry has
libraries available for all major
languages like Java, Python, Go
JavaScript.NET
and whatever else. It supports popular
frameworks out of the box including HTTP
servers, databases, messaging systems
and gRPC. So whichever language you're
using, whichever, there's no excuse not
to add tracing. So do it. Do it now. Do
it right away because you will you'll
see you will thank me later. Now once
you instrumented your code and you
capture traces, you need to export that
data somewhere. Open telemetry uses
OTLP, the open telemetry protocol as the
native protocol for exporting trace
data. And here's probably the most
important thing, the most the the big
power of open telemetry as a standard.
It's adopted by almost everyone.
Everyone you write your traces once once
and only once using open telemetry and
you can send them to any back end
without changing your instrumentation
code. Open source options like Jagger
Zipkin, and Tempo. Commercial cloud
providers like Data Dog, New Relic and
Honeycom, they all support open
telemetry. So you're not locked into a
vendor. You can switch backends, use
multiple backends simultaneously, or
migrate from one to another without
touching your application code.
Exporters botch and buffer data to
optimize network usage. And you can
configure them to send to multiple
destinations if you want redundancy or
need to support different tools.
So let's bring this back to where we
started. Your users were complaining
about slow responses, but your metric
said everything was fine. That's the
fundamental problem with distributed
systems. You cannot see what's really
happening when a user action triggers
dozens of separate requests flowing
through uh your microservices or
whatever else you have. Distributed
tracing solves this. Open telemetry
gives you the ability to follow a
logical transaction across all service
boundaries to see exactly which
operations are slow to understand the
complete path through your system to
identify bottlenecks you couldn't see
before. You get traces that connect
separate requests into one logical
transaction, spans that show timing for
each operation, attributes that capture
the context you care about, and context
propagation that makes it all work
seamlessly across services. So start
without instrumentation to learn the
concepts, then move to manual
instrumentation to capture what actually
matters to your business logic. Use
sampling strategies to keep costs
reasonable in production and leverage
the vendor natural nature of open
telemetry. You're never ever ever locked
in. Your traditional observability tools
are not lying to you. They're just blind
to what's really happening in
distributed systems. Distributed tracing
gives you the visibility you need. So
implement it. Start today. Your future
self will thank you when you're
debugging the next production issues.
Thank you for watching. See you in the
next one. Cheers.
Your users are complaining about slow response times—sometimes 8 seconds, other times 2 seconds—but your metrics show everything is fine. Average response times look acceptable, all services report healthy, and your dashboards are green. So what's really happening? The problem is that what looks like a single user request is actually dozens of separate, independent requests cascading through your microservices. Each service only sees its own operations, with no way to know they're part of the same logical transaction. Your logs show individual services completed successfully, but you can't correlate these entries across services or identify which specific operation is causing the delay. This video shows you exactly how to solve this blindness using distributed tracing with OpenTelemetry. You'll learn the difference between automatic and manual instrumentation, see real examples of tracing implementation in TypeScript, and analyze actual traces using Jaeger to understand request flows through complex systems. We'll cover traces, spans, context propagation, semantic conventions, sampling strategies, and how to export trace data to any backend without vendor lock-in. By the end, you'll understand why traditional observability tools can't see what's happening in distributed systems and how to implement tracing that reveals the complete journey of every request through your architecture. #DistributedTracing #OpenTelemetry #Microservices ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Sponsor: DevStats 🔗 https://devstats.plug.dev/5W1oh9J ▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬▬ Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join ▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/observability/distributed-tracing-explained-opentelemetry--jaeger-tutorial 🔗 OpenTelemetry: https://opentelemetry.io ▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below). ▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/ ▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox ▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Distributed Tracing with OpenTelemetry (OTEL) 01:18 DevStats (sponsor) 02:34 Microservices Performance Mystery 06:24 OpenTelemetry Distributed Tracing 10:57 Analyzing Traces with Jaeger 14:53 Understanding OpenTelemetry Traces 20:14 Tracing Solves Observability Blindness