Loading video player...
Welcome.
Uh my name is S God of Saxa. I work for
a leading uh automotive company in US.
Uh and I'm going to talk about how do we
use our obsibility for internal
developer platform.
Uh before I go there, uh the general um
thing about automotive industry is that
it is moving towards software defined
vehicles. So if you have a Tesla or the
cars that are recently built, you know
that the expectations have gone up. You
want your vehicle to perform better
every day. That's a that's a OTAA
deployments.
On top of retail pieces, uh there's a
tremendous amount in fleet. Uh that is
around getting diagnostics and a
prognosis of a vehicle. And there is a
projection uh by Forester saying that by
2032 there will be 100 millions of
vehicles connected to the cloud
platform.
Uh and the revenue pool for subscription
is grow 30% year-over-year.
So how do you handle uh this uh
experience for your drivers?
uh I'm going to little bit talk about uh
IDP first and then talk about some of
the opensource CNCF tooling along this
presentation specifically around
crossplane.
Uh there's a spec called open
application model. Uh there's
implementation called cubea and then tie
everything to open telemetry. So bear
with me we're going to touch about each
and every CNC of projects here in the
next 20 minutes.
um about the background of crossplane.
So think about Kubernetes, right?
Kubernetes is a well known for container
orchestration and management. But
Kubernetes also has the first class
APIdriven where you can actually manage
your not only the containers but hypers
scale resources. Talk about AWS uh
services, GCP services, right? So
basically we're trying to have a
crossplane the CNCF project as a control
plane. that control plane is managing
your all your resources, databases,
schema management, your CI/CD pipelines
and and stuff like that. So that's the
kind of the gist of this this particular
diagram. Um what are we trying to do?
What are we trying to achieve here? We
are trying to achieve a centralized
development environment. We are trying
to have u a security baked in. So when
you build when you deploy the cloud
applications you are taking away the
concerns from application teams from the
security and you're baking those those
into the into the platform itself.
[sighs and gasps]
um what what what what's a common
example? Let's say you have a CI/CD
pipeline. Um you have going through the
build tests and deploy phase. Can you
bake in the
uh the the fauca scan as an example in
the CI pipeline? Is it breaking your
license? Can you bake in your static
code analysis on the pipeline itself? So
you're taking away those security
concerns from developers to the platform
team. Can you have your CI pipeline be
done programmatically? So for example,
applications teams today write Docker
files. Can you move that Docker files in
a programmable way? There's a there's a
open-source uh project called Dagger.
Dagger has the programmable CI/CD
pipeline. So you can use those you can
have a hardened base image and then
build top of those images that your
application images. So that way then you
can do a cryptographically signature of
those images. You can uh you can then
have in the in the time of deployment
you can have your Kubernetes admission
controllers which can check if this
datab if this image that you built six
months ago it is if if it has a CVE you
can have your admission controller
reject that reject that because the
teams have to still run the CV fixes. So
you're taking away those applications
concerns and building building in the
platform itself.
Uh why you have to do that? Because um
the reason is that you know every team
picks their own tooling and technology
uh which is works fine at the small
scale but when you are an enterprise
company you don't have you don't have
the luxury because those best practices
from one team do not move to the team on
top of that you don't have a centralized
operations so for example I I don't know
about you but I have been on the bridges
in my companies where there has been 100
people on the call uh for the incident
and someone will ask this question was
there deployment 20 minutes ago because
we don't know because there's no set of
standardized dashboards and metrics to
take a look at what happened what caused
the incident right so the the behavior
that I'm trying to say is that when you
want to build your enterprise scale
software you have to have the operations
mindset and the IDP brings that in the
center foremost from an efficiency point
of view
I will touch about uh these common
themes for building the obsibility for
the cloud developer applications team.
So we'll talk about about the auto
instrumentation. We'll talk about locks,
metrics, stress and correlation. Uh
we'll talk about the network obsibility.
Uh CI/CD obsibility and synthetic
transactions. The way your all all your
application team needs all these
different commerce obsibility. We'll
touch upon all those in the in the in
the context of IDP.
Uh before that let me give you the kind
of a blueprint of how a commonly
deployed uh architecture of hotel
collectors looks like in this case. So
on the left side sorry starting from the
left side you have um the basically this
is a fleet of collectors right so each
collector is doing their own
responsibility separation of concern. So
one on the left side you have a
collector that's actually um running as
a as a demon set on Kubernetes node. It
is getting all the logs traces and
metrics. But before it is doing that the
the collector is sending the draws to
the it's running as a demon set. So it
is sending that its data to the node
itself the Kubernetes node itself.
And then the on the agent we have the
scraper. So basically if your
application is sending the Prometheus uh
Prometheus matrix you can have scrape
you can you can have a Prometheus
receiver on this collector which can
scrape those u uh those those data
points
also if their service has the um the
STTO the service mesh you can also uh
instrument your obsibility for getting
the network traffic so think about this
you're running as a you you have a site
car on Y proxy which has all the data in
terms terms of where is the TLS
termination happening. It can get the
data from from the service itself and
pass on tag along with and to the its
main container and send the data. Now
once once it is once the the the
collector receives the traffic we
actually send it to the the centralized
collector the gateway collector. Now the
gateway collector has a has a fan out
mechanism. The gateway collector then
sends the data to
uh other collectors. Uh so think about
this if you are if you want to run the
tail sampling. The tail sampling has the
feature where all the spans of a trace
should go to the one collector because
the at the end the tail sampler has to
make a decision that should I export
this this traces to the back end. The
back end could be your graphana could be
honeycomb could be any vendor of your
vendor choice. But then the decision has
to happen at the tail sampling and for
which all the spans of a trace should
end up with the same part of the
collector to make to to make its
decision.
Then you can also have a span matrix. So
the span matrix basically will derive
the metrics from the spans, right? And
then if you want to generate a service
graph, you want to have your SLO SLI
based on how your each of your
applications are interacting, you can
generate a service graph because the
same same load balancing exporter can
send the traces to a particular service
ID. So it will collect all the all the
spans from that service and generate a
graph based around your u your uh your
service topology. We'll see some of the
examples in the following screenshots
where but but this architecture is
basically showcasing you both from the
logs metrics and traces. Now you also on
the on the on the right hand on the
bottom right hand corner you have other
collectors. So for example if you are
running um Kafka, Pulsar any any
database like Yugabyte or Postgress you
also want to instrument those uh managed
services as well. So you can have a
hotel collector pointing to those it
prometheus scrape endpoints and have
those collectors scrape those and then
you can have a better correlation
because the same uh collectors are also
going to the same in the the main
gateway collector. So not only you're
able to trace your application
obsibility but also from the
infrastructure point of obsibility and
you can combine those two together to
give you holistic view.
Um I'll talk about so we talked about
just the how we ingest to the hotel
collectors. Let me talk about the best
practices I have seen in terms of
scaling and improving telemetry.
[snorts]
Uh the first point being here is the uh
span filtering. So
the cost of injection to your back end
depends on your how at collector
processor layer you want to handle it.
So if you want to have your if you want
to enrich your logs your logs with the
with the trace ID with the more context
here is that the porta collector
processor you can attach those u
metadata you can have the open telemetry
transform language OTTL processors that
can further enrich the data and you can
also log your you can also drop your
your noisy data.
So the second point was the enrich
signal with context. So you can add
resource contributes resource
attributes. So, open telemetry has a
good semantic conventions around u which
label you want to attach to it. So, if
you're running in a different cloud
provider or if you're in a different
region, you can you can add those those
labels to to the um uh in the in the
processor stage. Now, um the most of the
applications today have the logging
capabilities, right? And if you want to
do the open telemetry, hotel has
something called log bridge. So you you
can have that instrumentation and then
have the same semantic conventions that
can attach the logs with those with
those labels and send the data to um to
the hotel collector.
Um the the reason I wanted to mention
about the the correlation between
between the traces and and metrics is
that if there's a if there's a spike in
the metric and you want to see what
caused that spike so you should have a
relation between the trace and metrics
and and logs uh
the the mindset of also the operations
engineer would be like don't show me the
the the logs that are 5xx errors show me
what service it cost, where it came
from. So you're attaching all those
metadata labels in the hotel processor
which is giving you the whole uh
context.
Um managing obsibility uh is around
how you want to treat your telemetry. So
I like for example when we write our uh
infrastructure you know like AWS
services either we use terraform or
crossplane or something like that. So I
I in my mind the same goes with the
obserility too. So you have to treat
your telemetry as a code. So you can
have the you can have you can you can
version your schema you can also look at
your instrumentation score like how bad
that telemetry is when is making those
changes. So those are all in the source
control way. Uh that's how the key point
for this um the slide is. Um you also
want to provide the platform team should
also provide the the preconfigured
uh attributes for OT collection auto
collector configuration default
exporters and sampling rules. That's a
best practice so your service teams
don't have to reinvent the wheel. Right.
Um now in this following uh slide these
are like the screenshots for um what I
was been mentioning about in terms of
correlating signal in this slide you
basically you're seeing the network
obsibility on the first hand. So I was
mentioning about the STO the the service
mesh the service mesh has this provider
external provider that can send the
traces to open telemetry collector.
That's how we that's how you can now
understand from the service point of
view it's envoy proxy is how it's able
to handle those those traces. So in this
in this screen you're seeing the the
logs and traces coalition from the
network point of view because the envoy
is capturing those those traces on
behalf of the service itself.
Uh this is a feature for examplers. So
metrics who traces correlation on the
left side you're seeing this time series
uh graph and you want if you want to
understand any spike here and you want
to click on that that that that metric
point it can actually go to the traces
and can show you the whole waterfall
what caused that that that spike to
happen right
uh service graph I was I was mentioning
about the about the span metrics and the
service node graph uh this is how you
get that view if you're able to
correlate all your signals in in in one
um gateway collector.
Uh now how do you so we talked about the
collector how do you do the the
operations around the collector itself
right so these are some of bunch bunch
of metrics in terms of your receivers
exporters and and and and processors
the the key takeaway is that you have to
monitor all these signals to understand
when you have to
scale your auto collector to meet either
the the latency that you either your
tail sampling processor or your load
balance exporter has been experiencing.
So for example in this case uh one of
the metrics is is about the latency. So
it it basically it measures how long it
takes for a sampling decision for a
trace before it becomes eligible for
evaluation. So if you're seeing the high
latency that means you have to increase
the concurrency or the number of
replicas of the hotel collector.
uh by doing that you also have to make
sure that that hotel collector when
you're scaling it's only scraping those
jobs that it's responsible for otherwise
it'll be duplicate time series 3D dB
data
uh few things about u uh from the
deployment point of view so Argo
rollouts uh has this baked in with open
telemetry so if you're doing the canary
canary deployment you can have your
applications
instrument go the same auto collector
deployment and send the signals. So you
can you can you can now make correlation
between is my new version of app is it
stable than the previous app or not. You
can make your decisions around that part
and then either you can abort the
deployment or you can continue with your
deployment. So basically showcasing the
the u integration between the open
telemetry and the and argo rollouts here
[sighs]
uh about CI/CD obsibility. So we use a
framework called temporal. temporal
workflow is basically it automatically
captures state at every step and in the
event of fail it can pick up exactly
where the where it left off. So
basically it is you are seeing on the
left side the CI pipeline and then when
the CI pipeline is done from the GitHub
commit it's going to a uh the same the
same trace is going to the CD workflow
and then it is doing all the validation
and the hydrator. So basically here the
point of context is that the
applications team are writing the code
in the from an expressing point of view
of what they need. So if they need if I
if my application needs a load balancer
or if my application needs a database or
or or a messaging system the
infrastructure team will then take those
concerns and then put use the cube using
the cubea will be able to express into
kubernetes objects. Now the main
takeaway is that the open telemetry can
integrate with this with those each of
those steps in the temporal workflow and
then attach all the spans from each of
the stages into one waterfall model and
that's what you get here in terms of the
uh the dashboard. So basically what
you're seeing is that on the left side
you have the traces for both from the CI
and CD together across all your
environments and you can you can be able
to find out where in in your step which
is a bottleneck is your bottleneck is is
is in the queue for your deployments or
is it in the build phase as an example
uh synthetic and monitoring test. So
this is where uh Ksix Ksix is also
another open source project. If you want
to integrate with open telemetry to do a
synthetic transaction. So if I want to
understand if I'm doing the OTA
deployment to vehicle, if I'm touching
100 services across the line and I want
to test it uh in a way that I can catch
the problem before the end user catches
it. So this is where the open telemetry
along with the casix can give you the
whole end to end workflow and you can
use this workflow with your own
deployment workflow like the in the
previous screen the argo rollouts to
make sure that your deployment that
you're doing is good enough is not
causing any regression.
Um the last slide here is about
dashboards and alerts as a code. So I
was I the same way we are treating all
telemetry as a code. I want to emphasize
that all the dashboards and code alert
should also be treated as as inf in as a
code. So for example you can have the
same crossplane uh that we talked about
that's your underlying your
infrastructure can also for your choice
of tooling whether it's a graphana as an
example open source you can have a
provider for graphana and then have it
have it use it for your alerts uh and
and management of the dashboards. That
way you have a consistent way of
dashboards for all the service teams. No
one is having their own silo dashboards
and it's also version control in a way
that if I if I am a SR engineer and I'm
on a on on call but I don't own that
service. I can still be able to make
sense of it because of the of the
cementing conventions and also the same
uh metric registry as an example.
That's all I had. Um this is the
feedback code. So I would appreciate if
uh you can have u you know feedback for
the session and I can take any questions
you may have on this time.
[applause]
>> Yes.
>> Make sure that 100% Excuse
me. There's a microphone in the middle.
If you have a a Q&A question, just walk
over to the microphone.
>> Um, so my question was around the
sampling strategy. So all spans firmwide
are being sent to a centralized fleet of
open telemetry collectors which then
make the sampling decision. And if
that's the correct understanding, what
is sort of the volume at which these
bands are coming in per second or per
day?
>> I can't talk about the specific numbers
of the industry I work for, but uh uh we
on a general figure we handle
um close to let's say 3.5 million spans
a minute. Uh I'm talking about close to
100 services as an example. And uh the
the the centralized gateway collector is
not just one pod. It has a multiple
replicas and it has a built-in we use
this framework called uh Kubernetes
event autoscaler dis KDA Kubernetes
event auto discovery. So it basically
not only based on your CPU and memory
consumption. You can define your own
metrics. So for example I was talking
about the some of the metrics that we
track in the tail sampling and the load
balancing exporter. You can use those
metrics and scale your hotel collectors
based on that. Right? So basically
create a custom metric not only based on
your CPU memory to scale those
collectors. You also have to make sure
that those collectors are not
duplicating its job. I mean you can you
can shard your you can shard your uh
applications so that only those
collectors are responsible for scraping
those as an example.
>> Right. So this is like like a central
heartbeat in in that case. Yes.
>> Very quick followup. Yes. How do you
handle back pressure?
>> Yeah. Um it didn't say that this
diagram. So we have Kafka. So you can
have um the hotel exporters uh talk to
Kafka and then it can replay the data if
there's a there's a back pressure or
let's say we are doing the maintenance
of hotel collectors upgrades. Right. So
it's a rolling upgrade strategy but you
can have still the data in the
collectors in the Kafka queue that can
replay data when the collectors are back
online.
>> Yeah. All right. Thank you so much.
I think we are up the time but uh do I
have a for one one more question if
available?
>> Okay. Yeah.
>> Hello. Um thank you for the presentation
by the way. Uh you had a slide for CI/CD
um collecting spans and metrics.
>> Um
>> yes.
>> No I think one more maybe.
>> Yeah. So that one u can you just explain
the what tools did you use to collect
these? Are are you collecting these from
GitHub actions and
>> Yeah.
>> Okay.
>> Yeah, sure. Um, yeah. So, basically u uh
we use this tool called Dagger.
Dagger.io. It's open source tool. Dagger
is basically it builds your CI pipelines
programmatically.
So when you when a developer push a
commit from that point of time we start
through the GitHub actions the web hook
APIs and that's integrated with the open
telemetry instrumentation uh we use
Golang uh or Java dip doesn't matter
which language because it's auto
instrumentation uh um provider for it.
So from that point of time we start the
the the parent span the parent span and
then as we then with the context
propagation we're able to you know push
the same trace ID to each and every
request and then on the CD side you we
use something called temporal the open
source temporal workflow. So it has the
act concept of activities. So you can
span out each of those those those build
phases. So I talked about for example
the open application model if it is if
you are if your if your application
needs let's say a a database it will
make sure that your database spends also
also in in conjunction with your CD
pipeline so basically tracing end to
end. Did I answer that question?
>> Yes. Thank you.
>> Uh I will be in the hallway if you guys
have any more uh follow-up questions.
Thank you.
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Scaling Observability for an Enterprise Internal Developer Portal With OpenTelemetry - Gaurav Saxena, Automotive Company Building an enterprise-scale Internal Developer Platform (IDP) requires a robust observability strategy to ensure reliability, performance, and developer efficiency. This presentation explores how OpenTelemetry seamlessly integrates into an IDP to provide deep visibility and actionable insights. I'll dive into the benefits of OpenTelemetry and best practices for running collectors at scale to drive business impact.