Scaling Observability for an Enterprise Internal Developer Portal With OpenTelemetry - Gaurav Saxena | DailyDevLists

Loading video player...

Full Transcript

3,996 words • EN

Welcome.

Uh my name is S God of Saxa. I work for

a leading uh automotive company in US.

Uh and I'm going to talk about how do we

use our obsibility for internal

developer platform.

Uh before I go there, uh the general um

thing about automotive industry is that

it is moving towards software defined

vehicles. So if you have a Tesla or the

cars that are recently built, you know

that the expectations have gone up. You

want your vehicle to perform better

every day. That's a that's a OTAA

deployments.

On top of retail pieces, uh there's a

tremendous amount in fleet. Uh that is

around getting diagnostics and a

prognosis of a vehicle. And there is a

projection uh by Forester saying that by

2032 there will be 100 millions of

vehicles connected to the cloud

platform.

Uh and the revenue pool for subscription

is grow 30% year-over-year.

So how do you handle uh this uh

experience for your drivers?

uh I'm going to little bit talk about uh

IDP first and then talk about some of

the opensource CNCF tooling along this

presentation specifically around

crossplane.

Uh there's a spec called open

application model. Uh there's

implementation called cubea and then tie

everything to open telemetry. So bear

with me we're going to touch about each

and every CNC of projects here in the

next 20 minutes.

um about the background of crossplane.

So think about Kubernetes, right?

Kubernetes is a well known for container

orchestration and management. But

Kubernetes also has the first class

APIdriven where you can actually manage

your not only the containers but hypers

scale resources. Talk about AWS uh

services, GCP services, right? So

basically we're trying to have a

crossplane the CNCF project as a control

plane. that control plane is managing

your all your resources, databases,

schema management, your CI/CD pipelines

and and stuff like that. So that's the

kind of the gist of this this particular

diagram. Um what are we trying to do?

What are we trying to achieve here? We

are trying to achieve a centralized

development environment. We are trying

to have u a security baked in. So when

you build when you deploy the cloud

applications you are taking away the

concerns from application teams from the

security and you're baking those those

into the into the platform itself.

[sighs and gasps]

um what what what what's a common

example? Let's say you have a CI/CD

pipeline. Um you have going through the

build tests and deploy phase. Can you

bake in the

uh the the fauca scan as an example in

the CI pipeline? Is it breaking your

license? Can you bake in your static

code analysis on the pipeline itself? So

you're taking away those security

concerns from developers to the platform

team. Can you have your CI pipeline be

done programmatically? So for example,

applications teams today write Docker

files. Can you move that Docker files in

a programmable way? There's a there's a

open-source uh project called Dagger.

Dagger has the programmable CI/CD

pipeline. So you can use those you can

have a hardened base image and then

build top of those images that your

application images. So that way then you

can do a cryptographically signature of

those images. You can uh you can then

have in the in the time of deployment

you can have your Kubernetes admission

controllers which can check if this

datab if this image that you built six

months ago it is if if it has a CVE you

can have your admission controller

reject that reject that because the

teams have to still run the CV fixes. So

you're taking away those applications

concerns and building building in the

platform itself.

Uh why you have to do that? Because um

the reason is that you know every team

picks their own tooling and technology

uh which is works fine at the small

scale but when you are an enterprise

company you don't have you don't have

the luxury because those best practices

from one team do not move to the team on

top of that you don't have a centralized

operations so for example I I don't know

about you but I have been on the bridges

in my companies where there has been 100

people on the call uh for the incident

and someone will ask this question was

there deployment 20 minutes ago because

we don't know because there's no set of

standardized dashboards and metrics to

take a look at what happened what caused

the incident right so the the behavior

that I'm trying to say is that when you

want to build your enterprise scale

software you have to have the operations

mindset and the IDP brings that in the

center foremost from an efficiency point

of view

I will touch about uh these common

themes for building the obsibility for

the cloud developer applications team.

So we'll talk about about the auto

instrumentation. We'll talk about locks,

metrics, stress and correlation. Uh

we'll talk about the network obsibility.

Uh CI/CD obsibility and synthetic

transactions. The way your all all your

application team needs all these

different commerce obsibility. We'll

touch upon all those in the in the in

the context of IDP.

Uh before that let me give you the kind

of a blueprint of how a commonly

deployed uh architecture of hotel

collectors looks like in this case. So

on the left side sorry starting from the

left side you have um the basically this

is a fleet of collectors right so each

collector is doing their own

responsibility separation of concern. So

one on the left side you have a

collector that's actually um running as

a as a demon set on Kubernetes node. It

is getting all the logs traces and

metrics. But before it is doing that the

the collector is sending the draws to

the it's running as a demon set. So it

is sending that its data to the node

itself the Kubernetes node itself.

And then the on the agent we have the

scraper. So basically if your

application is sending the Prometheus uh

Prometheus matrix you can have scrape

you can you can have a Prometheus

receiver on this collector which can

scrape those u uh those those data

points

also if their service has the um the

STTO the service mesh you can also uh

instrument your obsibility for getting

the network traffic so think about this

you're running as a you you have a site

car on Y proxy which has all the data in

terms terms of where is the TLS

termination happening. It can get the

data from from the service itself and

pass on tag along with and to the its

main container and send the data. Now

once once it is once the the the

collector receives the traffic we

actually send it to the the centralized

collector the gateway collector. Now the

gateway collector has a has a fan out

mechanism. The gateway collector then

sends the data to

uh other collectors. Uh so think about

this if you are if you want to run the

tail sampling. The tail sampling has the

feature where all the spans of a trace

should go to the one collector because

the at the end the tail sampler has to

make a decision that should I export

this this traces to the back end. The

back end could be your graphana could be

honeycomb could be any vendor of your

vendor choice. But then the decision has

to happen at the tail sampling and for

which all the spans of a trace should

end up with the same part of the

collector to make to to make its

decision.

Then you can also have a span matrix. So

the span matrix basically will derive

the metrics from the spans, right? And

then if you want to generate a service

graph, you want to have your SLO SLI

based on how your each of your

applications are interacting, you can

generate a service graph because the

same same load balancing exporter can

send the traces to a particular service

ID. So it will collect all the all the

spans from that service and generate a

graph based around your u your uh your

service topology. We'll see some of the

examples in the following screenshots

where but but this architecture is

basically showcasing you both from the

logs metrics and traces. Now you also on

the on the on the right hand on the

bottom right hand corner you have other

collectors. So for example if you are

running um Kafka, Pulsar any any

database like Yugabyte or Postgress you

also want to instrument those uh managed

services as well. So you can have a

hotel collector pointing to those it

prometheus scrape endpoints and have

those collectors scrape those and then

you can have a better correlation

because the same uh collectors are also

going to the same in the the main

gateway collector. So not only you're

able to trace your application

obsibility but also from the

infrastructure point of obsibility and

you can combine those two together to

give you holistic view.

Um I'll talk about so we talked about

just the how we ingest to the hotel

collectors. Let me talk about the best

practices I have seen in terms of

scaling and improving telemetry.

[snorts]

Uh the first point being here is the uh

span filtering. So

the cost of injection to your back end

depends on your how at collector

processor layer you want to handle it.

So if you want to have your if you want

to enrich your logs your logs with the

with the trace ID with the more context

here is that the porta collector

processor you can attach those u

metadata you can have the open telemetry

transform language OTTL processors that

can further enrich the data and you can

also log your you can also drop your

your noisy data.

So the second point was the enrich

signal with context. So you can add

resource contributes resource

attributes. So, open telemetry has a

good semantic conventions around u which

label you want to attach to it. So, if

you're running in a different cloud

provider or if you're in a different

region, you can you can add those those

labels to to the um uh in the in the

processor stage. Now, um the most of the

applications today have the logging

capabilities, right? And if you want to

do the open telemetry, hotel has

something called log bridge. So you you

can have that instrumentation and then

have the same semantic conventions that

can attach the logs with those with

those labels and send the data to um to

the hotel collector.

Um the the reason I wanted to mention

about the the correlation between

between the traces and and metrics is

that if there's a if there's a spike in

the metric and you want to see what

caused that spike so you should have a

relation between the trace and metrics

and and logs uh

the the mindset of also the operations

engineer would be like don't show me the

the the logs that are 5xx errors show me

what service it cost, where it came

from. So you're attaching all those

metadata labels in the hotel processor

which is giving you the whole uh

context.

Um managing obsibility uh is around

how you want to treat your telemetry. So

I like for example when we write our uh

infrastructure you know like AWS

services either we use terraform or

crossplane or something like that. So I

I in my mind the same goes with the

obserility too. So you have to treat

your telemetry as a code. So you can

have the you can have you can you can

version your schema you can also look at

your instrumentation score like how bad

that telemetry is when is making those

changes. So those are all in the source

control way. Uh that's how the key point

for this um the slide is. Um you also

want to provide the platform team should

also provide the the preconfigured

uh attributes for OT collection auto

collector configuration default

exporters and sampling rules. That's a

best practice so your service teams

don't have to reinvent the wheel. Right.

Um now in this following uh slide these

are like the screenshots for um what I

was been mentioning about in terms of

correlating signal in this slide you

basically you're seeing the network

obsibility on the first hand. So I was

mentioning about the STO the the service

mesh the service mesh has this provider

external provider that can send the

traces to open telemetry collector.

That's how we that's how you can now

understand from the service point of

view it's envoy proxy is how it's able

to handle those those traces. So in this

in this screen you're seeing the the

logs and traces coalition from the

network point of view because the envoy

is capturing those those traces on

behalf of the service itself.

Uh this is a feature for examplers. So

metrics who traces correlation on the

left side you're seeing this time series

uh graph and you want if you want to

understand any spike here and you want

to click on that that that that metric

point it can actually go to the traces

and can show you the whole waterfall

what caused that that that spike to

happen right

uh service graph I was I was mentioning

about the about the span metrics and the

service node graph uh this is how you

get that view if you're able to

correlate all your signals in in in one

um gateway collector.

Uh now how do you so we talked about the

collector how do you do the the

operations around the collector itself

right so these are some of bunch bunch

of metrics in terms of your receivers

exporters and and and and processors

the the key takeaway is that you have to

monitor all these signals to understand

when you have to

scale your auto collector to meet either

the the latency that you either your

tail sampling processor or your load

balance exporter has been experiencing.

So for example in this case uh one of

the metrics is is about the latency. So

it it basically it measures how long it

takes for a sampling decision for a

trace before it becomes eligible for

evaluation. So if you're seeing the high

latency that means you have to increase

the concurrency or the number of

replicas of the hotel collector.

uh by doing that you also have to make

sure that that hotel collector when

you're scaling it's only scraping those

jobs that it's responsible for otherwise

it'll be duplicate time series 3D dB

data

uh few things about u uh from the

deployment point of view so Argo

rollouts uh has this baked in with open

telemetry so if you're doing the canary

canary deployment you can have your

applications

instrument go the same auto collector

deployment and send the signals. So you

can you can you can now make correlation

between is my new version of app is it

stable than the previous app or not. You

can make your decisions around that part

and then either you can abort the

deployment or you can continue with your

deployment. So basically showcasing the

the u integration between the open

telemetry and the and argo rollouts here

[sighs]

uh about CI/CD obsibility. So we use a

framework called temporal. temporal

workflow is basically it automatically

captures state at every step and in the

event of fail it can pick up exactly

where the where it left off. So

basically it is you are seeing on the

left side the CI pipeline and then when

the CI pipeline is done from the GitHub

commit it's going to a uh the same the

same trace is going to the CD workflow

and then it is doing all the validation

and the hydrator. So basically here the

point of context is that the

applications team are writing the code

in the from an expressing point of view

of what they need. So if they need if I

if my application needs a load balancer

or if my application needs a database or

or or a messaging system the

infrastructure team will then take those

concerns and then put use the cube using

the cubea will be able to express into

kubernetes objects. Now the main

takeaway is that the open telemetry can

integrate with this with those each of

those steps in the temporal workflow and

then attach all the spans from each of

the stages into one waterfall model and

that's what you get here in terms of the

uh the dashboard. So basically what

you're seeing is that on the left side

you have the traces for both from the CI

and CD together across all your

environments and you can you can be able

to find out where in in your step which

is a bottleneck is your bottleneck is is

is in the queue for your deployments or

is it in the build phase as an example

uh synthetic and monitoring test. So

this is where uh Ksix Ksix is also

another open source project. If you want

to integrate with open telemetry to do a

synthetic transaction. So if I want to

understand if I'm doing the OTA

deployment to vehicle, if I'm touching

100 services across the line and I want

to test it uh in a way that I can catch

the problem before the end user catches

it. So this is where the open telemetry

along with the casix can give you the

whole end to end workflow and you can

use this workflow with your own

deployment workflow like the in the

previous screen the argo rollouts to

make sure that your deployment that

you're doing is good enough is not

causing any regression.

Um the last slide here is about

dashboards and alerts as a code. So I

was I the same way we are treating all

telemetry as a code. I want to emphasize

that all the dashboards and code alert

should also be treated as as inf in as a

code. So for example you can have the

same crossplane uh that we talked about

that's your underlying your

infrastructure can also for your choice

of tooling whether it's a graphana as an

example open source you can have a

provider for graphana and then have it

have it use it for your alerts uh and

and management of the dashboards. That

way you have a consistent way of

dashboards for all the service teams. No

one is having their own silo dashboards

and it's also version control in a way

that if I if I am a SR engineer and I'm

on a on on call but I don't own that

service. I can still be able to make

sense of it because of the of the

cementing conventions and also the same

uh metric registry as an example.

That's all I had. Um this is the

feedback code. So I would appreciate if

uh you can have u you know feedback for

the session and I can take any questions

you may have on this time.

[applause]

>> Yes.

>> Make sure that 100% Excuse

me. There's a microphone in the middle.

If you have a a Q&A question, just walk

over to the microphone.

>> Um, so my question was around the

sampling strategy. So all spans firmwide

are being sent to a centralized fleet of

open telemetry collectors which then

make the sampling decision. And if

that's the correct understanding, what

is sort of the volume at which these

bands are coming in per second or per

day?

>> I can't talk about the specific numbers

of the industry I work for, but uh uh we

on a general figure we handle

um close to let's say 3.5 million spans

a minute. Uh I'm talking about close to

100 services as an example. And uh the

the the centralized gateway collector is

not just one pod. It has a multiple

replicas and it has a built-in we use

this framework called uh Kubernetes

event autoscaler dis KDA Kubernetes

event auto discovery. So it basically

not only based on your CPU and memory

consumption. You can define your own

metrics. So for example I was talking

about the some of the metrics that we

track in the tail sampling and the load

balancing exporter. You can use those

metrics and scale your hotel collectors

based on that. Right? So basically

create a custom metric not only based on

your CPU memory to scale those

collectors. You also have to make sure

that those collectors are not

duplicating its job. I mean you can you

can shard your you can shard your uh

applications so that only those

collectors are responsible for scraping

those as an example.

>> Right. So this is like like a central

heartbeat in in that case. Yes.

>> Very quick followup. Yes. How do you

handle back pressure?

>> Yeah. Um it didn't say that this

diagram. So we have Kafka. So you can

have um the hotel exporters uh talk to

Kafka and then it can replay the data if

there's a there's a back pressure or

let's say we are doing the maintenance

of hotel collectors upgrades. Right. So

it's a rolling upgrade strategy but you

can have still the data in the

collectors in the Kafka queue that can

replay data when the collectors are back

online.

>> Yeah. All right. Thank you so much.

I think we are up the time but uh do I

have a for one one more question if

available?

>> Okay. Yeah.

>> Hello. Um thank you for the presentation

by the way. Uh you had a slide for CI/CD

um collecting spans and metrics.

>> Um

>> yes.

>> No I think one more maybe.

>> Yeah. So that one u can you just explain

the what tools did you use to collect

these? Are are you collecting these from

GitHub actions and

>> Yeah.

>> Okay.

>> Yeah, sure. Um, yeah. So, basically u uh

we use this tool called Dagger.

Dagger.io. It's open source tool. Dagger

is basically it builds your CI pipelines

programmatically.

So when you when a developer push a

commit from that point of time we start

through the GitHub actions the web hook

APIs and that's integrated with the open

telemetry instrumentation uh we use

Golang uh or Java dip doesn't matter

which language because it's auto

instrumentation uh um provider for it.

So from that point of time we start the

the the parent span the parent span and

then as we then with the context

propagation we're able to you know push

the same trace ID to each and every

request and then on the CD side you we

use something called temporal the open

source temporal workflow. So it has the

act concept of activities. So you can

span out each of those those those build

phases. So I talked about for example

the open application model if it is if

you are if your if your application

needs let's say a a database it will

make sure that your database spends also

also in in conjunction with your CD

pipeline so basically tracing end to

end. Did I answer that question?

>> Yes. Thank you.

>> Uh I will be in the hallway if you guys

have any more uh follow-up questions.

Thank you.

Scaling Observability for an Enterprise Internal Developer Portal With OpenTelemetry - Gaurav Saxena

CNCF [Cloud Native Computing Foundation]

1 day ago

23:17

Platform Engineering & DevOps Culture

Rank #4

Description

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Scaling Observability for an Enterprise Internal Developer Portal With OpenTelemetry - Gaurav Saxena, Automotive Company Building an enterprise-scale Internal Developer Platform (IDP) requires a robust observability strategy to ensure reliability, performance, and developer efficiency. This presentation explores how OpenTelemetry seamlessly integrates into an IDP to provide deep visibility and actionable insights. I'll dive into the benefits of OpenTelemetry and best practices for running collectors at scale to drive business impact.

Video Details

Category

Platform Engineering & DevOps Culture

Featured Date

November 25, 2025

Quality Rank

#4

AI Recommended