Platforming the Platform: Running Our IDP With Our IDP - Andrew Sail & Lauren Roshore, Spotify | DailyDevLists

Loading video player...

Full Transcript

2,658 words • EN

Cool. Um, so thank you all for being

here with us today. And by us, I mean

me. Um, as we share a meta story about

building an IDP while operating our

entire engineering organization on that

very same IDP.

Uh, so I'm Lauren. Andrew is supposed to

join me today, but thanks to a series of

unfortunate events, it's just me holding

it down. Um, but for some context,

Andrew is the engineering manager for

backstage and portal platform

engineering at Spotify. And I'm the

engineering manager for the

observability team at Spotify.

So for a quick walk down memory lane, uh

the two of us go way back. Andrew and I

started off as senior engineer peers on

the observability team together. Then we

became EM peers and different orgs. Then

Andrew moved to data platform and then

eventually to this like S sur

observability infraace in a completely

new organization. Uh that's the one that

built portal and backstage open source

for the world beyond Spotify. So why

would we put this on a slide? to

emphasize that truly nothing matters

more than relationships and trust. Uh

what we're going to share with you all

today, no matter how great of an

architecture diagram, could not have

come together if we didn't catalyze a

partnership between our respective

teams.

So the concept of an internal developer

to portal or an IDP is about conquering

fragmentation, reducing cognitive

overload, and minimizing operational

risk as you scale modern systems. At

Spotify, we've been running our business

with backstage at the very heart of our

developer experience for years. This

history has yielded many successes, even

more failures. And as we, I stand here

today, an engineering culture that is

collaborative and efficient, offering

our engineers more information at their

fingertips by the day that they need to

do their jobs well. We're here today to

share just one of the success stories in

boosting efficiency, achieving

standardization, and running through

brick walls. all in the name of getting

things done and making our engineers

happier. Uh you'll leave here today with

a quantified success story in

platformization defragmentation and

standardization all with some awesome

open source tech at the very core.

So let's set the stage. Picture a brand

new product in a very young organization

where the rubber was meeting the road

fast and if you didn't scale up an

infert team proactively, a lot of

operational risk was looming. This is a

classic story line. At Spotify, we have

numerous strategies to conquer these

challenges. All of which reduce

fragmentation from the jump, reduce

cognitive load on your engineers so that

they preserve focus on actually building

things and in return operational risk is

managed in a healthy way.

So when Andrew joined the backstage

organization, they were making the shift

to building a public-f facing SAS flavor

of backstage. This was massively a

massively different business objective

than simply operating backstage to make

Spotify itself work or even supporting

the open source project. Having been at

Spotify for almost 5 years at the time,

admittedly nothing had ever felt truly

green field until this moment for him.

Uh so he was offered a challenge to

build out an team and function to ensure

that as our IDP sells, it also can

scale. as someone with an SRE background

and someone who loves hiring and

bootstrapping teams, he was pretty

excited about an opportunity to do this.

Uh, and that said, no matter what, it's

always an elegant puzzle to go from zero

to one.

So, before crafting the team and getting

to know his new reports, it was first

important for him to be a sponge. And

there were numerous learning curves

involved for him. He was not used to

working in the B2B space or even an

organization that is external facing.

His entire time at Spotify was internal

platforms focused. So he had to dust off

the consulting background that he had

and lean in with critical functions like

sales, customer success, marketing, and

partnerships as core stakeholders

alongside product and engineering. What

were the biggest challenges and pain

points he was hearing from this diverse

set of stakeholders? What were the P

zeros, ones, and twos? And what was

going to be his strategy to make folks

feel heard? After all, they'd been

operating without an infert team for

some time, and a pile of debt had been

acrewed. Once he knew what needed to be

solved most imminently, it was time to

roll up his sleeves and play as an IC

for a while alongside this small team,

all while growing it into a full-fledged

infert team. So, let's get technical.

What was a burning page zero that fell

on the team's plate?

Observability. So, metrics, alerting,

service levels, all pies in the sky for

this team. On the one hand, wow, this

was one of those engineering green field

moments. There were no mistakes to

recover from, only our own failures to

learn from and eventually celebrate. On

the other hand, that was pretty daunting

as the earliest testers and tries of the

IDP were in the product on a daily

basis. They had very little operational

data to ensure they were having a

positive experience. Safe to say this

encouraged them to put a little pep in

our step and accelerate development

here.

So, numerous options presented

themselves. Functions like our own

completely isolated business. There were

benefits here, especially from the

perspective of building something

exactly the way we wanted it. But

metaphorically speaking, the candy shop

of pre-existing platform offerings that

Spotify had already mastered were

available. And in this arena, they

largely were already exactly what they

wanted. Highly available, scaled to the

heavens, hotel and Prometheus native. So

this is where our partnership came into

play. Andrew and I work in completely

different organizations in a pretty

large business. The last hurdle that

remained was strictly organizational and

strategic. and we came together and

reminded ourselves and our senior

leadership of our most important guiding

principles.

Despite a unique tech stack relative to

the golden technologies that keep the

Spotify consumer platforms operating

around the clock, we didn't need to

build from scratch. We already had a

world-class set of standardized

platforms at our disposal. And yeah, you

guessed it. Backstage was a single pane

of glass and standardized commonplace

across our organizations. So the

solution was simple. Reuse what's

proven. This mindset not only

accelerated delivery but helped maintain

consistency and reliability from day

one.

So these principles became the

foundation for how we approached further

utilizing our observability platform.

First control the chaos. We built guard

rails and not gates for people. We

enable teams to take advantage of the

infrastructure the observability or

supports without reinventing the wheel

and needing to maintain this

infrastructure themselves. Next,

consistency. We want developers to have

a unified developer experience, a shared

standard language and reusable tooling

across all domains and products. And

finally, standardize golden

technologies. We focused on curating

golden paths rather than encouraging

endless customization, meeting people at

where they are and get like being able

to

meet their needs so they don't have to

build and deploy their own things and

customize them. So these ideas sound

simple, but they were transformative for

both engineering culture and velocity.

So here's what we ended up with. Uh the

fun of it all is that running a B2B

business is nothing at all like running

Spotify with a consumer business. Every

customer is afforded a strict tenant

isolation. So the scaling patterns are

quite different. Given the tenant

isolation, we had to get smart about two

things. Anonymizing operational data.

Spotify already does a nice job of

preventing uh PII and observability

data. So we got this out of the box. and

two shipping our metrics inside the

Spotify perimeter so that we could

leverage the world-class tooling that

our observability team already operates

all while maintaining a sock 2 compliant

posture and maintaining tenant

isolation.

So this diagram shows our endto-end

observability pipeline and how telemetry

flows from our IDP into Spotify's

infrastructure all of which is managed

and maintained via the very same IDP

that we are ensuring the high

availability of. So each IDP instance

admits logs uh metrics and eventually

traces. That's something we're still

working on. Um data is sent to the open

telemetry collectors via OTLP

push or Prometheus scrapes. The

collector exports data to Google PubSub

bringing it into Spotify's perimeter. So

PubSub is acting as a transport layer

here. It defines the secure perimeter

and decouples data producers from

consumers. The metrics then flow into

Victoria metrics. And then the

architecture is modular and built

entirely on open standards.

So this way we can have consistent

observability across all environments

and it's very scalable and easy to

extend.

So the beauty of what we've adopted is

that it's tech agnostic. Our

observability stack scales from

Spotify's internal services to our IDP.

Different teams you use different tech

stacks and that's fine. Our standards

ensure observability can work the same

way everywhere. And it's less about

specific tools and more about shared

principles, consistent signals, and

reliable outcomes. So now that the

stack's in place, let's see what kind of

impact it's had on speed, cost, and

happiness for our developers.

Ultimately, by unlocking the toolbox in

an organization, impact came quickly and

naturally. MTR measurably was reduced,

rendering happier engineers and happier

customers. And most importantly,

delivery velocity increased. The teams

were able to ship faster armed with real

reliable signals. Uh when we say speed

and happiness, this is what we mean.

Engineers are happier when their tools

just work and they can trust that their

changes are secure. Uh and in

perpetuity, we're now able to ship

features far more confidently.

So this is the heart of it. We dog food

everything. By running our IDP utilizing

standard internal infrastructure, we

gain empathy for our users because we

are our users. Sharing tools sharpens

empathy. Common standards reduces MTR

and unified feedback loops improve both

platform and product.

And [snorts] at this point in our

journey, we realized the biggest hurdle

wasn't technology. It was just

hesitation. So our message to ourselves

and you

is to seriously just don't overthink it.

The hardest part is getting started.

You'll never have the answers before you

begin, but build incrementally and reuse

what works and iterate. And if you have

folks that you've worked with before,

partner with them and figure out what

you can do together.

So, looking ahead, there's still plenty

to do. Now that Andrew's team have

alerts configured, they're focusing on

fine-tuning them to cut the noise. Um,

there's been a bit of adjustments that

have needed to happen since we've kind

of set all this up. On the observability

side, we plan to enable endto-end

tracing for them in the near future.

We're also working on auto

instrumentation with smarter sampling

guardrails and exploring anomaly

detection and AI assisted triage for

incidents. Each of these steps moves us

closer to truly intelligent proactive

observability. Um, so we solve things

before you even wake up from the page.

Thanks to the partnership we built, the

portal infra team will be among the

first adopters of all of our new

solutions across the business, providing

valuable feedback and continuing to be a

close collaborator. as we evolve

together.

So yeah, thanks for joining us and by us

I mean me and for being part of this

broader conversation about how platform

engineering can drive culture and scale.

Um happy to dive deeper on any of these

stories. We actually have a bunch of

time so we can do questions. um if you

want to talk about how you're applying

IDP principles in new orgs. And then we

also have a booth uh 11:43 uh in the

main section downstairs that I think

will open tomorrow and then we'll be

there for the rest of the week. So yeah,

thanks.

[applause]

>> Awesome. Thanks, Lauren. So we do have a

little bit of time for questions. Would

anyone like to ask a question?

And I will run over with the microphone.

No questions for Oh, there's one.

Oh, okay. Do you have any stories? Um, I

think you said there's other examples.

Do you have any stories of um startups

or the small developer workshops that

have moved maybe not from an IDP from

scratch, but have built their own

existing IDP and then want to move to

something greater.

um

built their own IDP. Uh not exactly. I

mean this was kind of like focused on

the IDP internally kind of becoming a

SAS product and then being able to like

observe that with our tools in house. So

I mean kind of the the takeaway is like

you know if you if you have open

standards you can kind of set them up

for anything like that. So, I mean, if

if that's kind of the question, like

being able to utilize the open standards

for your own in-house IDP could also

totally work, but not sure if I misheard

the question. [laughter]

>> Thank you. Anyone else got a question?

>> What type of metrics are you trying to

pull off the IDP to kind of inform where

it needs to go next? like what what are

kind of the key values that you're

looking at to inform what what else you

need to build?

>> Yeah. So, I kind of wish Andrew was here

because that's like his side of

everything. But from what I can tell, um

right now kind of the the issue they're

having is their alerts are too

sensitive, I would say. and they've kind

of set up things for each of their uh

portal instances to kind of see um kind

of basic metrics I would say like they

have like CPU memory kind of basic stuff

set up right now. Um I don't have like

too many details on exactly what his

team is is looking at but I do know that

like the biggest challenge right now is

they kind of configured things out of

the box and now it's just kind of like

adjusting them to see what reflects like

what their customers are like feedback

is to them. Um, part of the problem they

were having too is like folks were kind

of coming to them with um, issues that

they weren't able to proactively detect.

So, the nice thing about setting this up

is like they'll proactively see these

things first before their customers even

notice.

>> Nice. Any other questions?

>> Oh, to the

Yeah. So I just want to know how y'all

came to the decision to use Victoria

metrics um instead of like any other

like metrics tool.

>> Yeah. Um yeah. So internally we had our

own uh time series database for a long

time that was built uh by a different

team like many years ago before we were

deploying things with Kubernetes.

Essentially it just couldn't scale

anymore. Um like basically once there

were instances kind of spinning up all

the time that Cornelli got way too big

and things were falling over. It was

pretty slow. So we tried out a few

different vendors honestly. Um VM was

kind of the one that worked best for us.

Um we have a pretty good working

relationship with them. So yeah, so far

so good. I mean it scales really well.

It can handle the cardality and they

have a lot of features that kind of help

us to tweak things and like disable

things that we didn't have before. So

>> any other questions?

>> No. In which case, thanks Lauren. Thank

you.

Platforming the Platform: Running Our IDP With Our IDP - Andrew Sail & Lauren Roshore, Spotify

CNCF [Cloud Native Computing Foundation]

1 day ago

15:03

Platform Engineering & DevOps Culture

Rank #1

Description

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Platforming the Platform: Running Our IDP With Our IDP - Andrew Sail & Lauren Roshore, Spotify Scaling modern systems is hard—fragmentation, cognitive overload, and operational risk grow fast. At Spotify, we build and operate Portal, our Backstage-based Internal Developer Platform for external customers. But here’s the twist: we use Portal to run Portal. By dogfooding our own platform and tapping into Spotify’s internal observability and tooling ecosystem, we deliver faster, smarter, and with more joy—for both our teams and our customers. In this talk, Andrew Sail (Lead for Backstage/Portal Platform Engineering & SRE) and Laurén Roshore (Lead for Spotify’s internal Observability & Monitoring platforms) share how building with the same tools we provide sharpens empathy, reduces MTTR, and enables richer, more scalable defaults. You’ll learn how internal-external partnerships help us accelerate delivery and operate with confidence. This is a talk for anyone building platforms who wants fewer fire drills, better feedback loops, and happier engineers at every layer.

Video Details

Category

Platform Engineering & DevOps Culture

Featured Date

November 25, 2025

Quality Rank

#1

AI Recommended