Loading video player...
Cool. Um, so thank you all for being
here with us today. And by us, I mean
me. Um, as we share a meta story about
building an IDP while operating our
entire engineering organization on that
very same IDP.
Uh, so I'm Lauren. Andrew is supposed to
join me today, but thanks to a series of
unfortunate events, it's just me holding
it down. Um, but for some context,
Andrew is the engineering manager for
backstage and portal platform
engineering at Spotify. And I'm the
engineering manager for the
observability team at Spotify.
So for a quick walk down memory lane, uh
the two of us go way back. Andrew and I
started off as senior engineer peers on
the observability team together. Then we
became EM peers and different orgs. Then
Andrew moved to data platform and then
eventually to this like S sur
observability infraace in a completely
new organization. Uh that's the one that
built portal and backstage open source
for the world beyond Spotify. So why
would we put this on a slide? to
emphasize that truly nothing matters
more than relationships and trust. Uh
what we're going to share with you all
today, no matter how great of an
architecture diagram, could not have
come together if we didn't catalyze a
partnership between our respective
teams.
So the concept of an internal developer
to portal or an IDP is about conquering
fragmentation, reducing cognitive
overload, and minimizing operational
risk as you scale modern systems. At
Spotify, we've been running our business
with backstage at the very heart of our
developer experience for years. This
history has yielded many successes, even
more failures. And as we, I stand here
today, an engineering culture that is
collaborative and efficient, offering
our engineers more information at their
fingertips by the day that they need to
do their jobs well. We're here today to
share just one of the success stories in
boosting efficiency, achieving
standardization, and running through
brick walls. all in the name of getting
things done and making our engineers
happier. Uh you'll leave here today with
a quantified success story in
platformization defragmentation and
standardization all with some awesome
open source tech at the very core.
So let's set the stage. Picture a brand
new product in a very young organization
where the rubber was meeting the road
fast and if you didn't scale up an
infert team proactively, a lot of
operational risk was looming. This is a
classic story line. At Spotify, we have
numerous strategies to conquer these
challenges. All of which reduce
fragmentation from the jump, reduce
cognitive load on your engineers so that
they preserve focus on actually building
things and in return operational risk is
managed in a healthy way.
So when Andrew joined the backstage
organization, they were making the shift
to building a public-f facing SAS flavor
of backstage. This was massively a
massively different business objective
than simply operating backstage to make
Spotify itself work or even supporting
the open source project. Having been at
Spotify for almost 5 years at the time,
admittedly nothing had ever felt truly
green field until this moment for him.
Uh so he was offered a challenge to
build out an team and function to ensure
that as our IDP sells, it also can
scale. as someone with an SRE background
and someone who loves hiring and
bootstrapping teams, he was pretty
excited about an opportunity to do this.
Uh, and that said, no matter what, it's
always an elegant puzzle to go from zero
to one.
So, before crafting the team and getting
to know his new reports, it was first
important for him to be a sponge. And
there were numerous learning curves
involved for him. He was not used to
working in the B2B space or even an
organization that is external facing.
His entire time at Spotify was internal
platforms focused. So he had to dust off
the consulting background that he had
and lean in with critical functions like
sales, customer success, marketing, and
partnerships as core stakeholders
alongside product and engineering. What
were the biggest challenges and pain
points he was hearing from this diverse
set of stakeholders? What were the P
zeros, ones, and twos? And what was
going to be his strategy to make folks
feel heard? After all, they'd been
operating without an infert team for
some time, and a pile of debt had been
acrewed. Once he knew what needed to be
solved most imminently, it was time to
roll up his sleeves and play as an IC
for a while alongside this small team,
all while growing it into a full-fledged
infert team. So, let's get technical.
What was a burning page zero that fell
on the team's plate?
Observability. So, metrics, alerting,
service levels, all pies in the sky for
this team. On the one hand, wow, this
was one of those engineering green field
moments. There were no mistakes to
recover from, only our own failures to
learn from and eventually celebrate. On
the other hand, that was pretty daunting
as the earliest testers and tries of the
IDP were in the product on a daily
basis. They had very little operational
data to ensure they were having a
positive experience. Safe to say this
encouraged them to put a little pep in
our step and accelerate development
here.
So, numerous options presented
themselves. Functions like our own
completely isolated business. There were
benefits here, especially from the
perspective of building something
exactly the way we wanted it. But
metaphorically speaking, the candy shop
of pre-existing platform offerings that
Spotify had already mastered were
available. And in this arena, they
largely were already exactly what they
wanted. Highly available, scaled to the
heavens, hotel and Prometheus native. So
this is where our partnership came into
play. Andrew and I work in completely
different organizations in a pretty
large business. The last hurdle that
remained was strictly organizational and
strategic. and we came together and
reminded ourselves and our senior
leadership of our most important guiding
principles.
Despite a unique tech stack relative to
the golden technologies that keep the
Spotify consumer platforms operating
around the clock, we didn't need to
build from scratch. We already had a
world-class set of standardized
platforms at our disposal. And yeah, you
guessed it. Backstage was a single pane
of glass and standardized commonplace
across our organizations. So the
solution was simple. Reuse what's
proven. This mindset not only
accelerated delivery but helped maintain
consistency and reliability from day
one.
So these principles became the
foundation for how we approached further
utilizing our observability platform.
First control the chaos. We built guard
rails and not gates for people. We
enable teams to take advantage of the
infrastructure the observability or
supports without reinventing the wheel
and needing to maintain this
infrastructure themselves. Next,
consistency. We want developers to have
a unified developer experience, a shared
standard language and reusable tooling
across all domains and products. And
finally, standardize golden
technologies. We focused on curating
golden paths rather than encouraging
endless customization, meeting people at
where they are and get like being able
to
meet their needs so they don't have to
build and deploy their own things and
customize them. So these ideas sound
simple, but they were transformative for
both engineering culture and velocity.
So here's what we ended up with. Uh the
fun of it all is that running a B2B
business is nothing at all like running
Spotify with a consumer business. Every
customer is afforded a strict tenant
isolation. So the scaling patterns are
quite different. Given the tenant
isolation, we had to get smart about two
things. Anonymizing operational data.
Spotify already does a nice job of
preventing uh PII and observability
data. So we got this out of the box. and
two shipping our metrics inside the
Spotify perimeter so that we could
leverage the world-class tooling that
our observability team already operates
all while maintaining a sock 2 compliant
posture and maintaining tenant
isolation.
So this diagram shows our endto-end
observability pipeline and how telemetry
flows from our IDP into Spotify's
infrastructure all of which is managed
and maintained via the very same IDP
that we are ensuring the high
availability of. So each IDP instance
admits logs uh metrics and eventually
traces. That's something we're still
working on. Um data is sent to the open
telemetry collectors via OTLP
push or Prometheus scrapes. The
collector exports data to Google PubSub
bringing it into Spotify's perimeter. So
PubSub is acting as a transport layer
here. It defines the secure perimeter
and decouples data producers from
consumers. The metrics then flow into
Victoria metrics. And then the
architecture is modular and built
entirely on open standards.
So this way we can have consistent
observability across all environments
and it's very scalable and easy to
extend.
So the beauty of what we've adopted is
that it's tech agnostic. Our
observability stack scales from
Spotify's internal services to our IDP.
Different teams you use different tech
stacks and that's fine. Our standards
ensure observability can work the same
way everywhere. And it's less about
specific tools and more about shared
principles, consistent signals, and
reliable outcomes. So now that the
stack's in place, let's see what kind of
impact it's had on speed, cost, and
happiness for our developers.
Ultimately, by unlocking the toolbox in
an organization, impact came quickly and
naturally. MTR measurably was reduced,
rendering happier engineers and happier
customers. And most importantly,
delivery velocity increased. The teams
were able to ship faster armed with real
reliable signals. Uh when we say speed
and happiness, this is what we mean.
Engineers are happier when their tools
just work and they can trust that their
changes are secure. Uh and in
perpetuity, we're now able to ship
features far more confidently.
So this is the heart of it. We dog food
everything. By running our IDP utilizing
standard internal infrastructure, we
gain empathy for our users because we
are our users. Sharing tools sharpens
empathy. Common standards reduces MTR
and unified feedback loops improve both
platform and product.
And [snorts] at this point in our
journey, we realized the biggest hurdle
wasn't technology. It was just
hesitation. So our message to ourselves
and you
is to seriously just don't overthink it.
The hardest part is getting started.
You'll never have the answers before you
begin, but build incrementally and reuse
what works and iterate. And if you have
folks that you've worked with before,
partner with them and figure out what
you can do together.
So, looking ahead, there's still plenty
to do. Now that Andrew's team have
alerts configured, they're focusing on
fine-tuning them to cut the noise. Um,
there's been a bit of adjustments that
have needed to happen since we've kind
of set all this up. On the observability
side, we plan to enable endto-end
tracing for them in the near future.
We're also working on auto
instrumentation with smarter sampling
guardrails and exploring anomaly
detection and AI assisted triage for
incidents. Each of these steps moves us
closer to truly intelligent proactive
observability. Um, so we solve things
before you even wake up from the page.
Thanks to the partnership we built, the
portal infra team will be among the
first adopters of all of our new
solutions across the business, providing
valuable feedback and continuing to be a
close collaborator. as we evolve
together.
So yeah, thanks for joining us and by us
I mean me and for being part of this
broader conversation about how platform
engineering can drive culture and scale.
Um happy to dive deeper on any of these
stories. We actually have a bunch of
time so we can do questions. um if you
want to talk about how you're applying
IDP principles in new orgs. And then we
also have a booth uh 11:43 uh in the
main section downstairs that I think
will open tomorrow and then we'll be
there for the rest of the week. So yeah,
thanks.
[applause]
>> Awesome. Thanks, Lauren. So we do have a
little bit of time for questions. Would
anyone like to ask a question?
And I will run over with the microphone.
No questions for Oh, there's one.
Oh, okay. Do you have any stories? Um, I
think you said there's other examples.
Do you have any stories of um startups
or the small developer workshops that
have moved maybe not from an IDP from
scratch, but have built their own
existing IDP and then want to move to
something greater.
um
built their own IDP. Uh not exactly. I
mean this was kind of like focused on
the IDP internally kind of becoming a
SAS product and then being able to like
observe that with our tools in house. So
I mean kind of the the takeaway is like
you know if you if you have open
standards you can kind of set them up
for anything like that. So, I mean, if
if that's kind of the question, like
being able to utilize the open standards
for your own in-house IDP could also
totally work, but not sure if I misheard
the question. [laughter]
>> Thank you. Anyone else got a question?
>> What type of metrics are you trying to
pull off the IDP to kind of inform where
it needs to go next? like what what are
kind of the key values that you're
looking at to inform what what else you
need to build?
>> Yeah. So, I kind of wish Andrew was here
because that's like his side of
everything. But from what I can tell, um
right now kind of the the issue they're
having is their alerts are too
sensitive, I would say. and they've kind
of set up things for each of their uh
portal instances to kind of see um kind
of basic metrics I would say like they
have like CPU memory kind of basic stuff
set up right now. Um I don't have like
too many details on exactly what his
team is is looking at but I do know that
like the biggest challenge right now is
they kind of configured things out of
the box and now it's just kind of like
adjusting them to see what reflects like
what their customers are like feedback
is to them. Um, part of the problem they
were having too is like folks were kind
of coming to them with um, issues that
they weren't able to proactively detect.
So, the nice thing about setting this up
is like they'll proactively see these
things first before their customers even
notice.
>> Nice. Any other questions?
>> Oh, to the
Yeah. So I just want to know how y'all
came to the decision to use Victoria
metrics um instead of like any other
like metrics tool.
>> Yeah. Um yeah. So internally we had our
own uh time series database for a long
time that was built uh by a different
team like many years ago before we were
deploying things with Kubernetes.
Essentially it just couldn't scale
anymore. Um like basically once there
were instances kind of spinning up all
the time that Cornelli got way too big
and things were falling over. It was
pretty slow. So we tried out a few
different vendors honestly. Um VM was
kind of the one that worked best for us.
Um we have a pretty good working
relationship with them. So yeah, so far
so good. I mean it scales really well.
It can handle the cardality and they
have a lot of features that kind of help
us to tweak things and like disable
things that we didn't have before. So
>> any other questions?
>> No. In which case, thanks Lauren. Thank
you.
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Platforming the Platform: Running Our IDP With Our IDP - Andrew Sail & Lauren Roshore, Spotify Scaling modern systems is hard—fragmentation, cognitive overload, and operational risk grow fast. At Spotify, we build and operate Portal, our Backstage-based Internal Developer Platform for external customers. But here’s the twist: we use Portal to run Portal. By dogfooding our own platform and tapping into Spotify’s internal observability and tooling ecosystem, we deliver faster, smarter, and with more joy—for both our teams and our customers. In this talk, Andrew Sail (Lead for Backstage/Portal Platform Engineering & SRE) and Laurén Roshore (Lead for Spotify’s internal Observability & Monitoring platforms) share how building with the same tools we provide sharpens empathy, reduces MTTR, and enables richer, more scalable defaults. You’ll learn how internal-external partnerships help us accelerate delivery and operate with confidence. This is a talk for anyone building platforms who wants fewer fire drills, better feedback loops, and happier engineers at every layer.