Loading video player...
Thank you. All right. Hi everyone.
All right. Awesome.
So, let's just jump to it. Uh, so a
little bit about me. Well, name's Max
Espinosa. I'm a senior platform
architect at VIAP, which is a
telecommunications company that's
satellite based. If you guys have heard
of us, it's probably from residential
services or airline flights. Um, I have
nine years in the industry. originally a
brief stint as a Python application
developer and then got sucked into the
cloud and never crawled my way back out.
Um I built an internal developer plat an
internal application platform and I'll
mention the nuance between that and what
the IDPs are in a little bit and I'm
building another one now also at BISAP.
I'm a history buff and we're lately
trying to get into watercolor
particularly urban sketching. Uh it's
interesting. You can be kind of bad at
it and it fits the aesthetic. But
anyways, that's not why you're here. So
this talk is there is no silver bullet.
The complexities of building an internal
developer platform. So I'll go into what
an IDP is like at a very high level.
There's better talks that go into that
in more detail. I'll talk about a real
world IDP that we're building out at
Biasat right now. And then I'll get into
the meat of the topic which is the
complexities that we've faced in doing
this including server mesh choices,
internal shared ownership,
centralization choices,
um issues with poor UX resulting in more
support and then a couple points about
getting started and I'll end it with
some takeaways.
So what you're looking at here is a
reference diagram. Uh thanks to the good
folks at the platform engineering group,
they published actually a couple of
these in different formats for different
uh cloud providers. Some of them even
like multicloud. But anyways, you can
pick up one of these and it's a really
good starting point to understanding and
implementing your own IDP in your
organization. Uh what you'll see here
and I hope it's clear enough in in the
view over here is that it's broken up
into several planes. The first plane
being the developer control plane. This
is where I think most of the devs spend
most of the time here. This comprises of
something like backstage or a tool that
acts as like a service catalog. And this
comprises of your source control
mechanisms. Um, GitHub being one of the
most popular ones. And also where your
source code for your application lives.
This is the business logic and then the
infrastructure is code for the things
that are needed to support that business
logic. So usually this takes place as
something Terraform related.
And everything that goes on a developer
control plane basically acts as inputs
into the integration and delivery plane
which it it of itself is supposed to
take that and push it out into the
resource plane. So the resource plane
being uh where your services are
running. We're at a cubecon so probably
Kubernetes and all the other various
supporting resources for that. And that
integration and delivery plane usually
is the CI/CD pipeline and some form of
platform orchestrator. That's the thing
that's taking that platform code and
actually pushing it out because you
can't have everyone running Terraform on
their laptop. and then the monitoring
and logging plane and the security plane
um of which there's a plethora of tools
for there for you to pick from. So, you
would think looking at this, okay, cool.
I have an understanding of what an IDP
is. Um, I'm just missing one really,
really essential component.
uh namely BIM and sorry [laughter] and
and and
you call this a day and you want to ship
this out to people and after your dev
team mutinies on you and put someone
wiser in your place uh they take this
back and realize hey actually in order
for you to need an IDP you need to reach
a certain level of scale so you already
have pre-existing tools at your
organization you already have some
patterns of working with you likely
already have like pseudo IDPs lying
around. So you need to figure out how to
basically meet the organization where
they are. And this is kind of our take
at it at Viat. You'll see here we took
what was the reference architecture and
we added tools that we've already built
a lot of experience in and already have
integrations for. Um you'll see some
notable differences here from the
reference diagram in that we're not just
using backstage but we're also using
lean ex one is more developer facing
while the other is more business facing.
Uh also you'll notice that we're not
just using terraform we're also using
helm and I'll talk about this a little
bit about like how we do one versus the
other and the different roles that they
play. uh you'll also see that we're
using on the resource plane well using
EKS but we're also using psyllium and
apogee and I'll also talk a little bit
about that but just at a high level uh
we like psyllium because we are a
satellite based uh company and
satellites are in space so it's a high
latency environment so any kind of
performance gains we can get from the
networking level is much appreciated of
which selium has a couple with its uh
bandwidth managers and its mag lev uh
load balancing
and I'll speak a little bit more to how
we're using Apoge 2 once we get there.
So, kind of going into what do I mean by
a silver bullet? Um, mo most folks here
are probably familiar with this, but
Silver Bullet comes from this folklore
where these big scary creatures that are
hard to defeat like werewolves and
vampires can't be defeated unless like
they you hit them with this very
particular thing, the magic silver
bullet. In this case, what am I
referring to as this big scary thing
that it's building a successful IDP at
your organization? And the silver bullet
that I'm trying to argue against is the
notion that just understanding and
knowing what tech stack you're going to
pick and even potentially having already
hosted pieces of that tech stack is
going to be the bulk of the work. In in
reality, it's going to be there's
complexities you're going to face beyond
that.
So jumping into the first complexity is
service mesh choices.
So as I mentioned uh I built an
application platform at BISAT and that
let me make note of that that the IDP
itself facilitates us building an
application platform because the
application platform it's kind of a
paved road to deploy apps using all of
those tools that comprise of the IDP. So
we built one out and we originally
planned this to use ISTTO and we are
we're using ISTTO. we have like a 100
services on there. So we do have some
experience in managing this and the idea
that the idea behind that decision was
that it had a lot of advanced routing
capabilities. Um but it turned out for
us at least and maybe it's just our
particular situation but for 90% of our
use cases we did not need to look into
any of those uh advanced routing cases
or advanced routing features. And then
we also were running psyllium for the
performance benefits that I mentioned
earlier. And we kind of asked ourselves
can we get away without here in the
situation like can we get the features
of the service mesh through psyllium
alone and we kind of broke things up
into like the bare requirements of what
we wanted which was ingress pottood
encryption telemetry mesh visualization
and off. And we tried to take a look at
okay how do we implement these with
selium and this is what we ended up
finding. So for us we were using the
gateway API with the HTTP routes and
gateway resources which they themselves
are pretty broad delineated so it's a
pretty nice abstraction. I know psyllium
offers this as well and I know psyllium
also has an ambient mesh mode. We
unfortunately haven't had a chance to
explore those options yet. But since we
were already using psyllium and looking
to minimize our stack uh we were
basically we went down this avenue of
using the gateway API
also uh puff for potto encryption uh
selium supports wireguard which was
pretty easy for us to set up and for
telemetry and this is one of the key
value ads we found using it's just that
L7 telemetry like even as simple as like
okay this is the status code this is the
latencies for these services um we were
able to get that also through Hubble
metrics. So I don't think that's enabled
by default but it's pretty
straightforward to enable and we were
able to get the same dashboards that we
were generating on our end through
selium with through through ISTO with
psyllium there
and then mesh visualization. So, uh,
offers this really cool dashboard called
Ki where you can go in and kind of see
what services talk to what other
services. And we found there's no
feature parody here cuz the the Hubble
at least the one that we were using is
pretty minimal, but it for 90% of our
use cases, it was more than fine. We
were able to kind of see those
dependencies and debug issues.
The one place where we actually hit some
problems is with off star. Um this is
basically a catchall for off N and off
Z. So in STO we were using the request
authentication and the request
authorization policy here. And we didn't
find an equivalent yet in psyllium with
the gateway API. That's that's actively
changing that's under development. Um
but this is where we actually needed to
leverage Google Apogee uh to be able to
do that work for us and we're kind of
actively planning that out now as well.
Additionally, we were at some point
looking to do cluster a cluster
tocluster mesh to support higher levels
of reliability. And we know STTO has the
multicluster feature and we'd been
reading the docs for the isselium u mesh
psyllium cluster mesh to enable that as
well. So all in all we found at least
for like 90% of our use cases we are
able to get by with just using psyllium
alone which for us was nice because it
reduced our tech stack.
So going in a little bit into more
detail about like how exactly we're
doing this. So what we're doing is we
broke things up into two pieces like
kind of a common pattern. There are
pieces that the IDP admins like the
operators maintaining the platform own
and then there are pieces that the users
of the platform use. And we found that
making sure that the administrator
resources were managed by Terraform
offered us a higher level of stability
and reliability particularly because
Terraform offers the the plan step so we
could review things before they went in.
This also includes the gateway resource.
Also note these are not the official
icons for the gateway and the HTTP
resource. I I put a shout out to who
proposed them. So thank you
Um so for the IDP users uh everything
they're doing is inside of a Helm
abstraction. So we write a custom Helm
chart that people use to deploy their
services and that itself is just a
minimal set of things that they need to
create their deployment objects, network
policies, those routes etc. Um this was
nice because we you know it doesn't have
the plan step necessarily. Uh but at
that level and with multiple
environments we found this stable enough
to offer us a good service and it's not
like system level components. So it's
not like one bad deployment would bring
down the entire system.
Uh what you see on my right is uh
particular specific configurations that
took us a little while to discover. Um
the tooling enables what we wanted to
do. Um, we just found uh we needed to
tweak a couple knobs in order to get it
to actually work in the way that we
wanted. Particularly those first two
there, like the trusted dumper of hops.
In our case, the way we were using
psyllium, we wanted it to preserve the
source IP. And we needed to enable these
on the psyllium helm chart in order for
us to preserve the source IP to write
network policies that allowed us to do
blocking on certain IPs. And then the
last two are also additional things that
we needed for this for this kind of
system particularly because there's
there's there's no operator here
creating the load balancers. We're
actually orchestrating them directly
through Terraform.
So I hope that that that took us some
time to oh that took us some time out.
So
that took us some time to figure out. So
I hope that saves you guys some time
too. Um onto the next complexity of the
basically challenges of building an IDP
is the notion of internal shared
ownership. So in this case even though
you have a system that's self-service
you could accidentally end up having
assumed ops expectations which
accidentally basically makes you as the
platform owners a de facto s sur for
these services. So having built the
first platform at Viat, we took some
lessons learned there and we're applying
them to the next platform. Um including
shipping, monitoring and logging,
specifically with paging by default. Um
this really helps establishes that sense
of ownership if the issues go to the
respective parties that actually have
the ability to resolve the issues. Also
providing holistic cost visibility. In
our case, for cost savings, we're
hosting things in a multi-tenant uh
cluster. So, it can kind of be opaque
unless there's something there to look
at the cost of what's happening within
that cluster on a per namespace or per
pod level. And then offering rich
self-service docs or on boarding uh
particularly around debugging so
customers can resolve their own issues.
Uh so what you see here is on the left
um a quick snapshot of cube cost which
is the tool that we're using to be able
to peak into Kubernetes and see the
cloud spend per namespace. Uh we we
actually run this in a mode where we can
see this across multiple clusters and we
can with labeling be able to aggregate
things not just on a per namespace level
but per specific products and services
which we found very helpful.
On the right, you'll see that alerting
abstraction. Now, yes, you could have
your developers write the Prometheus
rules. You can have them fidget with
their alert manager configs, but those
all act as friction. And basically, this
abstraction or some kind of abstraction
will help them be able to set that up
faster so that way they can feel like
they have ownership over their
application faster. And we found that
particularly helpful.
And for documentation, tech docs has
been particularly helpful for us. Uh
mainly because one, we can just write
markdown and secondly, we can um use the
search functionality in there and people
can very quickly find issues that they
have in debug.
So another complexity that we faced was
centralization choices. So you're likely
going to have to manage multiple
Kubernetes clusters and as we saw in the
reference architecture you're some of
the stuff is going to be deployed likely
in a way that needs to be in that
cluster. So you very quickly run into
the problem okay so what do we want to
run everywhere in each cluster and what
do we want to centralize and those
choices that you make around
centralization have impact on
maintainability and cost of the
platform.
So particularly what we found helpful
for us was having a centralized Argo CD.
Uh mainly because it provided an
IDP-wide view of all the services. Um
also having a central Argo CD means that
you reduce the resource cost of managing
and just hosting multiple Argo CDs.
Although we and this might be just
through fault of our own. Uh we do have
a manual step to register new clusters
to the central Argo CD. Um it's probably
something that we will will automate
later. There's also we found we needed
to improve stability. So not just having
one Argo CD was helpful. We actually
broke it up into two. Uh one is a
customer Argo CD and the other is an
administrator Argo CD. And the
difference here is that that
administrator Argo CD's reliability
needs to be very very high because
without it you won't be able to solve
the customer issues. So by breaking it
apart, we kind of also broke up the risk
and we're using a central cube cost to
track cost across AWS accounts and the
clusters. So we while we install cube
cost tooling in all the clusters, the
primary cube cost is running inside of
one of those tooling clusters.
And then uh yeah, we also were using AMP
and we had a little bit of difficulty
with alerting on AMP. So in some cases
we actually are still running Prometheus
and alert manager together on the
clusters themselves.
So to the next complexity which is the
poor user experience resulting in
support. So if
so if you give someone a very uh highly
complex like tool and you give them the
richest documents you could possibly
provide and you expect them to do
self-service, you can actually end up in
a situation where this results still in
high support demands. So docs are
helpful, but you customers may still
need help in navigating that
documentation. And I know there's many
things people are doing nowadays to uh
fix that particular issue, but we found
backstage software templates and starter
kits alleviated a lot of these pain
points.
So what you'll see here on the left is
uh basically all of our documentation
for just getting started on the platform
like onboarding your first application.
It is something that we wrote over a
couple years and constantly modify
anytime someone's confused about
something. And as you can see, it is
long, has little snippets of code and
screenshots.
And we still get questions about this
all the time. And one cool thing about
Backstage and their software templates
is that it kind of changes the user
experience in that rather than having
them just looking at a documentation and
trying to do something on their terminal
or through a UI, you kind of offer this
wizardl like setup to bas to have
automations. In this case, onboarding an
application. So this is more than just
for this particular use case, but the
neat thing is that you can kind of take
all the relevant information that a
customer or developer needs and put it
where it's actually important and be
able to only ask for fields and
information as it's needed. So for
example, a lot of that doc can now be
more or less translated into these
little descriptions under each
particular field. And if customers are
not doing something, they don't need to
see like half of those fields. So it
just makes the experience much much
cleaner and at the end gives you like a
little summary of what ran and it's a
it's a framework. It's not a sol like
you you need to configure it. But we
found that having that user experience
and setting up the framework in this way
was really helpful for us.
So for getting started this like if you
look at that reference diagram depending
on where your company is and how much of
that stuff you're currently supporting
it could be a particularly big
undertaking and you could also be at
risk of having slow returns which in and
of itself results in diminished trust
which is not good. Um so for us we found
getting started by shoring up some core
services like Argo CD, GitHub action
runners, EKS provisioning um that helped
us kind of build the foundation for what
was the next piece which is Terraform
and Terraform infrastructure pipelines
which is our version of a platform
orchestrator backstage and eventually
the application platform that I've been
referring to in the talk. also having
clear communications around time savings
and tracking the developer experience
along the way.
So what you'll see here in the reference
diagram is in our use case we happen to
have uh we were able to pull investments
into some of those core technologies
that I've highlighted here on the box on
the right and that helped us basically
set up that foundation for the next
piece and eventually the application
platform.
Also, what you see in the upper right
hand corner is just a diagram where
we've we've used this a couple times to
help communicate the value ad of a
platform. Um, again, it doesn't generate
money directly. It increases
productivity and improves developer
satisfaction. And just being able to
show the time savings and the number of
things that you don't have to do uh we
found was very helpful not only to
talking to stakeholders but also talking
to developers.
and then clearly tracking the developer
experience. In our case at VISAT, we use
DX to do this. Uh basically
surveys that go out and then we
aggregate that data and it helps us kind
of track to see where we are in terms of
developer satisfaction with our tooling.
So some takeaways, uh building an IDP is
more than just hosting the IDP services.
Um, even if you're buying offtheshelf
solutions, there's likely gaps between
what the solutions offer and what your
organization needs. Um, there are
considerations that are going to be
customtailored to those needs. And the
end goal of the IDP is to improve the
developer experience and developers are
people. They may have preferences for
certain tools and subject matter
expertise in certain tools and that's
something that we have to actively keep
in mind when building out IDPS.
So quick shout out to my colleagues at
Viasat
uh Nano Banana for the awesome
[laughter] images, the CSUP communities
and the platform engineering day for
giving me the opportunity to speak and
uh if you need to reach me, you can
connect with me at uh readthinkhack.org
or just check that QR code there.
Cool. Thank you. See you guys around.
[applause]
So, you had on one of your slides about
the developer experience survey. Um, is
that something that's built into your
IDP or you doing like survey monkey or
something set offline for that? I'm just
checking this. Okay, cool. Um, yeah. No,
we're using a tool called DX. Um, it's a
separate tool that is basically built
around tracking. And I I'm not an expert
at DX. Um, I only use the the surveys
for the tooling that we get back, but it
asks customers in your organization
routine surveys that they answer and it
infers from that and generates basically
data that you can look at in terms of
the feedback. So we use that tool not
just in this case. We use it for many
other tools and it has a lot of
basically
plugins, connectors, however you want to
call it to a lot of data sources and can
generate a lot of cool reports for your
organization. Yeah, we we we found it
helpful in that regard at least.
>> All right, we got a question over here
in this corner.
>> Um, you're hosting this totally in the
cloud on AWS.
>> Oh, sorry. What was that? you hosting
this entirely in the cloud on AWS?
>> Uh, correct. Yes. Right now we're using
EKS.
>> Okay.
>> Hi. So, one of the interesting things
that you had on one of the slides was
how the IDP is meant to be built rather
than buying something off the shelf. Um
earlier today we did see us speak by
Cortex I believe for their IDP which is
essentially saying that the issue with
backstage is that you end up spending
countless engineering hours fine-tuning
it and even once it's deployed you still
have to support it on and on and on. I'm
curious if this is the experience that
you have with backstage and why you
decided to go for backstage rather than
another solution like Cortex that tends
to be more offtheshelf if you will.
Oh. Uh, it's a good question. Um, I I'm
not the backstage expert at the company,
but
to speak to like the difficulty of like
getting adoption and backstage itself,
we did originally actually host
backstage and like during a first
attempt and it didn't see that much
adoption. Um it wasn't until we hosted
it the second time and started
integrating it with more tooling that we
started seeing adoption and it seemed to
be something where at least I haven't
reviewed or looked into Cortex or the
other offerings. So this is back this is
like Visat's first dipping its toe into
something like backstage. But the
software templates and the tech docs in
particular were really good features to
kind of drive adoption for that tooling
because everyone was having the problem
of how do they make their documentation
discoverable. You had some people
hosting in MK docs, some people were
hosting on like Confluence but uh
everyone seems to love markdown and the
tech docs lets you actually install
different plugins. So you could actually
like render mermaid diagrams and
everything. So people started using that
and then naturally it seemed like they
wanted to start using more and software
templates was like the next main tool
that they started digging into. And that
was nice because it offers a really like
standardized and sleek UI in front of
like whatever thing you're trying to do
behind the scenes. And the the way you
write it through the scaffolder uh most
of the times you don't even have to
touch the actual like backstage like
implementation. you only really have to
like write what you want like in a YAML
with some like custom supported actions
and do the thing that you want to do. So
yeah, um I was really interested in your
uh it was like a cube cost dashboard to
show the savings that the product makes.
I have a really similar stack uh I'm
working on back home. Um it's very tough
to just say, "Hey, I'm saving everyone x
amount of dollars." Um, you mentioned
talking to like developers and people at
the company using it. How are you like
kind of phrasing those questions or like
getting that data to figure out like how
much the product is really saving?
>> Okay, I heard everything up until the
second to last part. There's like these
two delayed echoes. Can you repeat that
again? Just the last part.
>> Yeah. Yeah. Yeah. Um, I'm just kind of
wondering like how are you approaching
the developers like the people of the
company using the tool to um figure out
like the cost savings of the using the
stack that you're creating versus just
like the old monolith solutions that
they were doing before.
So for for us uh in in our case we're
hosting this platform on a couple AWS
accounts but mainly there's centralized
AWS accounts that the platform engineers
own and to just like go in there and
look at the cost explorer doesn't give
you a breakdown and I know this could I
know this changes now with like EKS has
some level of cost availability into
name spaces but um we had the problem
where the developers would build things
and then not really track their cost
very much and it'll kind of fall in us
and we have to report back to them and
be like, "Hey, you guys are spending a
lot." So, CubeCost helps because one, it
helps centralize things. Second, you can
generate reports like in PDF or whatnot,
just even like links to to pages and
they can see their cost across different
AWS accounts uh for their Kubernetes
clusters and also they track out of
cluster assets. So for example if as a
part of their application it needs to
use an RDS database uh it it can
aggregate that cost as well. So we could
the customer can see or the developer
can see the full cost of their
applications not just the Kubernetes
cost.
>> Okay. So you're tracking cost mostly on
like a
>> cloud resource,
>> right?
>> We're tracking cost of the cloud
resources. There's other features in
there too which I personally haven't
delved into too much. like it gives you
like hints for what to optimize. Like it
tells you like I don't know if you saw
there on the slide but it'll tell you
how much of your cost is actually idle
cost which is helpful because it will
kind of more or less hint at what is
overprovisioned.
Um yeah is that do you have any other
specific questions or
>> uh no between the 500 gigabyte gly
instances and how much time I'm saving a
developer for using the product. Uh I
think you did a great job. Thank you.
Okay, cool. We can chat after.
>> Thank you, Max.
>> [applause]
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io There Is No Silver Bullet: The Complexities of Building IDPs - Max Espinoza, Viasat, Inc. The allure of an internal developer platform (IDP) is tantalizing, isn't it? You've watched the talks, and you know why having one would make life better for yourself, your team, and your customers. However, using a handful of open-source software (OSS) or Cloud Native Computing Foundation (CNCF) projects in your organization a magic silver bullet. When building an IDP, a lot factors come into play. * Do you set up a centralized approach for syncing configurations across dozens of clusters? * Is a service mesh truly necessary? * How can make the most of shared clusters while not becoming the system-admin keeping everything together. * How do you expose services internally and externally in a way that prevents disjoint dev teams from introducing network misconfigurations? This talk won't prescribe solutions. Instead, it will draw on experiences from building platforms at Viasat; highlighting often missed considerations for those just starting their platform-building journey.