There Is No Silver Bullet: The Complexities of Building IDPs - Max Espinoza, Viasat, Inc. | DailyDevLists

Loading video player...

Full Transcript

4,998 words • EN

Thank you. All right. Hi everyone.

All right. Awesome.

So, let's just jump to it. Uh, so a

little bit about me. Well, name's Max

Espinosa. I'm a senior platform

architect at VIAP, which is a

telecommunications company that's

satellite based. If you guys have heard

of us, it's probably from residential

services or airline flights. Um, I have

nine years in the industry. originally a

brief stint as a Python application

developer and then got sucked into the

cloud and never crawled my way back out.

Um I built an internal developer plat an

internal application platform and I'll

mention the nuance between that and what

the IDPs are in a little bit and I'm

building another one now also at BISAP.

I'm a history buff and we're lately

trying to get into watercolor

particularly urban sketching. Uh it's

interesting. You can be kind of bad at

it and it fits the aesthetic. But

anyways, that's not why you're here. So

this talk is there is no silver bullet.

The complexities of building an internal

developer platform. So I'll go into what

an IDP is like at a very high level.

There's better talks that go into that

in more detail. I'll talk about a real

world IDP that we're building out at

Biasat right now. And then I'll get into

the meat of the topic which is the

complexities that we've faced in doing

this including server mesh choices,

internal shared ownership,

centralization choices,

um issues with poor UX resulting in more

support and then a couple points about

getting started and I'll end it with

some takeaways.

So what you're looking at here is a

reference diagram. Uh thanks to the good

folks at the platform engineering group,

they published actually a couple of

these in different formats for different

uh cloud providers. Some of them even

like multicloud. But anyways, you can

pick up one of these and it's a really

good starting point to understanding and

implementing your own IDP in your

organization. Uh what you'll see here

and I hope it's clear enough in in the

view over here is that it's broken up

into several planes. The first plane

being the developer control plane. This

is where I think most of the devs spend

most of the time here. This comprises of

something like backstage or a tool that

acts as like a service catalog. And this

comprises of your source control

mechanisms. Um, GitHub being one of the

most popular ones. And also where your

source code for your application lives.

This is the business logic and then the

infrastructure is code for the things

that are needed to support that business

logic. So usually this takes place as

something Terraform related.

And everything that goes on a developer

control plane basically acts as inputs

into the integration and delivery plane

which it it of itself is supposed to

take that and push it out into the

resource plane. So the resource plane

being uh where your services are

running. We're at a cubecon so probably

Kubernetes and all the other various

supporting resources for that. And that

integration and delivery plane usually

is the CI/CD pipeline and some form of

platform orchestrator. That's the thing

that's taking that platform code and

actually pushing it out because you

can't have everyone running Terraform on

their laptop. and then the monitoring

and logging plane and the security plane

um of which there's a plethora of tools

for there for you to pick from. So, you

would think looking at this, okay, cool.

I have an understanding of what an IDP

is. Um, I'm just missing one really,

really essential component.

uh namely BIM and sorry [laughter] and

and and

you call this a day and you want to ship

this out to people and after your dev

team mutinies on you and put someone

wiser in your place uh they take this

back and realize hey actually in order

for you to need an IDP you need to reach

a certain level of scale so you already

have pre-existing tools at your

organization you already have some

patterns of working with you likely

already have like pseudo IDPs lying

around. So you need to figure out how to

basically meet the organization where

they are. And this is kind of our take

at it at Viat. You'll see here we took

what was the reference architecture and

we added tools that we've already built

a lot of experience in and already have

integrations for. Um you'll see some

notable differences here from the

reference diagram in that we're not just

using backstage but we're also using

lean ex one is more developer facing

while the other is more business facing.

Uh also you'll notice that we're not

just using terraform we're also using

helm and I'll talk about this a little

bit about like how we do one versus the

other and the different roles that they

play. uh you'll also see that we're

using on the resource plane well using

EKS but we're also using psyllium and

apogee and I'll also talk a little bit

about that but just at a high level uh

we like psyllium because we are a

satellite based uh company and

satellites are in space so it's a high

latency environment so any kind of

performance gains we can get from the

networking level is much appreciated of

which selium has a couple with its uh

bandwidth managers and its mag lev uh

load balancing

and I'll speak a little bit more to how

we're using Apoge 2 once we get there.

So, kind of going into what do I mean by

a silver bullet? Um, mo most folks here

are probably familiar with this, but

Silver Bullet comes from this folklore

where these big scary creatures that are

hard to defeat like werewolves and

vampires can't be defeated unless like

they you hit them with this very

particular thing, the magic silver

bullet. In this case, what am I

referring to as this big scary thing

that it's building a successful IDP at

your organization? And the silver bullet

that I'm trying to argue against is the

notion that just understanding and

knowing what tech stack you're going to

pick and even potentially having already

hosted pieces of that tech stack is

going to be the bulk of the work. In in

reality, it's going to be there's

complexities you're going to face beyond

that.

So jumping into the first complexity is

service mesh choices.

So as I mentioned uh I built an

application platform at BISAT and that

let me make note of that that the IDP

itself facilitates us building an

application platform because the

application platform it's kind of a

paved road to deploy apps using all of

those tools that comprise of the IDP. So

we built one out and we originally

planned this to use ISTTO and we are

we're using ISTTO. we have like a 100

services on there. So we do have some

experience in managing this and the idea

that the idea behind that decision was

that it had a lot of advanced routing

capabilities. Um but it turned out for

us at least and maybe it's just our

particular situation but for 90% of our

use cases we did not need to look into

any of those uh advanced routing cases

or advanced routing features. And then

we also were running psyllium for the

performance benefits that I mentioned

earlier. And we kind of asked ourselves

can we get away without here in the

situation like can we get the features

of the service mesh through psyllium

alone and we kind of broke things up

into like the bare requirements of what

we wanted which was ingress pottood

encryption telemetry mesh visualization

and off. And we tried to take a look at

okay how do we implement these with

selium and this is what we ended up

finding. So for us we were using the

gateway API with the HTTP routes and

gateway resources which they themselves

are pretty broad delineated so it's a

pretty nice abstraction. I know psyllium

offers this as well and I know psyllium

also has an ambient mesh mode. We

unfortunately haven't had a chance to

explore those options yet. But since we

were already using psyllium and looking

to minimize our stack uh we were

basically we went down this avenue of

using the gateway API

also uh puff for potto encryption uh

selium supports wireguard which was

pretty easy for us to set up and for

telemetry and this is one of the key

value ads we found using it's just that

L7 telemetry like even as simple as like

okay this is the status code this is the

latencies for these services um we were

able to get that also through Hubble

metrics. So I don't think that's enabled

by default but it's pretty

straightforward to enable and we were

able to get the same dashboards that we

were generating on our end through

selium with through through ISTO with

psyllium there

and then mesh visualization. So, uh,

offers this really cool dashboard called

Ki where you can go in and kind of see

what services talk to what other

services. And we found there's no

feature parody here cuz the the Hubble

at least the one that we were using is

pretty minimal, but it for 90% of our

use cases, it was more than fine. We

were able to kind of see those

dependencies and debug issues.

The one place where we actually hit some

problems is with off star. Um this is

basically a catchall for off N and off

Z. So in STO we were using the request

authentication and the request

authorization policy here. And we didn't

find an equivalent yet in psyllium with

the gateway API. That's that's actively

changing that's under development. Um

but this is where we actually needed to

leverage Google Apogee uh to be able to

do that work for us and we're kind of

actively planning that out now as well.

Additionally, we were at some point

looking to do cluster a cluster

tocluster mesh to support higher levels

of reliability. And we know STTO has the

multicluster feature and we'd been

reading the docs for the isselium u mesh

psyllium cluster mesh to enable that as

well. So all in all we found at least

for like 90% of our use cases we are

able to get by with just using psyllium

alone which for us was nice because it

reduced our tech stack.

So going in a little bit into more

detail about like how exactly we're

doing this. So what we're doing is we

broke things up into two pieces like

kind of a common pattern. There are

pieces that the IDP admins like the

operators maintaining the platform own

and then there are pieces that the users

of the platform use. And we found that

making sure that the administrator

resources were managed by Terraform

offered us a higher level of stability

and reliability particularly because

Terraform offers the the plan step so we

could review things before they went in.

This also includes the gateway resource.

Also note these are not the official

icons for the gateway and the HTTP

resource. I I put a shout out to who

proposed them. So thank you

Um so for the IDP users uh everything

they're doing is inside of a Helm

abstraction. So we write a custom Helm

chart that people use to deploy their

services and that itself is just a

minimal set of things that they need to

create their deployment objects, network

policies, those routes etc. Um this was

nice because we you know it doesn't have

the plan step necessarily. Uh but at

that level and with multiple

environments we found this stable enough

to offer us a good service and it's not

like system level components. So it's

not like one bad deployment would bring

down the entire system.

Uh what you see on my right is uh

particular specific configurations that

took us a little while to discover. Um

the tooling enables what we wanted to

do. Um, we just found uh we needed to

tweak a couple knobs in order to get it

to actually work in the way that we

wanted. Particularly those first two

there, like the trusted dumper of hops.

In our case, the way we were using

psyllium, we wanted it to preserve the

source IP. And we needed to enable these

on the psyllium helm chart in order for

us to preserve the source IP to write

network policies that allowed us to do

blocking on certain IPs. And then the

last two are also additional things that

we needed for this for this kind of

system particularly because there's

there's there's no operator here

creating the load balancers. We're

actually orchestrating them directly

through Terraform.

So I hope that that that took us some

time to oh that took us some time out.

So

that took us some time to figure out. So

I hope that saves you guys some time

too. Um onto the next complexity of the

basically challenges of building an IDP

is the notion of internal shared

ownership. So in this case even though

you have a system that's self-service

you could accidentally end up having

assumed ops expectations which

accidentally basically makes you as the

platform owners a de facto s sur for

these services. So having built the

first platform at Viat, we took some

lessons learned there and we're applying

them to the next platform. Um including

shipping, monitoring and logging,

specifically with paging by default. Um

this really helps establishes that sense

of ownership if the issues go to the

respective parties that actually have

the ability to resolve the issues. Also

providing holistic cost visibility. In

our case, for cost savings, we're

hosting things in a multi-tenant uh

cluster. So, it can kind of be opaque

unless there's something there to look

at the cost of what's happening within

that cluster on a per namespace or per

pod level. And then offering rich

self-service docs or on boarding uh

particularly around debugging so

customers can resolve their own issues.

Uh so what you see here is on the left

um a quick snapshot of cube cost which

is the tool that we're using to be able

to peak into Kubernetes and see the

cloud spend per namespace. Uh we we

actually run this in a mode where we can

see this across multiple clusters and we

can with labeling be able to aggregate

things not just on a per namespace level

but per specific products and services

which we found very helpful.

On the right, you'll see that alerting

abstraction. Now, yes, you could have

your developers write the Prometheus

rules. You can have them fidget with

their alert manager configs, but those

all act as friction. And basically, this

abstraction or some kind of abstraction

will help them be able to set that up

faster so that way they can feel like

they have ownership over their

application faster. And we found that

particularly helpful.

And for documentation, tech docs has

been particularly helpful for us. Uh

mainly because one, we can just write

markdown and secondly, we can um use the

search functionality in there and people

can very quickly find issues that they

have in debug.

So another complexity that we faced was

centralization choices. So you're likely

going to have to manage multiple

Kubernetes clusters and as we saw in the

reference architecture you're some of

the stuff is going to be deployed likely

in a way that needs to be in that

cluster. So you very quickly run into

the problem okay so what do we want to

run everywhere in each cluster and what

do we want to centralize and those

choices that you make around

centralization have impact on

maintainability and cost of the

platform.

So particularly what we found helpful

for us was having a centralized Argo CD.

Uh mainly because it provided an

IDP-wide view of all the services. Um

also having a central Argo CD means that

you reduce the resource cost of managing

and just hosting multiple Argo CDs.

Although we and this might be just

through fault of our own. Uh we do have

a manual step to register new clusters

to the central Argo CD. Um it's probably

something that we will will automate

later. There's also we found we needed

to improve stability. So not just having

one Argo CD was helpful. We actually

broke it up into two. Uh one is a

customer Argo CD and the other is an

administrator Argo CD. And the

difference here is that that

administrator Argo CD's reliability

needs to be very very high because

without it you won't be able to solve

the customer issues. So by breaking it

apart, we kind of also broke up the risk

and we're using a central cube cost to

track cost across AWS accounts and the

clusters. So we while we install cube

cost tooling in all the clusters, the

primary cube cost is running inside of

one of those tooling clusters.

And then uh yeah, we also were using AMP

and we had a little bit of difficulty

with alerting on AMP. So in some cases

we actually are still running Prometheus

and alert manager together on the

clusters themselves.

So to the next complexity which is the

poor user experience resulting in

support. So if

so if you give someone a very uh highly

complex like tool and you give them the

richest documents you could possibly

provide and you expect them to do

self-service, you can actually end up in

a situation where this results still in

high support demands. So docs are

helpful, but you customers may still

need help in navigating that

documentation. And I know there's many

things people are doing nowadays to uh

fix that particular issue, but we found

backstage software templates and starter

kits alleviated a lot of these pain

points.

So what you'll see here on the left is

uh basically all of our documentation

for just getting started on the platform

like onboarding your first application.

It is something that we wrote over a

couple years and constantly modify

anytime someone's confused about

something. And as you can see, it is

long, has little snippets of code and

screenshots.

And we still get questions about this

all the time. And one cool thing about

Backstage and their software templates

is that it kind of changes the user

experience in that rather than having

them just looking at a documentation and

trying to do something on their terminal

or through a UI, you kind of offer this

wizardl like setup to bas to have

automations. In this case, onboarding an

application. So this is more than just

for this particular use case, but the

neat thing is that you can kind of take

all the relevant information that a

customer or developer needs and put it

where it's actually important and be

able to only ask for fields and

information as it's needed. So for

example, a lot of that doc can now be

more or less translated into these

little descriptions under each

particular field. And if customers are

not doing something, they don't need to

see like half of those fields. So it

just makes the experience much much

cleaner and at the end gives you like a

little summary of what ran and it's a

it's a framework. It's not a sol like

you you need to configure it. But we

found that having that user experience

and setting up the framework in this way

was really helpful for us.

So for getting started this like if you

look at that reference diagram depending

on where your company is and how much of

that stuff you're currently supporting

it could be a particularly big

undertaking and you could also be at

risk of having slow returns which in and

of itself results in diminished trust

which is not good. Um so for us we found

getting started by shoring up some core

services like Argo CD, GitHub action

runners, EKS provisioning um that helped

us kind of build the foundation for what

was the next piece which is Terraform

and Terraform infrastructure pipelines

which is our version of a platform

orchestrator backstage and eventually

the application platform that I've been

referring to in the talk. also having

clear communications around time savings

and tracking the developer experience

along the way.

So what you'll see here in the reference

diagram is in our use case we happen to

have uh we were able to pull investments

into some of those core technologies

that I've highlighted here on the box on

the right and that helped us basically

set up that foundation for the next

piece and eventually the application

platform.

Also, what you see in the upper right

hand corner is just a diagram where

we've we've used this a couple times to

help communicate the value ad of a

platform. Um, again, it doesn't generate

money directly. It increases

productivity and improves developer

satisfaction. And just being able to

show the time savings and the number of

things that you don't have to do uh we

found was very helpful not only to

talking to stakeholders but also talking

to developers.

and then clearly tracking the developer

experience. In our case at VISAT, we use

DX to do this. Uh basically

surveys that go out and then we

aggregate that data and it helps us kind

of track to see where we are in terms of

developer satisfaction with our tooling.

So some takeaways, uh building an IDP is

more than just hosting the IDP services.

Um, even if you're buying offtheshelf

solutions, there's likely gaps between

what the solutions offer and what your

organization needs. Um, there are

considerations that are going to be

customtailored to those needs. And the

end goal of the IDP is to improve the

developer experience and developers are

people. They may have preferences for

certain tools and subject matter

expertise in certain tools and that's

something that we have to actively keep

in mind when building out IDPS.

So quick shout out to my colleagues at

Viasat

uh Nano Banana for the awesome

[laughter] images, the CSUP communities

and the platform engineering day for

giving me the opportunity to speak and

uh if you need to reach me, you can

connect with me at uh readthinkhack.org

or just check that QR code there.

Cool. Thank you. See you guys around.

[applause]

So, you had on one of your slides about

the developer experience survey. Um, is

that something that's built into your

IDP or you doing like survey monkey or

something set offline for that? I'm just

checking this. Okay, cool. Um, yeah. No,

we're using a tool called DX. Um, it's a

separate tool that is basically built

around tracking. And I I'm not an expert

at DX. Um, I only use the the surveys

for the tooling that we get back, but it

asks customers in your organization

routine surveys that they answer and it

infers from that and generates basically

data that you can look at in terms of

the feedback. So we use that tool not

just in this case. We use it for many

other tools and it has a lot of

basically

plugins, connectors, however you want to

call it to a lot of data sources and can

generate a lot of cool reports for your

organization. Yeah, we we we found it

helpful in that regard at least.

>> All right, we got a question over here

in this corner.

>> Um, you're hosting this totally in the

cloud on AWS.

>> Oh, sorry. What was that? you hosting

this entirely in the cloud on AWS?

>> Uh, correct. Yes. Right now we're using

EKS.

>> Okay.

>> Hi. So, one of the interesting things

that you had on one of the slides was

how the IDP is meant to be built rather

than buying something off the shelf. Um

earlier today we did see us speak by

Cortex I believe for their IDP which is

essentially saying that the issue with

backstage is that you end up spending

countless engineering hours fine-tuning

it and even once it's deployed you still

have to support it on and on and on. I'm

curious if this is the experience that

you have with backstage and why you

decided to go for backstage rather than

another solution like Cortex that tends

to be more offtheshelf if you will.

Oh. Uh, it's a good question. Um, I I'm

not the backstage expert at the company,

but

to speak to like the difficulty of like

getting adoption and backstage itself,

we did originally actually host

backstage and like during a first

attempt and it didn't see that much

adoption. Um it wasn't until we hosted

it the second time and started

integrating it with more tooling that we

started seeing adoption and it seemed to

be something where at least I haven't

reviewed or looked into Cortex or the

other offerings. So this is back this is

like Visat's first dipping its toe into

something like backstage. But the

software templates and the tech docs in

particular were really good features to

kind of drive adoption for that tooling

because everyone was having the problem

of how do they make their documentation

discoverable. You had some people

hosting in MK docs, some people were

hosting on like Confluence but uh

everyone seems to love markdown and the

tech docs lets you actually install

different plugins. So you could actually

like render mermaid diagrams and

everything. So people started using that

and then naturally it seemed like they

wanted to start using more and software

templates was like the next main tool

that they started digging into. And that

was nice because it offers a really like

standardized and sleek UI in front of

like whatever thing you're trying to do

behind the scenes. And the the way you

write it through the scaffolder uh most

of the times you don't even have to

touch the actual like backstage like

implementation. you only really have to

like write what you want like in a YAML

with some like custom supported actions

and do the thing that you want to do. So

yeah, um I was really interested in your

uh it was like a cube cost dashboard to

show the savings that the product makes.

I have a really similar stack uh I'm

working on back home. Um it's very tough

to just say, "Hey, I'm saving everyone x

amount of dollars." Um, you mentioned

talking to like developers and people at

the company using it. How are you like

kind of phrasing those questions or like

getting that data to figure out like how

much the product is really saving?

>> Okay, I heard everything up until the

second to last part. There's like these

two delayed echoes. Can you repeat that

again? Just the last part.

>> Yeah. Yeah. Yeah. Um, I'm just kind of

wondering like how are you approaching

the developers like the people of the

company using the tool to um figure out

like the cost savings of the using the

stack that you're creating versus just

like the old monolith solutions that

they were doing before.

So for for us uh in in our case we're

hosting this platform on a couple AWS

accounts but mainly there's centralized

AWS accounts that the platform engineers

own and to just like go in there and

look at the cost explorer doesn't give

you a breakdown and I know this could I

know this changes now with like EKS has

some level of cost availability into

name spaces but um we had the problem

where the developers would build things

and then not really track their cost

very much and it'll kind of fall in us

and we have to report back to them and

be like, "Hey, you guys are spending a

lot." So, CubeCost helps because one, it

helps centralize things. Second, you can

generate reports like in PDF or whatnot,

just even like links to to pages and

they can see their cost across different

AWS accounts uh for their Kubernetes

clusters and also they track out of

cluster assets. So for example if as a

part of their application it needs to

use an RDS database uh it it can

aggregate that cost as well. So we could

the customer can see or the developer

can see the full cost of their

applications not just the Kubernetes

cost.

>> Okay. So you're tracking cost mostly on

like a

>> cloud resource,

>> right?

>> We're tracking cost of the cloud

resources. There's other features in

there too which I personally haven't

delved into too much. like it gives you

like hints for what to optimize. Like it

tells you like I don't know if you saw

there on the slide but it'll tell you

how much of your cost is actually idle

cost which is helpful because it will

kind of more or less hint at what is

overprovisioned.

Um yeah is that do you have any other

specific questions or

>> uh no between the 500 gigabyte gly

instances and how much time I'm saving a

developer for using the product. Uh I

think you did a great job. Thank you.

Okay, cool. We can chat after.

>> Thank you, Max.

>> [applause]

There Is No Silver Bullet: The Complexities of Building IDPs - Max Espinoza, Viasat, Inc.

CNCF [Cloud Native Computing Foundation]

1 day ago

29:18

Platform Engineering & DevOps Culture

Rank #5

Description

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io There Is No Silver Bullet: The Complexities of Building IDPs - Max Espinoza, Viasat, Inc. The allure of an internal developer platform (IDP) is tantalizing, isn't it? You've watched the talks, and you know why having one would make life better for yourself, your team, and your customers. However, using a handful of open-source software (OSS) or Cloud Native Computing Foundation (CNCF) projects in your organization a magic silver bullet. When building an IDP, a lot factors come into play. * Do you set up a centralized approach for syncing configurations across dozens of clusters? * Is a service mesh truly necessary? * How can make the most of shared clusters while not becoming the system-admin keeping everything together. * How do you expose services internally and externally in a way that prevents disjoint dev teams from introducing network misconfigurations? This talk won't prescribe solutions. Instead, it will draw on experiences from building platforms at Viasat; highlighting often missed considerations for those just starting their platform-building journey.

Video Details

Category

Platform Engineering & DevOps Culture

Featured Date

November 25, 2025

Quality Rank

#5

AI Recommended