Build Your Own Kubernetes Operators with Go and Kubebuilder – Full Course | DailyDevLists

Loading video player...

Full Transcript

59,065 words • EN

In this hands-on Kubernetes operator

course, you'll learn how to extend

Kubernetes by building your own custom

operators and controllers from scratch.

You'll go beyond simply using Kubernetes

and start treating it as a software

development kit. You'll learn how to

build a realworld operator that manages

AWS EC2 instances directly from

Kubernetes covering everything from the

internal architecture of informers and

caches to advanced concepts like

finalizers and item potency.

Shubhamqatara developed this course.

>> Now if you already know Kubernetes, you

know that there are concepts and

Kubernetes objects like pods,

deployments, replica sets, stateful

sets, services and so on so forth. But

do you know that you can create an

object called EC2 instance?

No. Well, that's the beauty of

Kubernetes because you can extend the

current capabilities of Kubernetes and

create something which is called an

operator. So you can create an operator

to control things which are outside of

Kubernetes like EC2 instance which we

will learn in this particular course.

I'm very excited to bring you the

Kubernetes operator course from scratch.

This 6R plus course is brought to you by

Shouhham who has 8 plus years of

experience and working in Tago have

trained many on open shift holds

multiple certifications like GCP cloud

professional and devops and this course

comes as an outcome of his work at

Privago for building custom operators in

production. Yes, we'll build a

full-fledged working operator end to end

from scratch learning why it is even

important, how to do and everything

about cube builder and then building it

end to end. I'm really really excited

about this course and cannot wait for

you to get started. So before we can

build a custom operator for Kubernetes,

we need to know what is an operator,

right? And before that there is a term

that is called a controller that you

really need to be familiar with. Now

many of you might not might know already

what a controller is. It's you know um

you have heard about this which is the

cube controller

um manager.

But what does it really do? What work is

it that the that the controller is is

responsible for?

So a controller is nothing but it is

think of that as a forever running loop

right think of this as a piece of

software which we will be writing that

is a forever running loop and if I want

to write a bit of pseudo code for that

it's kind of like this so you always run

it and the first thing that it does is

it um it observes the state of the

resource whichever resource you are

writing an operator for you will have a

controller for that as well. So if you

want to work with pods or deployment you

want to work with services you want to

work with config map there is a operator

for all of those resources. So the first

thing that it does is it keeps on

observing the state of your resource. If

the state is updated

again for whatever reason you updated

the state, maybe in your deployment you

change the image, maybe in your config

map you edit the data of the config map.

Whatever reason that happens and if the

resource is updated, the second thing

that a controller manager or an operator

or a controller really does is it

compares the current state to the

desired state. And this is where you put

your business logic. This is actually

where you define what to do in case

there is a drift that is recognized and

most importantly what not to do if there

is no drift because it's very important

to make your operators or at least your

controllers uh ident.

They have to be amputent. I cannot

stress this enough. We will talk about

the reconciled loop just in a minute.

But this has to be important in terms of

if in case your resource needs no

change, there should be nothing done on

Kubernetes. You should be able to run

your controller as many times, but it

should not result into a change if there

was no change needed. And if it finds

that there's a drift between the current

state and the desired state, it then

does an update. Or you can also say it

acts on what logic you have asked it to

do what to do in case there was a drift

found. And then uh we close this. So

it's a forever running loop that never

stops and keeps on watching the API

server for your resources that you are

managing.

Now what we are going to build is a

cloud uh it's a cloud controller because

what we are building will be a piece of

software that actually runs on your

Kubernetes environment here. Let's say

this was your Kubernetes and there you

say I want to make kind EC2 instance.

Let's put it this way. It goes to

Amazon, sees if this instance with this

name is already there or not. If it is

there, it does nothing. If it's not

there, it creates something. So, it's

kind of what we would call a cloud

controller. Think about when you run on

EKS, when you go to Azure Kubernetes

service, it is very easy for you to

change the service definition, the SVC

for example in in EKS to have a load

balancer. You can just say the type of

services load balancer and in your EKS

cluster there is a software which is

working which is running that abstracts

how to create a load balancer how to

make your service as you know as

backends of that load of that load

balancer it hides away the complexity

for you and that is what a cloud

controller does. There may be many

different controllers that cloud

providers will give you in their own EC2

in their own u kubernetes distributions

to make your lives easier so that you do

not have to know the the nitty-g

gritties of it. You just say I want a

resource and then you get one and that

is what a cloud controller manager would

be.

Now when I was talking about uh

controller we spoke of this term called

ident and this is something I actually

want to um and you know um explore a

little bit with you. So there are few

things that your code should actually be

doing when you write a controller when

you write u the logic for what to do

there are some things that can actually

be uh that can actually be done. Um and

the first thing is a happy path. So what

is a happy path? Um you have a logic

your reconcile you know this is actually

also called as reconcile loop. It's

here. This drives the cluster state to

your desired state and this is what it

reconciles and that is why cubernetes is

eventually consistent. And I mean in a

way that you make a change eventually

which is a very short time again that's

why we don't we think it's this but

eventually your state is going to match

the desired state that you want to do

the the cluster of state is going to

match the desired state.

Now let's zoom in in this path a little

bit where we have our um you know um

case one where you have a logic your

resource got updated and your reconcile

function is then triggered. This is

where you know this is the start of your

uh loop. Let's put it this way. This is

the beginning of your loop. So the first

thing that you do is uh you get your

object from the request. The way it

works is when you update a resource in

Kubernetes and there's a loop that's

watching on that, there's a controller

that watches on that. The controller

gets a request. The controller can

actually get the request that you

wanted, which is the API request to the

API server and it can get the object

data. For example, if you updated a

config map, your reconciliation loop,

your reconciliation loop can actually

get the YAML or the JSON of that config

map. So you can verify or you can

actually you know um see what has been

changed or what updates has been done,

what has been done by the user on that.

So you can get the object from that

resource and you can then observe the

desired state from the spec. What you

actually do is you see um you define

your uh config map for example or let's

say a pod. So you have a pod and then

you have a dospec in which you define

your containers.

So you can see how many or what spec is

there for a particular resource and then

you can compare uh with that spec what

is the actual state of the resource if

they match you know if the if the number

of containers in your pod are exactly

what you wanted then you have to uh you

have to just you know skip it you don't

have to do anything and this is what the

happy path is you do nothing and this is

absolut absolutely important that you

realize you don't have to do anything in

this case. You don't have to make any

API calls. You just ignore that request

to your reconciliation loop because the

actual state is equal to the desired

state. And that's what happens when you

exit your loop gracefully. Of course,

I'm not saying you will stop the loop

because you have to keep on listening on

the request, but you will not make any

changes.

There's also a second thing that can

happen. So uh in this step you have your

function triggered. You get the actual

object from the request. You see what is

the spec of the object. What object is

being modified and what is the actual

resource of the of the actual state of

the resource. And this is where it gets

interesting.

If the desired state is equal to the

actual state that you want, you do

nothing. We know about this from the

previous happy loop. However, if they do

not match, for example, in your

deployment, the current that you have in

HCD is your replica

three. Let's take this example. This is

nice. So let's say uh your current uh

one that is there is replica equal to

three and this is for a deployment which

is stored in HCD. This key is stored in

HCD. Now you do um a cubectl edit you

know you do a deployment and then you

give the name of the deployment and then

you save that file. First thing that

happens is that your reconciliation loop

will get this request that okay because

I'm watching deployment this deployment

is now updated and that is where you

made the change to be replica equal to

five.

What your current is three your desired

is five. Now you say okay in your spec

you will have replica equal to five.

This is what you can do when I say

observe the desired state. You get the

actual object YAML. You get the actual

object YAML and then you observe uh this

desired state. So you want five replicas

and you observe the actual state which

is still replica equal to three. So

there is now a drift. the current

actually does not match the desired

state and this is where your logic would

actually come into the picture what to

do in case your resources are not

matching to what the user has actually

asked to do. So there you will calculate

some differences. You will probably take

some actions. You will do a create,

update, delete for the resource. In this

case, you will create five more pods.

Sorry, you would create two more boards

because you wanted five. So 3 + 2 is

going to be five pods, which is actual

user uh requirement. And then there if

your action is succeeded, you update the

status field. And then you exit the loop

again. This is very important you know

uh every every resource in Kubernetes

has a dot status. So you have a spec and

then you have a status and this is how

the reconciliation loop knows if it is

actually matching. If for some reason

you could not create the pod for

whatever reason it may be you can return

an error and then you can reue retry

doing that action. And this is what

makes Kubernetes as healing. It tries

again. It tries again with a you know

with a back off. You can configure this

that if you were not able to do this

right now maybe there was no uh let's

say you could not create the pod because

you did not have enough memory. Your pod

would actually not be created or they

actually put in the pending state. This

is not a good example but let's say for

whatever reason your pods could not be

created. Maybe you were missing the role

based access control in this name space

where the pod should be created. Now it

will be recued and the way it goes is it

goes back to the beginning of the

reconciliation loop here and then it is

started again and this is what happens

when I say you need to recue. Recue

means you retry that action and this is

what Kubernetes is about self-healing

because if you give the rolebased access

control to the you know to to the

controller it will be able to create

resources. It's not like I tried once

and I couldn't do it. It keeps on uh you

know uh trying again and again. You

might have seen this. If you have a pod

which needs a persistent volume, um it

goes into pending if there is no P lab

that the pod needs. But if you create

one, the pod automatically gets

scheduled. It gets started. You do not

have to do that. And this is the beauty

of the loop that can reue that you can

recue for your um for your cases. And

this is this is absolutely the

brilliance of um self-healing

in Kubernetes.

Now one thing you have to be very

careful is this. There's also a sad path

and this is something you always always

want to avoid when you are writing a

custom uh controller.

The things are pretty much the same. So

what you do is you start your loop. You

got a request that somebody updated the

deployment. You see what they have made

the changes to. You see if there's is

actually there or not. The change there

is if actually if you go to the desired

state you have to absolutely

do nothing. You have to do nothing.

Absolutely nothing. What I mean by that

is you do not have to update the

resource for anything because when you

update the resource let's say here you

know um let me talk about that.

Now this is interesting. The way it

works is I'll go back to the the one

where you had to make some work you

calculate uh the difference and you

update your resource. Now if that action

is succeeded if that action is succeeded

you will actually be triggering because

you updated the resource. This will

actually trigger the reconcile uh

feature again.

Kubernetes controllers they do not know

what you have updated whether you

updated the spec whether you updated the

metadata whether you updated the status

they don't know about that they just say

okay the resource deployment was updated

here so I will re retry my actual you

know I I would rerun this from the

consiliation loop and now because you

updated because you created five pods

now your replica

is actually five and it will now say

okay um I get the object the replica is

five that the user wanted and now

because I'm running this again the

replicas have been already created the

state matches I don't have to do

anything you have to write your

reconciliation boobs ident

maybe you uh You got a request and uh

your object you get the object you

observe them they do not actually make

need a diff they don't need any work

maybe you have got the same replica was

five and then your actual state was also

five you do not need anything but

by mistake you update the status of last

sync you say okay it's just the metadata

it does not change my deployment right

it doesn't change my containers. It

doesn't change the image I'm using in

the environment variables. It doesn't

change that. I'm just putting

as a good person. I want to see when

this was last synchronized. And you say

that whenever a request comes, even if I

make no changes, um I would update the

the status.los sync, which actually

would then trigger an API call. And you

see whenever you update your resource it

goes back to the beginning of the

reconciliation loop and this is where

you would have a forever running loop

request comes in. Um okay you got the

object you observe the desired state

from the spec. I'll zoom in a little

bit. uh there was actually no need of

any changes on the resource but you by

mistake you are updating the status. So

Kubernetes says okay the object the

controller is looking for has been

updated. So it goes back up to the

beginning of the loop and then you

update the sync uh the last sync again.

Kubernetes says I got a new update from

the beginning and this loop will

continue forever. your resource will

keep on updating without having any you

know without any stock. So this is very

very important that you need to be very

careful of um not making any changes if

you do not require any changes.

Now this is the foundation this is

actually the foundation of uh how to

write a operator how to write an

operator. The controller is the actual

logic that you have to have. Now what uh

and there are a couple of things uh when

you are writing an operator this is

absolutely important I think this is a

good thing you should read this the most

important question you should be asking

or your controller should be asking is

if there's anything for me to do that

means if the current state is equal to

the desired state if not exit

immediately do not do anything there

should be a golden rule as well that you

should follow that you should only write

to the API server when the actual state

differs from the desired state. Where in

this case you see here you you are like

okay I know that the actual state is

equal to desired state. I make no calls

to the API server. I do not update my

resource. But by mistake you update the

last sync which is again a request to

the API server to modify the resource.

And then the reconciliation loop sees ah

there's an update. let me go back and I

would restart that uh I I would rerun

the loop and then it's a problem. So you

always um have to make sure that you

only make the changes to the resource

when they differ from the desired set.

And this is also what a tempotent means

that you can run your loop 100 times if

the cluster if the machine is already in

that state you should not be doing

anything. you know it doesn't break

anything it doesn't change anything if

the cluster state is equal to the

desired state that is absolutely

important to uh to be to be taken into

account and this is what's interesting

this is what makes these operators uh

resilient which is they are stateless

they don't remember what they did with

your resource in the last request they

they don't do that they don't remember

they don't remember if the Paul replica

was three or five or seven. They don't

remember if you have the environment

variable or not. They always always

check the resources. They always their

source of truth if they go to the

required you know place maybe you're

writing a a cloud operator they go to

the cloud maybe you're writing a

database operator which creates a

database it goes to the database always

runs the query and this is why uh these

are stateless. So your container your

your controller can actually be killed

or the you know the node on which it was

running it could be deleted it could

crash the container uh the controller

will go to another node it just starts

from there it doesn't have to have a

persistent volume to store the state it

doesn't know that and this is why it can

crash restart and still figure out if it

needs to do something uh on a particular

resource or not because that's what you

have made it to do you have the logic

that it always observes, it always

checks the desired state in the current

state and if there's anything to be done

uh it does it otherwise it says cool the

uh resource is already in that uh uh in

that state which the user wanted me to

do.

Now if you talk about um uh this is

about controllers but what is an

operator?

I think you guys might already know

about operator in in a in a way because

you want to write your own operator but

let's just go through that quickly. Um,

imagine you guys want a house. You know,

you you get a house. Let's say you are

living in India. And this is a very good

example that I like. Let's say you are

living in India. You have a house

already. Maybe your parents own one. And

one day you decide to move to Germany.

The place is completely new to you. You

have never been to Germany before. You

don't speak the language as well. You

don't know German. Now you need a place

to stay. you need a house to stay. you

call a company uh you know in this case

you call a company and the company says

hello sir you're moving to Germany we

would make would help you make sure your

move is easy and simple we have two

options one we can give you a full

furnished house

we will give you a full furnished house

and also we have another option where

you can just get a simple uh unfernished

I'm saying a simple house but let's say

an unfernished house.

You can choose whichever you want and we

would be happy to give you the key when

you land in Germany once you sign the

forms and everything. The company also

says one thing that sir while we are

giving you the furnished house we also

give you a helper.

Now you say what is this helper? What is

it going to help me with? The company

says if at any point in time you break

uh you know a tap, maybe your water

filter is broken, maybe the floor is um

you know you spill something on the on

the carpet. Are you going to fix it?

Maybe your bathroom uh tap is broken.

Maybe you break um a window. You you

never know. You don't know anyone in

Germany. you will fix it yourself or you

can help you can have the helper do

these things for you because you don't

know the nitty-g gritties of where the

hardware store is, how to call someone

if I lose my keys for the house. Let the

helper do it for you. So the helper is

actually someone who has the full

knowledge of this house, who has the

full knowledge of how to fix things if

they goes wrong. You just have to tell

the helper maybe you lost your keys, you

know, just tell the helper, go get me a

key. He knows where the store is. He has

the logic. He has the knowledge of where

the store is. He has the knowledge of

where to go and in what language, how to

speak to the to the person who can make

you a key in German and gets you a key.

If you have a broken pipe horse, he

knows how to fix it. So, think of this

guy. Okay, this helper as the actual

operator. Now if you want to port this

in um in the terms of software, think

about you uh have a database which is

called MySQL. Now for you uh

installation of things is easy now

because you have a container you can

simply run it and you would be able to

get your app your software but what

about day- operations? What about maybe

you want to do a database migration of

your schema? Maybe you want to take a

backup. Maybe you want to take

incremental backups on a particular, you

know, a schedule. That knowledge needs

to be either with you or someone who can

do this for you. And this is where MySQL

not just gives you the database MySQL

but also has an operator for you. This

operator is actually a controller

running internally. So this controller

has all this logic. If the user asks me

to uh create a database, I know how to

make a database. I know how to log into

the DB, I know how to create a database,

I know how to do that. It knows about

it. So you just have to tell what to do.

In this case, this helper was the

operator. And in this case, this was the

MySQL database product we were actually

looking for. And that makes your life a

lot easier because you don't have to

worry about the lower level details. Now

operator

has two things. One is a custom resource

definition and then the other thing is a

custom resource.

You know how you can do cubectl get

pods. You get a response. Maybe you have

pods or not. It says yes I have pods or

it says no pod found in the name space.

But if you do cubectl get apple, it

doesn't know what this resource called

apple is because kubernetes has its own

vocabulary. It has the API resources

that it has been told to remember and

those are the resources the internal

ones that are native to Kubernetes like

pod deployment secrets uh services you

know um all these things these are

resources that Kubernetes knows about.

But what if you want to create your own

resource which is in our case what we

will do is going to be an EC2 uh

instance.

I could also want to create an S3

bucket. In that case, I need to expand

Kubernetes's vocabulary that okay, this

resource called EC2 instance. If

somebody says uh it gives you a YAML

which is kind uh EC2 instance, you know

how to create or at least you know what

that is. What to do on that? That's a

different story. You know what that is.

So that if somebody gives uh on this

file cubectl create you can don't just

tell me you don't know what is this

resource you know about that now I have

given you the schema of what an EC2

instance would be I have given you this

custom resource definition so whatever

the user gives you in this kind

acceptepic because now your vocabulary

has been increased and this is going to

be a custom resource whenever you create

uh whenever you instantiate a custom

resource definition that is called a

custom resource. For example, if you

created a custom resource definition uh

for EC2 instance when you create it and

then you can do cubectl get uh EC2

instance.

What you receive is an instantiation of

the of the definition that is a custom

resource on which very important on

which your operator your controller

will be acting upon. So your controller

knows that on a resource type EC2

instance it has been created it has been

deleted. If you create a resource called

EC2 instance, it knows that on this

resource there was an update which is to

create the resource. The controller will

create that resource for you. If you

delete that, the controller will say

okay on this resource which I am

watching there's a delete operation

performed by the user. So it goes ahead

and deletes it for you. So without the

controller your your custom resources

are nothing. They are just kubernetes

knows about it. It does not react on

that. It does not acknowledge that okay

I'm going to do what you want me to do

because it doesn't have the knowledge.

So while the CR and CRD uh you use them

to tell what you want the controller

with them is actually the how part of

it. How do I do that? And this is what

we going to be building. We will be

building um a cloud controller which is

for building EC2 instances on Amazon.

And this is what we will be looking for.

Um there's also something which you need

to know. Kubernetes is not just a

platform now. It is a complete operating

system for um you know for people. So

let's talk about how Kubernetes is

actually expandable and how can you use

Kubernetes as an SDK. So what's very

important with Kubernetes is to look it

from not just a platform where you can

run your applications but rather how can

you expand Kubernetes as a software

development kit and what can you do with

that on other platforms that's also what

you can do. So the first thing that

Kubernetes is so widely adopted by cloud

providers by onprem for other softwares

is because of its extensibility.

Get me let me get a color different. So

it is because of the extensibility

because of these custom resources

because of these operators and because

of the controllers and this is what uh

we just talked about. Kubernetes also

have API first approach. So everything

in Kubernetes has an API. Everything

your pod is an API. Your service is an

API. Your API server has APIs for all of

these things and that makes it very easy

to u write your code for and there are

client libraries for this and that makes

it very very easy. You have the SDKs

that you can build your controllers on

uh for Kubernetes. there is Go, uh,

Python, there's Java, there is

integration of JavaScript with

Kubernetes because there are client

libraries for that as well. And you can

also, uh, Kubernetes has backward

compatibility because it does not just

delete API resources, it deprecates them

first. It gives you enough time to move

towards a different um, uh, uh, you

know, to a different API um, version and

it also versions its API. So maybe you

might have seen uh pods/v1

or you might have seen network

um you know um network config

/v1

beta 1 beta 1. So this is the version of

kubernetes um API. So it makes it very

easy for you to develop new APIs without

breaking the existing ones and that

makes it really really simple or really

helpful I would not say simple but

helpful to expand your APIs and this is

a plug-in everything you can have your

networking you can bring your own CNI

you can choose from different CNIs quite

popular ones are stelium um I think yeah

selium is one very popular from isalent

which is a company acquired by Cisco. Uh

you also have different options for

storage. You have different options for

runtimes and web hooks where you can

intercept everything as a admission

controller which could either validate

your request or which can either mutate

your request. I think for these web

hooks we can have an entirely different

course for it. they deserve their own

time because I don't do justice if I

just talk about there is an admission

controller which can validate and uh

mutate it doesn't doesn't help so

probably something to look for in the

future for us

and this is why different cloud

providers because of this extensibility

of kubernetes there are different

flavors and thousand plus tools that you

can use on top of kubernetes so there is

open shift from red hack there is suz

from Rancher, Tanzu from WMware. Then

there's softwares on top of that which

is cubeflow K native um cube which is

also quite popular nowadays and that's

what makes the developers happy because

they say what not how. Now if you are

working in a platform engineering team

uh you want your let's say you know this

is a developer this developer wants a

machine in Amazon he wants or she wants

an EC2 instance and you manage your

cloud let's say you are the cloud uh

admin who will give them the EC2

instance they come to you you uh say

okay I run some commands blah blah blah

and this is the instance and you give

them that that's okay but this is a very

old approach.

What you can rather let these guys do

and this is what um in in internal

developer platforms would actually help

you with or you can build your own then

you can say okay listen what if you want

an EC2 instance you don't have to come

to me just give me this YAML which is

you know you can explain them explain it

to them you can have a Helm chart around

this that says I want an instance where

you can say the number of instances

maybe two the instance uh type which you

want and then maybe the you know you can

have them give the um the instance uh ID

where you can then say the machine the

AMI ID that you want to use a very

simple thing and then maybe also the

port numbers that should be open. They

give you this in a YAML format and you

pass this from your controller,

you know, after you can have a pull

request review. So after they have a

pull request, they this is stored in

GitHub. You have a pull request and then

they get an EC2 instance. With this,

they get to say what they want. They

don't care about how to create resources

in EC2. They don't care about BPCs. they

don't care about anything and also

because you have a githops workflow now

you can have argo cd uh deploying these

resources and then the controller takes

care of creating the e2 instance

everything is as a code you can have a

githubs very resources um very very

simply with this platform uh as as a you

know as a as a product which is platform

engineering all about so you can have

the declarative options you can use Helm

to help the lives of developers easy

that they can just give you this

information. You render the resource and

then your controller takes care of that

and and this is this is I cannot um

stress it enough how how simple it makes

our lives easier.

Now because you can run Kubernetes not

because the thing is you can run

Kubernetes anywhere and the reason why

you can run Kubernetes anywhere is

because of the standardization. You can

run this in any cloud you can run this

on edge you can run this you can run AI

workloads on top of that anywhere you

know Kubernetes is standard because it

has one pattern which is a controller

pattern that rules them all. Um I would

say DNS just works again it can be

problematic but every pod knows where

the where every other pod is. um it has

its own challenges depending upon how

many number of services you have in a

cluster, how many pods you have in a

cluster. Scalability could be another

issue but for a for a cluster that you

have bootstrapped, it just works fine.

And then you have config management for

your developers which I don't think I

need to uh talk about.

The single I'm trying to make here is

it's not just a container orchestrator.

It is a complete operating system. You

want networking, it has it. You want

memory management, it has that. You want

compute management, CPU, uh, storage, it

has that. It has disk management, it has

it. So, you can actually build and

package and ship your software that runs

on top of Kubernetes. Uh, any sort of

software that you can uh, you know, you

can build and run on top of Kubernetes.

It's not like you're just using

Kubernetes, but you can ex expand it

with all of these controllers and these

um operator frameworks that we are

talking about. And this is why I love

Kubernetes a lot. All right. So, this

was about how do you use Kubernetes as

an SDK. Now, let's talk about um how do

you bootstrap Kubernetes? Um how do you

bootstrap a Kubernetes operator with uh

with a software called Cube Builder? And

this is where our journey would be

beginning. So let's go on and do some

hands-on on writing an operator.

So before we can build our own

cubernetes operator, we need a place to

run this operator on and that is going

to be Kubernetes. Now you can build a

Kubernetes cluster in GKE. You could

probably use Amazon as a managed

service. You can build your own clusters

with uh QBDM. Whichever way you want to

do it is fine because the operator that

you are building it will be built into a

container image and that container image

can be run on any Kubernetes cluster. In

our case, we want to keep it simple. So

I'm going to be building the operator

and I'll be testing this operator which

is going to be running on my cluster

locally and create instances on Amazon

which is external to the cluster just to

show that you can manage infrastructure

that is external to your Kubernetes

environment. And this is why Kubernetes

is really popular because it lets you uh

use it as a SDK as an operating system

of the cloud which we will also talk

about in the future. So K3D is a

Kubernetes distribution by Rancher which

has many other distributions like K3S

which is also a very simple lightweight

Kubernetes distribution. It also has RE2

which is more hardened and for security

if you are working in the governance um

and K3D it lets you create containers or

rather it lets you create Kubernetes

clusters in containers. If you have kind

you can use kind. If you have K3D you

can use K3D. If you have a sandbox

cluster somewhere, you can use that as

well. The reason why I'm using this

local is because it's very lightweight.

It does not cost me lots of resources.

It's free of course and it's very fast

because it's running on my computer.

So for K3D, we can install that very

simply. Just go to the installation

script and you can download that with

either cur or you can download that with

wget. I would suggest you go with the

latest version. And once you have this

downloaded, you can do K3D or K3D

version. And I've got the latest version

of K3D, which is 5.8.3.

And the Kubernetes version that I would

be using when I build a cluster with K3D

is going to be 1.31.5.

But there is a newer version of

Kubernetes. What if I want to use that

better? We are DevOps engineers. We are

uh cloud engineers. We like to have a

single source of truth for all of our

applications which is why we do githops

right and wouldn't it be nice if you can

just version control your clusters as

well uh that right now I have got one

cluster which has two agents maybe I

want to increase it let me put into

GitHub and that is exactly what K3D

allows you to do with a very simple

cluster config file and this one has

lots of options which you can go to K3D

uh and look on the documentation.

However, I I've kept it very simple.

This one gives me one master. K3D allows

you to create multim masteraster

multi-node cluster. Again, I'm just

going with one because I don't need high

availability.

And second, I'm going to be using two

agents here, which is going to be the

worker nodes. And this bit tells me the

version of the Kubernetes that I want to

use. And that's the one which we will be

using.

You also need Docker because K3D uses

Docker because it creates containers in

which it runs your Kubernetes cluster

which runs containers and that's a whole

inception going on out there. But these

are the two things that that I would be

using. If you have any other

distribution of Kubernetes, you can very

simply use that. So I've got Docker um

running on my machine. I've got or stack

which is actually giving me docker in in

the background which is giving me a

runtime in the background I would say

and to talk about K3D its architecture

is fairly fairly simple. So what it does

is that this is how it looks like. So

you have your laptop or you have your

computer on which you want to create

multiple Kubernetes clusters. Now as a

developer I might need different

clusters for different applications. I

might want to promote them from dev,

testing, QA just to have a pipeline

going for a complete software

development life cycle. That's also

possible for me too. And that is where

K3D shines really well. When you make a

cluster in K3D, it creates a separate

Docker network for all of them. So they

are completely isolated from each other

and they have their own tier as as you

will. have their own network uh in which

they would be talking to. So here you

can see I've got one cluster here which

is blue and there's one cluster which is

green cluster A and cluster B and this

is the master node and these are just

robots which is our work is cuz that's

where the actual work gets done and we

have these docker networks created right

now if you do docker network list you

see the standard docker networks that

are created when you install docker

however when you do k3b cluster create

with this config file which is our

source of truth. When you do that,

there's going to be a new network

created which I just showed you. So we

will see that just in a moment. Once

this is created when you you know when

you ask it to create a cluster not just

it creates your cluster for you not just

it sets up a gateway for you not just it

creates your workers for you it also

updates the cube config or rather it can

help you to get the cube config and here

you can see my context is automatically

set to cubectl. It says you can use it

like ctl cluster info. And if I do that,

that's where my clusters are. That's

where my cluster is running. Now if you

do docker ps, you will see there are a

couple of containers that are just

started. And this is our infrastructure

for K3D. We have got two agents which is

our worker nodes. We have got one server

and we also have this engineext proxy

container which is there for some reason

and the reason why it is there is for

you to talk to your API server because

you can use K3D to create multiple

masters. You need to have a load

balancer. So you should not be needing

to set it. That's why K3D does it for

you. And here it creates a container

that is listening on your port on your

computer's port which is 5745

and that's actually uh forwarding the

traffic to 6443 of the master or in case

you have multiple of the masters and

that's why you see the Kubernetes

control plane is running on 5745

on all the IP addresses of your

computer. If you go to this port, you

will be talking to Kubernetes. You will

be talking to the cube API server.

Now, what can you do? Every time you

have a cluster, it's good to do a smoke

testing. A very simple one. So, we can

do cubectl

get nodes. There you go. You have got

one control plane, one master. You've

got two agents which are ready. You can

do cubectl get service. There you go.

You can do cubectl get pods and some of

them are code DNS which is very simple.

It comes with a metric server also. It

comes with traffic insert which is again

uh it allows you to expose your services

outside or work as an ingress if you

will. Um and it has got a local part

provisioner which is for storage. I

talked about the metric server already.

Now let's try to do some smoke tests.

And if you can do cubectl create

deployment or kc create deploy

it's going to be creating a deployment

and it's going to create a pod um k get

pods and here you can see it's container

creating. If I do k logs and if I can do

my deployment this is a log for

engineext. That is fairly fairly simple.

If you had used engine x this should be

nothing new. You can also expose your uh

deployment. We want to check the network

connectivity between our applications.

If one service or one pod can talk to

other application in the cluster, let's

just validate that. So I could do uh I

want to expose my service. I want to

expose my deployment called my

deployment and the port number for that

would be 80. Here you can see it's a

service resource in Kubernetes and it

has got a cluster IP. Now you know if I

want one application to talk to uh

another application in my Kubernetes

cluster I can use this cluster IP and

that's exactly what we would be doing.

What we would be doing is here okay um

so here we have a pod in our new cluster

for which we just created a service. I

want to test the networking in my K3D

cluster. So I would create a new pod. I

would try to curl this service and I

should get a response from this pod and

I should be able to curl it because it

is HTTP cuz I know I just ran an enginex

server and this should work because it

is a single cluster. You know you cannot

by default expose your service IP

addresses outside the cluster. However,

inside it should work fine.

And that is where we can use our trusty

curl image. This lets you just do a curl

to any other IP address or host name.

And that's where we can do k run. I want

to use this is my container. I want to

create a curl container with the name of

curl. This is my image. And I want to

connect on the IP address of my service.

That's that. Let's look at the pod. This

pod is container creating and it's

completed already. Crash loop back off.

That's fine. Let's check what happened.

And if I do logs for curl, it wasn't my

crash loop back off. It just started,

exited, started, exited, and it's like

what is going on? It was not a chron job

that runs till completion. Um but you

can see here this is the response that

you get from the service uh which is

engine X and that tells me that my

cluster is ready for connection. My

cluster is ready for me to build

applications and also uh you can

probably go to um you can also check

from your cluster if you have external

connectivity because we would be talking

to Amazon.

Might as well check that. So we can do k

run curl or let's say Google and I could

do httpswww.google.com.

Do I have a pod now? Uh, Google

container creating and let's say and

that looks Google to me. Um, looks fine,

right? So, we have connectivity between

our applications and we also have

connectivity now uh to external

environments and this is going to be the

foundation on which we will be building

our application.

Um you also would be needing to have go

on your computer which we talked about.

You need docker git the standard

developer tools. So um that's it. This

will be our uh our setup. Now I think we

should talk about what are you going to

be really building in this course and

what is a reconciliation loop? How does

kubernetes know what you want it to do?

How does the controller or what is even

a controller in the first place? How do

they know that I want to do something?

The user has asked me to do something

and uh I should do that. How do they

know that the state of the cluster is

not matching the state of the you know

uh desired uh versus current state? How

do they know about it? So let's get uh

let's let's learn that now. So if you

want to know how to build an operator,

the best thing to use is an already

available framework which is called cube

builder.

There are also some other frameworks

that helps you to build cubernetes

operators like operator SDK. However, um

cube builder is also one of the very

famous operator frameworks that allows

you to write your own controllers for

kubernetes. This is for people who are

using Kubernetes and they want to

develop a very indepth uh knowledge of

how Kubernetes reacts on certain

resources, how the operator loop

functions, how is it identities,

how would you know um you actually

compare the state to the desired state.

What is a web hook? How does it work?

How do you implement versioning with a

cubernetes operator?

That's all which is very very inbuilt

and which is very simple with cube

builder. So this allows you to have a

starting point without spending so much

time on what is going to be my project

structure. How would I you know uh

structure my code? How would I structure

my test cases? Um how do I generate my

um metrics? How do I add a locking into

my soft into my controllers? Am I going

to have a leader election? How do I

implement a leader election? How do I

expose a metrics? on what port do I

export the metrics? All of that is taken

care by your builder. What it does is is

it allows you to have a directory

structure in which it has the

boilerplate code for building your

Kubernetes operators already there

thousands of lines. Uh instead of you to

have to write it allows you to focus on

the business logic. It allows you to

focus on what is going to be your

specification of the custom resources.

It allows you to tell what to do in

order to you know how to react in case

there is a change in those custom

resources. That's what it allows you to

do instead of uh looking at how do I

start with an operator in the in the

first place. It also lets you generate

the role based access control. It lets

you generate the cube um um what's it

called? It lets you generate the the

customize resources as well in case you

want to deploy your operator into

different places. It also lets you wrap

your operator into a Helm chart for its

own deployment. So that um it can be

used in any cluster regardless of

whether you are running on cloud,

whether you are running on prem on

wherever you are running. It allows you

to version control your APIs as well. So

for us, let's get started with that. And

the first thing you can do is you can

quickly install

um install cube builder. Let's go there

and installation and setup or maybe I

look on GitHub and there should be some

releases um that you can you can

download. Um we can also install uh

using um the installation book. There

are many different ways of installing

it. Either you can download it from the

releases which uh which one uh is

working for you. I'm using a Mac. So I

have got an ARM 64 because I'm using a

Mac and that's my architecture. And once

it's downloaded, I think you can also

use Buu. I'm not sure if you can but um

how can I install that but as I show you

there you go. So you can install Cube

Builder using a very simple third

command.

Now first thing that Cube Builder needs

or what you do with Cube Builder is you

create a project. Now a project, think

of that project as a collection of your

APIs that you will be building and it's

a simple directory structure that allows

you to initialize um you know um your

your APIs and let's do that now. So

first thing we will do I have cube

builder cube builder version already uh

which is which is available 4.5.1. I

think the latest one is 4.7.1.

I'm not too far behind but that's okay.

So I've got the cube builder and the

first thing we will be doing is we will

create a project where we will be

hosting or we will be you know um

building our API. The first thing cube

builder uh in it and here is the

important thing when you are building

your custom operator um let's say you

are working in a company called uh

example um you want to build your uh

custom resources in a certain domain

which makes it easy for Kubernetes to

know where this operator is coming from.

If you do cubectl

API resources and if I do less here you

can see every resource in Kubernetes is

actually its own identifiable API um

every resource that we see is an

identifiable API c um API resource for

example if I uh look at let's say um

AI services here for example hub.tra.io

io/me1pha

1. We will talk about what the group

version kind is. But uh just to just for

you to know uh you can define the domain

in which your API should be uh declared

in which your API should be built. So

for example, I could say uh Q builder uh

in it I want to be building things

related to cloud and let's say I work

with um um Netflix for example and my

products should be under the domain of

netflix.com in this case I'm using

cloud.com and the repository in which my

um in which my code would be hosted just

as a project descript encryption. What

it does is it writes the customized

manifests for you. So you can have it

deployed in different clusters based on

your requirements. It writes a lot of

scaffolding code for you. And what it

does is is it creates you a directory

structure. It writes you a docker file

which you can use to build your operator

into a uh into a deployable image. It

creates you a make file that uh you can

use to generate your custom resource

definitions. Maybe I open this in VS

Code. That would make more sense.

Um maybe I open this here in cursor.

That would make more sense. So it gives

you a make file that lets you generate

your um your you know uh your RPA lets

generate your custom resources, custom

resource definitions. It helps you

deploy those into a cluster and install

them from the cluster. If you are doing

a local testing, this make file is

really really um helpful. And this is

where is going to be your project. Uh

this is the project uh information on

where uh what is the name of the

project? What is the domain under which

your project uh is is defined and um and

what is the version of uh of of the cube

builder project that you are using.

Apart from that and this was the docker

file that we were talking about. Apart

from that it gives you this cmd

directory. Now it has already created a

lot of files and a lot of folders for

you. So let's quickly go through that.

The first cmd main.go is actually the

entry point of your operator of your

controller. So this already is done for

you. you would have to worry about what

libraries in Go I want to import in case

I want to build a custom operator.

Whenever I say operator um when I'm talk

I'm talking about controller because

that is a loop that actually uh does a

job for us. So you would be thinking

what library am I um supposed to be uh

you know importing for example take this

the client go and the uh client o

package. So this O package is actually

the one that allows you to talk to um

Kubernetes.

It it imports all the Kubernetes client

O plugins in case you were using GCP,

Azure, uh you want to talk to the

clusters. It lets you get the cube

config and this is the package that lets

you work with. You also have a package

for uh importing the Kubernetes API

machinery. We will talk about API

machinery in a in a bit. uh this lets

you define uh do the runtimes that are

needed to define a cubidity schema. How

do you declare a health endpoint? How do

you do logging for your operator? It let

you create a lot of codebase and this is

the main go which is the main file from

which you declare your um your code.

This is the entry point for your code.

We will talk about that when we um when

we write it. You also have a lot of

config folders where you define u how

are you going to uh be working with your

it has some defaults for kubernetes like

your services like your customized

files. It has customization that lets

you deploy your operator to different uh

clusters and name spaces. It has the

customization for your manager which

lets you create a deployment and the

name space in which you want it to be

deployed. It's a fairly straightforward

customization file. It lets you also

create role bases uh access control. It

lets you create cluster roles, cluster

role bindings. Um so it easier for you

to be running your operators. Otherwise,

if you are managing, let's say you write

an operator

which listens on a resource called um

EC2 instance, but it doesn't have the

permission to uh to to be uh you know um

listing EC2 instance in a in a

namespace. You will not be able to

manage those resources in that name

space. So without you worrying about how

does my role based access control would

look like it lets you create a lot of um

boilerplate code along with it lets you

create the rolebased access control as

well uh for you it also gives you end

toend testing so you don't have to write

your own testing fees it lets you help

uh it helps you with that as well and

the one thing that is uh interesting

with that which I was looking for is the

where did that So where is my cmd config

hack? I simply I'm missing Oh yes

because yeah so this is just the project

resource. This is just the project uh as

a boiler plate that cube builder allows

you to do. The second thing we can do

with cube builder. The next thing we can

do with cube builder is to actually

create an API. And this part is amazing.

This is going to be our resource that we

this is going to be our custom resource

that we will be creating. So what we

have just done is what you have just

done now is we declared a project called

cloud.com.

Now with cloud you have many resources

to manage. You might have uh things like

compute to manage. You might have things

like um storage to manage. You might

have things like network to manage

things in compute. Could be uh your um

let's say um EC2 instances you know it

could be your AMI in images for example

they could be your security groups as

well. In storage it could be a EBS uh

EBS module. It could be an S3 bucket

that you want to manage. Uh in network,

you might want to manage a VPC. You want

to uh manage a firewall rule perhaps. So

the thing that I'm trying to say is you

can create multiple APIs in a single

project in a single domain and this is

what we are going to be doing. we will

be building our own API which is going

to be in the compute subdomain and it's

going to be our EC2 resource. So that is

what cube builder allows us to do uh is

to create our own little API. So let's

do that. I would do cube builder create

uh here we go cube builder create API.

The group is going to be compute uh and

kind is going to be EC2 instance.

I want to create the resource. Yes. So

this has created the custom resource and

the custom resource definitions for me.

Uh it has written them on the disk. And

yes, I want you to create the controller

as well. So it downloads um many

different go uh go packages. It also

creates a directory called API/v1.

And this is absolutely

uh important. This is the API the

version of our API and we are building a

file uh we we building a resource called

EC2 types and that is where we define

our EC2 types.code.

Um now once we talk about um now once we

talk about the uh the EC2 type.go we can

take a look at that how does it look

like and this is where the actual

business logic would go for us.

This is where the actual specification

of our API would look like. Now before

you can build your own Kubernetes

cluster, I'm sorry, before you can build

your own operator for EC2, let's let's

see what would this actually look like.

You know how you going to use the YAML

for that? So if I give uh EC2 operatory,

I would probably say um kind is an EC2

operator.

Um meta, it would have some metadata. I

would give it a name and name would be

um my instance and then uh name space

would look like uh default

um

API version. It's defined in

compute.domain.com.

Um, this is a version one of our EC2

operator API. And then I would have two

things. So, every resource you have

would have a spec or almost all of them.

Uh, and then they would have a status

field. And this is something which is

very very important. When you are

writing a custom resource, you have to

define what the resource is going to

look like. What is going to be things in

the spec of your resource and what is

going to be in the status of your

resource. And this is what um the the

file in API v1 EC2 instances.go helps us

to do this. It lets us declare our given

um spec for the resource that we are

trying. Um, for example, my spec would

have um um AMI ID and this is going to

be the my dummy AMI ID and I would have

a key or I would have an SSH key.

This is going to be my key pair that I

want to use on Amazon. Uh I would have a

instance uh let's say I would have a

type. So maybe T3 micro I want to have.

And then you could have a storage and

you would have uh in storage you would

then say um I want a standard disc.

Maybe you could say I want a a

persistence or you could say fast disk

which translates to one of the faster

block devices in Kubernetes because you

want see you all you want to do is you

make you're making the developer life

easy. you're abstracting the actual

details um from the developers. So they

can say okay I could go for a standard

disk of size maybe 10 gigs and fast

would be of size of 50 gigs

that is that is the data that I need and

this would be one of the minimum things

you can use for your cubernetes cluster

and with this spec that you're giving

every resource has a spec and that is

defined for kubernetes it is defined as

at a strruct in collab. So if I uh look

at this DC2 operator, I let's say we

just keep this simple. We're going to

keep these three AMI ID, SSH key, and

type. Um this is going to be my things

that I want to use and all. Let me just

copy the um let me just comment this

out. Where did that go? There we go. So

I define the spec. Now this is the spec

for my uh Kubernetes uh for my operator

and I'm going to say my EC2 instance

spec will contain an AMI ID. It will

contain the SSH key and also the type of

the it will contain the type uh of the

instance that I want to be using. Now

this is where uh it's very important for

you to give these JSON tags because when

you give a request to Kubernetes about a

kind of EC2 instance it needs to

marshall your request. It needs to

understand what is this key uh and what

to do with that is this key is AMI ID

this key is SSH key this key is type. So

these uh JSON tags are absolutely

required for serialization so that

Kubernetes can know this field relates

to a certain um required um key for

example.

Then you can also have the status for

your EC2 instance. Maybe you want to

give out uh things like in in this one

you might want to give uh the space as

probably it's running if your EC2

instance is running or not. Maybe you

want to give out things like um public

IP and that's going to be a 1.23.

And this is what you will be putting in

the status field. So I would say um if I

look in here you see to operator I want

to have phase

um I want to have phase which is going

to be a string this is the type of

string and I want to have uh let's say I

want to have the instance ID as well and

I can just simply go for a public IP. So

these three things are which I want to

um be be having. Now this is very

important when you are using when you

are building resources like this an AI

editor would really help you uh like you

can see I'm using cursor uh this really

helps you to speed up your development

again you are the one who's doing the

thinking you are the one who is coming

up with the spec you are the one who is

coming up with um you know what what

should you be showing in the status

however it helps you as a as a very good

helper

Now you got the spec, you got the status

because these two things are absolutely

important to be um to be in a resource.

Now how would your overall resource look

like? The instance the EC2 instance

would have um the type metadata and

object metadata. So when you see any

Kubernetes resource this kind and API

version this is actually coming from the

type meta. So this meta v1 is actually

you can see this is a package in

kubernetes. This defines the metadata of

any kubernetes resource. This go package

defines the metadata of any uh resource

and has two type of uh you know it has

two strcts there. So the kind and API

version that we see on all the

cubernetes resources it is actually

defined in a strct in Kubernetes called

type meta. And this is what the EC2

instance would look like. It would have

some type meta. So you can see here on

if I copy this probably this would make

more sense.

Let me just copy that all the way here.

Uh and this would be

there you go. So let's comment that out.

Now this is a type of EC2 instance which

is the kind of a EC2 instance. So I got

that. There we go. So the first thing

this kind has is the API version and the

kind. The first thing the resource has

is the API version and the kind. And

these two things are defined by the type

meta. And then we have the metadata of

the object itself and that's defined by

the object meta which contains the name

of the object which contains generated

name of the object the name space the

UID the resource version the creation

timestamp every every object would have

these two um you know struts declared

inside of that which defines what object

it is and second which defines what is

the object's metadata and then You have

the spec where you have defined this

spec and then you have the status which

defines the status of the resource and

this is how an API is created. This is

how you declare what resources are going

to be in your API. Now I don't have to

tell my developers that guys you need to

raise me a ticket so I can create you a

resource in Amazon. Oh, you wanted 10

gigs. I probably gave you 15 gigs. Maybe

I did not hear that correctly. Let me

delete and recreate that or resize it.

You do not have to do that. If I just

give this to my developer,

it is so much easier for them. Maybe I

can have them a simple UI that lets them

declare the name of the instance, the,

you know, the count of the instance,

what storage they want. It automatically

creates me this manifest. And because I

already have a Kubernetes operator and a

you know a controller listening on top

of that, it is very easy for me to track

every request that a developer is making

for these uh instances because um they

are all they can be put into a version

control system. They can be put into

GitHub and you can use our code CD that

makes developers life so easy. They do

not need to know about what is a fast

storage. They don't need to worry about

what is a standard storage. Of course,

they need to know the benchmarking of it

but they don't need to know it is a

persistent disk. They don't need to know

the different type of stoages uh Amazon

has to offer. It is offloading from them

and that is what it makes it very very

simple.

Now things that you see here um these

ones plus Q builder object root true. So

these ones are called cube builder

markers and they are there for code

generation. They are there for custom

resource definition generations for you.

For example, this one says this is

actually a Kubernetes resource. So

somebody could say cubectl get EC2

instance. Somebody could say for example

here is where somebody could say cubecdl

get instance list and that is going to

be uh what is returned this defines it

also has a sub resource called status

which we are defining here above. So

this is what cube builder helps you with

and in the end we are registering our

EC2 instance and EC2 instance list with

the cubernetes schema. this function uh

it uses the resources that we just

created. It gets the APIs that we just

declared and it initi registers that

with the Kubernetes schema which is

actually this function comes from a file

called group version_info.co.

Now this one it's a very simple file. It

uses the Kubernetes schema runtime

package uh from API machinery and the

controller runtime. What these packages

allow you to do is they let you declare

your uh they let you declare your APIs

and the kind to Kubernetes and here you

are saying that you have a group

version. So you're declaring a schema

group version. The group is called

compute.cloud.com

again. So you could say your domain

uh domain was actually uh cloud.com

and then your group was uh compute

uh and then uh your compute.cloud.com

cloud.com and then your version is v1

and then your kind is e2 instance group

and this is how every resource in

kubernetes think of that as a URL every

object on the web has its own unique

identifiable um URL for example

um think of that as kubernetes every

resource is declared in a group it has a

version and it has a kind. Every

resource does that. Every resource has

it. Pod service. If I do that, maybe I

could do kubectl

explain service. You can see here it

kind is called service. Its version is

v1. If you do not see the group, that's

because it is in the core group of

kubernetes, which is uh which is which

doesn't have a name, but it's called the

core group. So is the same for pod. Uh

if you go ahead um here you can see pod

is v1.

So this is why you now understand when

you write kind we are pod API version v1

you are telling kubernetes that this

yaml that I'm giving you it is a

resource of kind pod which is declared

in this group and I want the version v1

for this resource. Every resource have a

group version kind and this code is

actually adding your declared schema and

it is adding your um declared group into

Kubernetes. So it's loading your

resource YAM your actual custom resource

declaration into Kubernetes. So when you

give it a YAML of EC2 instance it knows

what spec this resource has. What is the

AMI ID? what is going to be um the phase

that is running what is going to be the

uh the public IP that I'm going to be

returning so it knows what is your spec

and status that is what we are doing

here we create a schema builder so that

we can add our own schema and then we

have um this this add to schema um it it

does add the type in your group version

to kubernetes and that's where the magic

actually happens this is where you

declare what is going to in your a in

your resources.

Um once you have that then you can also

uh look into another directory that it

has created for you called the internal

controller and that is where the

reconciliation logic happens. That is

where you get the reconciliation logic

uh of what to do. So this one is about

custom resource but what to do on top of

that custom resource? What do I do with

that? That's given in the controller um

package in the internal /controller

directory and there's a file called your

API named_controller.go.

What this does is it creates its own

package and it then creates your um you

know it creates a reconiler.

In this reconiler strct it is having two

um it imports two uh interfaces. one is

the client which gives you the actual

Kubernetes client that you can use to

talk to Kubernetes clusters and then

there's a there's a schema that we can

then use to convert between the YAML

that you are giving and what Kubernetes

knows about you know what is declared in

Kubernetes um resources

then you have some custom markers for

for rolebased access control and this is

where the actual reconcile dilation loop

happens. This is the one uh this was the

actual logic that makes sure your

cluster state is equal to the desired

state. That's the one that makes sure

your cluster state would be um it reacts

on the cluster state and looks on the

desired state and say this is where your

logic will go. This is the heart of your

controller. This is the heart of your uh

of what you are writing what you want to

do with that and then you return a

result

and an error. Now we will talk about um

these two things as well. I'm just

running you through the code when we

write our own as an example then we will

uh we will look into this. Once you have

the reconciliation logic, it is actually

adding um it's adding uh the controller

with the controller manager. So this

setup with manager, it uses the

controller manager to add your

controller too. I think it makes sense

if we talk about the architecture a

little bit of cube builder and that

would be so much helpful. So if I go to

architecture, this is the one that will

make so much sense. what Cube Builder

allows us to do. Oh, wait a minute.

Okay, so the when you run, let me go

here.

When you run uh maybe a little bit

bigger would help.

Let's say here.

When you run a Kubernetes um controller,

the first thing that it runs is it runs

the main.go

program. If you remember, this is from

the cmd/main.go

which is the file here. It starts with

the cmd uh main.go file. So the main go

file is the one which is responsible

when you build your operator into a

binary. Here's a main function that's

the entry point of of the operator. So

let's take a look at its main file from

the beginning. It's part of the main

package and it does import quite a few

of um inbuilt packages from Golang.

However, for it to really be working as

an operator, there are many more

packages that are imported um and that's

from the Kubernetes itself. So let's

take a look on those packages. The first

one that we see here, this is the O

package. And this lets your operator uh

use the exec entry point plugins or um

you know uh talk to your EKS clusters,

talk to your GKE uh cluster API server

or using the OIBC if in case you're

using for authentication. This one's

responsible for making sure that your

operators can use the cube config or the

exec entry points and they can talk to

your cubitus cluster.

The runtime package from the API

machinery is responsible uh to kind of

you know you understand YAML but

Kubernetes does not understand YAML. It

understands objects which are ghost

trucks you know in example. So this one

defines schema. This one's define

objects that can help you to convert

your YAML into Kubernetes understandable

constructs. Kubernetes understandable

objects. And when you do um cubectl get

pods, the YAML that you get is actually

converted from the pod object in

Kubernetes by using the runtime package.

We also have in the API machinery util

package and uh this would be looking

like it's the same package again but

this one's defined in pkg runtime in the

API machinery and this one's defined in

the util uh as runtime and this one is

more like a utility function that helps

your operator be stable in case there

was a panic which is kind of like a

fatal error that your operator got. So

instead of completely crashing the

process, this lets you log that

particular panic and still uh complet

still continuing with the with the

operator process so it doesn't just

completely crash onto you.

We then have uh the client go package

which is again uh this is the I think

the SDK for go for kubernetes and here

we are calling the schema or scheme

package and this one lets you register

your APIs that you have defined the

custom resources. It also lets you

define the pod services the core

constructs of Kubernetes um with your

operator or rather think of it this way

that it gives your operator the

knowledge of the predefined Kubernetes

resources like pod deployment uh secret

services and also it lets your operator

register the EC2 operator um custom

resource that we are creating.

We also have the controller runtime

package and this one right here is the

secret source which is responsible to

have you or to work with a manager that

can help you with clients caches and the

leader election. This one, this

controller runtime is the one that is

responsible that gives you the tools to

construct the controllers that can

listen on changes on your custom

resources and then uh you know they can

uh handle the caches, they can handle

the clients to talk to the API server uh

and eventually um if in case you want to

have early election or not uh that also

is done by the controller runtime. So if

I want to talk about a little bit of the

architecture of how this um controller

would look like. So we would have the

process which is again started from the

main.go

and this main.go would have a manager.

Again you will see this as coming ahead.

But here's where a manager is the one

that manages two things. one, it has a

client and this is used to communicate

to the cube API server and it also

handles the caches of your requested or

um the the the custom resource that was

updated.

Imagine this, you want to write an

operator that reacts on a change uh to

the EC2 operator object and that's where

the EC2 operator object YAML or the spec

will be stored. We're going to talk

about the cache much more in the in the

future in the video. Not right now. It

doesn't make more sense. However, um for

me to explain uh the manager, it does

have the client which is used to talk to

the API server. Then we have the cache.

And here's where the interesting thing

comes into the picture. This is what we

are writing right now. Or rather this

green bit. This green bit right here is

our user provider logic which is what we

are using in the reconcile function.

This controller is responsible for

reacting on the changes and eventually

running the reconiler which is our logic

that tells what to do if in case the EC2

operator object was changed or you know

um whatever change you made to that this

is where it's going to be um this is the

logic which is going to be uh running.

You can also have in the manager in your

controller you can also have a web hook.

This is kind of like the similar um

validating web hook and mutating web

hooks. If in case you want your operator

to also uh serve those web hooks, it's

possible to do so.

Now we also have couple of um we also

have couple of packages for the

certificate watches. This is the one

which is responsible um when you are

working with uh let me rather draw it.

This will make more sense when you are

using things like C or let's say you

have um uh admission control admission

web hook in your operator you have a

mutating web hook.

Here you have a mutating web hook and

your Kubernetes API server. You register

this web hook with the API server and

this can then talk to this mutating web

hook. The API server will simply ignore

or will not talk to your web hooks. You

know, I'm not going to explain the

mutating web hooks or validating web

hooks because this is not a part of this

course. Um, it's something there are

very good documentations that you can

read about. However, when your API

server talks to any of the web hooks,

whether it is mutating or whether it is

validating, uh it has to have a valid

certificate.

It does not talk over HTTP. You have to

have a valid certificate. And a lot of

times you would be using the cert

manager to issue your certificates to

this uh service your your controller

that is hosting the mutating web hook.

Now if in case the search manager uh

again it's used to uh issue certificates

for your web hook and every 90 days I

think by default it will be rotating

your certificates and in this case if

your certificate has changed maybe you

are storing that certificate into a

secret then it is given into the pod. Um

however if this certificate is changed

you will need to restart your

controller. You will need to restart

your controller pod. So eventually the

new certificate is loaded and the next

HTTP request uses the new certificate

which is renewed by the search manager.

This offers a downtime and to fix this

we have theert watcher um package. This

one creates a watcher for the change

certificates and it reloads them on the

fly without you to have to restart your

controller package. So you don't have

any downtime uh in case you are updating

your certificates in case you updated or

search manager did an update for your

certificates.

We also have the health package what

lets you uh expose the the you know the

livveness probes and the readiness

probes that you can use for your

operator. This exposes the health and

the readiness endpoint probes which you

can use in your deployment when you are

deploying this operator and you can say

uh check at this endpoint every now and

then. Uh it's a similar uh it's a

standard Kubernetes livess and readiness

probe. We also have the zap package

which is mostly used for logging. We

then have filters package in the metrics

package here. And this one let's uh I

think this makes sense for me to first

talk about the metrics here and then we

talk about the filters. See when you are

writing your operator with cube builder

it doesn't just let you focus on the

reconciler. I mean this is what your

business logic is. That's what you are

uh supposed to be writing. However, with

cube builder, your operator

which is running in a pod, it by default

exposes an endpoint called matrix. And

this might be looking sim familiar to

you. Um because this is something which

we use a lot in Prometheus. When you are

writing a Prometheus service monitor or

when you are writing a scrape config,

you give three things to the Prometheus

server. the IP or the service name, you

give the port number of the scrape

config and then you also define uh the

you know the path the scraping path.

This same you can use uh with your

operator cube builder. When you are

building an operator, cube builder

exposes the metrics endpoint and this it

it exposes couple of Prometheus readable

metrics like what is the success rate of

your operator? How many times the

reconciler has executed? How many times

it event it resulted into an error? How

many times it resulted into a success?

So it's not uh it doesn't give you an

idea of how many EC2 instances have you

created but rather this is more on the

metrics of the operator itself and then

if in case you want to maybe you you

have a requirement that my operator can

create EC2 instances but I also want to

know how many it has created

successfully. So you know you can also

expose your metrics you can instrument

your code with Prometheus uh go packages

and as soon as you were able to create a

VM you know uh on on Amazon we'll look

into the code in the future uh in the in

the further parts of the video uh you

can then increment your uh AWS instance

count uh to one because you were able to

create just one more um instance and

then you can expose this to the metric

endpoint. The thing that I'm trying to

explain here is it's already done for

you by cube builder and by default there

is no username or password. It is open

to everyone and then you can use

Prometheus with a scrape config to

scrape this metrics the operator related

metrics uh into Prometheus and show that

onto Grafana.

However um you can also then use this

filters uh package. This lets you define

some sort of authentication that this

metrics endpoint is not publicly. It it

should not be publicly accessible. I

only want to um I I only want to allow

someone who has this username and

password. Uh I want to have some sort of

authentication on this matrix endpoints.

And these are the this filters package

provides us these functions where we can

use um these authentication gate um

gated authentications for our metrics.

We then have the web hook. Again, this

is the package which is responsible for

you to create these validating web

hooks, mutating web hooks. There are

many many videos available. Uh we also

did a live stream on cube simplify of

creating your own validating web hook.

You can definitely take a look at that.

I'll put the link of that in the

description. And uh this one helps you

declare your validating and mutating web

hooks. These are core heart of your

operator. You know without these

packages it without cube builder using

these packages it would be very very

difficult for you to build an operator.

So cube builder is really good in terms

of scaffolding your project. When I say

scaffolding it means it is it gives you

a very good blueprint. It gives you a

lot of boilerplate code which again you

can uh refactor but to begin with you

only focus on your reconciler logic and

that for me it's amazing.

Now here's where the repository where my

code is going to be in the API v1 and

this is where I am uh calling my custom

resource definition which I declared.

You remember we had API then v1 and then

we had the EC2 instance right here. This

was our spec of the EC2 instance. That's

what we are calling in uh in the the

main.go. So I am calling my um my v1

with the name of compute v1 and then I'm

also calling my actual controller logic

which has the reconiler or this is where

my reconiler logic will be in the

future.

Now coming forward we have couple of

variables. This setup log is fairly

simple. This sets up a logger for our um

you know for our controller and the

scheme that you see here. Think of this

as a phone book. This is an

instantiation of the new scheme

function. The scheme is acting as a

phone book. It is acting as a registry

where you will write all of your objects

that you want Kubernetes to know about

or rather your operator to know about.

And that's what we do here in the

function in it. We use the util runtime

which is available here. for this

runtime package.

And here's where we have a must

function. So what this does is in case

this must function returned an error um

in case this must function you know was

not able to register if there was a

panic the program will stop right here

because your operator is completely

useless. um yet your operator is

completely useless if it doesn't know

about the core uh API types like pod,

deployment or rather also your own EC2

instance. So we register the default um

core u we register the default API types

and you can look at that using um

cubectl. Let me increase the font a

little bit. We can do cube ctl uh API

resources here. You see? So think of the

phone book which is our scheme. We are

adding all of this um to our phone book.

So we are telling our operator this is

what we have available uh all of this is

what we have available in our uh AP in

our cubernetes cluster and then we also

add our own default u our own custom

resources which is what we are calling

from the API B1. So essentially we are

telling Kubernetes that the scheme that

we declared over here it's an empty book

and in that empty book using the add to

scheme function which is here given uh

to us by the client go scheme we add the

built-in types so our operator knows

what built-in types are available into

Kubernetes

and also we add our custom type which is

the EC2 instance and then our uh

registry or the scheme is a complete

catalog And that's what our operator

would be able to use. Now,

now here's the main function. This is

where everything starts for any go

program. And we are defining a couple of

variables. For example, I want to define

the metric address um on which IP of my

port the metric would be listening to.

And once we have defined these

variables, we also define some flags

from the command line uh when you are

running your your you know when when you

build this with go build and when you

run this binary you can give these uh

command lines as metrics address probe

address you can define leader election

and all that. So we define the IP

address on which our metrics should be

sobbed. We declare some variables which

is the path for our metric certificates

because um just like web hooks can be

served over a certificate we can declare

that our metrics also is declare is you

know um accessible over HTTP or it needs

a TLS config as well and that's what we

can define with these variables what is

the path of our certificate what is the

name of the certificate and the key we

want to use for our metrics. The same

goes for our web hook.

Now there's a very good fun there's a

very good um concept that operators can

help you with or rather when you are

running distributed systems like HCD or

especially when you talk about your cube

uh controller manager

see that is also a controller what we

are writing it has many it's a

collection of multiple controllers but

this runs as three different pods in

your cluster or rather It runs each one

on the master in your cluster. The thing

is when you are writing a controller uh

it is very important of how these

controllers are running in parallel and

do they all make changes or not. For

instance, uh take if I was running two

copies of my EC2 controller. So this is

one controller and this is another

controller and there was an update um

which lets me create

uh an instance

you know uh I did an update I created an

object of EC2 instance kind there was

update and this update was seen by both

of my uh controllers controller number

one and controller number two

they both are going to go and create me

an EC2 instance and this is not What I

want I do want high availability but it

should be active passive. There should

be one leader. There could be multiple

replicas for high availability but only

one at one time should be running. And

this is what leader election u you know

um uh is something that you can use and

cube builder makes it very easy for you

uh that it allows you to uh declare the

leader election with a simple boolean.

So in that case this is also running.

This is also running but this is a

leader. So if in case an update

statement or an event comes from the API

server only this one is seeing it and

only one instance is created which is

what we want to do. The other one is

there but it's not the leader. If the

leader is no longer running or or

automatically it's going to become the

leader and this will be then serving

your requests for the EC2 instance

custom resource changes. This is what uh

leader election means and then you can

enable if in case you want to have

reader election and you can run your

operator into high availability. We

define the probe address on which your

uh health probes are available. So you

remember this this package which is the

health where you declare your health's

endpoint and the ready endpoint on what

port number uh they are exposed by

default the port number I think is 8081

here which is the health probe bind

address

the command line flag and this is the

variable that is going to be responsible

for it. Do you want to use secure

metrics or not? And this this variable

secure metrics and metricsert paths uh

name and key they are related because

you can say I want my metrics to be

exposed over h over TLS and if you say

that you want them to be over TLS then

you can define your metric certificate

paths the certificate name and the key

otherwise there's no need for that. Uh

you can also say if your operator does

enable HTTP2

or um HTTP you know it does not enable

HTTP2 and then we have a list of

functions uh that are R TLS options.

I'll make it simpler explanation as we

go ahead. So we declare a couple of

variables we declare a couple of command

line flags. We define some options for

our logging that this is development

true. when you say development, it

actually um gives you a stack trace on

warnings as well. Um it doesn't give you

any sampling. Uh if you go for

production, it only gives you a stack

trace on um on errors and it does do a

sampling for you. So if in case you are

deploying this to production, that's

something you should always consider um

development as false.

Now we set up a logger. We uh we passed

all of our flags of the CLI that was

given by the user. We um you know we

define our options for logging. We

create a new logger. Essentially what

we're doing in this line here is we are

setting up a new logger with our zap

options or with our logging options.

Now with your TLS when you have this TLS

config it's kind of like a list of

options that you can do. One option here

is if you want to disable uh you know uh

if you do not enable HTTP2 here in case

you are disabling uh HTTP2 you can

append that to your TLS options. So we

say in this case uh my um you know I did

not enable HTTP2. So for me in the TLS

options it would be I disable the HTTP2

and I only enable the version 1.1 of my

HTTP because I'm disabling the HTTP2.

Now here's where you create some

watchers for your certificates. You

remember we talked about these

certificates for the metrics and there

could be certificate for the web hook

because you can expose both of them um

over over TLS. So the certificate could

be for your web hooks. The certificate

could be for your metrics. And we have a

we have a cert watcher. So essentially

what happens is let's say uh this is

what I already explained. You have a

cert manager. The search manager renews

your certificates on the disk. This

watcher will be detecting those changes

on the certificates. It will load them

into the memory in the current pod in

the current operator. It does not

restart the operator. It does it on its

own. There's no there's no downtime.

There is zero manual intervention.

Otherwise, you'll have to um restart

your your operator because your um you

know your certificate was updated by the

search manager.

We define our TLS options which is again

a list of functions uh that returns us a

TLS config and we um instantiate a new

variable. So it's kind of like we are

creating an alias and this is the one um

by by this time the TLS options is a

default TLS options um that we would be

using and we declare a new variable and

we set that as a value. So we can

customize um the TLS configuration for

our web hook server uh if in case we

want to use a watcher or in case you

want to even use a certificates or not.

So uh it's easier for us to customize.

Now if you really gave a web hook

certificate path which is here if you

did give a certificate path that means

you want your web hook to actually be

serving over TLS and that's the that's

the thing then if the length of your

variable uh is greater than zero we will

say initializing web hook certificate

watcher and I will be then using the TLS

as well and I will be using the

certificate. So we define a variable

error and here's where we create a new

watcher for the certificate uh path and

the certificate key. If in case there

was an error, you just simply exit one

because you wanted a TLS config for your

web hook but you couldn't get one. So it

makes sense to stop right there. And

here we are adding a new option to our

web hook TLS uh options. this variable

it contains right now till this point it

only has one TLS option which is disable

HTTP2 that's what we we did here you

know uh by this time it only starts with

one uh TLS option which is disabling

HTTP uh 2 and if you have given a

variable um if you have given a webbook

certificate path we then append onto

this TLS options that we do want to use

um another we do want to use a web hook

uh certificate and this is the get

certificate function from the TLS config

that gives us the name or the

information of the certificate we want

to use for our web hooks and here's

where we are creating a new web hook

server with these TLS options

similar thing happens when you are

working with a metric server options so

these are the metric server options uh

where we define the bind address on

which our metric is going to be exposed

this bind address is 80081.

Do you want to use secure metrics or

not? And what are the TLS options?

Again, by this time we are just

disabling the you know um the HTTP2.

Uh we don't have any TLS right now

because if you do not do uh if you don't

give secure metrics which is as a

boolean if you do not give um secured

metrics then there would be no TLS

options. you only work with HTTP 1.1,

you disable HTTP uh you know uh you

disable HTTP2 but if in case you did

give secure metrics you will be using

some sort of authentication

um um that your metrics endpoint is not

publicly uh it's reachable but not

accessible. There is some sort of

authentication and authorization and

only the authorized users and service

accounts can access your metrics.

Now u this was the metric service

options that we started with. If in case

you did want secure metrics you give uh

some sort of authentication and then

this is the same logic that we did for

our web hook certificate path that if in

case you do give you know your metric

certificate path you create a watcher

like we did for our web hook. Uh there's

a watcher which is for our metric

certificates. Then we append uh the

metric certificate option TLS options uh

with the certificate uh information.

Essentially what this does is if you did

give me a certificate path if it's not

zero the length of the certificate path

is not zero you give me the path of the

certificate I'm going to run your metric

server with the TLS option that that

serves the certificate information.

That's essentially what it is doing. So

you should not get confused on uh on

what this is happening, what this is

doing. I just told you. If you do give

the parts of your metric certificate,

it's just going to expose your metrics

endpoint on this certificate that you

have given. The same thing happened

here. If in case you did give a

certificate for your web hook, it's

going to expose your validation or

mutating web hook over with this

certificate information.

Uh and here's the one which is quite

interesting. This is the from the

controller manager from the controller

runtime. You see this is the one which I

just showed you. This one uh lets us

create a manager. Within the manager you

can have multiple um controllers. It

looks something like this. So here I

have in my main.go file

um this is my operator. This is the

main.go file. In here I have a manager.

Oh, wait a second.

This is my manager. Let's take it this

way. And within my manager, then I will

have my controller. And I can have

multiple controllers in a manager. If I

wanted to uh write something about this,

this is my controller.

This is my main.go go which is

responsible uh for creating a manager

using the controller runtime and then

the manager is responsible to or it's

our responsibility to register our

controllers

with the manager

and that's essentially what we're doing

here. So once we declared all the

variables, once we gave all the flags,

once we defined all of our TLS options,

once we have configured if we want to

use TLS for our web hooks and metrics

and if in case we want to use

authentication with our metrics or not.

Once all of that is sorted, we start or

we use the new manager function that

returns us a new manager which is

available here. This is our manager.

This variable has our manager with all

these options. What is the scheme? So

our controller knows about all the

resources, custom resources or the the

core resources available in Kubernetes.

What are our metric server options? If

in case with the metric server options,

do you want secure metrics or not? You

know, uh what is the port number for

your metrics that you are binding to?

What is the IP address for the metrics?

What is the endpoint which is usually uh

by default/metrics?

And then if in case you have given some

um certificates

the same thing happens for our web hook

server is it secure in terms of have you

given certificates to that or not. Uh

and that is our web hook server. Um we

then declare the health probe endpoints

which is um which is a probe address

that is um where did that go? 8081. This

is what your livveness probe and the

readiness probes will be looking into

the container when they are doing a

probe.

And here's where the leader election

because when you are creating a manager,

the manager should know uh are you

looking forward to have a leader

election and you should definitely do

this when you are building an operator

that you want to run in multiple

replicas in multiple pods. There should

be only one which is both of them are

running but only one is active at any

time. So this is this is absolutely your

responsibility um that you can enable

the leader election and then if you

could not make the manager because new

manager returns you the manager and also

an error. So if you could not create a

manager or you give the error that I was

unable to start the manager and you

simply exit because if you don't have a

manager, you don't have anything. You

don't have a controller. So that's

that's the over um that's the one that

looks on your controllers. If there's no

manager, there's no reason to continue.

Just just exit right there. And that's

why we use the OS package. Now once your

manager is created we need to register

our controller which is the the custom

resource which is uh what we need to do

here. So if you were able to create the

manager we are using you know um we from

the EC2 instance reconiler we define the

client and we use the manager.get schema

which tells our manager what is the

schema of our EC2 instance. Essentially

we use a function called setup with

manager and this one sets up our EC2

instance custom resource with the

manager and which is available here in

the EC2 instancecontroller.go file. This

is this is the one uh which is where our

reconciling logic is and where our

reconider logic will be. So it sets up

our controller in here in the main.go

go. At this point once we started our

manager, we set up or we add our um you

know we add our um custom resource or we

add our controller to our manager. So

the manager knows that I have this

particular controller. This is what I

need to listen on to if any changes are

done to this custom resources and this

is what the logic is what I need to run

with uh with the operator.

Now it's also interesting here if in

case you were having some certificate

watches if this was not nil you add the

certificates to your um you know you add

the certificate watcher to the manager

for your metric server for your web

hooks again um we don't we don't use

certificates right now and I'm also not

using any web hooks for mutating or

validating so I'm not going to do

anything um any certificates for me it's

going to be empty otherwise you will be

adding the certificate watcher to your

manager. So manager has couple of

things. It has controllers. It has

another controller. You can have more

than one. It will then have the watcher

as well for the web hook

certificates. It also has a watcher for

the metrics

certificate and it watches and renews

the certificates or reloads the

certificate on the fly. So you don't

have to restart your your operator

from the manager. We also get uh a

function called add health check. And

this is where u the health check is is

being done. Uh we add two health checks

or two endpoints. One is a /halth, one

is a /ready. And this is what you can

use like a fairly simple Kubernetes

health check that lets you see if your

operator is healthy, if your operator is

ready or not. And here from the manager

which was written by the new manager

function. This manager has a another

function called start. And this is the

one that starts our manager. It's kind

of like you got a car which has an

engine which has a you know which has um

a mechanism for the airbag. As soon as

you turn on the key the whole thing

starts. So first thing is your engine

starts. It starts sending power to other

components. This is a similar analogy

when you are starting this particular

manager. So the manager starts and then

it starts the other controllers inside

of this process. It starts the watchers

and uh and everything comes in into

life.

Now of course if you were not able to

start the manager or create the manager

above here. So either you were not able

to create a manager instance or start

the manager. We simply just exit because

without the manager there's nothing uh

that is available.

So I think this was the whole main.go

file and uh what I wanted to also show

you here is we do import quite a lot of

packages. We do import quite a lot of go

packages around here. One of them is

compute v1 in our API v1 directory. And

here in the spec, this spec matches to

our um where I go config CRD basis the

CRD around here. See whatever you give

in your um custom resources spec that

gets reflected into a resource called

custom resource definition. And here's

why you're declaring. You're telling

Kubernetes this YAML file gets installed

into Kubernetes it's a resource that

tells Kubernetes about other resources.

It's a custom resource that tells

Kubernetes about other custom resources.

So you tell Kubernetes that I'm telling

you about another custom resource who

looks something like this.

Its version is V1. Its name is EC2

instance. It's a list you know it's the

name namespaced

scope um object it is under this

particular group and here's where the

spec for your EC2 uh instance and you

can see here the same one to one mapping

we have the AMI ID we have the instance

type we have the SSH key and we have the

storage which is again given uh given

here now at any time when you are

writing a spec for the API you might

want to change something. Imagine you

could say I want to give a tag.

um or you would say I want to give a

department

and this is going to be a simple string

which is what you can use as tagging you

know so when you create an instance you

use this department value to add that as

a tag to your EC2 instance and whenever

it's very important at any time when you

make changes to your specification you

have to run the make command more

precisely ly um you need to do the make

manifests because your CRD is not aware

that you just make changes to your

specification. The CRD is still older.

Think of this as now it's outdated

compared to the spec where we added a

new uh value. Do I have a department

here? I don't have that. I can't search

for it. Okay. So once you make changes,

we do make manifest. And as soon as we

do this, you see a new department um

spec is now added which is type of a

string. We can also say um um maybe I

want to add project which is going to be

another um tag. Uh and then again I will

need to use the make manifest because as

soon as I do make manifest you can see

on the right side it's going to be added

here. You see um the project was again

added. So at any time you make changes

to your spec, your CRD needs to be

updated uh on the disk which is with

make manifest and you also need to

update the CRD into Kubernetes because

see the flow looks something like this.

This is you

this is the spec

and you make changes to the spec. Now,

this all is happening on your computer

right here. It's all happening. This all

is happening on your computer right

here. Um, and

wait a second.

Okay. So, this was a spec that you

changed and you changed your CRD on the

disk. However, um the CRD doesn't just

need to be updated on your computer

where where you're you're developing.

You then have this Kubernetes cluster

which is again um where you need to have

a CRD and from where you can then create

a custom resource. We talked about it

the CRD now and from that you create a

custom resource.

Now you see you made the change

and it's updated here. It's version two

of the CRD but you are still using an

older one. You're still using a version

one. So you can use make manifest in the

make file which is given to you by uh

the cube builder. You can do make

manifest. It updates it on the disk. And

if you are pointing to the right

Kubernetes cluster using your cube

config the environment variable, you can

then use make manifest and make install.

It is then going to apply the same CRD

which was generated by the spec change

all the way to your Kubernetes cluster

as well. So they're always in sync.

You're not thinking I made the changes

to my spec, my CRD is updated, but when

I try to make changes here for this new

change, you know, I want to add a new

field called project. It says there's no

field called project, but I see it here.

It's probably because you did not um

update your CRD in the cluster. You only

updated that on your disk and that's not

going to cut it. So um usually if I ever

make changes to my spec, I do make

manifest

many fifths and I do a make install. So

I update this on the disk and I also

install this. Make sure you're connected

to the right cluster otherwise um if

it's a different cluster and the

resource the the custom resource does

not exist it gets installed there or if

it's there it gets updated and there

could be some breaking changes that

you're introducing. So be very very

careful when you're doing it.

All right, I think this was the whole uh

explanation of the main.go which is

probably something you will not use a

lot, you will not make changes to but

it's it's absolutely important to know

all these options what the web hook

watcher does. Why do we have so many uh

packages involved? Um you know you can

expose your metrics uh endpoint

securely. When I say securely

I mean with an authentication and you

can also use TLS or not. This is

something optional both of them. The

same goes for the web hook endpoints. So

it is something which is which is

absolutely important to know uh that you

can also do the leader election and uh

this is the main uh function where your

operator starts.

So now that you have a very good idea of

the main uh go file uh which is the one

that starts everything. Let's see how

the reconiler works. Let's see the

reconciler in action. We will make

changes to some custom resources. See

how our operator gets those changes and

then what can we do on top of that. This

is what we will be laying as a

foundation of creating our operator that

reacts on the changes of the EC2

instance object and then we um we will

move on ahead from there. Okay. So

whenever you want to write your own

custom operator, the first thing you

need to ask yourself is what kind of

resource are you going to manage? In our

case, it is going to be an EC2 instance

over on Amazon. We are writing an

operator. We building a custom operator

that goes to Amazon based on our behalf

and it uh you know creates you an EC2

instance. So something would look like

this. You're going to have your

Kubernetes cluster in which you have

your operator running and there is going

to be a human a certain someone that

gives you a YAML file because we talked

to Kubernetes via YAML. The the

interesting thing about this YAML is the

kind that you have declared it's going

to be um you know the API version that

you have declared using cubebuilder

which is cloud um which is a compute I

think which is compute.cloud.com/ver

one of this API resource and the kind in

this case is E32 instance. Now of course

in the end what's happening you give

this YAML to let's say the Kubernetes

API server because it knows about um the

EC2 instance which we will deploy our

custom resource definitions to

Kubernetes.

This resource change maybe you say I

want to create a resource of this kind.

The controller in here will look on this

change. It will get the data from the

API server and this is the one

responsible to go to Amazon and creating

you an EC2 instance. This is the one

responsible for making the

authentication with EC2. This is the one

which is responsible to provide the

minimal set of instructions you need to

give to Amazon when you want to create

an instance. This could be uh the

instance

you know it could be the instance type

that you want to give it which is

absolutely required. This could be uh a

security group you want to give which I

think is absolutely required. Um you can

you also would definitely need to give

some storage on how much your machine

would be needing. Some things could be

required some things could be optional.

For instance, um tags they are

completely optional. You can give that,

you cannot give that. It is up to you.

So when you are writing an operator, you

are writing a custom resource like this,

it is on you to have some minimum at

least most required uh things that you

want to send to Amazon. And this is

where when you are designing your spec

because when you give you a YAML you

will have kind API version then you will

have a spec and then you will have a

status. So this spec

here is actually matching what you give

in your YAML for other resources and

that is what you will be having. So in

this case your YAML would look something

like speci

ID. This is going to be the name of the

key. Uh then SSH key, instance type and

instance subnet. We are using these JSON

tags so that Kubernetes can unmarshall

uh it knows what does this particular

thing that you are giving me called an

AMI ID. What to do with this particular

object with this key and then the

request that is coming to the API

server.

We might want to extend this in terms of

let's say uh in here for example

storage. So I am giving um I also have

an option called tags in my YAML which

is going to be a map of string and

string and you can also create your

custom strct types. For example, if I

give storage um and then I can have a

custom object here. You see we know what

is a string. Go knows what's a string.

Go knows what's an integer. Go knows

what a boolean is. it doesn't know the

embedded type of uh storage config and

that's the problem. It says it is

undefined. So you can define another

strct which is going to be uh type

storage config. Oh wait a minute storage

config

uh and there you can give um you know

the size of the and that's what I love

about these AI editors. So you can give

the volume size, you can give the type

of the volume that you want. You know,

Amazon have different type of volumes

there. And then if in case you want to

give your device a name, you don't want

that. The only thing I would like is a

size. And

um it's going to be the type of the in

uh the device that I want, which is

going to be uh one of the Amazon

provided ones. And then you can also

have additional uh uh storage which is

in here in this case one is a root disk

and then you can have additional resour

devices and this is where this omit

empty comes into the picture. It's very

very handy. The same thing could be done

for our tags as well. The thing is

sometimes these resource these options

are you know these things are optional.

You can give a YAML for Kubernetes that

creates an EC2 instance but you may have

tags you may have additional storage.

You absolutely need the instance type.

You absolutely need the AMI ID that

would be wrong. If I do um you know in

here if I go and say uh omit empty this

is wrong because it has to be a required

field in your YAML manifest your

resource. So you can choose when you are

building your um when you're building

your spec as into what things you want.

In this case the additional storage is a

string or it's a it's a list of storage

config. So you can add additional

storage configurations. In my case um I

I would keep the additional storage just

as you know a type and a string.

The same thing will happen for your

status. So when you do cubectl get

status hyphen o yaml what you see is the

status dot you will see the phase in

which it is you will see the instance ID

you will see the public IP. This is

probably the information you get back

from Amazon. Imagine this if this

developer gives you a YAML. Let's say

you are building a internal development

platform. You want the developers to

query uh the resource that they have

created. when they do cubectl create-f

which has an EC2 instance and then when

this guy says cubectl uh get EC2

instance you need to give him some

information or you need to give her some

sort of information probably the first

thing you want the user to know is the

state if the instance was failed if it

was running if it was pending whatever

state it was and then you probably won't

want to give them the public IP of this

instance. If in case you allow your

organization allows for the instance to

have a public IP, you will do that. The

only place you can get this information

is from Amazon. So when you are creating

your instance in this case, you want to

pull if it's running in certain time. If

it's not, you fail the operation.

Otherwise, what you do is you get back

some information and this information is

going to be the state of the instance

and then it's going to be a public uh

IP. In our case, this is what we care

about. There will be many things that

can be given back uh as an operation of

creating instance. But that's not what

we care about. We want to show the user

that in their status they can see um the

phase which is going to be a string uh

the instance ID which is also a string

and the public IP which is also a string

in our case. Now whenever we build um

whenever we u make changes to our API

spec I told you that is absolutely

important that you run the make command

from cube uh from the root of this um

cube builder project so that it

generates you the custom resource

definition. Um what has happened uh wait

a second.

All right. So when I do um in my API

version one EC2 instance types.com

actually it's in config CRD the basis

and then compute um cloud.com v2

instance this is the actual custom

resource definition that you have

created when you make changes into your

spec like in here when you make changes

into your spec what happens is um when

you take command cube builder code knows

how to write the custom resource

definition as a boiler plate and this is

where you define the group for your

resource. This is where you define the

kind for your resource and then you

define the version of your resource. So

this would tell you that for a for a

single kind of resource you can have

multiple versions because you see it's

uh it's a list of versions that are

available. So you can have it to a

cloud.computee.com/w1 compute.com/we1

then this schema would apply cloud uh

compute.cloud.com/me2 cloud.com/me2

another version of the schema would

apply and this is why you probably might

have seen that this particular key is

only available in a newer version of

your YAML there could be some key which

is only available in the newer version

or uh in the older version it's um it

was only available in the older one

because in the version two that might

have been removed the important thing is

um is the spec in here so we have got

our properties. We have an AMI ID. We

got our instance type. We got our SSH

key. And then we got our subnet. And

these all are required because we did

not get the omit empty. But now because

we made some changes into our spec, I

absolutely have to regenerate these

manifests. And for that I can simply do

make um manifests.

So what that will do is it will be

updating your custom resource definition

with a few um more um you know with a

few more parameters. For example, one of

them is the additional storage. It was

not there before but now it is. So then

you can have an additional storage and

then there's a new option also called

storage and there's a new option as well

called um tags which is a type of

string. So every time uh it also updates

the required because some of them do not

have omit empty they are absolutely

needed. This is how Kubernetes knows

that this strip this particular key is

not available in the YAML. I have to cry

about it. I cannot let the user give me

this request because the custom resource

definition has marked this particular uh

you know string as required. this

particular key in YAML as a required but

the user has not given me that.

Now this is what you will give to your

Kubernetes cluster before you can create

an EC2 instance before you can do

anything w with the operator. The first

thing you need to do is you need to give

this to Kubernetes because if you do not

then when you create a when the

developer creates a YAML of kind EC2

instance and then the API version which

is cloud.com

Kubernetes has no idea what is this um

you know what this resource that the

user is talking about what is this group

called compute.cloud.com cloud.com in

version one I don't have an inst a

resource called EC2 instance and this is

something you can either uh use a

cubectl apply uh with this custom

resource definition yaml or you can use

make install command with the make file

that's something what is actually done

for you so you see we use customize to

build our custom resource definition and

then we apply that to um cq cubes apply

- f and on the standard input and this

is where we now have our custom resource

definition

first make uninstall and see what

happens if I do um cubectl get e2

instance uh dot um oh wait here see if I

do k get e2 instances compute.cloud.com

cloud.com or if I just do tell me how

many ETR instances do I have Kubernetes

says I don't have that resource but if I

do okay I'm going to deploy you or I'm

going to give you a custom resource

definition

at least don't say I don't know what

that resource is if you have that give

the user if you don't have that just

tell them I do not have that resource

but don't just say I don't know what

resource are you talking about so this

is What you do when you uh give your um

you know when you do a make install

creates your custom visro definition

which if I see here you can see this is

uh you can do a CRD on you can do a get

on your CRDs and this is the custom

resource definition that I have which

you actually can uh also see like this

and this is the same thing that I just

showed you on uh on on on cursor so

which is not on cursor distributed in

terminals.

So that's how one would actually um

update or create the custom resource

definitions. In my case, uh how would

the YAML look like? So if I would

probably ask um you know um my my AI

that okay take uh take this spec and

give me

an updated YAML for this resource. It's

going to just spit out how the YAML

would look like. And see this is

additional to what's what's going to

happen. Uh I just going to I'm just

going to accept that. Here we go. So

this is how your YAML would look like.

It's going to be a kind of ETO instance.

Then you see some specs are there. It

tells you um you would have an AMI ID,

the SSH key, instance type. Maybe when

you give this to your developers, you

might want to make it a lot simpler or

at least make things like um instance

type or VM preset, something like that.

if they're more familiar with with those

words. Uh maybe you can do uh SSH key.

That makes total sense. Um I think this

this does make sense. Um it would have

been better for um other other examples.

But in this case, the YAML is perfectly

fine.

So I I would say okay, this was what I

wanted to show you as a YAML. In our

case, um the next thing that you would

do is once you have your YAML defined,

once you have your everything defined,

now we need to look into the reconcile

loop. See, by this time, Kubernetes

knows that it has some a custom resource

called um EC2.cloud.acample

doommain.com. In version one, there's a

resource called EC2. Now if someone

gives it a EC2 instance, what to do on

that? If someone gives it a YAML that

please create me an EC2 instance, what's

going to happen? How would it react to

that? And this is where we will be

looking into our reconcili. So let's get

started and let's see how we will build

a reconcili.

So our reconstru loop would look

something like this. It's under internal

controller EC2 instance_controller.go.

This is where the magic happens. This is

where whenever you make changes to your

custom resource, that's the place where

it comes to and then this is where the

logic you would be giving to operate on

the resource that has been changed which

is your custom resource. It is in the

package controller. It is importing

quite a few things. One of them uh is

controller runtime. This is absolutely

important that handles the runtime of

your controller. And also you see it is

actually getting our own um uh ECI EC2

instance. This is going to go to

github.com in here operator repo API v1

and then it's going to call this as

compute v1. Essentially what this is

doing is your controller needs access to

your spec of the U custom resource. It

could also have been very simply done

but I could say you know um please go to

API v1 because I have that locally

available or okay I think it's better

because I have it already on GitHub. So

what's going to happen is if I show you

um if I go to githubhub.com

and here is going to be my repository.

Let's go to GitHub. I should have just

copied otherwise it goes to Golang

Populator repo.

Here you can see an API v1 and this is

where it's looking for the EC2 instance

type. This is where our code is. So this

is what Kubernetes operator will be

using and the controller will be using

to map your request to a particular

known um data type to a particular known

uh spec or status of the custom

resource. This is the heart of your um

object that you're creating.

Um it creates you a reconiler which is

used in the client runtime that helps

you communicate to the Kubernetes API

server. It has a schema object which

registers your schema. Couple of cube

builder um markers. I think yes it's a

marker which is creating you the arbback

rules so that you can uh work on the

custom resources because when you create

an operator

it would be running it in its own name

space but if a custom resource is

created in a different one the operator

needs access to see in that name space

as well. So this is where the arbback is

extremely extremely helpful. Now what

can we do with this? This is the

reconcile function. This is the one

where all of your requests are going to

be uh looked into. This is where all of

your you know um whenever you do a

cubectl get or cubectl apply this is

where the changes are going to be looked

upon. This function has two return um it

returns two things. First it returns the

result of the reconciliation

and second it returns the error in the

in case if there was any if there was no

error it will simply return a nil. Now

this is the beauty of Kubernetes

selfhealing. You know how uh if you

create a pod which has a persistent

volume claim but that PVC is not bound

to a PV yet that pod is going to be in a

pending state. It keeps on being in a

pending state but as soon as you create

a PV as soon as you add it to the

persistent volume claim the pod is then

automatically started because there was

a recue going on for that particular

pod. the the reconciliation for the the

logic for the pod kept being if the

requests are fulfilled if they are not I

give you an error and I start the

reconciliation again it puts it in the

queue to reconcile this is the beauty of

selfhealing it will be done eventually

once all the conditions are met and you

don't have to trigger that reconcile uh

you know uh you don't have to trigger

another run of the reconcile loop

yourself kubernetes sets for you and

this is where you will be giving your

logic. The first thing that you do the

very first thing that you will be doing

is you need to operate on that instance.

For example, um the basic thing you when

you talk to your you know when you say

that users can then create an EC2

instance of their type. You want the

user to give the name of the instance.

You might want them to give the tags of

the instance. You might want them to

give the storage config. So you need to

extract this information. You need to

extract this information from the

request from the API server request that

came to the reconciler loop so that you

can use this information to talk to

Amazon in in our case because it's a

cloud operator. So you need to store

or you need to get all these objects

that are being given in this YAML by the

user in certain variables. So you can

iterate on top of that. So this is very

important. The user is actually creating

a resource of kind EC2 instance. You

also need to have a variable of kind EC2

instance. So that Kubernetes you can use

the Kubernetes schema to store your

actual keys in your variable. It's think

like the user is sending a circle. You

need a mold that can hold the circle. If

the user is sending a triangle, you need

a mold that can hold a triangle. If a

user is sending an int data, you need a

variable of type int. The same thing

happens. The user is sending the data of

kind EC2 instance. You need a variable

that will be of kind EC2 instance. So

let's declare that first. The first

thing I do is EC2 instance object

uh is going uh EC2 instance object

is actually here. So I'm using the

compute v1. This is the compute v1 and

in here I have declared the EC2 instance

and there you see this is essentially

what we created in our uh types.co. Now

we do have a spec then we have a status

but essentially this is the root of our

Kubernetes um object the EC2 instance

will have some metadata it will have

some object metadata uh it will have the

spec and then the status. So this is

what we are calling and creating a

variable in our reconciliation object.

Then what you can do is because the API

server will be sending a request to your

controller or rather it's the other way

around the controller will listen if

there was any changes done on your

custom resource

and then you can iterate on on top of

that. So you create a variable of type e

to instance object. I would rather make

it simple just to keep it easy to

instance. And then we can use the get

function. What this get function does is

it uses the context which is of your of

your request. More importantly, it gets

the name space under which this resource

was of changed. I'm not saying created.

I'm saying it gets the name space of the

resource in which the update happened on

your custom resource and the actual

inflight um YAML the actual context of

your YAML is then going to be stored in

this particular object in the EC2

instance. So think of this as you take

the YAML from the user and you give it

to your reconciler. So now it knows the

name space in which this object was

updated. U the name u you know the the

instance type this YAML had the kind um

the storage type this this YAML had the

number of tags the user wanted that you

can now create on top of this. So I was

want to say let's say um log

uh there was also I think before this

there was a logger that we can also use.

So here you can see we have a log

function and we can say um I want to log

all of my request and using log.info I

can do that. So I got I create an object

of type EC2 instance and this is the EC2

instance instance type. I get my object

which is coming from um the inflight

request and then I'm saying reconciling

EC2 instance and you see I can uh get

the name of that particular resource

rather than info. Let's just print this

for now.

So um I would say I want to have an EC2

instance. EC2 instance

and then I can say uh print lm I got a

um I got a request for an EC2 instance

in

the name space and then I could say and

the EC2 instance is um EC2 instance just

just keep it like that uh you can also

probably then say fmt.print print ln and

you can print the entire spec you want.

I don't want to print the entire spec. I

would just say I got a request for an

EC2 instance in this name space and the

instance is instance name. You see these

are all the options that you can where I

go

here is and the instances are here I can

say and the easy to inst Oh my god wait

let's get let's do that again I want to

I want to just uh see the data that has

been given to me and I can say uh I got

a request for an EC2 instance in the

name space And let's keep new uh uh

prints. The instance, the EC2 instance

name is EC2 instance.name.

Uh and I would say then the instance

FMT.

LM instance

type is

uh EC2 instance.spec. And you see this

is this is the beauty of uh the AI

editors again. Now you can see I am able

to get all the information which was

sent which was you know watched by my

reconciler by my operator under E2

instance spec and you can see AMI ID SSH

key subnet tag store regation storage

this is essentially what you were

building in the spec of your custom

resource this is a onetoone mapping that

is why we created a variable of type EC2

instance and then we got the you know

the inflight request that we received

from the API server and then I'm saying

I got a request for blah blah blah the

only thing I do not have see I have the

instance ID I have the AMI ID SSH key

subnet tag everything I don't have the

name for my instance and actually this

is the name of the object that I'm

giving but maybe I want the user also to

give the name of the instance and I

would simply say instance name because

instance name could be different than

the kubernetes object kind uh metadata

the yaml that you give they would be

different and here I can say um my name

would be in the spec

uh spec dot instance name it's going to

be a capital spec Now the important

thing is I just added another uh object

in my spec my custom resource definition

that is right now in Kubernetes. It has

no idea about this new instance name. So

I would have to do my magic again. So I

would do make generate

make manifests and then I would say make

install. So that my Kubernetes is now

updated that there's a new resource

called there's a new uh there's a new

key in the strct for the spec which is

called instance name and um that's it.

Now once you get the data once you you

know uh iterate on top of that in my

case I'm just printing it right now but

as we move forward we will use this data

to talk to Kubernetes and then I will uh

create myself an instance that is where

you would have your actual business

logic what we will then do is once you

have used the data in my case I'm not

changing anything in the object object

there was a resource created I got

information about this but I'm not

updating that resource so then in that

case I will be returning a a result

which is going to be you know um is if

in case you are spending it contains the

result of the deconeller reconciler

invocation if you go to the controller

runtime on go on on the go uh consider

this result actually contains two

things. Whether to recue this or not and

this is a default to false. This is very

important. When you exit your reconciler

function, you need to tell two things.

Whether there was an error in the

reconciler function and is there a

requirement to rerun this reconciled

loop. You only remember we talked about

this in the previous part of the video

that you only reconcile

you only uh you know rerun the reconcile

loop if you have updated the API object.

We are not doing that right now. So we

do not need to u you know send any uh

reconcile boolean which is VQ as

boolean. By default it is false. So in

case in our case we don't have an error

and we are also sending a pause for the

reconciliation

this reconcile loop will not run again.

Uh it's kind of like um when you start

with the reconcile loop. So this is this

is how it looks like. You have the

reconcile

function

and the request came over to this

function. you made your change, you made

your business logic, whatever you wanted

to do. In our case, I'm just printing

things. I'm not creating an EC2

instance. I'm not updating my custom

resource with the status of uh the EC2

instance creation. I'm just printing

this. So, because um here um should be a

bit bigger change

made.

So in my case, did I make a change into

the custom resource into the custom

resource that request came to me which

is in the EC2 instance? And I would say

no or you could say a yes. And in in

case you have no changes, you would

simply return

nil for the error and then false for

your reconciliation. If you did give a

yes, if you made some changes, then you

have to return a true for um the

reconciliation and then if there was an

error, you will return the error. If in

case the error was nil, you will return

the nil. This part we will talk about uh

coming up. But for now, I'm not

returning any uh you know, I'm not uh

changing anything in the EC2 instance

object. So I'm just returning a false.

Now this at this point guys uh let me do

a l.log and then I would say um let's do

here reconcile reconciling EC2 instances

the name and I would here say

reconstance

and this is the name of my instance. Now

let's get a YAML and then see how this

will be um functioning. Now this is the

time we run our operator in Kubernetes.

Now we can make a container image. We

can you know push that container image

to a registry and then get it from

there. The good thing about using cube

builder is when you have a working

development environment and you have a

cube config which points to a operate on

your cubernetes cluster you can just run

the main program locally and then it

will be as if like it's running in your

kubernetes cluster

and I will again uh call my trusted AI

to use the spec and give me a dummy yaml

So we can create that.

Uh uh uh this is the spec. Let me

quickly get this and then I would say

please undo everything. I don't need

that change.

Cool. So um let's say Kubernetes. Do I

have a folder called Kubernetes? No. Let

me do an example.

Uh instance instance.yaml.

And there is our spec.

Before spec we have a API version and

then we have a kind and then we have a

metadata and then you see we then have

spec. The API version is v1. The kind is

EC2 instance and the metadata

uh wait the API kind is API version v1

but it's compute.cloud.com/v1.

E2 instance metadata would be name of

Kubernetes object for uh EC2

and there we go this is simple uh what

we have and then I would say let's run

our operator

now we can do go run cmd main go because

in in the cmd folder um wait where that

goes here in the cmd folder, you have

your main program. This is the entry

point. In any go code, your entry point

is always going to be the main uh go

file. This is the one that registers

your schema from Kubernetes. This is the

one that creates a client so that you

can talk to Kubernetes. It registers

some uh you know um some booleans, some

flags if you will. We will clean this up

because we don't need a lot of that. We

already have gone through this code. The

most important thing that it does is it

starts the manager here. Uh enable

enable enable enable enable. But I think

we did see somewhere

that it was starting um the manager.

Wait a second. This is the new manager

function. Just going to give the manager

new.

Where was that?

uh uh uh uh here. So we're going to have

a log of starting manager because we did

not work with the web hooks. We don't

have any readiness check, livveness

check, nothing. So we should just see

starting manager and then we should we

we will be seeing if we get any request

to our controller. So here we will be

exporting the cube config. Let me

increase that on a little bit. And here

I would run my function. If this is a

little a bit small for the font, please

bear with me. I hope this is this is you

know seeable. But uh essentially what

I'm doing is I'm running the main

function now. So we will be running our

operator.

Do I have any EC2 instances? Uh no. Do I

have them in any name space?

Uh no.

How does our example look like? So if I

do k - f example. Oh wait, I need to go

to operator folder here. And then I can

do k - f example instance create. Let's

do a dry run. See if our yaml was good.

And there you can see the yaml was fine.

Um then I can just simply say first I

run my program in here. And this is how

your go code will be running. So you see

this is all what cube builder does for

you. You do not have to set up your

authentication with the API server. You

do not have to set up your um you know

um how would you run your your

controllers? How would you run your

multiple operator loops that you have

for different API versions? It does that

for you. It starts an event source. So

it's kind of like the listener for your

object in Kubernetes and here it's

starting a worker for there's a

controller for EC2 instance. This is the

group and this is the kind uh which is

uh EC2 instance. So it's kind of like

you have one controller for one

resource. It is a onetoone mapping. You

can have multiple instances of that

controller and in this case you would do

a leader election uh because if one

object uh if one you know instance is

managing your request for that custom

resource others should not do that but

in our case we only have one replica but

we have one controller park object if I

was uh if I was creating um more

custom resources let's say right now I

have an EC2 instance This is my custom

resource. For that I have a controller.

My controller here you can see it's

called uh also EC2 instance.

If I was to create another custom

resource which was let's say a storage

bucket. Maybe I want the users to be

able to create buckets in my Amazon

account very easily. There would be

another controller uh which is going to

be then storage bucket. They could be

running in the same manager

in the same manager or operator pod.

I think this is where you can review uh

the part of the video before where we

talked about what is in the operator.

There's a manager within manager. Then

you have multiple controllers but it's a

onetoone mapping to the object and um

the the controller. Think of this if

anything happens to this resource this

code will apply. If anything happens to

this resource then this particular code

would be would be applied.

Now is the uh now is the moment of

truth. Would it would something happen

if I uh simply just say please create me

an instance.yamel I should see something

in here. That's what I am more concerned

about. So let's create that.

Um of course it's invalid. I cannot Oh

there you go. So it says the kind is

invalid. It must be EC2 instance. Of

course in the dry run for client side

by much

sample here my kind was wrong.

And then if I do a create again you see

there is my request.

I know that the instance name is my EC2

instance. This is Kubernetes not knowing

about this. This is our operator knows

about it. So it started the worker. Cube

builder started the worker and this is

our code from here till here. This is

our code. We get the log which is

reconciling instance and you can see

this is the code which is um started

here reconciling instance name and then

we get all of our u program executed. I

got a request for an EC2 instance in the

name space. You see it gives you the

name space default and then the object

name as well.

um which is a request.namespace.

This is telling you the the namespace as

well as the name of the object that you

have. The EC2 instance name is this is

now reading the spec and you can see

tags are it's giving you a map of

environment dev owner is Alice which is

this is what in your YAML looks like uh

example

and instance. So essentially what the

user gave my program our operator our

controller most importantly knows about

it you see so um my storage would be

size 50 and then type GP2 it's actually

just printing this as an object map but

we can u do that even better for storage

let's let's make some changes I want to

say storage size is 50 and type is g2.

So I could say storage size is um you

can say f is size

and type is you see storage dot type you

can obviously

access any sort of object that was there

in your spec like this storage dots size

because this is how you access yaml so I

would say spec dot storage dot size and

that's also what's happening spec dots

storage dots size and same for the type.

Um, you can also do

a delete. Now, see this is very very

important. This bit executed when we

created the resource. When I delete that

resource,

when I delete that resource, you see my

reconciliation loop started again from

the very beginning. This is absolutely

absolutely important.

Whenever you make any changes on your

object, the reconiler starts from the

very beginning. It does not know whether

you created the resource, whether you

deleted the resource, whether you um you

know um whether you updated some

metadata annotation. It has no

distinction of what the uh what did you

do? It knows about the update that has

happened. And this is where it is your

duty as someone who is writing the

operator, someone who's writing the

controller logic that you can make

changes. You can run your reconciliation

loop many times. But if no change was

required, no change is actually made to

your object

which in this case you can see because

the request because the resource was

deleted we don't have any EC2 instance

name we don't have any instance type

nothing but the loop ran completely

and here you can it says reconciled EC2

instance blah blah blah something more

evident would be when I just uh show you

let's say the name of the instance I

want to get rid of uh all these things

because I want to keep it simple. Uh or

rather I would say um

I would say fmt.

Ln uh got a request simple or update was

made

to the e uh EC2 instance restores. I'm

not saying the name or anything. I'm

just saying that there was an update

made and this is why I am um called or

reconiling that makes no sense.

Now I will run this again. I stop my

program and this is the beauty of

stopping the program when you are

building this with cube config uh with

the cube builder because it has a

graceful shutdown. It doesn't just stop

the program abruptly. It is a graceful

shutdown and um it it helps you uh

cleanly shutting down your manager

because I made some logic changes. I'm

now starting this again

main.go and then I would say um k - f

create. So here you can see it says

update was made to the EC2 instance

resource and this is why I am recon I am

reconciling it. That's the main uh logic

here. And then I got the instance type

which is E3 medium. If I made me make

some changes to this EC2 instance let's

say I want to add a metadata. I want to

add a label here. So I want to say

labels and I would say hello colon

world. I save and exit. You see I got

another line.

It's not like I created the object. It's

kind of like I only um updated it. So

you see I was not making a change as in

I was creating that resource. I just

edited that and that was only a simple

metadata change which was the labels but

my code ran again from the very

beginning. What if I add some

annotations to my object? If I do um

here let's go to my annotations

and I would say hello again and world.

You see

the whole reconcile loop runs again. The

thing that I'm trying to tell you is

whenever you make any changes in your um

object in your custom resource, the

whole reconciliation loop will run

always.

What if I maybe remove my label that I

had added or remove the annotation? Say

that again. You see running it again.

Kubernetes does not differentiate

whether it was a metadata change,

whether it was a spec change. It does

not do that. It just simply goes ahead

and says okay, you change the resource

and this is the update.

This is why when you make changes to

sync, let's say your instance name or

instance type, the reconciler finds

this. This is the beauty how a

reconciler would work. Whenever you make

changes, let's say in your instance

type, you make a change from T3 medium

to T2 micro, the reconil has no uh

state. First of all, it does not

remember that before it was T3 medium

and now the user has asked for T2

medium. It does not remember the past

request. It knows the status right now.

I mean it's in the HCD. But in this

case, let's say when you are going and

when you are saying that uh my my type

for the instance was T3

medium before and you change that to T2

medium.

This before is stored in HCD. That is

correct.

But the reconciler loop that will run,

it will have no idea that previously the

user asked for a T3 medium. They're

completely stateless. What the

reconciler loop will now do is it gets

your request. This is your logic that

you would add that allows the user to

change the spec for instance type or

maybe um you know the user can

dynamically change the tags that they

want to give. So here in this case if

the user has made updates to the type uh

key in in the in the YAML of the EC2 uh

resource, it is your responsibility that

goes to Amazon

and sees if the instance of T3 medium

was available and if it was you delete

that and you create a T2 micro because

you can't change the instance type as

far as I remember. if you can that's on

on Amazon side that's a different story

but the reason what I'm telling you is

your operator your controller the

reconcile loop will not remember the

past request it always has to check the

current state is T3 medium the desired

is T3 medium nothing needs to be done

but if the current state in the cluster

is T3 medium and the desired is T2

medium it goes to Amazon is okay this

needs to go away and this needs to be in

action and this is how you do

selfhealing or eventually consistent and

then you update the object which we will

see in the next u sessions.

So this is how you will be building an

operator that knows how to watch the API

server for your custom resource changes

that knows how to watch the API server

um and update the reconciliation logic

in case there was some changes you

change the object which we will see um

and um yeah that was it. So this will be

giving you a very good idea, a very

beginner idea. I would not say beginner

but it's a good enough idea for you to

build your operators and then you run

them on Kubernetes.

Next thing that we going to be learning

is I already have it available. This is

going to be how we will be ziting an

operator which will be actually creating

us an EC2 instance. The next parts of

this video are going to be more onto how

to use Amazon SDK in Golan to create an

EC2 um instance on Amazon because we now

know from Kubernetes point of view, from

the operator point of view, we know how

to write an operator, we know how to

write the spec, how to install the

custom resource definition and how to

react on changes into our custom

resource in the operator. Now it's about

what do you do with that change. In my

case, I'm just printing it. In the

actual case of the course of this video,

we will be building we will be using

these changes and then we will be

building them on top of Amazon. We will

be creating an EC2 instance. So that is

what we will be doing next. Till this

point you know how to write your

operator. You know you can get requests.

you know how you can you know the

reconcile loop does it for you. So in

the next part we will be using uh the

Kubernetes SDK in Golang to create us an

easyto instance and then we will see if

in case a request was successful we

don't need to reconcile again we will

talk about finalizes but that's all

coming in the video. So let's look at

how we can create uh EC2 instances with

our operator uh using Golac. Okay,

before we can actually get started for

the code, there is something which is

absolutely important for you to

understand. We have been working with

the reconciler loop and we talked about

that the reconiler is the one that takes

your request and runs it through a

series of you know your logic and that's

where you get your changes for the

current state to be equal to the desired

state. However, this reconciler is

expected to return two values. One of

them is the result and the other one is

actually an error or it's going to be

nil. These two return values are

actually required by Kubernetes to know

what needs to be done for your current

reconciler request. So imagine your

reconciler got a request here and you

made some changes to your environment.

You made some changes to your you know

resources that you need to change and

then you have to tell Kubernetes whether

you want to re rerun the reconciler for

the same request or you just want to

wait for new requests. Uh wait for new

requests.

In this case, you did not get an any

error. You did not return any error.

Based on these values of the result and

the error, that is when Kubernetes

decides, do I need to rerun your

existing request with the reconciler

again? And this is how we work with

things like selfhealing.

If you know about this, you can give

this an uh you know, give this the give

this a try. Get yourself a pod that is

in a pending state because of CPU or

because of memory. Ask for resources

that are not available in your cluster.

The pod is going to be in a pending

state. Then go ahead and add a new node

that will be able to host that

particular pod. And once that node is

active and available, this pod gets from

pending into the running state. You

didn't have to do anything. You didn't

have to tell Kubernetes that, hey, I got

a new node. Please put my pending pod on

this new node. It it didn't work that

way. It was self-healing because when

the first time the when the first time

Kubernetes tries to put your pod to a

node, it says, "Okay, there is no node

available. I'm going to put this in a

pending state." Think of this as a recon

silent. So, the decision were made that

I'm going to put the pod in the pending

state.

And the controller responsible foruling

your pod returns a pending which is

actually uh it sends an error that for

the request that came to me I was not

able to properly process it and there

was an error and this is where

Kubernetes knows I have to retry again

for that request and this is how

self-healing works while Kubernetes was

retrying and retrying and retrying with

an exponential ial back off you happened

to add a new node and this is when once

you added the new node when the you know

when the logic ran again it was no

longer pending the reconciliate said

okay you asked for eight CPUs and I have

node now which is 20 CPU available 20

cores that are available I'm not sending

any error rather I'm going to send a nil

for an error that I did not get any

error and the pod was scheduled and the

pod but then eventually went into a

running state. This is something that

Kubernetes does for you. And as a

developer for this reconiler, it is

absolutely your responsibility to tell

Kubernetes whether your reconil function

was okay or did you get any error and

would you like uh Kubernetes to actually

retry that particular thing. This could

happen for EC2 instances. Imagine when

you tried to have your reconcile

function and you were calling the AWS uh

API to create an EC2 instance and you

were not able to do that. You had the

right credentials, you had the right

access for AM for your user that you are

using but maybe uh there was some

network timeout happen

or anything that could stop your request

from processing um happened. you would

like to retry again, right? Maybe after

like 10 seconds or 20 seconds or

whatever your time is, you would like to

retry. In this case, you can tell

Kubernetes that there was an error. My

reconciler function is returning an

error that please retry that again. And

based on the requests and the error

values, Kubernetes decides do I need to

retry this particular request or do I

need to wait for new events or new

updates to the custom resource for which

this reconciler is listening on. So

there's a very simple um condition that

your reconciler can actually uh look

into or look for and this is also in the

priority order.

If your reconciler function, if your

reconciler is you know um is returning

an error. So your error is present you

are returning an error. This result is

completely ignored. That means whatever

you send in the result is completely

ignored and you are then using an

exponential backoff. A little bit about

the result. What are you actually

sending in this result is two things.

First you are sending do you want to

recue or not? Usually it's a it's it's a

boolean where you can say I want to

recue or not. And second you're sending

a time for the recube.

If you are sending an error if there is

an error present in your reconciler this

result is completely ignored and you

will always be retrying. Kubernetes says

okay the reconciler function or the

reconciler is giving me an error that

means it could not prop properly process

the the request that came in I will

retry this and this is where the

self-healing uh loop comes into the

picture if in case you are uh not

sending any error so and this is the

second thing if you think okay

everything is fine I have processed my

request I'm not sending any errors and

you do send a custom recue after. And

this rec is actually this time rather I

should have put this as um wait a second

I can probably get a better color here.

Um this should be rec

after this is the time after which your

reconciler should again be uh started

and this is like a forever running loop.

So imagine this. You create an instance.

You create the instance. It's okay. You

probably want to check for the instances

every 10 seconds

or every 20 seconds. Maybe you are doing

some sort of drift detection there. And

if you were able to look for your

instance, everything was fine, that

means you are not having any errors with

that instance. Um but you re you want to

retry that again after 20 seconds and

this is what you are sending. You are

not sending any error because you did

not have any errors. However, you are

sending a fixed time. You are telling

Kubernetes that there was no error in my

request but I want you to rerun this

reconiler every 20 seconds. And this is

kind of like a forever running loop. It

never stops because you don't have any

errors but you always want to retry that

again and again. You want to rerun this.

What could be the reasons for it? I just

told you. Maybe you want to do some sort

of a drift detection.

The third condition could happen if you

are not sending any errors and you want

to you know recue and your custom

timeout is not set which is kind of

similar that you have no errors and you

also want to recue but you don't have

any re uh recue after set that means you

are asking Kubernetes that hey my

reconciler was okay I want you to retry

again, but I'm not telling you in what

frequency do you have to try. It's kind

of like similar to level two, which in

this case you're also not sending any

errors, but you are telling how frequent

do you want to try. In this case, you're

also not sending any errors, but you're

not also telling Kubernetes um how

frequent do you want to try. You are

letting this with Kubernetes and this is

going to be the exponential backoff.

This is where Kubernetes will say okay

the user said there is no error for the

reconcile loop the function was running

properly fine but they are not asking me

to run this in a forever loop I would

probably uh I'm going to use an

exponential backup so it will run your

request and then maybe another time it

takes 2 milliseconds the next time it's

going to take 4 milliseconds the next

time it's going to take uh probably 16

milliseconds or so and this is going to

be an exponential back off until I I

think the maximum limit is 1,000

milliseconds um until then it stops

doing it.

And the last condition that you can

return for your reconciler is you do not

have any errors and you also did not

send any rec flags. Probably you just

said result result was empty and then

you are sending a nil. You are returning

a nil. This is where Kubernetes says

okay everything was fine. I'm not doing

anything. I'll just wait for a new

update or I will wait for a new event

where the custom resource has been

updated. Kind of like for the new

requests here. This is absolutely

critical for you to understand otherwise

you might see your reconiler making

changes again and again or you might see

your reconciler running again and again

because you did not send the right set

of values. you did not put the right

return values for the recon and for

kubernetes to understand what to do

now as as I was talking about once this

is understood uh I was talking about we

will be looking into the go code so

let's take a look here and let me get

that here so in your screen you can see

that I've made some changes to our um

our instance spec before this this

before u now it was a very simple one.

It was just having an instance type, an

AMI ID, probably a key pair and a

security group. But when you want to

make things more robust and when you

want to make things more production

ready, you have to think from an overall

point of view. When you want to create

an EC2 instance, there could be many

things that you have to give. You

definitely have to give the instance

type whether you want to use a T2 micro,

T3 micro or any other instance family.

Then you absolutely have to give an AMI

ID which is going to be the the AMI ID

you want to use. You have to give the

region as well under which your instance

should be created. You need to give the

availability zone. You have to give the

key pair so that you can log into the

instance. You need to give the list of

security groups around here. the subnets

in which your instance could be running

and also when you want to provision the

machines as soon as they boot up with

your changes we usually use Amazon's

user data and uh that also is what you

can give you can probably give tags as

well you can also give some storage you

have to give storage and whether you

want the instance to have a public IP or

not this is kind of like a boolean that

you can give now on the right side you

can see this omit empty. This is

actually that uh a place where you can

control what kind of fields in a YAML

when you give your EC2 instance spec are

optional or what kind of fields are

required. For example, tags could be

optional. User data is optional but

storage is absolutely needed. AMI ID is

absolutely needed. Instance type is

required. So this omit empty lets people

define the only important or the

required fields otherwise the other ones

could just be skipped.

So here you can see I have a storage

which is type of a new strct called

storage config and here's a new strct

called storage config where we define a

root volume and then we can also

probably give some additional volumes as

well. This is an example where you give

your root disk as 100 gigs and maybe you

want a VM for a database. You can add a

bigger disk in the instance and this

could be done by the additional volume

and both of them are of type volume

config and a list of volume config

because additional volume itself is a

list of additional disks that you can

add to your instance. And this is a very

simple volume config where you define

the size of the disk. You define the

type of that disk, the device name which

is going to be available in the instance

when you mount it or attach that and the

encrypted uh boolean if in case you want

the device if in case you want the disk

to be encrypted or not because Amazon's

uh allows you to encrypt your discs in

case you want that.

So this think about the EC2 instance as

a more holistic approach whether you

want to give or you want to allow the

users to be able to declare their um set

of set of data and the metadata. In this

case, you are allowing the developers to

create an EC2 instance, not just create

one, but also you are letting them login

with their key pair and you are also

allowing them to use their user data

that you can, you know, give to Amazon

when you are creating the instance that

lets it preconfigure before they can

even login and the VMs are exactly how

they want it to be.

So this was a bit of a change in uh our

EC2 instance spec to make it more

production ready to make it more not

from development but actually to

production. I also made some changes to

the status where when you do cubectl get

uh EC2 instance you will see the spec

and also you will see the status. So in

the status I would like to see the

instance ID so it is easier for people

to see what is available on Amazon and

what your Kubernetes knows about the

state of that instance if it is running

if it's terminated it is unknown it is

stopped all those Amazon EC2 instance

states and also a very important thing

is going to be the public IP because

when I do cubectl get EC2 instance I

should have enough that lets me log to

this public uh to the to the particular

instance and this is a public IP and

that's what I want to show when someone

does an EC2 uh uh cubectl get EC2

instances

and then again we have the standard

strct of our EC2 instance which contains

the type metadata the object metadata

and our spec and status and this is kind

of like just when you get a list of

instances what's going to happen and

this is how Kubernetes knows uh what is

a set instance would look like for you.

Now I've already made the changes and I

told you whenever you make make the

changes you have to run the make

manifests command and then you have to

install that to your Kubernetes cluster.

So my Kubernetes cluster already has

this custom resource definition. If I do

cubectl get EC2 or CRD which is EC2

instances.computee.cloud.com

cloud.com- oyl and let's look at this

you can see the name is easyto instance

the list kind the plural the singular it

is a namespace scope resource

and there you can see I have got couple

of things such as the kind and there's

my spec I have got the AMI ID the

associate boolean um a public IP or not

it's a boolean the availability zone I

want to run my instances on and things

like my security group which is type of

an array because you can give multiple

security groups and then I've got my

storage configuration where I give one

root volume and I have got additional

volumes which is type of an object which

is then again uh globally it's a type of

an array so you can give multiple

additional volumes but you can only have

one root device you can only have one

root um clock device

now once we have uh defined our spec

properly. There is going to be now some

uh code that actually uses this and

creates an EC2 instance. So let's see

that. Um once I have my instance type, I

can actually go to my U E2 controller

and this is where everything starts.

This is where you will be seeing the

reconcile loop. We saw this before. We

use the reconiler to actually see uh

what happens when I get a request and

this is what your to-do list is my logic

starts and I have created a logger

for this context that is aware of the

context and you can use l.info which is

going to just print stuff when you are

running your operator. So it makes it

more verbose and you can see what is

going to be uh what's going to happen or

what is happening with your controller.

It prints out a function uh it prints

out uh an info message that the

reconsidered loop has started and this

is the name space under which you got a

request. So Rick RQ is the request that

comes to your consiler and then you send

a result back to Kubernetes. So it came

from this name space and the name of the

request was uh request.name. That's the

name of the object that uh we are uh we

are working with.

Then there are some comments which I was

building this. I put some comments for

us to be easily understanding this

again. But you know what we are doing?

We are creating a variable of type EC2

instance so that we can marshall the

object which is coming to us in this

reconciled loop by Kubernetes into a

variable and then we can easily iterate

over on top of that. we get the object

uh into uh our EC2 instance variable

from this name space and if you could

not get the object and this is

absolutely very important. See, you may

have any problems uh for getting the

object. Maybe you have a wrong YAML.

Maybe you probably were supposed to give

a string, but you give a boolean to one

of the keys. Or you probably did cubectl

delete the object. That's correct. Even

if you delete the object, it is an

update to the custom resource.

Then again this reconciler is going to

be started and you have to check if the

error that you got when you are trying

to get the object was is not found. This

is one of the errors from Kubernetes.

Kubernetes has a package called errors.

And let me show you here and here you

can see it has all these errors defined

for you. So it makes it easier for you

to declare what was the error in my

case. See, sometimes when you create an

object, it gives you the object already

exists. It's an error. But you can

actually see what kind of error it was

because if I was just say if I was just

saying get me the object and if error is

not equal to nil, I would just say okay

there was an error please try again. But

the user will never know what the error

was. In this case, I can say, "Hey, you

know what? I was trying to get your

object. I was trying to get it into my

variable, but I got an error while

trying to get it and the error was is

not found." And that is where it returns

a true if the condition was that I could

not find the object. This is probably

when you are deleting the object. Um, it

again runs the reconciler. It looks

something like this. you have uh the

object here.

Whatever change you make on this,

whatever change you make on this, the

reconciler will be running again. So the

change could be you added an annotation.

That's an update. Then the reconiler the

change could be you added a label on top

of that to the object. Again the reconil

would be running. It is your

responsibility to write this reconiler

in a way that if it surely should not be

changing anything if the change that you

did to the object doesn't require a

change it should not be changing the

actual resources. For example, your

object could be EC2 instance. Maybe on

this Kubernetes object, you want it to

give a label. That doesn't mean you have

to change something external to the

Amazon instance. That doesn't that

should not happen. So this is something

you have to code in your reconciler.

Even when you say cubectl delete,

when you say delete, the object is

deleted. there was a change on the

object and then another uh run of the

reconciler would happen. So you have to

check that when you were trying to get

your object you could not get that and

there was an error and the error was

actually is not found you will simply

say um the object does not exist or uh

there's no need to reconcile because the

object was deleted and then we just

return an empty result and a nil.

Remember this is one of the return types

that you have to say. What you're

telling Kubernetes is everything is

fine. There was no error from my side

because the object does not exist in in

our case and please wait for the new

requests which are coming to the

deconidered. This one request that came

in is all good. If you could not get the

error for any other reason then it is

not found. Maybe uh you did not have

write um arbback to get that to get the

object in that name space. For whatever

reason you could not get the object of

the request, you will then say u you

will send an error. And you see here the

moment you send an error you are telling

Kubernetes please retry this object.

Please retry running this reconiler

please retry the operation of the whole

reconciler loop. And this is where the

self-healing would work. Probably you

had some problems where you could not

get um you could not get the you know

the object but you try again and it if

it works then you're happy because you

don't you don't have any errors anymore

otherwise you send an error again and

this is going to go ahead with a

exponential backoff. So it's like when

the first request comes in, it was an

error but it was not an is not found

error. You send uh you return an error.

Okay, it goes back to the reconciler

again, runs to the reconciler, there was

another um there was an error again

which was returned goes back to the

reconiler and this happens with an

exponential backoff. This is where

return values of the reconiler function

are absolutely critical. Absolutely

critical.

Now the next thing is um whenever you

delete an object you set a deletion time

stamp and I will talk to uh you about

this in in the future. We are not going

to talk about this right now. This is

when you delete the object. We will

first learn how to create one and then

we will delete one. And this also is

about deletion. This is the logic of

deletion. We'll talk about that later

and I'll tell you why this is here. This

is the logic of checking if the instance

is already there because you want to be

at the potent. You don't want to create

the same instance with the same instance

ID if it is already there. And this is

also the logic which I probably would

talk to you about later. This also is a

logic which I will probably talk to you

about later.

Um so here is where we start in our

loop. The first thing you do is you

start your reconider. You create an

object. You try to get the object into

your EC2 instance variable the custom

variable type that you have created. And

then you say okay I'm starting

completely new. I have no ID. I have no

instances on my Amazon and I'm going to

create a new instance.

The first thing you do when you create

an instance is or when you create an

object, it's a very good idea to add a

finalizer. You might have seen this in

Kubernetes. You when you do cubectl get

hyphen or YAML, you might see this

finalizer in the metadata texture of

your object. What this finalizer

actually does is is very it's very

interesting. So let's say when you

created this object in Kubernetes which

was an EC2 instance

and there you added a finalizer.

Finalizer is nothing but it's a list of

key value pairs that you can add. Let's

say I add my finalizer as hello colon

word.

Then this object was created and your

reconciler actually went to Amazon and

give you a new uh AWS instance.

All right, everything is happy. You got

the instance. Now the the thing that

happens with finalizer is when you say I

want to delete this object

when you say I don't need this instance

anymore I want to delete this particular

object you can delete this from

Kubernetes however that's not the only

thing where you need to delete it from

you also have to delete this from Amazon

so how do you tell Kubernetes that why I

am deleting this from an external

resource from an external platform. Do

not delete this object from Kubernetes.

Only when this instance is completely

gone from here can only you delete this

particular object. That is the role of a

finalizer. Finalizers will hold the

deletion will hold the deletion of the

object in Kubernetes until the actual

cleanup has happened. And once you have

you know um once you have deleted the

object you then remove the finalizer and

then Kubernetes will allow you to delete

this particular uh EC2 instance

Kubernetes object that we have created

as a good practice when you create the

object in Kubernetes that's where you

should add the finalizer

and this is extremely important this

finalizer is being added to your

Kubernetes object. object which is EC2

instance and this also is an update and

this is also going to rerun the

reconciler loop any update to the object

whether you are adding a label whether

you are adding a metadata whether you're

adding anything really whether you are

updating the status of the object in

your code that will recon uh that will

start a new reconsidered loop so this is

where you have to be very careful of a

depoency uh in your code and you see

what happens. We we print a message and

we say I'm about to add a finalizer and

I use this um this append function

because it's just a key value pair I'm

adding in my EC2 instance finalizer

because I've already mastered this using

the R.get. My EC2 instance has actually

the YAML of the request that was given

to me and I'm creating uh a key in here

called finalizer and I'm appending uh my

finalizers here in uh called EC2

instance uh EC2

instance.comput.example.com.

I'll show you how it looks. It just

creates a new key under your object in

your object and then it just um adds

this uh this as a as as a list uh there

as an entry in the list because you can

have multiple finalizers

uh in your object.

Once you have declared that I want to

add a finalizer, the actual way of

updating your object is going to be

R.update reconciler.update.

There are a couple of functions we get

with the reconciler. Get lets us get the

object of the uh you know get gets uh

the get function lets us get the yaml of

the object into our variable and then

you can also use r.update.

This lets you update the object that the

reconciler is working on right now. So

in this case I'm updating my EC2

instance where I'm actually creating and

adding some finalizers and we will see

this when you create the instance it

will actually give you the finalizers

when you uh as soon as you create the

the instance and because you made an

update on the object it will start a

reconciler again. Not right now though

this is very important. See it's very

important to uh to remember when you

have the reconciler let's say when you

have the reconciler it starts let's say

here you made an update to the object

maybe you updated the annotations

this is where you will start another

reconciler but not right now you will

move ahead in your code and you probably

make another update in this case you

updated the labels of your uh reconciler

of your object. Kubernetes also records

this as a second time it has to run.

Then you do a return

and you do a nil. What's going to happen

is Kubernetes will run this reconiler

twice because you made updates to the

object twice. It's kind of like it

remembers that this is where an update

was. I have to rerun the reconciler.

This is what an update was. I have to

rerun the reconciler. It is a golden

rule of reconcilers that any update to

the custom resource whether it was done

by you with cubectl commands or whether

your reconciler is doing it will start

another reconciler loop. It will not

stop u you know the current reconciler.

It's not like it got an update. it will

directly go from here. It's not like

that. It will finish the proper

execution and then based on how many

updates did you make, this is where the

reconired loop is going to run again.

And this is your responsibility to make

sure that um you know when you run this

again, this update does not happen. This

label does not happen because they're

already there. And then you will say,

okay, I did not make any changes

throughout my reconciler on this object.

um I I need to do nothing. I don't need

to start the reconider again for this

particular uh custom resource. If you

make a new custom resource then again

yes the reconciler will be started and

the loop will keep on running.

Extremely extremely important to know

about the return types of the

reconciler.

Now once I have uh updated you know um

my object I'm telling Kubernetes please

add this finalizer to my object and if I

got an error that said failed to add

finalizer actually if I got any error um

I'm printing an error that says hey I

was not able to add the finalizer and I

say please recue and I'm sending an

error. So this is where you are

returning another uh return type. And

you see whenever you get an error

whether you are trying to get the

object, whether you're trying to delete

the object, whether you're trying to

update the object, you want to retry

again. And this is where you will send

uh you will return an error. And

whenever you return an error, this whole

result is completely ignored. It's

absolutely absolutely important to

understand this. when you are returning

or when your reconciler you see here

this reconciler function is returning

two values one is the result the second

one is an error if you return an error

the result is completely forgotten

kubernetes says you know what the

reconciler had an error I'm going to

retry that again with an exponential

backoff and this is where this is how

the self-healing works in Kubernetes

Now once you have added the finalizer

and this printing an info message for

info log which says finalizer was added

this updates a new reconider loop

execution but the current will continue

and this is where we create an EC2

instance I'm just sending a log uh I'm

just printing a log continuing with the

EC2 instance in the current reconciler

and this is the beauty what we were

waiting for this is what's happening

when we want to write an operator that

talks to uh that that creates mere

Kubernetes cluster. Guys, this is where

it all comes down to. We have our spec

of the custom resource. We have the, you

know, the logic that listens on the

update of our custom resource. We have

the logic to get the manifest or think

of this as this way to get the YAML of

what the user has given in my EC2

instance. Now I need to create an EC2

instance. This is absolutely important.

Now it's going to be so much fun. Now

when you want to create an Amazon

instance, you know what you need.

You absolutely need an Amazon account

and you need to have the credentials.

You need to give the credentials to your

operator or to your controller so it can

go on your behalf and work on Amazon.

And this is exactly what we will be

doing. Before we can actually go ahead

and create an instance, we need to

figure out the authentication.

Then we will use a client that we

created using this authentication and we

will give it this particular YAML and

we're going to ask it to go ahead and

give me an instance on Amazon. And this

is exactly what's happening now in this

create EC2 instance function. So I've

created an EC2 instance. I've created a

function which is called create EC2

instance and I pass the users requested

YAML. I pass the user manifest the EC2

instance and let's see what this

function actually does.

This function which is the create EC2

instance. It is accepting a value of

type EC2 instance which is the whole

YAML from the user and it is returning

two things. First, it returns another um

it is returning a variable. It's

returning a type of created instance

info. See, when you create an EC2

instance, you get a lot of output. You

get a lot of data. But we don't want all

of that. We only want to um you know

when a user creates an instance

when a user creates this instance

probably they care about the state

whether the state is running or not.

They care about um you know uh created

or not true or false. They also care

about the public IP

that was it you know um created of was

it there or not? What do I have a public

IP or not? And for this information I

have created a new strct which looks

something like this created instance

info and this helps me to send back the

data from my create instance function.

What I'm sending back is I can send an

instance ID which is important so people

can know what instance ids are there on

Amazon uh using cubectl get I'm also

sending a public IP I can send a private

IP a public DNS private DNS the state

that is all I can send from my function

and this is the um this is the return

type and my function which is going to

create me the EC2 instance it is

returning two things first is the strct

which is the information of the created

instance and second is an error um which

is I probably could tell the user that I

tried to create the instance but there

was an error maybe the authentication

was a was a problem maybe you don't have

enough quota in your region in your

account I I want to send I want to uh

send them something so they are aware of

what has really happened why the request

failed to create the instance so we

create a logger

which is create AC2 instance. This is

good. You can have your logs with custom

uh log name and this is easier for you

to know which file which function

created this particular log entry when

you do cubectl logs in your operator.

Um so I'm putting up a info which I'm

saying I'm starting the EC2 instance

creation. This is going to be the AMI

ID. I'm going to use what the user has

given from the spec. This is the

instance type and this is the region

under which I'm going to create my

instance. So the first thing I have to

do is to create an EC2 instance client

guys. Now it has nothing to do with

Kubernetes. It is completely how you

create an Amazon instance in Goland. It

has nothing to do with Kubernetes

because you already have the instance

YAML or rather the instance info because

EC2 instance is the whole instance that

should be created. Now you uh are doing

the generic things on how to create

instances on Amazon. The first thing you

do is you create the EC2 instance client

with this AWS client function. What this

is essentially doing is it's reading the

AWS access key and the secret key from

your OS environment variable. You see

because you need to give some sort of

authentication on how you would talk to

Amazon. I'm using the AWS key and the

access key and then I'm using the config

from Amazon to load the default config

with these credentials that I have

given. Um and then if I could not create

the config, I return an error.

Otherwise, if you have the access key

and the secret key, um you are able to

create a config and then you are

returning a new config. Think of this as

this function is just returning an EC2

client. And this is absolutely

important. You would use this client to

talk to Amazon. So you use your access

key and your secret key to create an EC2

client. And then this EC2 client is

here. Till this you have the

authentication to Amazon. Till this 24

you have the authentication to Amazon.

Now you need to say hey Amazon please

create me an instance with this key in

this subnet. This is the minimum count.

This is the maximum number of instances

I want of of such. Um this is going to

be the instance type that I want you to

create. And this is going to be the

image AMI ID I want to use. These are

the input instance parameters. You are

creating an instance. Amazon expects you

to give certain instance inputs and

these are um one of them. There are many

other instance inputs that you can give.

If I show you, you can tell Amazon what

is the maximum count of instances you

want, what is the minimum count of

instances you need, any u block device

mappings you have, you probably can say

what is the capacity reservation

specifications, the CPU options you can

give, is it a dry run or not, is it a

EBS optimizer or not. Um, so there are

many many different uh options you can

give when you are creating Amazon. This

is just something when you create Amazon

instance, this is the information you

give. You can also give security group

IDs if in case you want the security

group to be used for creating this

instance. I'm just keeping it very

simple so that we know what's really

happening. We are creating our query to

create an EC2 instance with these

inputs. And once I have my input

declared, I'm using my client EC2 client

which I created above. And there's a

function called run instances. And this

is the function from go uh SDK of

Kubernet of Amazon that launches the

specified number of instances using the

AMI for which you have the permissions

for and this is absolutely the one that

has creating the instance for you. So,

so far you created the client, you

created the instance uh input and now

you have created your actual instance

here. Now, for whatever reason, this run

instance is going to be returning uh two

things. First, it returns the actual

output.

See, I told you when you create the

instance, you get a lot of output. So,

this is what's going to be returned. So

if you look into this EC2 instance dot

uh EC2

run instance output. If I look that on

Goland

uh here

you can see I'm going to look for the

run instance output.

Run instance output here. And this is

what is being returned to you. You are

returned. Where did that go? Whoa. Whoa.

Whoa.

I think I was a bit quick there. Let's

wait.

Uh run instance output here. So you see

you are returned the act the the the

growth you returned the instances that

were created and this is the metadata.

There's something wrong with my browser.

Wait.

Essentially what you are given is the

what you are given is uh where did I go

return instance output

uh there's a type so you get the

instances and this is where you have the

instance metadata what is the primary uh

what is the private IP what is the

instance um uh ID that you that was

created for you what region it was

running in what zone it was running So

think of that as a metadata of your

instance when you created that and

that's what we are um saving in the

result. If in case there was any error

because this run instance does return an

error as well you will say I failed to

create the EC2 instance and then you

return the error back to the main

program and you say this was the actual

error because of which I could not

create the instance. There could be many

reasons why you could not make one. Um

perhaps you did not have the permissions

in that region. Perhaps you did not have

quotas in that region. Perhaps you used

a wrong AMI ID which doesn't exist. Um

could be a typo or anything. You just uh

are returning this to the user. It's a

good thing to return them the reason why

uh it failed.

So that's what you're checking. If the

instances returned is zero, you will

just say um uh there were no instances

returned for me. And till here if we

have no error, we have an instance for

ourselves. And this is what has happened

so far. our code. We had the client and

then we use the run instance

function

to actually create an instance and it

gave us some output back.

Till this time this output contains

things like uh the state that was you

know when the instance was created at

that time what was the region the

metadata the private IP the private DNS

um DNS name

by this time there is no public IP

however there's one thing important

second when you make an API call to

Amazon with this run instance function.

What you essentially got back in the

output is the state of the instance at

the time when AWS received that request.

It might not be running. You know how

when you create an EC2 instance, it goes

into uh pending, creating, initializing,

then it eventually goes into the running

state. At this time you have an instance

created for you but it might not be in

the running state. It might take some

time for the instance to be in the

running state. And this is what you

want. You want to wait until the VM is

running. So when you uh run when you

execute the run instance function it

creates the instance and gives you back

the metadata. What it does not have

however is the public IP and the state

whether the state is running or not. So

it's kind of like you say hey Amazon

create me the instance. Amazon says cool

I will give you one here's some metadata

but you don't get the public IP and you

might not be in a running state when you

created this actual instance

but this is what you say if I got back

in my result because you see this one

instances it gives you back the list of

instances and you are checking if um if

there was you know uh an actual instance

where I could I did not have any errors

and there was an instance created you

will just send an info that says okay I

was able to create the instance

successfully and this is the instance ID

that was returned to me you store the

result of uh you store the output that

was given to you and uh there you can

access things like instance ID uh

private IP public IP because it's all

returned for uh returned by Amazon to

you

and now we wait for the instance to be

running. See, it's a good idea that you

created the instance, but it's not like

imagine this uh there's this developer

and he goes to you and says, "Can you

give me an EC2 instance?" And you go to

Amazon and you say, "Please create me

the instance." And you get back the p

the private IP

in your company. you guys are using a

bastion host which is available and

using this bastion we can talk to the uh

VM which has a private IP because you

might not have a public IP you might

have disabled the public IPs. So

essentially what happened the guy asked

for a VM you said hey Amazon create me a

VM and you got the private IP and you

gave it to to him. You never waited to

see if the instance was actually in the

running state or not. Maybe the instance

was created but it never reached

running. Maybe there were some problems

in the region of Amazon or maybe the

instance malfunctioned. Whatever could

have happened, you you were not waiting

for the instance to be running. You gave

it to him and he logs in to the bastion

only to find out that this instance is

not running. So he cannot use it or she

cannot use it. And that's where the

problem is. As an operator, it is your

responsibility that you create that

particular resource. You create the

instance and you wait for it to be in a

certain state that you want. In our

case, it is running. So what I could

have done is I could have had a for

loop. I could have had a for loop that

keeps polling Amazon. Hey, is this

instance now in running? Is it now in

running? Is it now in running every

maybe 5 seconds? That's also doable.

Think of this as um a while through and

I would say think it like this. So I'm

using a while loop check the instance

and that's it. This is kind of like my

function. It keeps running. I am giving

it an input to describe me the instance

and I'm describing the instance with

this function and I get some responses

back. If the state name is running,

think of this as a pseudo code. um then

you break otherwise you keep running. So

you keep running every every 5 seconds

or 10 seconds. That is a doable option

but it's not a good idea. It's not a

good idea because Amazon gives you these

waiters that can do this for you more

gracefully.

A waiter is nothing but it's a construct

from the Golang from the go package of

Amazon that waits for a certain time um

for a certain state to be reached of the

instance. In my case, there's a new

instance running waiter. If I go to that

and if I show you or probably even here,

um there is going to be a new instance

uh running waiter. Now what this does

what this does is it defines a waiter

for instance one. This one actually has

the logic to wait for the instance to be

in the ready state. It does the polling

more efficiently compared to me having

this writing in my logic uh in my code.

So you can define a waiter which is

going to wait for the instance to be

reaching the running state. You can also

give the maximum time for which you want

to wait because it's not like if Amazon

takes forever for your instance to be

created. Um you just say um the you know

the checking loop will keep running

forever. You have to give some feedback

to the user and typically you can give

the run max time which is going to be

time dot minute and three. So you're

giving three minutes that you want to

wait for the instance to be reaching the

running state. This could be depending

on your uh requirement you can you can

increase this or decrease this. Every

request that you make this waiter will

be exponentially uh backing off. So it's

it's kind of like it starts from like

every 10 seconds and then uh it

increases this time out up to your uh

given time. So it does it a lot better.

You create a waiter and then you use the

wait function to ask it to wait on this

instance ID. So you are telling this

waiter that please wait for this

instance for this maximum time for it to

be reaching the running state. And by

this time if it was not reached the

running state there would be an error.

And if the error was not equal to nil at

this time you will just say failed to

wait for instance to be running and the

instance could not reach the running

state in 3 minutes. Now this is

important. You do say the maximum time

for your instance or that you want to

wait is 3 minutes. However, if the

instance has reached the running state

in the first 30 seconds, then the loop

will stop. It's not like you're going to

wait for 3 minutes dedicated even if the

instance reached the running state

before. It's not going to happen like

that. This is why the waiters are quite

interesting. They have the logic for it.

So, you don't have to deal with with

that. This only comes from uh experience

when you are using the SDK. These are

the things that you can also Google. How

do I make my code more efficient? How do

I use waiting? And you will get that.

Now by this time when the instance has

been created, we get the remaining uh

information back. We got the state

because we were waiting for it to be

running and we only break our loop when

the running state is there within 3

minutes. Of course, now by this time,

Amazon has also associated your instance

a public IP. 3 minutes are good enough

for Amazon to give your instance a

public IP and there then um what you can

do so actually um okay this was a bit

wrong by this time we are just waiting

for the instance to be running we don't

have the public IP yet this is where I

probably skipped ahead when you are

using a veator you were only waiting for

this instance to be running and once the

instance is running, Amazon will give

you a public IP. So you have it running

but you don't have the public IP yet

because it was not given to you in the

output when you created the instance.

This is where you will use another

Amazon function describe instance.

Now you say okay I created the instance

I waited for that to be in the running

state. Now I'm describing this

particular instance. Now I'm going to

get my public IP as well. Of course,

given if you have uh the public IP

allowed in your Amazon account and this

is where another request happens to

Amazon. So we waited for it to be in the

running state within 3 minutes. And then

I'm saying calling Amazon describe

instance API to give the instance

details. I tell Amazon that I want to

describe an instance whose ID is what I

got when you created this instance for

me and uh I want to store this result

which is of the describe instance um

function and I want to store this into a

describe result variable. If I could not

describe the instance again this is a

very simple go check you will say I

failed to describe the instance whatever

reason you are having you just give the

instance or if you could describe it

your result is going to be in the

describe result which is again a type of

describe instance output.

Now when you describe the instance you

get a strct back from Amazon. You get

some data in in a in a specific strct

which we can see it here.

You do get some output with describe

instances. Wait for that. And this is

the describe instance output. What you

get is you get the information about

your reservations which is the instances

on Amazon. And within these

reservations, you have the instance

information. So if you look at the

reservation there, you have the

instances that were described for you.

So you can call the reservation because

I know I only created one instance. So

it's only going to have one element in

the list because the reservation is a

list of reservations. And for these

instances which is only one the public

DNS name I uh is going to be describing

like this. So I print the public IP and

then I say the state is state dot name.

Again this is returned by the instance

str of the golab because if I show you

here if I go to instance

it will have public DNS. Let's look for

that. There you go. The instance strct

is returning a public DNS. It is

returning me a public IP address and

also it returns me the state. Um here

the state is of type instance state

where we also have an instance state

name. So it's kind of like they have

created packages for all the other one

and here you see you have the name. So

you have asked for I want to describe my

instance and the input is this instance

ID. I store the result. I store all the

reservations that was returned to me by

Amazon and all the instances inside of

it. I know I only have one instance. So

I can call it with zero index and tell

me the public DNS name and tell me the

state of the of the VM.

Now here's interesting thing. By now you

have all the information you need for

your EC2 instance. You've got the

private IP, you've got the public IP,

you've got the instance state, you've

got the name of the instance that was

created, the key name that was using and

now you can actually um

u so so by this you have uh all the

resources that you need for your Amazon

VM uh to be to to be given to the

developer who has asked for this

instance.

The next thing that you can do is you

can get the information uh about the

instance. But this way this thing is not

needed because we already

um

uh we already requested the the instance

information. That's not what we are

doing. However, this is what we are

returning back. This is extremely

important.

See when you have all the information

about your instance that you have

created, we want this to be returned

back to the actual controller. So that

this is where we have the instance

information.

This function is returning me um a type

which is created instance info. And I

just showed you the created instance

info here. Where did that go in my API?

This one. So here's a strct uh which is

created instance info and that is what

my function should be returning. This

create EC2 instance should be returning

and this is what I'm preparing now. So I

got all my instance information in a

variable called instance from the

described result and then I'm preparing

my return type because you know this

function is returning two things. First

is an error if there was any and second

is the created instance info which

contains the public IP the private IP uh

which contains a public DNS private DNS

the state and the instance ID and this

is what I've prepared now so once this

uh is done we will simply say uh I have

created my EC2 instance and there I'm

returning my um my return types because

I did not have any errors when creat

creating this instance, I'm returning a

nil and I'm returning the information of

my instance which was created. What

might be interesting is uh this function

called dreer string uh dreer string.

What this does is it is actually just

dreferencing my pointer. The reason why

is we are dreferencing this pointer is

because when you talk to Amazon SDK, it

is returning you things like public IP

address uh which might not be available

at that time. So when you're returning

this, it might be a uh it is a pointer

type but you are returning a nil pointer

and that's a problem. So this is this is

important that we are able to

distinguish between whether it was an

empty string or whether it was a nil

value. If it was indeed a nil value and

you were trying to dreference a nil

pointer, that's going to be a problem.

And this is essentially why we waited

for so long for the instance to actually

have a public IP. This dreference

function just dreferences my string to

return a string which I can give back to

my main function. And by this time

and by this time I have an EC2 instance

that was created for me. Now the create

EC2 instance function doesn't just

create me an instance on Amazon. It it

does return me two different values as

well. One is an error which is a good

idea that your function does return an

error if there was any or it returns a

nil so that you can use that error in

the further steps. For example, we are

using it here to say if there was an

error, we want to put that error as a

log output of our reconil. So when

people are looking at the logs of our

application which is a reconciler in

this case they will know why there was

an error which you know which stopped

you from creating an EC2 instance and

then you can use this error to be sent

as a reconcilers's return value because

you remember reconciler

uh here it is expected to return two

things first is the result of the

reconciling function which is the

current reconciler and then if there was

any error with that reconciler.

Now it depends how you are creating uh

you know um your your reconciler. Maybe

you want to retry

after waiting some time you want to

retry creating that that easy to

instance. And this is why uh you can

return an error within the return

function. What you're essentially

telling Kubernetes is I tried to do an

an operation which was in my case was to

create an EC2 instance and I could not

do that. Whatever the problem was, I

want you to take some time and retry

that process again. Retry that function

again. And this is where Kubernetes will

say, okay, I'm going to retry running

that reconciler loop. So, I'm going to

retry to create that EC2 instance for

you. And this is kind of like being done

in an exponential backoff. It tries, it

fails, it waits a little time, it tries

again. If it fails again, it waits a bit

longer. And this is how the Kubernetes

will be doing its uh exponential backoff

with your request. So you're not getting

rate limited uh by you know um by Amazon

that you keep trying and asking for an

EC2 instance every 2 minutes or 3

minutes or whatever your uh reconcile is

it waits during the time and it uh it's

an exponential backoff. Now once you

have the EC2 instance once you were able

to create the EC2 instance I'm returning

the info as well which is if you

remember it's a strct that we created uh

probably somewhere around here.

This is the strct that we created and

this is the information I want from my

create EC2 instance function because

this is something I want to give to my

users when they do a cubectl Jet EC2

instance. They should know the instance

ID and most importantly they should be

knowing the the public IP of the

instance so they can always log and they

can start working there. You also would

probably want to give them the state of

that instance. How is it right now on

Amazon it is running? Is it stopped? Is

it terminated? Um or if any other state

that is you want to update them as well.

So we do a very small log. We are saying

that okay I was able to create the

instance and now I will update my

status. If you remember

every EC2 object that we create, every

EC2 instance that we create has a spec

and it also has a status field. This is

much like with any other um this is much

like with any other Kubernetes object

which is where you have the flexibility

to tell what status should be. In our

case, we are telling the instance

ID. In our case, we are telling the

public IP.

In our case, we are telling the state of

this instance whether it is running,

whether it is stopped, whether it is you

know um terminated or all the other um

states that Amazon instances can have

and that is where we are appending the

actual object. This EC2 instance here if

you remember we actually create a

variable for it and we got the object

from the request that came into our

reconciler. So the user asks uh to do

something on the EC2 instance custom

resource. We got that YAML. Think of you

get the YAML from the user with the

R.get method and store this into a

variable. Now this EC2 instance uh has a

spec which you are reading and you are

using that information to work on. So

this is where the user is giving uh the

instance type they want to use. The user

is giving what the object is for the

storage, what user data they want. that

is a spec. We usually use the spec to do

our operation. We use the spec of the

EC2 instance to create ourselves an

instance that the user is asking for.

And then the status is for us as a

Kubernetes developer to tell what

happened with this particular object.

And that's where um the AC2 instance

status is where we can tell that what is

the instance ID, what is the state,

public IP, private, public DNS, private

DNS. This is all actually is what we

have defined in here. You can see our

instance, our EC2 instance also has a

status truck because the way our actual

EC2 instance looks like is it has the

metadata for the object and the type and

then it has a spec and then it has a

status and this is what we will be

updating now because we already did an

operation. Maybe it failed, maybe it was

successful. If it failed, we handled it.

We ask Kubernetes to rerun the compiler

but if it was successful you want to

update on the status and that's what we

are doing. So the status or my instance

ID the state I'm actually getting it

from uh this function. So the instance

ID is given to me from this function

under the create instance and created

instance info variable uh which is

having an instance ID and then we are

setting up the state the public IP all

on the right side of these uh

substitutions is given to me by this

function and I'm updating the status of

my custom resource which was which was

you know uh picked up by the reconciler.

Now you have associated the output of

the function to the status of this uh to

the status of this um EC2 instance

variable.

It doesn't just update it. However, you

need to use a function called

r.status.update

because you actually want to update on

the status. If you see here the

reconiler has got couple of functions.

One because r is the type of e2 instance

reconciler. It has couple of functions.

First one is the r.get.

This lets you get the actual object

which is coming to the reconciler. In

our case, think of this as it lets you

get the gaml of the EC2 instance object

which the user has created. Then the

other one is also um r.update. In this

case you are doing an update on the EC2

instance. We will be uh using this. We

use this in case of adding the

finalizer. you are adding um on the

object that there's an update which is

the EC2 instance and the finalizer. You

can then also update the spec which is

using the R dot um you know using the

status function and you want to update

on this status of the object. You are

not updating on the metadata in here.

You are not updating on the spec. You

are only updating the status. And that's

why we tell our reconciler that we want

to work with the status of our object

and essentially we want to update the

status with this information that we

have just added here.

So to to sum up that again you use the

creates create EC2 instance function we

get some information from that and then

we are updating our object status with

this information.

If you were able to update that

everything is fine. But if there was

indeed an error when you were trying to

update the status of this object just

say um I could not update you know I

could not update the object and then you

return an error which will try to

reconcile it again and we'll try to

re-update your um your your status and

if everything is fine we reach the end

of our loop and then we just say it's

all done nothing needs to be done keep

looking for the object updates and this

reconciler is all ended for However, if

you remember, I did tell you couple of

things. If you remember any update that

you do, any update to the object, in our

case, the object that the reconciler is

looking for is a EC2 instance. If you

update this or a user updates this with

cubectl edit, it does not matter. If at

all there is any update to this object,

there's going to be another run of the

reconiler.

It's absolutely important for us to

understand. So the way your reconciler

is working right now is first thing it

gets the object.

Second, it tries to create an instance.

It then updates the finalizer

on the object.

Then if the instance creation was okay,

it goes ahead and it updates the status

of the custom resource. And then we

reach the end of the loop.

The problem however is you updated the

object here. So there was a change on

the object here at this place because

when you update or even if you want to

add a finalizer, maybe you add a label

to your object, maybe you add an

annotation to the object, it does not

matter. Kubernetes does not

differentiate what kind of change you

did on the object. It says, okay, the

reconciler looks for an for an update.

You did an update here also when you

were able to create the instance you got

some in instant information back and

that is where the status was updated. So

here was also an update.

The way our reconil works is it starts

from here. It sees that okay right now

the instance that the user is looking

for it is not it is not there

because it's a new instance. Then what

happens is it categorizes this. It's a

new instance because this object when I

say object it's the EC2 instance object.

It does not exist in my Kubernetes CD.

It's a new object. Then it says it's a

new object. It creates you an instance

and then you add the finalizer. And this

is where you update the status. When you

have updated the status, you actually

give the instance ID in the status

and you can only give the instance ID if

you have an instance ID and you will

only have the instance ID when the

object was created when the instance was

created for you. So think of this as

what's what's happening here.

you caught the instance and then you are

updating your uh status of uh with an

instance ID and this will be triggering

a new reconciler event because I I'm

telling you again and again every time

you make an update to the EC2 instance

object it does not matter whether you

update the label whether you update

things in the metadata whether you

update things in the spec or you update

things in the status and the the

reconciler will start again and this is

where it is your responsibility to make

sure your reconciler is ident

because what's going to happen then when

you reach the end of the loop it's not

just going to wait for uh it's not just

going to wait for new EC EC2 instance

object

Kubernetes remembers it when it is

running through your reconiler it marks

that okay this particular operation

which was updating the finalizer this

was an update so I will rerun the

reconiler it doesn't just stop the

execution current one the current will

go ahead we think of this as a handler

in anible if you know about that it says

this particular step asked me to update

my object which is the EC2 instance I

will run the reconciler once again and

then it goes on the fourth step and here

as well you update the the status of the

custom resource. The same thing happens

here. It says okay this operation as

well is updating my custom resource. I

will rerun the reconil again. So in this

case when your first execution happens

you will start again on your reconiler

to be to be running and this will happen

two times because you have updated the

object here and here two times

when you get a kind of like you know

when you update the object the current

execution does not stop. It's not like

you reach three and you start again. It

doesn't happen that way. You will run

through the entire reconciler. You will

keep noting which operations were

updating the custom resource and then

for how many times that was updated the

reconil would be running and now this is

your responsibility

to make it at potent. Imagine guys, you

created this instance, you got the

instance ID and when you run it again,

you create one more instance and you

update the object, you update the custom

resource, then again you will get a new,

you will create a new instance, update

the finalizer, create this update the

status, and then you will go again and

you will keep creating instances. And

this is kind of like a forever loop.

And the reason why it's a forever loop

is because the reconiler has no state.

It does not remember that the last

request is where I created the instance.

It doesn't have any remembrance of what

was happening with this in the end. So

it is your responsibility that

once I have executed through my

reconiler

when I updated the object here let's say

and I also updated the object here uh

for this object when the request will go

again to my reconiler I should be

checking the state of the you know what

is my current state you have to check

the current state and you have to check

if This is meeting the desired state. In

our case for the EC2 instance, we have

to check that. See here we update the

status.

Here we update the status and we give

there is an instance ID.

When you make a new EC2 instance object,

this will not have an instance ID

because it is brand new. But once you

run this through the reconiler, you

create an instance, you update the

finalizer and then you update the custom

resource object with the status and

there's an instance ID. It is then you

can use in your reconilider. You can

check if the if the request that's

coming in to me for this status of this

object is there an instance ID. If there

is an instance ID, I already have run uh

I've already worked with this with this

instance before. I do not need to create

a new instance is not needed because

this one already has an instance ID.

Then you work on that instance. See if

that is running that is stopped. You

know, you do the drift detection. But at

least the new instance doesn't need to

be created. And this is essentially

what's happening in our loop once we

make couple of updates. In this case, we

are updating the spec and also we are

updating our finalizer.

Here we reach there. So we will say um

okay I I'm done with the reconciler. I

would have waited for a new object but

because in the reconciler I did update

my status I'm going to go on the very

top of my reconciler and run it again.

So it's going to start from the very

beginning again two times because your

reconciler is updating the object twice.

Absolutely important. Without this you

will be creating a reconciler that keeps

on working uh and it doesn't really stop

or it doesn't really know what uh what

to do. So to understand this a bit

better, let's see how your reconsider

can go into a loop and do the things

again and again and again and how you

can stop this. Um and how do we stop

that in our controller? So this is kind

of like the request that you give.

Imagine this is your EC2 instance

request that you are giving. You give

things such as your kind. Oh, wait a

second. So you give your kind here. You

define your object's metadata and what

you are also defining is the spec.

That's what you want the instance to be

created as. And right now there would be

a status but this one is actually empty

because you are creating this in object

in Kubernetes for the first time. It

will not have any status. it will only

have a status once the reconciler has

run through its logic and that is the

one which will be updating the status.

So you feed your uh your object you know

you feed your object uh information when

you do a cubectl create hyphen f on this

object that is then sent to the

reconciler and the reconciler logic says

I will be creating an instance now this

one imagine what's happening here it

goes to Amazon and then it gets

information it creates you a VM the VM

is running here it gets the public IP

the state of this instance which is we

are more concerned about it should be

running that's why we have a waiter that

waits until this VM is running and this

is the information we get back from

Amazon it's a very simple description of

what we are doing in our code once we

get this and once we are able to create

this instance what we then do is we

update this particular object in our

case we are updating Updating this for

the finalizer.

So we update the object to add the

finalizer here and then we update the

status of that particular object.

Eventually once the object will be

exiting once the object will be exiting

your reconiler

this whole thing here is kind of like

the reconciler. This is your reconciler

logic. Um it makes sense for me to

increase the thickness so you will see

this. This is the actual reconiler

that's happening. Reconiler let me

increase the size of that a little bit

there. So the reconil creates an

instance it updates the object and this

is the output of your reconiler apart

from the Amazon VM that has been created

on Amazon. you get your spec back

and also one interesting thing is this

bit. Now your object has a status

because we updated on the status. It

will also have a finalizer which I have

not added there because I want to keep

it simple. But we have a status now. And

because you updated the object here,

this is very important. Because you

updated the object, you will be passing

the same object. You will be then

passing the same object to the

reconiler.

And then what's going to happen is you

will be creating the instance. And then

what's going to happen? You will be

updating the object. And then what's

going to happen? This is how you will be

reaching a forever running loop which is

then going to be problematic because the

reconciler has no idea that it has

created the object already. It has

created the VM on Amazon already. It it

doesn't have a correlation between what

it did and what to be done because it

has no state. So essentially what you

would be looking for is once you have

made changes to your object which is in

our case you have this status added

there I will change my logic a little

bit. What I would say that okay create

instance if the object dot status

dot uh instance

id is blank

which is then if it's empty then it's a

new object that the user have created

when I'm saying object I mean this

particular yaml if this does not have a

status or at least it doesn't have a

status and an instance ID that means it

has never run through me. the reconciler

has never created an EC2 instance for

that and that is when you should be

creating a new EC2 instance otherwise if

it is not blank then just skip directly

do not make any changes whatsoever on

the object because as soon as you make a

change on the object it starts again and

your reconciler would be doing any

change that you make in your reconciler

to the object of the Kubernetes object

it will trigger a new uh execution you

will trigger a new loop and that's

exactly I have this sort of a demotency

done by this particular function. See we

already do uh we already get the object

which is coming to a reconciler using

the r.get get we get the YAML the spec

and the status of my EC2 instance in

this variable and then I'm checking if

the status if the status field is

populated and the instance ID is not Z

not not blank you remember now right it

will be blank if it was a new instance

if if it was a new object and for a new

object I need to create a new instance

but if the object is not new it will

have a status

it will have an instance ID and then I'm

saying if the instance ID is not empty

we simply say requested object already

exists in Kubernetes not creating a new

instance because I've already created an

instance for the same instance ID on

Amazon that's why you have the instance

ID because I was able to create that and

there's no need then simply I would just

be returning a nil and I'll wait for a

new update on this object which is EC2

instance

nothing to be done. This bit makes our

code adempted.

Now here you can be a little bit more

cheffy if you want. You can be a little

bit more you know uh complicating things

where you can introduce a drift

detection. The thing is imagine you did

create the EC2 instance um which is on

Amazon. So here is AWS.

You did create the instance and it was

in a running state. You get that back

and you update your status of the object

here. And when somebody will do cubectl

get EC2 instance, they will see the

instance ID, they will see the public IP

and then they will see the status as

well which is running that matches your

Amazon instance. The problem is if you

go outside and you stop your instance,

if you stop your instance, it does not

update the status of your object because

Kubernetes does not know what you did to

your instance outside.

It just doesn't know that. So it could

be your your you know it could be your

uh feature in the program in in your

software where you can say if this

instance ID is not empty that means on

Amazon I do have an instance it might be

running it might be stopped it might be

in some other state I don't know I have

it then I will go to Amazon and then

check if it is indeed running or not so

this goes to the reconciler

you will say you know what the instance

ID is not empty so I will not make a new

instance but I will just go to Amazon

and see if this instance with this

instance ID what is the state of that so

you go there you find it is stopped you

then update your state from running to

actually stop so you can have a drift

detection if the instance was stopped

there you also update your status here.

This is kind of like a Shepy thing you

can do. In my case, I'm keeping it

simple because I'm saying the instance

was already been operated on. It is

there on Amazon um and uh it's also in

Kubernetes. I will not do anything. I

will not create a new instance. But you

can have a drift detection as well where

you take the instance ID, describe the

instance, get the state, update the

state. You know, we we know how to

update the state. we did it here and

this is going to be your own little

drift detection and I think that's going

to be interesting to to build. So if you

have followed the course till here um I

would really encourage you guys to add

this functionality as well which I'm

leaving deliberately because I don't

want to make it too much complicated

and if the instance ID was empty uh was

not empty in our case we already have an

object and then I'm not creating new

instance see my program will just back

off from here it will not create new

instances and this is what this is

actually what um what's going to

So this is now I'm going to show you a

bit of a demo of how all this look like

to here. This is where we will actually

deploy this um to our Kubernetes

environment. And then let's try to

create and see this in action. I already

have this running. But what you can do

is first thing you can do is u make u

many fit

and you know about this we already did

this when we were building the API. This

creates the custom resource definition

and then you can do a make install. It

installs or creates those custom

resource definitions in your Kubernetes

object in your Kubernetes cluster. I

already have that because if I do k

explain ec2 instances you take it it

knows about my instance which is in the

compute.cloud.com

and this is the group this is the

version and this is what my fields can

be for the EC2 instance. I can also do

cubectl get EC2 to instance and you can

see it's it doesn't say I don't know

what this object is because I did a make

manifest uh manifest and you can do

these things together make install it

creates you the CRD it installs it on

the cluster

and then so far what I what we have

built we can run our um reconider using

go run cm MB main.go. I'm in the root of

my uh project. And now you can see this

is what my reconciler is now running. If

you go through the logs a little bit, it

starts the manager. The manager manages

multiple controllers and then you can

have more than one worker for a

controller. And this is what you see

here. We start the manager. We start the

controllers and then there are a couple

of workers. I only have one uh but you

can you can read about multiple workers

and you can you know uh create more than

one if you have much workloads but for

us one worker would be enough.

Now to actually create an EC2 instance

on Amazon, let me just quickly open up

my AWS console and show you how would

this look like.

And right now I don't think I have any

instances which is running. That makes

sense because I didn't make any

instances.

Um let it load.

Hello. Hello.

Let me try that again.

Okay. Uh, wait a second. Now, let me try

that again.

All right. Probably it was my tail scale

that was behaving a bit weird. But as

you can see right now I do not have any

running instances.

Now what we can do is I have an object

which looks like the spec that you would

like. Here you can see I want to create

an EC2 instance. This is the name. This

is the name space. And I've given my uh

T3 medium my AMI ID that I want to use.

It's in the region for Amazon Linux 2.

The region is central one. the

availability zone. I already have this

key pair. I already have the security

group. I already have this subnet in my

Amazon account because you know you need

these things before you can make a BM.

And this is how my instance is going to

be created. Now because we already have

our controller running, as soon as I say

please create me an EC2 instance,

as soon as I say please create me this

instance, you will see your logs acting

up. Let me just do it here so you will

see that better.

Now let me do a cubectl create there.

You see automatically as soon as you

gave a create instruction your reconcil

loop started and this is the logic that

we have given from the very beginning

that's the log we are seeing reconcile

loop started and this is where you get

the object uh you create a variable of

of that type you get the object and then

you see if the instance ID is there or

not if it has a deletion termination

timestamp nothing is there because it's

a new object object and then you will

see creating new instance adding the

finalizer the finalizer would be added

all that we went through will be

happening now so let's go through the

logs a little bit you can see the

reconstru loop was started and then I

see the log of it's creating a new

instance and then you add the finalizer

it's interesting to see this in the

object so if you do k get ec2 instance

you see this is the eventual uh output

you're going to get and there was an

instance created on Amazon for me from

my Kubernetes in from my Kubernetes

cluster that is exactly what we were

working through we are able to create an

EC2 instance from our Kubernetes

environment using the controller that

you just have written and you're able to

get the information the state is running

the public IP is the same public IP that

you see here 35.159.299

299 uh 220 and alert 7. It's the same

here and it is the same instance ID. Now

this time I got an instance ID because

the VM was created. But what I'm more

interested to show you is um this thing

which is going to be the finalizer.

You see we get the log on the left side.

It says about to add finalizer. You can

see here about to add finalizer. And

this is what the update um that we did

to our object. If you see this here,

this would make sense. You add the

finalizer which is ec2

instances.computcloud.com

and then you do an actual update on the

object. And that's the result of this

update function.

Then once the object was created, once

the instance, let's go forward and see

the logs. How did they they go ahead. So

we say we are creating a new instance.

We are about to add a finalizer. We add

the finalizer and it says this update

will trigger a new reconcile loop but

the current will continue. As I've

already told you we do an update, we

register this so that we will come back

and restart the loop again. But we keep

continuing. We don't break the existing

exe execution right there.

So you add a finalizer to your

Kubernetes object and then we continue

with the EC2 instance creation in the

current reconciler and that's where we

actually create the instance.

Once you create uh once you make a call

to Amazon to create the instance you can

see here we call the run instance API.

The EC2 instance creation was completely

successful and this is now where for the

first time we get the instance ID. you

will only get the instance ID if the

instance was created,

right? When the instance was uh was

running. So we we then make another call

to Amazon to get the public IP cuz you

don't just get the public IP as soon as

you create the VM. It takes a little

time for the the public IP to be

populated. And we just say I'm calling

the Amazon describe API to get the

instance details. And this is where you

just print. This is just like some debug

that I was doing with this with this

code. And here you can see we get the

private IP 172.31.25.250

which is the same uh in here. If I check

that here you can see uh 172.3125250

that is my output. My domain name is uh

this is the public domain name. This is

the instance IP, the region and you know

the metadata that was given to me by the

describe instance API. You can see the

name of the key that you were using when

you were creating this instance. And

here you now have the public IP.

Very important. Till this we have made

update to our reconciler um one time you

know we have made the changes to the

reconciler uh once which is updating the

finalizer. Now we update the status as

well. Now we update the status and um

what actually happens is now you will be

updating things in here. This is the

status update.

So we do the status update. We update

the instance ID, the private DNS, the

public IP, the private the public DNS

and the state of that which is running.

Out of these five things, we are only

showing uh four. The public IP, the

state, the instance type and the

instance ID. Now, it is absolutely

important to remember we made the

changes to our object. The reconil will

be starting again. And there you can see

after you made the changes to your um to

your status the reconciler was started

again. However, this time we saw the

requested object already exists in

Kubernetes and not creating a new

instance. This is where we use or we

introduce the item potency in our

Kubernetes um in our controller. Um it

is missing one log however which is a

bit misleading that you might think

because we made updates to our object

twice this should be running twice and

that is absolutely correct. I think it's

missing a log. So let's try that again.

Now what might look like there's a

missing log entry and it could give you

an indication that I said whenever you

update an object in Kubernetes as many

times the reconil loop will be running

those many times. So we updated our

custom resource once at the finalizer

and then we update our uh custom

resource once with the status the two

times the reconciler load should be

running. But we only see the log entry

here once which says uh this is the

reconciled loop now started and

requested object already exists. We

should be seeing this twice cuz that's

what I've been telling you because we

updated the resource twice. But we don't

do we don't see that here and I think

it's going to be a lot more clear when

we see the internals of how the operator

is knowing that there was a change and

what really happens internally. So let's

say you are working on this Kubernetes

uh resource which is our custom resource

and you do a update

or you add this resource whatever you do

you have triggered a change on this

custom resource. Now any change that you

do to any resource in Kubernetes the

first one that knows about this or gets

to know about that is your API server.

This is where you have your

authentication. This is where you have

your authorization. And once you have

gone through the authentication,

Once you have gone through the

authorization,

you know, and also the admission um

controllers, the the web hooks, this

will be persisting your change into HCD.

This is the time where you have made a

commit to the ETCD and this is your

source of truth. This is your single

source of truths and that's where you

have added your desired state. Okay,

that's where you have added your desired

state. Now, as soon as the API server

makes an update to this HCD, this API

server kind of not really broadcasts,

but you can think of that it tells

everyone that hey guys, there was an

update to this custom resource which is

kind um which is you know of kind uh EC2

instance

and you know there are Many many

controllers in Kubernetes responsible

for different resources. For example,

one is a pod controller which is

responsible for changes into the pod.

One is a deployment controller which is

responsible for changes in the

deployment. One is a service controller

which is responsible for the service.

What they do? They are only reacting on

the resources which they have been

programmed for. In our case, we do have

a controller which is only listening on

the EC2 instance uh type of the

resource. So the API server tells

everyone. Think of this as broadcast but

it is not really broadcasting. It sends

an event to anyone who is watching this

custom resource. The pod controller

watches the pods. The deployment

controller watches the deployment. In

our case, this EC2 instance controller

is watching the custom resource of kind

EC2 instance. So this update is listened

by this guy or it's watched by this

controller. Think of this as your

controller subscribes to the API server

and it is telling the API server

whenever there is a change of uh in this

kind EC2 instance tell me the API server

registers it that okay there's this guy

who is watching and listening for it and

then the API server will be telling um

once it triggers an update like this

once it triggers an event then the EC2

instance uh controller who's watching

this update this this event gets to be

notified about that.

Now if we zoom in in this op in this

controller a little bit let me uh create

what's really happening in this

controller. So this will make more uh

sense because I just said the EC2

instance controller gets it. But really

what happens here

is that this particular controller,

this particular controller who's

watching,

you know, who's watching the API server,

the one responsible for watching uh or

doing the watch is called an informer.

Think of informer as a piece of software

that opens a long running or a very long

uh living um you know stream to the API

server and it always catches these

updates that hey guys okay there was an

update I am now notified about it. This

is the part of your controller. Your

controller has an informer which helps

you to open a watch to the custom

resource that you are interested about.

Now as soon as there was from the API

server uh it says there is an update the

API server sends this update as well as

the object

and the watcher consumes it. The watcher

subscribe to it. And this is how this

informer when I say watcher it is the

informer. This informer gets the you

know the actual update event and then it

gets the actual object which is the

yaml. This object is kind of like the

yaml of our kubernetes resource. Now

this informer has couple of uh things to

do. The first thing it does is it stores

the object. It stores this object which

is given with this update event into a

cache.

This cache is managed in the controller

itself. You don't have to do that. Cube

builder already has bootstrapped these

things for you uh using the contain uh

using the controller runtime. There are

packages that manages caches for you.

And this cache is where you are storing

your whole object that was given to you

with this event of an update.

The first thing that the informer does

is always adds it to the cache.

Then the informer has couple of

something called handlers

or we call them event handlers or we

call them resource event handler. These

are think of this as functions that

would be running uh when you make a new

add operation, when you do an on update

operation, maybe you did an update to

your resource or when you do a delete.

These functions they are not doing

anything except all three. The only

thing that they do is when the informer

has stored this object into the cache

these handler the informer will be

triggering this handler and based on

what you have done they add this

object's key into this working queue

into this work

this is the infamous work Q that we have

been talking about now what's really

added in the working queue is not the

whole object of your um your of your um

custom resource. It's not the whole

object. The thing that is added in this

working queue uh the thing that is added

in this working Q is the name space

and within that it is the name of your

object which is the name of the EC2 uh

instance kind and then you have the

metadata. So metadata name that's what's

added. It does not add the entire object

the spec the status they are not added

there.

Now once you have added once your

handler has added this in the queue then

in your controller you have workers

or we here we have the reconcile uh loop

that we have been working with and

there's a worker running

this reconcile loop. This worker keeps

on looking in this working queue. As

soon as there is a key in this working

queue which has been added by the

resource handler in this case, the

worker runs the reconcile logic. And

this is eventually how your um

Kubernetes you know how your controller

how your operator gets to know about

that there was a change and so that I

have to run my worker. Now during our

reconciler we did update our spec uh we

did update our uh object two times.

One was for the finalizer.

So look at look at what's going to

happen once you update this for the

finalizer. The same thing begins from

here. It's kind of like the same process

because whether you update your custom

resource, whether you do this or an

application does that, the API server

has no differentiation of who made the

change to this custom resource. All it

knows there was a change. That's all I

care about. Now, the first time when we

added the finalizer and we updated the

the object, the same thing happened.

This update was sent to the API server.

Then it was stored in the HCD and then

um you know your controller was called

because it has the same thing. The

informer was there. This informer then

triggered a handler

and this handler added this particular

key in the working queue.

And let's give this the name space as

default

default slash uh EC2 that's the name of

my um kind metadata.name name. This is

the name of my object in Kubernetes.

That happened and while you were running

your recon, while you were running your

loop, the second thing that you did is

you updated, you know, the status

for the status. We updated things like

the instance ID.

We updated the instance ID. Then we

updated the public IP.

Wherever at any time there's an update,

the same thing will happen. The only

difference is

this working Q is kind of uh single

for uh for one controller. There's one

working Q. It's shared for that

controller.

So the hander adds it here. And the same

thing will be happening here. So let's

say you update the status. Now this

update to the custom resource is seen by

the API server adds it to the HCB. Then

the informers are watching uh they first

thing they update the cache locally.

They update update the cache locally and

uh then the handler is called. Now what

was happening this this is where um the

one single log line explanation is

coming. See your handler is responsible

to add the name space and the name of

that object in the working cube. But

what happened was when this handler

wanted to add it, it said I want to add

an object which is in the default name

space and the name of my object is EC2.

This working Q is smart

in a way. It says an object with the

same name in the same name space that

you are trying to add is already in the

queue. So I'm going to do something

called the dduplication.

And this is such a power powerful

mechanism because if you did not have

that you will be kind of running a

reconciler storm. You know, every time

you make changes to your object, imagine

this while running your reconciler,

maybe you made changes to your object 10

times or 20 times, right? You made

changes to the object 10 and 20 times.

You do not want to run the reconiler 10

or 20 times. Just one run because you

already have that the same key is

available in the working queue. it's

going to use um the spec of that object

and be done with it. This dduplication

is already handled by this working Q

package in Golang which is again we are

using this weather controller runtime.

We don't see this but this is uh

eventually what's happening in the

background. So we do not add it again or

rather you can say you add it but then

it is getting dduplicated.

Now once you finish your reconciler see

this is what happened you update the

finalizer you then create the EC2

instance if you haven't forgotten the

flow this is what is happening in our

reconil logic we update the finalizer

then we add the EC2 instance uh on

Amazon

now once we added this EC2 instance we

update the status And here once we have

updated this tariff we are done. The

worker is creep

the worker now

here is free.

Now the only thing the worker is ever

responsible for is looking at this

particular working queue as soon as it

gets free.

When I say it gets free, I mean the

reconcile loop has run successfully, you

know, and now the worker is looking for

any other changes or it is looking if

there is a key in the working queue.

These are the two reasons why the

reconcile loop would be started or why

the worker will run the reconciler

again. For the worker to be running this

reconciler, there are two uh ways it

could do that. First, there was a change

made by the user

to the custom resource which is what I

told you the whole process just now

where you make a change to a custom

resource the API server and then the

worker sees the working queue here's a

working queue it starts that or second

the worker will be running the

reconilers

the worker is going to run the reconiler

in case you have some changes made uh by

the user to the custom resource or your

req

interval

uh is done

and then it's time to uh retry that

again if in case there was a change uh

to the custom resource if not it doesn't

do anything it should not do anything it

should just simply um be uh exiting the

the reconciled loop

the third thing third reason why it

could run after it is finished is there

is an object

in the working in the work queue.

It's kind of like imagine um you know

you are

moving bricks from point A. This is

point A. You have a brick here and you

have a work to move this to point B.

This is you.

So either you if there is no break let's

say if there is no break you wait

you wait and every 10 seconds every 10

seconds you see if there is a break

there's no break okay I don't do

anything

this is uh the req interval every x

seconds or minutes you are watching but

there's nothing there so you don't do

anything second one is you know your

manager or your owner or whoever that

is. This guy places a break in here. He

does it on himself and then he tells you

I have added a break. Now you get

active. You put this brick uh in your

hands and you move it to the point B.

This one's when somebody has made

changes um you know uh when somebody has

made changes to the custom resource

manually.

The third reason why you might move with

brick. The third reason why you might

move this brick is imagine that um this

is point A and this is point B. There

was a break. You were watching this

here. This is important. You are

watching this now. So there was a break.

You are watching this. You move this

brick there. And while this brick is in

the you know think of this as it's

transporting

here another one comes up another one

was added now as soon as you finish

moving this brick from point A to point

B your work is done you immediately look

at here what there's another one you

move it again and this is kind of like

there was already an object pending

while you were finishing your work

someone put another brick and As soon as

you are done with moving that that break

in your hand, you look back, there's

another break, you move that. That's the

working cube that we are talking about.

As soon as there's an object, u your

worker will be starting again. And this

is why we only see one log entry because

when the worker finished creating the

instance, when the worker finished

updating the status, it's it was done.

it was exiting and then it saw that

there was a new object in this working

queue. In this case, the third um you

know in this case the third stage is uh

applicable to us. So the worker saw hey

I was making uh you know because there

was already a key added to this working

queue I started working on that but

something already added another key in

this working queue. So as soon as I'm

done with the in you know with the

current run of this instance creation I

look there and then I run that again. I

run the reconcile once more because

there is a key in my working cube. And

this is why you only see that once

because we already have this item

potency that if you know for our status

dot uh instance ID

is not equal to null that means we

already have worked on this. This is the

adap potency you have to add in your

operators so that it doesn't keep

running in a forever running loop. Um,

we talked about this already, but this

is where you only see that once because

there was only one key added even though

you had two updates, even though you

made changes to your custom resource

twice, the dduplication happened in the

working queue and there was only one

object and as soon as the worker is

taking this, you know, it reads the

object from there. The working queue

gets empty because the worker is working

on the only available key. It reconciles

it, sees the ident potency kicks in and

then there's nothing. There's no object

in the working queue. The worker is now

waiting. The worker is happy. Now the

worker will be triggering the reconciler

if there was a change made by the user

or when your reconcile interval is re

and uh you know you have configured your

reconciler to to watch for or to ask the

API server is there a change to my um

object every uh reconcile interval which

is I think we have it here.

So somewhere around um you know um I

think we don't we do not configure a rec

uh I think we did not configure a DEQ

interval. So this is also a very small

piece of information. When you write

your reconiler

I'm not sure if I've explained this

already in in there or not. I think I

did but let's talk about that again. For

the reconciler you can define a recq

after

which is if I show that again here in

this result uh we can send two things.

Uh one is the recue. Do you want the

operator to recue after the worker is

done working? You know you want to try

that again for a new object and what

duration after. So think of this as you

might want your controller

in a in easy words you might want your

controller every uh 1 minute

to run again

even if there is no object in the

working queue. You want it to be running

every 1 minutes. And this is how you can

configure um you know um when you are

returning the result you could say vq

after um time do 1 second. So this is

kind of like making this as a chron job.

Now you can create your operators as a

chron job or you can make them behave as

a chron job by using the dq after you

say I'm not returning an error so

there's no retry needed. However, after

this time, um, you know, rec should

happen. So that's that's like making

this as a chron job. Imagine you want

to, uh, delete the pods which are

container creating error state. So you

want to scan through all the pods,

delete the ones which are in container

creating state as a cleanup and you want

to rerun this process after um, x amount

of seconds or or minutes. So this is

kind of like a chron job but your

operator can do that as well with this

req after.

So now that you have a very good idea of

how the internals of your operator are

working where is the uh working queue

managed again the working queue is

managed in the operator itself in the

memory of the of the controller. Uh the

informer is part of the controller. The

handlers are part of the controller.

This cache is also a part of the

controller and this working Q is also

the part of the controller. The worker

is part of the controller and again the

reconcile loop it's all what's making up

the controller and uh yeah that's that's

something important.

Now with this cache you might think why

is this cache uh you know created here?

What what is the reason of this cache?

Think of this as when API server makes a

change, it sends the update that I have

made a change to the custom resource and

it also sends the actual object to uh

which is the whole YAML of of of the

object of the custom resource. It sends

an update event.

It sends an update event and the object

custom resource or the whole custom

resource object. Let's put it. It sends

a custom resource object

and this event is seen by the uh

controller.

This event is seen by the controller.

Now what's happening once this

controller sees this event? the first

thing uh or informer inside this

controller is the one that watches it.

It adds this object into its cache or I

should rather wait I will do it like

this. This whole object is actually

added by the informer in the cache of

the controller.

And then you know what happens? There's

then the event handler. This adds the

key in the working Q. And then there is

a worker

that keeps looking at this working

queue. As soon as it finds that there's

a key in the working queue uh which is

added by the event handler you know the

event handler adds this key which is the

name and the name space of the object

now the whole YAML not the whole object

but just the name and the name space the

worker starts it now if the worker needs

to access the spec you know that's what

we are doing in here um let's say if I

say create instance and that's what I am

or rather if I begin

I'm saying it here. Um, see, I'm

checking if this object has a deletion

timestamp. I'm reading that object. I'm

seeing if that object status has an

instance ID or not, I'm reading that. So

whenever you want to read that object,

you know, whenever the worker logic or

the whenever the reconciler

logic wants to read the object, it does

not read it from the HCD. It doesn't go

to API server then reads it from the

HCD. It does not do that. it reads it

from this cache

which is which is faster in orders of

magnitude you know um compared to when

you were going to the API server and

reading it and this cache is always

updated as soon as this update event or

add event you know or delete event comes

in the first thing the informer does is

refreshes the old copy and updates the

new one so that you're recon compenser

will always be getting the latest state

from uh from the cache of your custom

resource object and this is how it's

reading the object. The worker reads the

spec, the worker reads the status,

whatever you do with this object, it's

being done from this cache and this

makes it really really fast. You don't

have to go out of your process or rather

the controller to the API server then to

the HCD then read the the spec or the

status. It's right there whenever you

need it. So I think this was a very good

idea of uh just explaining why we only

saw one line you know one run of the

insider not two. I think this was the

right time for me to explain this and I

hope this is clear to you guys how the

API server um up sends an update. How

the informer is looking for it. How the

watcher and the informer which is one of

the same similar things is looking at

that. How the handler adds it to the key

and then the worker reads the working

queue uh and then you know runs the

deconsider logic. Now that you know all

this, this will be giving me a good idea

to tell you how Kubernetes handles the

object deletion. We talked about how

does it create one. We talked about how

do we work with the caches? We work with

the informers, the handlers, you know,

the working queue, how the reconciler uh

does that, how the dduplication works.

So now let's talk about how Kubernetes

will handle the deletion of objects.

This is where the deletion timestamp

will be coming into the picture. When

you have an object and when you say

cubectl

delete

the object,

the reconciler does not know whether you

have asked for deleting

the object, whether you are updating the

object, whether you are updating the

finalizer,

whether you are updating the status of

that object. It absolutely has no idea

about that. And this is where how would

you differentiate that the particular

update was to delete that object. How

does Kubernetes know the user is asking

for deleting this object? It knows about

it by adding something called a

timestamp or um essentially it's a

deletion time stamp.

See if you look at my object now um if

you look at my object right now it's

less it does not have a deletion time

stamp what it has is a creation time

stamp what it does not have is a

deletion time stamp but I can run this

on a loop let me show you that uh

cubectl

get dogs and And you can see it here. I

probably I can just show you the

the metadata. So you can see here this

is my current object right now. This is

my current object that I have created.

And you can see my reconciler is happy.

It knows about this object that it was

already created. It is on Amazon. And

you can also uh the cubectpl get EC2

instances and you know the public IP and

everything that's that's okay but when

you want to delete this the way

Kubernetes knows that this was a

deletion operation is by adding a

deletion timestamp and this is what our

program can actually look for. So you

can see here this is what the program uh

could could look into.

Um here that's where you can check if it

was a deletion request or not. So as

soon as your reconciler is started it

doesn't know the reason why it has

started. Was it an update? Was it a

delete object you know delete um

operation on the object? And this is

where you can use this deletion

timestamp. If the deletion time stamp is

zero,

then it's not a deletion request.

However, if the deletion timestamp is

not zero, that means you can say the

user has actually requested a deletion,

the instance is now being deleted. And

then you can call Amazon to delete your

instance, clean up properly. And once

only you have done the deletion then you

can remove your finalizer. And this is

how finalizers are used when you are

deleting an object. As soon as you give

a request to delete something finalizer

will hold the deletion uh of that

Kubernetes object until the actual

resource has been terminated. And if you

were not able to remove the finalizer

you do not delete that object from

Kubernetes. you know, you just say try

again. You keep on doing it and once the

instance was gone, then eventually you

let the you let the finalizers be

removed and you know the object will

actually be gone.

Now this is what you will see in the

logs right now. As soon as I will do uh

let me go to the directory and as soon

as I will do delete on this EC2

instance. I hope you can see this in in

the

you know in the size of the font. What I

essentially want you to look at uh to

look at is as soon as we give a delete

request to Kubernetes there will be a

new deletion timestamp and because it is

an update to our object the reconciler

will be started then the reconciler will

know oh it has a deletion timestamp so

basically I need to delete my instance

and then delete it from Kubernetes. So

if I do cubectl uh delete, see what

happens. It got a deletion timestamp

automatically. My object was updated and

then my Kubernetes reconcile started

again and it saw oh wait it has a

deletion time stamp and you see my

object is not deleted yet. It's waiting.

It did not give me the response back.

It's waiting because I have a finalizer

that is holding the deletion. So you see

it says has update deletion timestamp

instance is being deleted. Then we call

the delete EC2 instance function which

I'll just show you the instance

termination was initiated and it is

waiting for the instance to be

terminated which you go here. You will

be seeing the instance is no longer

running because it is right now shutting

down. This is the instance that we just

created. It's right now shutting down

and I'm waiting for my object is waiting

until it goes into the terminated state

as the other instances are in

terminated. We want for the instance to

be properly terminated and once the

instance is now you can see it's

terminated.

See what happens. We're waiting for the

instance to be terminated and let's just

give it a little time to actually uh the

maximum time it's going to wait is for 5

minutes for deletion. This will be then

updating and removing the finalizers for

me.

Um, and eventually that's going to be

then cleaned up. And that's what

happened. I only was able to delete my

object once the instance was terminated.

You see here waiting for instance to be

terminated. it was waiting and then once

it returned the terminated state we say

EC2 instance successfully terminated and

again uh because you deleted this object

now it is another update on the reconil.

So you can see the reconiler has started

again.

So see what happens again is this is

very interesting any update that you

make it starts. So what happened is

what happened is you have an object

you do a deletion

on this object. This object gets a

deletion timestamp

and this is seen as an update by the

reconiler.

The reconciler runs, it sees, okay, it

does have a deletion timestamp. So, I

need to talk to Amazon to delete this

particular instance.

I need to delete this particular

instance. And once it is deleted,

you know, once it's deleted,

what I will do is I will remove the

finalizer

and then let the object deletion

be done.

And here's where you're making an update

again. Now, see what happens. Um, if I

go back to our code, how it works. You

can see we delete the EC2 instance. This

is a very simple delete EC2 instance

function. All it does is it says

deleting the instance and then it runs

the terminate instance function for

terminating the actual instance. It

waited it waits for it to be terminated.

This is quite similar on how we were

doing a running waiter. we have a

terminated waiter and we wait for the

waiter to actually return uh terminated

and then we say instance was terminated

just fine. If we were not able to

terminate that if there was an error we

try again but in our case we did not

have any error that means we were able

to terminate the EC2 instance that means

this cleanup has happened now I can

remove my finalizer and this is again an

update. This will again start the

reconciler.

Absolutely uh important. Any update that

you make, it's going to start the

reconciler.

So you remove the finalizer using the

control uh using the controller runtime

uh using the controller utils and you

say please remove this finalizer and

then you update the object and there you

see as soon as you update the object

Kubernetes says cool I will go back and

run the reconciler again and at this

point the instance is terminated and the

finalizer is removed. We go back to the

beginning of our reconciler here and

that's what you see now in the logs when

we were able to terminate the instance.

Uh because we updated the finalizer at

this location, the reconstruction loop

started again for a particular um

object. But you see here uh it's quite

interesting

between the time when you remove the

finalizer between the time when you

remove the finalizer

when you were removing or you removed

the finalizer

which is registered as an update to the

object.

Uh and

a new

run of the reconciler.

New run of the reconiler.

Your object in Kubernetes,

your object in Kubernetes is actually

gone. It's deleted.

Think about this that the custom

resource for which you remove the

finalizer it's um it's UID because every

resource in Kubernetes has a UID think

of this as a it has a UID was a B c then

what happened you removed the finalizer

and then um you updated the object um

you updated the sorry let's start that

again so between the time when you

remove the finalizer which is triggered

as a which is registered as an update

your object was actually deleted but the

reconciler says because you updated I'm

going to start the reconciler again from

the beginning

and I will start it for an object whose

UID is ABC it's a it's not a new object

because you updated an object which

existed and then it was not existing

anymore but the reconciler does not know

that it is deleted. It just says I'll

start the reconciler again for the same

object. And this is where you have to

tell the reconciler that even if you are

starting again

if you try to get the object and you get

an error you will get an error because

there is no object that exists with that

ID that you are trying. But if the error

is uh it's is not found you know when

you do cubectl get uh pod xyz you you

get the pod is not found it's kind of

like that error simply just say okay

it's it was a cleanup I will not do

anything and I'll just wait for the new

uh request and that's what's happening

in our logs it started again you updated

the finalizer

the object was deleted the update was

registered the reconcil was started from

the for the same reconciler object you

know but it is gone now and that's where

we are handling it we say otherwise the

reconciler would have said um not found

not found not found we want it to be

handled saying the instance was deleted

that you know um you are trying for if

you cannot find this object it's okay

because that's what you get not found

that was the error that is not found

and that is the entire end to end

functionality of our controller.

This by this time you know how to create

resources on Amazon. Your controller is

able to ignore or have a then potency.

So it doesn't keep running the loop

again and again and again. It can handle

the termination of instances. It can

handle the finalizer as well. And only

when you remove the finalizer, it will

clean up or it will let Kubernetes clean

up the objects and subtly ignore if the

object

you know for which there was an update

that was uh registered and the rerun of

the reconciler happens but the object is

gone. Meanwhile, we handle it that if

you try to get the object and the error

that you get while getting the object is

uh is not found. Kubernetes has a

package called errors which is where all

the errors are defined. You see um is

already exist or uh is not found.

Sometimes you create an object and you

get Kubernetes says this already exists

and you can use these uh error types in

your program. I'm using one which is

called is not found and this is what it

says. is not found returns true if the

specified error was created by new not

found. So you are trying to get an

object but you get an error because it

doesn't exist and the error is is not

found. So that's what we are doing. If

it's not found simply say okay the

instance is deleted no need to reconcile

and I'll wait for new objects to be

coming up to me and that is what you

will see this loop now will only run

when a new object of EC2 instance type

is created or updated let's try that

again I will do it that please give me a

new instance because I don't have this

object in Kubernetes it's going to be

creating a new VM form. So let's do that

and you will see this in action again.

Create and you see on the right side the

reconcile is started because there was

an update which is create on our

instance and this is the same thing

happening. We see that this object is

new. So we create the instance again. We

add the finalizer and then the finalizer

was added and then I call the Amazon API

to get me an instance. I describe the

instance, get the public IP and I update

my status.

You go back to Amazon and you will now

see that there is an instance which is

running. There we go. This is the

instance which is now uh you know it's

in the running state and we had a waiter

for it to wait for the running state uh

to to uh for the instance to be reaching

the running state. And this is the same

public ID and this is the same uh thing

that we see in here 3647

52 and the same instance ID. Now um I

will actually um I wanted to show you

something which is again that's what's

happening here. We do get the object uh

we do get the object and as soon as we

update the status the recon starts again

and it says the requested object is

already existing because you have an

instance ID. I will not do anything.

Try the deletion again just to see it's

working. I do a deletion. This time I

will get the deletion termination

timestamp and we will clean up the

Amazon instance. Then we remove the

finalizer, remove the object and handle

the next run of the reconciler by saying

the object was deleted. It's okay.

Nothing is required to be done. So you

see it got started again. It found the

deletion timestamp. That means the

instance is being deleted. See, you will

not have the deletion timestamp if you

do not delete an object. I showed you in

the previous examples. It was only

having creation time stamp. But as soon

as you add a deletion timestamp, that is

a cue uh that you are trying to delete

an object and that's what we're looking

for. If your object has a deletion

timestamp, that means it should be uh

terminating. And there you can see my

instance state is changing from running

to shutting down, which is what we saw

before as well. It will wait for some

time for that to be terminated because

we're using a waiter. Golang uh the Go

SDK for for Amazon. lets you wait using

these waiters. Instead of you waiting

for two seconds, then pulling again,

then pulling again, there's a waiter

that lets you do it very very easily. Or

you can also do uh kind of like long

polling if Amazon supports it. I'm not

sure, but you can do a periodic polling

that you wait for 5 seconds up to 5

minutes and then you pull if the

instance state is terminated and blah

blah blah. But you see here my object is

holding the deletion or Kubernetes is

holding the deletion of my object. I'm

waiting I'm waiting for the finalizer to

be removed and the finalizer will only

be removed when the object is is cleaned

up because of our logic in here. We wait

for the object to be deleted. If the

object was properly deleted then only I

remove the finalizer. Otherwise, I just

send an error and I go back to the

beginning and try that again. Try the

deletion again. And because I would be

able to terminate my instance, it's

waiting a little bit longer. And these

are um this is the beauty of using a

waiter.

It's not like you are waiting for

dedicated 5 minutes. If uh using a

waiter with with the Golang SDK of

Amazon, if the instance is terminated

even before, you don't wait for the

entire 5 minutes. it's more efficient

waiting for the resources and then you

can see now it is terminated

successfully. Uh so you remove the

finalizer that triggers an that

registers an update your object is

deleted that registered update start the

reconider loop again and then you say

cleaned up no need to reconcile. This is

the beauty of Kubernetes operators. What

this shows you is that you are able to

manage your Amazon instances right from

your um Kubernetes environment and this

was entirely that this course was about

you can make this more chessy as I said

you can have bit of a drift detection or

in in my case I didn't have it because I

want you guys to you know build it in

your own program where you can say if

the instance is stopped on Amazon

It updates my status of my uh of the

Kubernetes object from running to stop

it clean the public IP that all but this

is a very very good example of uh using

Kubernetes as a platform to manage other

platforms. You can use Kubernetes not

just as a destination platform for your

applications but you can use Kubernetes

as a platform to manage your resources

on any other platform which is which is

the beauty of Kubernetes expending it or

extending that using the operators is

what you can do. So this was the entire

demo. This was the entire code that I

want you guys to try again. And now

let's see how you can package this

properly with Helm and how you can

actually run this inside of Kubernetes

because right now this one is running on

my local computer. It is using the cube

config environment variable to actually

connect to my to my Kubernetes. But

let's package this using helm and then

see how you can ship it and you can run

this inside of kubernetes and uh let's

let's get started there. So now that we

have seen our application our controller

is running end to end and it's able to

create the uh you know the ecto

instances. Um the way it is running

right now is there's my computer and I'm

running um the go I've installed go

there and I'm running it with go run.

Here's where the reconciler is running

connecting to the Kubernetes cluster

using this cube config environment

variable. There are a couple of

environment variables as well which is

the AWS access key and the secret uh key

which is used to connect to Amazon and

eventually create an instance because I

need to authenticate to Amazon and uh

from there uh we are able to get

ourselves an instance and this gets

reflected in my Kubernetes cluster. The

thing however is you're not going to be

running this application here.

Essentially an operator runs inside your

Kubernetes cluster. It's running here as

a pod as a generic pod that has access

to uh the credentials needed to talk to

your Amazon environment. This pod is

running in Kubernetes with a service

account. Now it's quite important uh how

arbback plays a role whenever you are

writing an operator. Imagine this is a

name space called um uh name a give it

any name space and you have another

nameace under which your operator is

running. Let's say this this nameace is

actually called bidding. you have a team

whose name is bidding and they do

bidding for for clients and um they are

using your custom resources which is the

EC2 instance. So they create a object of

EC2 uh EC2 instance.

Now to let your operator know that in

this bidding name space there has been a

change in the EC2 instance object

because your operator listens um on

these uh on these objects and changes

you need to run you don't you actually

need to give the access to the operator

pod which is running with a Kubernetes

service account. So you need to give the

service account access in this name

space to be able to list get you know

the the basic Kubernetes arbback. You

need the service account to have access

to these name spaces for the object

called EC2 instance.

And you would be needing to give this

service account access to both read of

this in of this object and also to write

for that object because you need to

update the status of this EC2 instance.

So both of them are needed and this is

how your operator will be able to manage

this namespace or at least manage the

object EC2 instance in this name space

and be able to go to Amazon create uh an

instance there and eventually update

this EC2 instance status giving them

giving the billing theme the public IP

of the instance which was just created

for them for that to be running inside

the uh inside the name space for this

operator to actually be running in the

Kubernetes cluster. We need to build a

Docker image and this is no-brainer. You

saw this coming miles ago. We need to

build this image and you will be pushing

this image to a repository

and from there you will be creating a

deployment

in Kubernetes. you will be creating a

deployment in Kubernetes that uses this

image. Um and then you will be deploying

this pod

which is the operator pod. You also need

the credentials here. You need the AWS

access key and you also need the secret

key.

Now you can also create you will also

then need to create a secret in

Kubernetes reference the secret in this

particular deployment and then roll out

that pod. So eventually the pod has the

logic for creating our instances and

managing them on Amazon and then it will

also have the right authentication

um artifacts needed to talk to uh to to

Amazon. For this building of your image,

there is a make file available from um

cube a make file available from cube

builder which is very very simple. So

let's let's see how this works. This is

a make file in the project which is from

cube builder. And first thing you will

need to change here is the URL of your

um image where you want to uh be able to

push the image where you want docker to

tag this image and eventually push it

for you. For me it's my uh Docker Hub

repository and I think I'm keeping this

public so if anybody wants to use that

um they can and this is uh the Docker

image and it has lots of targets

available. You have this make file and

you already used it for creating your

manifest when you update changes in the

API spec you had to regenerate the

manifests you know creating the custom

resource definitions. uh you have some

uh you have some targets for testing

your application environment. You also

have some llinters available. And here's

where things gets interesting. You can

just say go run what we have been doing.

Go run uh cmd main.go. We can also say

make run. So it's kind of like an alias.

It generates the manifests the the

boilerplate code. It formats your go

code and it also runs another um bet.

What is it? uh it it runs the go with

against your code. What we are looking

for is the docker build. It has a target

called docker build. Essentially what it

just does is it runs your container

tool. For me in for my case it's docker.

So it will be running docker build

hyphen tag and it's giving me the image

tag which is what I have declared above

here and essentially it builds me a

docker image with that particular tag

and then I can do a docker push to push

my image to a registry. Now this goes

without saying that your kubernetes

cluster will need access to this

container registry because without that

they will not be able to pull the

images. You also can build images from

multiple architects. Right now, I'm only

building for ARM because I'm running

this on Mac and my Kubernetes cluster is

also running on Mac. So, it's all ARM

for me. But you might be building this

on your Mac and you might want to run

this operator for an AMD machine. Uh you

can use a Docker buildex

target to build it for different

platforms and then generate a single

manifest and be able to deploy it there.

This make file makes it very very simple

for you to be able to build your images

for your platform that you are running

with or for crossplatform as well. So

let's do that. I will do make docker

build. And what this does is it's

building me the docker image from my

main branch or this repository which is

the EC2 operator. Now this is where it

takes a little bit time. You see it's

building it for Linux but it's building

it for ARM 64 architecture and this is

where we are building our source code

into an executable binary which is going

to be called as manager. So let's wait

for that to finish

and once this will be done you can see

the Golang version we are using is 1.23.

If you want to see the docker image it

is very uh minimal. You are using the

Golang 1.23 as your builder. You copy

your go mod. You copy your go sum. Set

up a working directory. Uh copy your

APIs internal your main.go and

eventually you go ahead and build your

manager because this is the one which is

running your controller your controller

manager

from the disase images. You just execute

this manager binary which you have built

with go with a user 6532. So it's a

nonroot user which is a good thing. you

you always almost want to run your

container images with a nonroot user um

for security reasons and once your image

is build I can simply say make docker

push and this is going to push my image

to the registry I've already pushed

couple of few layers because when I was

trying with this course I have it and

now this one is pushing your container

registry your image to the registry if I

want to see this uh let's Go to

dockerhub. Can I see that here?

Of course. So hub.docker.com

and there will be the image. Search for

my username and there will be couple of

images I have. What was the name of the

image? EC2 kubernetes operator. Here is

so this one is where I have uh the tag

which is only latest. You can have a

CI/CD pipeline if you are storing your

code in GitHub. You can use GitHub

actions to always update your images in

case your API spec is changed. In case

your main main go files changed or you

know your internal folder which contains

the actual controller logic has uh been

updated. You can trigger a new build and

then you can trigger a new deployment.

With this thing uh aside we have our

image. However, these artifacts still

are needed. We need a deployment. We

need this secret for this deployment to

work. We need a name space. So, building

this image wasn't as big as a problem

because you need to have quite a few of

resources here. You also need Kubernetes

artifacts for the rolebased access

control. you need to give the service

account running this pod access for EC2

uh EC2 instance resources on the cluster

level because it should be able to work

in any name space um at least for this

object only. So this rpack is also

required. So you should be getting where

we are going with this. We need

something to be able to ship this

application for other customers and

that's what we will be using Helm for.

You can create a Helm chart which will

be you know one of one of the things

that I wanted to do with this course is

a Helm chart which shows an end toend

delivery of this application.

Helm you can do that yourself. We have

the Helm init command. It makes it very

simple for me to create a Helm chart.

Then you will update your deployment to

set the environment variables from the

secret. You will create the secret. It's

simple. However, there are two ways in

which you can do uh uh in which cube

builder can actually help you. The first

one is you can do u make uh build

installer. There's a target called build

installer. And what it does is it reads

your um make file. It reads your make

file. It reads the image that you have.

It then generates a file called

dis/install.yamel.

And if I show you this file um what this

looks like, it's a new file. And you see

it has all of these artifacts which is

needed for your application to be uh

deployed in Kubernetes. So it creates a

name space called EC2 operatory system.

Then it has uh the the custom resource

definition which is our EC2 instance.

Then it has the service account um which

is going to be running our actual pod.

Then you have a couple of roles. You

have some cluster roles. You have some

cluster role bindings. Um at least it

lets you be able to create update delete

in this API group for this resources.

Cube builder really helps you to be able

to bootstrap your um deployment

strategy. So with a make file with this

target you can create a single

deployable unit. And here's the

important thing. It gives you a service

as well as it gives you a deployment

here. This one you see it's using our

image that we just has uh that we just

had pushed. It has couple of uh

livveness probes. It has some uh

readiness probes as well. In our

program, we did not create an endpoint

at /halth. We do not have a livveness

probe. We don't have a readiness probe

because I wanted to keep it simple

focusing on the operator. So, you will

probably be removing them. So, uh

getting rid of you know the liveless

probe. It's here.

And then you will be getting rid of the

readiness probe. when you are actually

deploying this for production, these

things are really really good to have.

So you can check the health of your uh

you know you can check the health of

your um operator.

Now for this run container because you

also need some environment variables. So

you will see uh you can say environment

and you can see see here uh you can have

these access key environment variable

and secret key environment variable and

it's coming from a secret called AWS

credentials and then in the end you can

actually append uh here and you can

create API version secret there you go

some random data is being spilled but

that's okay

this will give you a complete deployable

um YAML file which you can just do a

cubectl apply hyphen f and be able to

deploy this application. However,

there's no version control on this file

as we would be able to do these things

on a helm chart. Somebody who wants to

deploy this controller, they need to

know the very layer the lower level

details of where to create the secret,

where to update my AWS access key and

secret key, where do I update my uh

controller parameters in the deployment.

So that is still a problem and that's

where cube builder helps you uh instead

of giving this big file which is a

single deployable unit. You can use cube

builder edit command. What it does is

you can tell cube builder that I want to

use the plug-in helm/v1 alpha.

Essentially this gives you a helm chart

created to deploy your operator. And if

I do that, you can see here generating

Helm chart to distribute the project and

you don't have any web hooks created

which we discussed in the beginning. So

it doesn't do that. However, it gives

you all these rolebased access controls.

It gives you all the templates for your

um deployment uh for your service

account for your services everything in

the desk

chart folder. And this is how your Helm

chart would look like. If I want to show

you this one is created here. So the

name of your chart is EC2 operator which

is also the name of your project. Um and

once you have this you can look at the

templates where the event with the

actual uh resources Kubernetes

constructs are created. You have them

for CRD which is the actual custom

resource definition. And of course we

want to be able to deploy that uh the

assert manager for the issue.

So what you see in the template folder

is couple of uh resources created which

is the cubernetes construct. We have the

CRD which is essentially what we want to

be able to deploy. We also have search

manager which is going to give us an

issuer certificate.

Um and this is part of web hooks in case

we were using any we want to use manager

for that. Here's where the interesting

thing is. This is the deployment of your

manager. This is where you will be using

a values file to define your um to

define the values for your um for your

resources.

You also can see if you have metrics

available for the service. If you have

metrics available uh from your operator

in case you are sharing them for

Kubernetes, you can create some network

policies. You can have uh service

monitors. And then here's where a

plethora of arbback rules which are

created for you. So this makes it very

very simple for you to ship your uh

project without you doing a helm in it

and actually creating all these

resources by yourself letting you easily

control the behavior of your operator by

the single Helm chart with the values

file. So the values file which is

shipped with the helm chart that we just

built with cube builder. It controls the

deployment of my controller manager that

I'm shipping up with my operator. So

this defines how many replicas do I want

of my controller manager, where's the

image coming from. And this is something

that you pushed with uh with docker. So

if I do docker push, this is where the

actual image was was pushed. So let's

take that and uh let me change that

here. Now there are a couple of

arguments available to your controller

manager. Uh I don't need the push so

that can be taken out. There's a couple

of arguments available. Uh the first one

is leader elect. This is something you

would be using in case you have a leader

election where you run multiple you know

replicas of of your controller. In our

case we are just running one. So it

doesn't make much sense for us. And

there are two arguments which is metric

bind address and the health probe

binding address which we will talk about

in a little bit. This is a standard

concept of Kubernetes where we define

how much limits for my CPU and my memory

the application would be needing. And

here's where uh it is really really

interesting. So usually when you are

building an application, it is your

responsibility to run an HTTP server

inside of it. If in case you want to use

the HTTP get type of livveness probe or

the readiness probe and it is your

responsibility to create a endpoint for

example in this case it is the health

endpoint and so is the ready endpoint as

well. Usually you write the application,

you make these uh you make these API

endpoints available and then you tell

Kubernetes that check my application on

this port number and on this part and

see if you get a response which is 200.

If you do that within this uh period

seconds and the delay after the initial

delay um my application is is live, my

application is ready. Otherwise do what

you need to do in case when an

application uh fails its readiness probe

or the livveness probe which is either

you stop sending traffic to it or you

kill the container and redeploy that. We

did not create any sort of API for the

health and the readiness right now and

that is the beauty of this operator

framework that we are using which is

cube builder. These API endpoints are

already available to you in uh you know

uh in your controller. So you can make

use of them right out of the box using

the readiness probe and the livveness

probe. And this is where I'm configuring

that my health probe binding address is

any address in the container and it's on

the port number 8081 on which my health

and the ready probes are running. It's

also important to understand that you

get some metrics out of it uh out of the

controller manager already when you are

using um when you're using cube builder

to write your own operators. You don't

have to implement the logic of how would

I export some metrics of my application.

It is automatically done for you by cube

builder. Of course, it's a very limited

set of metrics which we will see. We

will explore what kind of metrics there

are available. But it makes total sense

from the controller's point of view if

it's working properly or not. Uh how

many times it has reconciled, how many

times it has failed, how many times the

reconcile loop is successful. All of

that is right out of the box for you to

use in your uh operator.

Then there are a couple of uh security

related contexts that we don't want to

run our container as a root user and the

service account name we want to use. And

um um and here's where things gets

interesting.

See when you are working with Kubernetes

you usually in the same cluster let's

say this is our Kubernetes cluster

in this cluster you create a name space

for your operator and then there are uh

you know this is a customer name spaces

so here let's say I give this as EC2

operator this is where my operator

usually would be living which is running

as a pod and here is where I will be

creating um my object which is EC2

instance.

Now if if a developer is creating this

if a developer is creating this object

in their name space the pod or my

controller I should say should be able

to react on this change because that's

the object that my operator should be

listening into rather if there was any

number of name spaces anywhere in the

cluster if the EC2 uh EC2 to instance is

actually created or deleted or updated.

My operator should be able to see that

change and this is why my operator pod

is running with a service account. I

will need to give access to this service

account that this service account has a

role uh and a role binding or a cluster

cluster role binding which allows the

service account to list, get, update,

patch, watch, delete the changes

happening on this particular Kubernetes

object. That is absolutely important.

Otherwise, you will only be able to

create your instances in this ob uh in

the same namespace. But that's not what

we want to do. A pattern for Kubernetes

is you create your operator in a

different namespace in a dedicated name

space and you let users use that

operator in their own um name uh in

their own name spaces by creating the

object on which this operator would be

listening onto. And that's what we are

doing here. We want to enable all the

rolebased access control needed which is

again coming from the templates and are

back here. All these rolebased access

control roles, role binding, cluster

role and the cluster role bindings that

are required for my operator which is

running with this service account to be

able to list, get, patch, update all

those Kubernetes related uh constructs

that I can do on an API endpoint and I

want it to be allowed. Otherwise it

would be you who have to figure out what

roles I need to give to my operator.

What role bindings I need to give to my

operator. What what you know on the

cluster level I need to do for allowing

it access on the name spaces on the EC2

operator resources.

So this this helm chart from cube

builder really makes it simpler for you.

You can also control if you want to uh

enable the custom resources. So this

helm chart does not just deploy the

controller, it also deploys the custom

resource definitions for you. And here's

where you can say enable true that yes,

I want to deploy the custom resource

definition as well. And I want to keep

them in case someone does a helm in

uninstall for my chart. See, you will be

using this Helm chart to um you know u

to deploy this operator. You would be

using this Helm chart to deploy this

operator. Now you might decide that I

want to uninstall but what to do with

that CRD? Would you like that CRD to be

here available? So somebody could also

deploy an operator maybe manually you

know creating a deployment at least your

cluster would understand the custom

resource definition or you also want to

clean this up. This is the flag where

you can just use uh it will be keeping

the custom resource definition or it

will be deleting the custom resource

definition.

There's also matrix available as I said

the the operator that you have written

with cube builder it comes with pre

given metrics available. We will explore

these metrics and you can say that you

want these metrics to be exposed or to

be uh to to be uh accessed from uh from

outside the pod and for that what it

does is it creates you a service in the

name space. So if I go to my templates

and if I show you here metrics and

here's what it's doing. If values matrix

is enabled all it does is it creates you

a service type of resource in Kubernetes

and the target port uh the port on the

service is 8443 and the target port is

also 8443. However in our case the

metric port is listening on 8080. So I

will change that here to 8080. So this

will create me a Kubernetes service type

which will be listening on port 8443 and

forwarding it to my pod at port number

AL.

This is used by um the Prometheus

service monitor. Again my cluster does

not have Prometheus installed but if it

it would be installed. Uh this enabling

uh of the Prometheus key will be

creating a service monitor which then

uses this particular service to see if

the pod is running or not and to scrape

the metrics from it uh just to show you

uh in in Prometheus and then you can

have a dashboard available on that using

uh that Prometheus as a data source in

your graphana. Pretty straightforward

stuff. And here's where we have access

to uh controlling in in case we want the

search manager injection to our web

hooks or not. Right now we are not

working with any web hooks. So I'm just

going to keep that as disabled. And I'm

also not using any network policies. So

I would be disabling that as well. You

probably want to allow this if in case

you want to have metrics and those

metrics should only be created by

Prometheus running in a certain name

space and you can use metrop policies to

control um that behavior.

Now once you have this we can deploy

this uh helm chart but it's missing one

thing.

See your board is responsible to go to

Amazon and then create resources on top

of that which is which happens to be an

EC2 instance.

You need access uh to Amazon. In that

case you need the authentication.

Now when we were doing this locally

which I explained to you when we were

doing this thing locally I had my Amazon

environment variables already exported

but right now my pod does not have them.

The code reads them from uh from

environment variables but I also need to

set the them in my pod. So you have to

set some environment variables for the

AWS access key and secret key so that

you can authenticate to Amazon. And

that's what I had already done in my

shell. If I show you uh env for AWS and

you can see these are my access key ID

and this is my secret key which again by

the time you are watching this I already

would have disabled them because there's

no way I want those keys to be um uh to

be exported publicly.

Now once you see this what you can do

there are two options for us to pass

these environment variables into our

controller. The controller is created by

a deployment which happens to be uh

here. So this is the deployment which is

responsible for deploying our controller

which is using the image that we have

given which runs the manager command

that's being set by the docker file when

we build this container image. And

here's where we can define some

environment variables. Um, you can

create a secret. So you could do

something like this. So I would say AWS

secret.gamel.

And this is going to be kind of secret.

And you can see you can create a secret

called AWS credentials in that name

space. And then type is opaque. And you

have your AWS access key and the secret

key. Of course you will put them as

plain text. And then you can refer that

secret. For example, in here uh I can

say access key ID and secret key ID.

That's one way of doing that. And that's

probably a better way because you have

your sensitive data in a secret. You use

that secret into the deployment and then

uh you deploy the application. And

eventually it's going to get the secrets

uh from uh it's going to get the

sensitive data which happens to be the

access key ID and the secret access key

from your secret and then the code will

be running fine. That's one way of doing

it. Another way of doing that is which

I'm going to do and that's a little bit

um that's that's quite wrong. We should

not do that but for the demo I'm just

doing it. This deployment that is

created by the helmchart, it reads the

environment variables from

values.controller manager.container.env.

So I can actually set some environment

variables like this. This is my AWS

access key ID and this is my AWS access

secret key. Of course, this is something

you would be creating a secret for and

then referencing which I just showed

you. But for me to keep it simple, I'm

just um um showing you there's another

way of doing it which is a bad way. But

um you have been warned. So be very

careful about uh controlling your access

key ids or the secret keys. You should

never never put them in plain text in

your code or in your Helm charts or in

your values file. You should never do

that. Probably in this case when you are

you know on a journey to build an

operator, you already know about the

external secrets operator. uh uh project

and that's what you would be using to

read these secrets from a secrets

manager like vault like Google secrets

manager or they have integration with

other things as well.

Now once we have this once we have our

controller we define the number of

replicas I want we define the

environment variables we have the right

image repository we define the liveness

probe and the readiness probe and

everything else it's time to deploy this

Helen chart to our cluster and for that

I can just go to this chart because

that's where it is created this is my

values file I can say um let me just see

if I have any errors in my the file

somewhere

and end probably not. It looks good to

me. This is the range again. This is the

end of the if condition. Looks good. So,

let's do that now. Helm uh install

of I have no EC2 instances. I already

have the custom resource definition

because I was trying this Helm chart.

But let's delete that as well.

uh delete the custom resource

definition. So that is deleted. My my

Kubernetes resource does not understand.

It could not find the requested

resources. And uh let's do that now. So

Helm install give me uh install my EC2

operator Helm chart. That's the name of

my Helm chart. I want you to create the

name space. The namespace named is EC2

operator which is already existing but

if it's there it doesn't do anything. So

there's no problem. It's kind of like in

a temp potent field and here's the

values file and dot would be my helm

chart that I have just created.

Now uh let me look at the pods and you

can see the pod is running now which is

using our image. If I describe that pod

here, we can see we have the livveness

probe, we have the readiness probe, we

have the environment variables as well.

And this is running uh we can do k get

pods.

It is running on this particular node.

Now I'm using k3d and I have got some

nodes available. these uh there's one

control plane, one is master and then

there are uh two worker nodes that I

have. All of them are actually docker

containers. If you remember when we were

setting up the environment for our uh

you know the development environment, we

are using k3d and if I do docker pfs gp

for agent zero, you can see k3d ec2

operator agent zero. That is the same

name of my uh worker on which this pod

is running. What I want to show you is

if I exec uh into

uh docker exec husband it uh sh if I

exec into this container

um I can get the IP of my pod and I can

say port 8080 / health. It doesn't have

cur but if I do wget um that also fails

saying on this IP port 8080 there's no

health. Let me check what was the

endpoint for my health checks

for the health probe. It was 8081.

That's correct. So I need to look on

port 8081 / health to find out if my pod

is healthy or not. And uh it says health

already file exists. So let me do a

little cleanup. Uh health ready metrics

cuz I was trying this before. So it was

already there. Let's start from the

beginning.

I want to see inside my controller on

this pod 8081 is there a health endpoint

and you see health is saved. If I do cat

health it's okay. So this tells me my

health probe my um you know the

livveness probe is working fine because

I have on this port number I have uh

this this API endpoint and it returns me

a value of okay. The same thing goes on

if I use the ready endpoint. Maybe let

me increase the font a little bit. And

here you can see I also have get ready

is saved. And if I do cat on ready,

that's also okay. So on both of these

endpoints for my livveness probe and the

readiness probe on this port number on

both of these endpoints, I get an okay.

That means these endpoints were created

by the controller runtime for us. uh so

that we can do a health uh health check.

It's also interesting that we have uh

some metrics as well on port number

8080. So if I do uh wget on port number

8080 on the metric endpoint matrix

endpoint you see there was something

available on this endpoint as well. And

if I do less on metrics, these are all

the Prometheus matrix which are already

built in and exposed by the application

which you have built. This is not what

you have done. This is already given by

the um by the controller. And here's

where you can see um the controller

runtime total. How many times the

controller runtime has reconciled and

resulted into an error. how many times

it reconciled and result into a recess,

a rec.

And all of this is what you can use this

information, you know, you can use these

informations and how many errors total

you have had so far to show a dashboard

um of how your operator is doing. And

with these metrics if in case you see

that the errors are going up you can

make changes to your operator you can

make changes to your code eventually to

be at a better stage than the previous

one because you have metrics you have

insights of how this is this is going

on. You can even do um your own code

instrumentation for a bit of metrics.

For example, you can tell how many EC2

instances have been created by this

particular operator. Um, having

Prometheus in your code and then

exporting those metrics in in a way that

Prometheus understands it and can scrape

it is a different topic al together. But

if in case you know that it would be

nice for you to have this instrumented

in your code and then you can export

this information of how many EC2

instances were created um deleted so you

can know how how how much people are

using your um your particular um your

operator.

So it looks like my pod is running. It

looks like my pod has got the right

health endpoints. my part has got the

right um health and the readiness probes

and also I've got the metrics available.

But now I want to create an instance

because that's what it should be doing.

It's okay. Everything is is happy. But

is it really doing what it's supposed to

do? So let's do that. And I'm going to

look on the logs of my EC2 operator. You

can see it's starting the workers. It's

all healthy. It's waiting for my

resources. And now what I'm going to do

is I'm going to do the same thing what I

did when we were running this out of the

cluster which was on our computer. And

I'm going to create an EC2 instance

which looks something like this. This is

my EC2 instance. I give the instance

type, the AMI ID, the region that I'm

using, the availability zone, the key

pair. This is something we already had

used before. But I want now that

operator now it's running in my

Kubernetes cluster to create me this

instance because eventually that is

where you will be running this uh inside

of the Kubernetes cluster. So if I do a

create this is the moment of truth what

we have been working towards so far and

right now for me to keep it simple you

can see uh AWS console

you can see I do not have any instances

running

so there's nothing running uh let's do

that and I will do a create

now as soon as I did that this output

should be looking familiar to you. This

is where um we got a request. It says

the request was new. So I'm creating a

new instance and then we go to Amazon

and we add a finalizer. We go to Amazon

and we wait for the instance to be

running and you can see the instance is

already created. If I do EC2 instances

uh in the default name space and here

you can see

can I ping this instance? Of course. And

uh that was the the beauty of my

operator. I got an EC2 instance which I

can access right with the public IP from

my computer right from the public IP of

my in of my you know of my comfort just

to get cubectl get instances get the IP

and log and I start working there. The

important thing is this object EC2

instance is in the default name space

and my operator is actually running in

the EC2 operator name space. So these

are different and this is what I was

talking about here.

This name space is default

and here's where the object was created

and this is the pod which is running in

my EC2 operator name space went to

Amazon because here's where the instance

available. You can see it's running now.

And once it's done, my EC2 instance

operator went to this object and it

updated

the status.

So my operator needs access to not just

read the the object but also to write to

that object so that it can update the

status such as it can give you the state

of this instance uh the public IP and

the instance ID that was created on

Amazon. And that is what we have been

looking forward to so that we can um you

know uh we can go ahead and create our

instances or manage our Amazon

environment to clean it up. So I can

show you it's actually also deleting

resources. Let me do a little bit of um

you know a buffer. So we start from

when we delete our instance what happens

as soon as would I do a delete my

reconcile loop starts because there's an

update to the object and I see it has a

deletion timestamp instance has been

deleted so we print that we are now

deleting the EC2 instance and then we

use Amazon API to delete to send a

terminate request to our instance and

then we wait for the instance to be

terminated and that's essentially what's

happening

in here. So you see it's not running

anymore. It is now terminated. This was

the instance which was terminated.

And as soon as it got terminated, my

waiter said, "Okay, it's all fine. It's

not terminated." And eventually I was

able to delete my EC2 instance object in

the Kubernetes cluster. It was pending.

it did not delete it until the actual

resource on Amazon was was cleaned and

then the finalizer was removed from my

object and then essentially the object

was actually deleted. So this is how you

will be building an application. You

will test this locally. You will build

it into a container image and then you

will ship this to your different

clusters that you want to deploy using a

Helm chart. Essentially what you were

able to do locally is now all happening

in your Kubernetes clusters because the

operator should be running in your

Kubernetes cluster. So this is what I

wanted to show you guys an end to end

starting from bootstrapping the project

then going ahead building the project

testing it and eventually making it work

and then we deploy that with Kubernetes

and essentially run this in the cluster

with all the proper role based access

control with all the proper line probes

the redness probes and also the matrix

which is which is which is nice to have

to see how your controller is really

doing and when everything is good

there's no need to reconcile all is

happy. So I think this makes a this

makes a lot of sense uh to write your

own operators and I want you guys to try

this out and see how this works and let

me know if you have any questions and

I'll be happy to help and let's move on

ahead.

All right, so the code for building this

operator for the cloud that manages your

EC2 instances is now coming to an end.

And I have to admit it's quite a lot.

But trust me, what you have just done

with this course is that you have

actually understood one of the most

advanced concepts in Kubernetes which is

the reconciler which is how to write

applications which are self-healing

which is how to write an operator.

While you know the basics now, while you

know a very good understanding of how to

work with operators, there's no limit to

it. Think of this course as a logistic

that makes you, you know, that enables

you to go ahead and build cool stuff

that runs on top of Kubernetes. Not just

using um container images to run on

Kubernetes, but rather software that

runs on Kubernetes and manages your

other infrastructure, which is what we

did with Amazon. You might be using

Azure. Try to make the developers life

easier by managing resources um using

Kubernetes on Azure. Maybe you are doing

this on GCP. The sky is the limit for

you. Now, now you not just know how to

use Kubernetes, but you also know how to

write applications which are native to

Kubernetes that manages your different

other environments. So you know how to

work with the reconcile loops. You know

how do you design the API endpoints. You

know what is the controller logic is you

know what the integration looks like

when you're working with cloud. And you

are very much applicable now or you

already are probably working as a

platform engineer. You might be a site

reliability engineer. You might be a

DevOps engineer but now you know how to

expand and expand on Kubernetes. Cube

Builder as a project we have looked into

good detail in this course. We have got

we have really you know struck the nerve

of using it to create a production ready

bootstrap plus bootstrap operator but I

would not say it is right now production

ready. When you build an operator you

run it fails people complain about it

you refine that and eventually it

becomes a production ready. But you have

the tools now to go to that journey on

yourself. Next try to build your own

operator. Try to extend this particular

operator to have the metrics available.

Try to write your own operator that

manages the SP buckets. Now for people

try to write your own operator that

manages EFS um file systems on on Amazon

or maybe anything else. No, it doesn't

have to be limited to the cloud. So go

ahead, have fun, have, you know, have

fun building uh building new operators

and have fun building new tools that

runs on top of Kubernetes. And if you

have any questions, let me know.

Build Your Own Kubernetes Operators with Go and Kubebuilder – Full Course

freeCodeCamp.org

46 days ago

6:19:24

Kubernetes & Container Orchestration

Rank #1

Description

In this hands-on Kubernetes Operator course, you will learn how to extend Kubernetes by building your own custom operators and controllers from scratch. You’ll go beyond simply using Kubernetes and start treating it as a Software Development Kit (SDK). You will learn how to build a real-world operator that manages AWS EC2 instances directly from Kubernetes, covering everything from the internal architecture of Informers and Caches to advanced concepts like Finalizers and Idempotency. 💻 Code & Resources: https://github.com/shkatara/kubernetes-ec2-operator Shubham: https://www.linkedin.com/in/shubhamkatara/ Saiyam: https://www.linkedin.com/in/saiyampathak Kubesimplify: https://www.youtube.com/@kubesimplify Course Curriculum & Timestamps Part 1: The Theory of Controllers - 0:00:00 Introduction & Prerequisites - 0:01:55 What is a Controller? (The Observe-Compare-Act Loop) - 0:06:45 Idempotency in Controllers - 0:07:55 Deep Dive: The Reconcile Loop (Happy Path, Sad Path, & Error Handling) - 0:19:45 The Foundation of Writing Operators - 0:23:05 What is an Operator? (The "Helper" Analogy) - 0:27:35 CRDs (Custom Resource Definitions) and CRs (Custom Resources) Part 2: Kubernetes Extensibility - 0:31:35 Kubernetes as an SDK & Extensibility - 0:34:00 Networking, Storage, & Admission Controllers - 0:35:20 Internal Developer Platforms (IDP) & Platform Engineering - 0:39:50 Bootstrapping with Kubebuilder Part 3: Setting Up the Environment - 0:41:05 Setting up the Local Environment (K3D, Docker) - 0:52:15 Introduction to the Kubebuilder Framework - 0:56:35 Project Initialization (kubebuilder init) - 1:00:30 Exploring Scaffolding (Makefiles, Dockerfiles, main.go) Part 4: Building the API & Logic - 1:04:15 Creating your first API (kubebuilder create api) - 1:06:45 Defining EC2 Instance Types & Specs in Go - 1:13:05 Understanding TypeMeta and ObjectMeta - 1:21:45 Internal Controller Logic Breakdown - 1:24:05 Deep Dive: Manager Architecture & Controller-Runtime - 1:31:05 Cert Watchers, Health Checks, & Prometheus Metrics - 1:52:55 Initializing the Manager in main.go Part 5: Hands-on Development - 2:07:35 Implementing the Reconcile Loop Logic - 2:22:35 Custom Resource Definitions (CRDs) in Action - 2:46:45 Running the Operator Locally - 3:01:45 AWS SDK Integration in Go - 3:22:55 Using Finalizers for Cleanup Logic - 3:36:55 Creating EC2 Instances on AWS via the Operator - 3:53:20 Implementing Waiters for Instance State (Running/Terminated) Part 6: Advanced Internals & Deployment - 4:13:45 Idempotency & Reconciler Loop Internals - 4:46:35 How Informers, Caches, and WorkQueues Work - 5:11:20 Handling Object Deletion & Timestamps - 5:32:05 Packaging the Operator with Helm - 5:43:05 Deploying to Kubernetes (RBAC & Service Accounts) - 6:16:20 Conclusion & Future Steps

Video Details

Category

Kubernetes & Container Orchestration

Featured Date

January 16, 2026

Quality Rank

#1

AI Recommended