Loading video player...
In this hands-on Kubernetes operator
course, you'll learn how to extend
Kubernetes by building your own custom
operators and controllers from scratch.
You'll go beyond simply using Kubernetes
and start treating it as a software
development kit. You'll learn how to
build a realworld operator that manages
AWS EC2 instances directly from
Kubernetes covering everything from the
internal architecture of informers and
caches to advanced concepts like
finalizers and item potency.
Shubhamqatara developed this course.
>> Now if you already know Kubernetes, you
know that there are concepts and
Kubernetes objects like pods,
deployments, replica sets, stateful
sets, services and so on so forth. But
do you know that you can create an
object called EC2 instance?
No. Well, that's the beauty of
Kubernetes because you can extend the
current capabilities of Kubernetes and
create something which is called an
operator. So you can create an operator
to control things which are outside of
Kubernetes like EC2 instance which we
will learn in this particular course.
I'm very excited to bring you the
Kubernetes operator course from scratch.
This 6R plus course is brought to you by
Shouhham who has 8 plus years of
experience and working in Tago have
trained many on open shift holds
multiple certifications like GCP cloud
professional and devops and this course
comes as an outcome of his work at
Privago for building custom operators in
production. Yes, we'll build a
full-fledged working operator end to end
from scratch learning why it is even
important, how to do and everything
about cube builder and then building it
end to end. I'm really really excited
about this course and cannot wait for
you to get started. So before we can
build a custom operator for Kubernetes,
we need to know what is an operator,
right? And before that there is a term
that is called a controller that you
really need to be familiar with. Now
many of you might not might know already
what a controller is. It's you know um
you have heard about this which is the
cube controller
um manager.
But what does it really do? What work is
it that the that the controller is is
responsible for?
So a controller is nothing but it is
think of that as a forever running loop
right think of this as a piece of
software which we will be writing that
is a forever running loop and if I want
to write a bit of pseudo code for that
it's kind of like this so you always run
it and the first thing that it does is
it um it observes the state of the
resource whichever resource you are
writing an operator for you will have a
controller for that as well. So if you
want to work with pods or deployment you
want to work with services you want to
work with config map there is a operator
for all of those resources. So the first
thing that it does is it keeps on
observing the state of your resource. If
the state is updated
again for whatever reason you updated
the state, maybe in your deployment you
change the image, maybe in your config
map you edit the data of the config map.
Whatever reason that happens and if the
resource is updated, the second thing
that a controller manager or an operator
or a controller really does is it
compares the current state to the
desired state. And this is where you put
your business logic. This is actually
where you define what to do in case
there is a drift that is recognized and
most importantly what not to do if there
is no drift because it's very important
to make your operators or at least your
controllers uh ident.
They have to be amputent. I cannot
stress this enough. We will talk about
the reconciled loop just in a minute.
But this has to be important in terms of
if in case your resource needs no
change, there should be nothing done on
Kubernetes. You should be able to run
your controller as many times, but it
should not result into a change if there
was no change needed. And if it finds
that there's a drift between the current
state and the desired state, it then
does an update. Or you can also say it
acts on what logic you have asked it to
do what to do in case there was a drift
found. And then uh we close this. So
it's a forever running loop that never
stops and keeps on watching the API
server for your resources that you are
managing.
Now what we are going to build is a
cloud uh it's a cloud controller because
what we are building will be a piece of
software that actually runs on your
Kubernetes environment here. Let's say
this was your Kubernetes and there you
say I want to make kind EC2 instance.
Let's put it this way. It goes to
Amazon, sees if this instance with this
name is already there or not. If it is
there, it does nothing. If it's not
there, it creates something. So, it's
kind of what we would call a cloud
controller. Think about when you run on
EKS, when you go to Azure Kubernetes
service, it is very easy for you to
change the service definition, the SVC
for example in in EKS to have a load
balancer. You can just say the type of
services load balancer and in your EKS
cluster there is a software which is
working which is running that abstracts
how to create a load balancer how to
make your service as you know as
backends of that load of that load
balancer it hides away the complexity
for you and that is what a cloud
controller does. There may be many
different controllers that cloud
providers will give you in their own EC2
in their own u kubernetes distributions
to make your lives easier so that you do
not have to know the the nitty-g
gritties of it. You just say I want a
resource and then you get one and that
is what a cloud controller manager would
be.
Now when I was talking about uh
controller we spoke of this term called
ident and this is something I actually
want to um and you know um explore a
little bit with you. So there are few
things that your code should actually be
doing when you write a controller when
you write u the logic for what to do
there are some things that can actually
be uh that can actually be done. Um and
the first thing is a happy path. So what
is a happy path? Um you have a logic
your reconcile you know this is actually
also called as reconcile loop. It's
here. This drives the cluster state to
your desired state and this is what it
reconciles and that is why cubernetes is
eventually consistent. And I mean in a
way that you make a change eventually
which is a very short time again that's
why we don't we think it's this but
eventually your state is going to match
the desired state that you want to do
the the cluster of state is going to
match the desired state.
Now let's zoom in in this path a little
bit where we have our um you know um
case one where you have a logic your
resource got updated and your reconcile
function is then triggered. This is
where you know this is the start of your
uh loop. Let's put it this way. This is
the beginning of your loop. So the first
thing that you do is uh you get your
object from the request. The way it
works is when you update a resource in
Kubernetes and there's a loop that's
watching on that, there's a controller
that watches on that. The controller
gets a request. The controller can
actually get the request that you
wanted, which is the API request to the
API server and it can get the object
data. For example, if you updated a
config map, your reconciliation loop,
your reconciliation loop can actually
get the YAML or the JSON of that config
map. So you can verify or you can
actually you know um see what has been
changed or what updates has been done,
what has been done by the user on that.
So you can get the object from that
resource and you can then observe the
desired state from the spec. What you
actually do is you see um you define
your uh config map for example or let's
say a pod. So you have a pod and then
you have a dospec in which you define
your containers.
So you can see how many or what spec is
there for a particular resource and then
you can compare uh with that spec what
is the actual state of the resource if
they match you know if the if the number
of containers in your pod are exactly
what you wanted then you have to uh you
have to just you know skip it you don't
have to do anything and this is what the
happy path is you do nothing and this is
absolut absolutely important that you
realize you don't have to do anything in
this case. You don't have to make any
API calls. You just ignore that request
to your reconciliation loop because the
actual state is equal to the desired
state. And that's what happens when you
exit your loop gracefully. Of course,
I'm not saying you will stop the loop
because you have to keep on listening on
the request, but you will not make any
changes.
There's also a second thing that can
happen. So uh in this step you have your
function triggered. You get the actual
object from the request. You see what is
the spec of the object. What object is
being modified and what is the actual
resource of the of the actual state of
the resource. And this is where it gets
interesting.
If the desired state is equal to the
actual state that you want, you do
nothing. We know about this from the
previous happy loop. However, if they do
not match, for example, in your
deployment, the current that you have in
HCD is your replica
three. Let's take this example. This is
nice. So let's say uh your current uh
one that is there is replica equal to
three and this is for a deployment which
is stored in HCD. This key is stored in
HCD. Now you do um a cubectl edit you
know you do a deployment and then you
give the name of the deployment and then
you save that file. First thing that
happens is that your reconciliation loop
will get this request that okay because
I'm watching deployment this deployment
is now updated and that is where you
made the change to be replica equal to
five.
What your current is three your desired
is five. Now you say okay in your spec
you will have replica equal to five.
This is what you can do when I say
observe the desired state. You get the
actual object YAML. You get the actual
object YAML and then you observe uh this
desired state. So you want five replicas
and you observe the actual state which
is still replica equal to three. So
there is now a drift. the current
actually does not match the desired
state and this is where your logic would
actually come into the picture what to
do in case your resources are not
matching to what the user has actually
asked to do. So there you will calculate
some differences. You will probably take
some actions. You will do a create,
update, delete for the resource. In this
case, you will create five more pods.
Sorry, you would create two more boards
because you wanted five. So 3 + 2 is
going to be five pods, which is actual
user uh requirement. And then there if
your action is succeeded, you update the
status field. And then you exit the loop
again. This is very important you know
uh every every resource in Kubernetes
has a dot status. So you have a spec and
then you have a status and this is how
the reconciliation loop knows if it is
actually matching. If for some reason
you could not create the pod for
whatever reason it may be you can return
an error and then you can reue retry
doing that action. And this is what
makes Kubernetes as healing. It tries
again. It tries again with a you know
with a back off. You can configure this
that if you were not able to do this
right now maybe there was no uh let's
say you could not create the pod because
you did not have enough memory. Your pod
would actually not be created or they
actually put in the pending state. This
is not a good example but let's say for
whatever reason your pods could not be
created. Maybe you were missing the role
based access control in this name space
where the pod should be created. Now it
will be recued and the way it goes is it
goes back to the beginning of the
reconciliation loop here and then it is
started again and this is what happens
when I say you need to recue. Recue
means you retry that action and this is
what Kubernetes is about self-healing
because if you give the rolebased access
control to the you know to to the
controller it will be able to create
resources. It's not like I tried once
and I couldn't do it. It keeps on uh you
know uh trying again and again. You
might have seen this. If you have a pod
which needs a persistent volume, um it
goes into pending if there is no P lab
that the pod needs. But if you create
one, the pod automatically gets
scheduled. It gets started. You do not
have to do that. And this is the beauty
of the loop that can reue that you can
recue for your um for your cases. And
this is this is absolutely the
brilliance of um self-healing
in Kubernetes.
Now one thing you have to be very
careful is this. There's also a sad path
and this is something you always always
want to avoid when you are writing a
custom uh controller.
The things are pretty much the same. So
what you do is you start your loop. You
got a request that somebody updated the
deployment. You see what they have made
the changes to. You see if there's is
actually there or not. The change there
is if actually if you go to the desired
state you have to absolutely
do nothing. You have to do nothing.
Absolutely nothing. What I mean by that
is you do not have to update the
resource for anything because when you
update the resource let's say here you
know um let me talk about that.
Now this is interesting. The way it
works is I'll go back to the the one
where you had to make some work you
calculate uh the difference and you
update your resource. Now if that action
is succeeded if that action is succeeded
you will actually be triggering because
you updated the resource. This will
actually trigger the reconcile uh
feature again.
Kubernetes controllers they do not know
what you have updated whether you
updated the spec whether you updated the
metadata whether you updated the status
they don't know about that they just say
okay the resource deployment was updated
here so I will re retry my actual you
know I I would rerun this from the
consiliation loop and now because you
updated because you created five pods
now your replica
is actually five and it will now say
okay um I get the object the replica is
five that the user wanted and now
because I'm running this again the
replicas have been already created the
state matches I don't have to do
anything you have to write your
reconciliation boobs ident
maybe you uh You got a request and uh
your object you get the object you
observe them they do not actually make
need a diff they don't need any work
maybe you have got the same replica was
five and then your actual state was also
five you do not need anything but
by mistake you update the status of last
sync you say okay it's just the metadata
it does not change my deployment right
it doesn't change my containers. It
doesn't change the image I'm using in
the environment variables. It doesn't
change that. I'm just putting
as a good person. I want to see when
this was last synchronized. And you say
that whenever a request comes, even if I
make no changes, um I would update the
the status.los sync, which actually
would then trigger an API call. And you
see whenever you update your resource it
goes back to the beginning of the
reconciliation loop and this is where
you would have a forever running loop
request comes in. Um okay you got the
object you observe the desired state
from the spec. I'll zoom in a little
bit. uh there was actually no need of
any changes on the resource but you by
mistake you are updating the status. So
Kubernetes says okay the object the
controller is looking for has been
updated. So it goes back up to the
beginning of the loop and then you
update the sync uh the last sync again.
Kubernetes says I got a new update from
the beginning and this loop will
continue forever. your resource will
keep on updating without having any you
know without any stock. So this is very
very important that you need to be very
careful of um not making any changes if
you do not require any changes.
Now this is the foundation this is
actually the foundation of uh how to
write a operator how to write an
operator. The controller is the actual
logic that you have to have. Now what uh
and there are a couple of things uh when
you are writing an operator this is
absolutely important I think this is a
good thing you should read this the most
important question you should be asking
or your controller should be asking is
if there's anything for me to do that
means if the current state is equal to
the desired state if not exit
immediately do not do anything there
should be a golden rule as well that you
should follow that you should only write
to the API server when the actual state
differs from the desired state. Where in
this case you see here you you are like
okay I know that the actual state is
equal to desired state. I make no calls
to the API server. I do not update my
resource. But by mistake you update the
last sync which is again a request to
the API server to modify the resource.
And then the reconciliation loop sees ah
there's an update. let me go back and I
would restart that uh I I would rerun
the loop and then it's a problem. So you
always um have to make sure that you
only make the changes to the resource
when they differ from the desired set.
And this is also what a tempotent means
that you can run your loop 100 times if
the cluster if the machine is already in
that state you should not be doing
anything. you know it doesn't break
anything it doesn't change anything if
the cluster state is equal to the
desired state that is absolutely
important to uh to be to be taken into
account and this is what's interesting
this is what makes these operators uh
resilient which is they are stateless
they don't remember what they did with
your resource in the last request they
they don't do that they don't remember
they don't remember if the Paul replica
was three or five or seven. They don't
remember if you have the environment
variable or not. They always always
check the resources. They always their
source of truth if they go to the
required you know place maybe you're
writing a a cloud operator they go to
the cloud maybe you're writing a
database operator which creates a
database it goes to the database always
runs the query and this is why uh these
are stateless. So your container your
your controller can actually be killed
or the you know the node on which it was
running it could be deleted it could
crash the container uh the controller
will go to another node it just starts
from there it doesn't have to have a
persistent volume to store the state it
doesn't know that and this is why it can
crash restart and still figure out if it
needs to do something uh on a particular
resource or not because that's what you
have made it to do you have the logic
that it always observes, it always
checks the desired state in the current
state and if there's anything to be done
uh it does it otherwise it says cool the
uh resource is already in that uh uh in
that state which the user wanted me to
do.
Now if you talk about um uh this is
about controllers but what is an
operator?
I think you guys might already know
about operator in in a in a way because
you want to write your own operator but
let's just go through that quickly. Um,
imagine you guys want a house. You know,
you you get a house. Let's say you are
living in India. And this is a very good
example that I like. Let's say you are
living in India. You have a house
already. Maybe your parents own one. And
one day you decide to move to Germany.
The place is completely new to you. You
have never been to Germany before. You
don't speak the language as well. You
don't know German. Now you need a place
to stay. you need a house to stay. you
call a company uh you know in this case
you call a company and the company says
hello sir you're moving to Germany we
would make would help you make sure your
move is easy and simple we have two
options one we can give you a full
furnished house
we will give you a full furnished house
and also we have another option where
you can just get a simple uh unfernished
I'm saying a simple house but let's say
an unfernished house.
You can choose whichever you want and we
would be happy to give you the key when
you land in Germany once you sign the
forms and everything. The company also
says one thing that sir while we are
giving you the furnished house we also
give you a helper.
Now you say what is this helper? What is
it going to help me with? The company
says if at any point in time you break
uh you know a tap, maybe your water
filter is broken, maybe the floor is um
you know you spill something on the on
the carpet. Are you going to fix it?
Maybe your bathroom uh tap is broken.
Maybe you break um a window. You you
never know. You don't know anyone in
Germany. you will fix it yourself or you
can help you can have the helper do
these things for you because you don't
know the nitty-g gritties of where the
hardware store is, how to call someone
if I lose my keys for the house. Let the
helper do it for you. So the helper is
actually someone who has the full
knowledge of this house, who has the
full knowledge of how to fix things if
they goes wrong. You just have to tell
the helper maybe you lost your keys, you
know, just tell the helper, go get me a
key. He knows where the store is. He has
the logic. He has the knowledge of where
the store is. He has the knowledge of
where to go and in what language, how to
speak to the to the person who can make
you a key in German and gets you a key.
If you have a broken pipe horse, he
knows how to fix it. So, think of this
guy. Okay, this helper as the actual
operator. Now if you want to port this
in um in the terms of software, think
about you uh have a database which is
called MySQL. Now for you uh
installation of things is easy now
because you have a container you can
simply run it and you would be able to
get your app your software but what
about day- operations? What about maybe
you want to do a database migration of
your schema? Maybe you want to take a
backup. Maybe you want to take
incremental backups on a particular, you
know, a schedule. That knowledge needs
to be either with you or someone who can
do this for you. And this is where MySQL
not just gives you the database MySQL
but also has an operator for you. This
operator is actually a controller
running internally. So this controller
has all this logic. If the user asks me
to uh create a database, I know how to
make a database. I know how to log into
the DB, I know how to create a database,
I know how to do that. It knows about
it. So you just have to tell what to do.
In this case, this helper was the
operator. And in this case, this was the
MySQL database product we were actually
looking for. And that makes your life a
lot easier because you don't have to
worry about the lower level details. Now
operator
has two things. One is a custom resource
definition and then the other thing is a
custom resource.
You know how you can do cubectl get
pods. You get a response. Maybe you have
pods or not. It says yes I have pods or
it says no pod found in the name space.
But if you do cubectl get apple, it
doesn't know what this resource called
apple is because kubernetes has its own
vocabulary. It has the API resources
that it has been told to remember and
those are the resources the internal
ones that are native to Kubernetes like
pod deployment secrets uh services you
know um all these things these are
resources that Kubernetes knows about.
But what if you want to create your own
resource which is in our case what we
will do is going to be an EC2 uh
instance.
I could also want to create an S3
bucket. In that case, I need to expand
Kubernetes's vocabulary that okay, this
resource called EC2 instance. If
somebody says uh it gives you a YAML
which is kind uh EC2 instance, you know
how to create or at least you know what
that is. What to do on that? That's a
different story. You know what that is.
So that if somebody gives uh on this
file cubectl create you can don't just
tell me you don't know what is this
resource you know about that now I have
given you the schema of what an EC2
instance would be I have given you this
custom resource definition so whatever
the user gives you in this kind
acceptepic because now your vocabulary
has been increased and this is going to
be a custom resource whenever you create
uh whenever you instantiate a custom
resource definition that is called a
custom resource. For example, if you
created a custom resource definition uh
for EC2 instance when you create it and
then you can do cubectl get uh EC2
instance.
What you receive is an instantiation of
the of the definition that is a custom
resource on which very important on
which your operator your controller
will be acting upon. So your controller
knows that on a resource type EC2
instance it has been created it has been
deleted. If you create a resource called
EC2 instance, it knows that on this
resource there was an update which is to
create the resource. The controller will
create that resource for you. If you
delete that, the controller will say
okay on this resource which I am
watching there's a delete operation
performed by the user. So it goes ahead
and deletes it for you. So without the
controller your your custom resources
are nothing. They are just kubernetes
knows about it. It does not react on
that. It does not acknowledge that okay
I'm going to do what you want me to do
because it doesn't have the knowledge.
So while the CR and CRD uh you use them
to tell what you want the controller
with them is actually the how part of
it. How do I do that? And this is what
we going to be building. We will be
building um a cloud controller which is
for building EC2 instances on Amazon.
And this is what we will be looking for.
Um there's also something which you need
to know. Kubernetes is not just a
platform now. It is a complete operating
system for um you know for people. So
let's talk about how Kubernetes is
actually expandable and how can you use
Kubernetes as an SDK. So what's very
important with Kubernetes is to look it
from not just a platform where you can
run your applications but rather how can
you expand Kubernetes as a software
development kit and what can you do with
that on other platforms that's also what
you can do. So the first thing that
Kubernetes is so widely adopted by cloud
providers by onprem for other softwares
is because of its extensibility.
Get me let me get a color different. So
it is because of the extensibility
because of these custom resources
because of these operators and because
of the controllers and this is what uh
we just talked about. Kubernetes also
have API first approach. So everything
in Kubernetes has an API. Everything
your pod is an API. Your service is an
API. Your API server has APIs for all of
these things and that makes it very easy
to u write your code for and there are
client libraries for this and that makes
it very very easy. You have the SDKs
that you can build your controllers on
uh for Kubernetes. there is Go, uh,
Python, there's Java, there is
integration of JavaScript with
Kubernetes because there are client
libraries for that as well. And you can
also, uh, Kubernetes has backward
compatibility because it does not just
delete API resources, it deprecates them
first. It gives you enough time to move
towards a different um, uh, uh, you
know, to a different API um, version and
it also versions its API. So maybe you
might have seen uh pods/v1
or you might have seen network
um you know um network config
/v1
beta 1 beta 1. So this is the version of
kubernetes um API. So it makes it very
easy for you to develop new APIs without
breaking the existing ones and that
makes it really really simple or really
helpful I would not say simple but
helpful to expand your APIs and this is
a plug-in everything you can have your
networking you can bring your own CNI
you can choose from different CNIs quite
popular ones are stelium um I think yeah
selium is one very popular from isalent
which is a company acquired by Cisco. Uh
you also have different options for
storage. You have different options for
runtimes and web hooks where you can
intercept everything as a admission
controller which could either validate
your request or which can either mutate
your request. I think for these web
hooks we can have an entirely different
course for it. they deserve their own
time because I don't do justice if I
just talk about there is an admission
controller which can validate and uh
mutate it doesn't doesn't help so
probably something to look for in the
future for us
and this is why different cloud
providers because of this extensibility
of kubernetes there are different
flavors and thousand plus tools that you
can use on top of kubernetes so there is
open shift from red hack there is suz
from Rancher, Tanzu from WMware. Then
there's softwares on top of that which
is cubeflow K native um cube which is
also quite popular nowadays and that's
what makes the developers happy because
they say what not how. Now if you are
working in a platform engineering team
uh you want your let's say you know this
is a developer this developer wants a
machine in Amazon he wants or she wants
an EC2 instance and you manage your
cloud let's say you are the cloud uh
admin who will give them the EC2
instance they come to you you uh say
okay I run some commands blah blah blah
and this is the instance and you give
them that that's okay but this is a very
old approach.
What you can rather let these guys do
and this is what um in in internal
developer platforms would actually help
you with or you can build your own then
you can say okay listen what if you want
an EC2 instance you don't have to come
to me just give me this YAML which is
you know you can explain them explain it
to them you can have a Helm chart around
this that says I want an instance where
you can say the number of instances
maybe two the instance uh type which you
want and then maybe the you know you can
have them give the um the instance uh ID
where you can then say the machine the
AMI ID that you want to use a very
simple thing and then maybe also the
port numbers that should be open. They
give you this in a YAML format and you
pass this from your controller,
you know, after you can have a pull
request review. So after they have a
pull request, they this is stored in
GitHub. You have a pull request and then
they get an EC2 instance. With this,
they get to say what they want. They
don't care about how to create resources
in EC2. They don't care about BPCs. they
don't care about anything and also
because you have a githops workflow now
you can have argo cd uh deploying these
resources and then the controller takes
care of creating the e2 instance
everything is as a code you can have a
githubs very resources um very very
simply with this platform uh as as a you
know as a as a product which is platform
engineering all about so you can have
the declarative options you can use Helm
to help the lives of developers easy
that they can just give you this
information. You render the resource and
then your controller takes care of that
and and this is this is I cannot um
stress it enough how how simple it makes
our lives easier.
Now because you can run Kubernetes not
because the thing is you can run
Kubernetes anywhere and the reason why
you can run Kubernetes anywhere is
because of the standardization. You can
run this in any cloud you can run this
on edge you can run this you can run AI
workloads on top of that anywhere you
know Kubernetes is standard because it
has one pattern which is a controller
pattern that rules them all. Um I would
say DNS just works again it can be
problematic but every pod knows where
the where every other pod is. um it has
its own challenges depending upon how
many number of services you have in a
cluster, how many pods you have in a
cluster. Scalability could be another
issue but for a for a cluster that you
have bootstrapped, it just works fine.
And then you have config management for
your developers which I don't think I
need to uh talk about.
The single I'm trying to make here is
it's not just a container orchestrator.
It is a complete operating system. You
want networking, it has it. You want
memory management, it has that. You want
compute management, CPU, uh, storage, it
has that. It has disk management, it has
it. So, you can actually build and
package and ship your software that runs
on top of Kubernetes. Uh, any sort of
software that you can uh, you know, you
can build and run on top of Kubernetes.
It's not like you're just using
Kubernetes, but you can ex expand it
with all of these controllers and these
um operator frameworks that we are
talking about. And this is why I love
Kubernetes a lot. All right. So, this
was about how do you use Kubernetes as
an SDK. Now, let's talk about um how do
you bootstrap Kubernetes? Um how do you
bootstrap a Kubernetes operator with uh
with a software called Cube Builder? And
this is where our journey would be
beginning. So let's go on and do some
hands-on on writing an operator.
So before we can build our own
cubernetes operator, we need a place to
run this operator on and that is going
to be Kubernetes. Now you can build a
Kubernetes cluster in GKE. You could
probably use Amazon as a managed
service. You can build your own clusters
with uh QBDM. Whichever way you want to
do it is fine because the operator that
you are building it will be built into a
container image and that container image
can be run on any Kubernetes cluster. In
our case, we want to keep it simple. So
I'm going to be building the operator
and I'll be testing this operator which
is going to be running on my cluster
locally and create instances on Amazon
which is external to the cluster just to
show that you can manage infrastructure
that is external to your Kubernetes
environment. And this is why Kubernetes
is really popular because it lets you uh
use it as a SDK as an operating system
of the cloud which we will also talk
about in the future. So K3D is a
Kubernetes distribution by Rancher which
has many other distributions like K3S
which is also a very simple lightweight
Kubernetes distribution. It also has RE2
which is more hardened and for security
if you are working in the governance um
and K3D it lets you create containers or
rather it lets you create Kubernetes
clusters in containers. If you have kind
you can use kind. If you have K3D you
can use K3D. If you have a sandbox
cluster somewhere, you can use that as
well. The reason why I'm using this
local is because it's very lightweight.
It does not cost me lots of resources.
It's free of course and it's very fast
because it's running on my computer.
So for K3D, we can install that very
simply. Just go to the installation
script and you can download that with
either cur or you can download that with
wget. I would suggest you go with the
latest version. And once you have this
downloaded, you can do K3D or K3D
version. And I've got the latest version
of K3D, which is 5.8.3.
And the Kubernetes version that I would
be using when I build a cluster with K3D
is going to be 1.31.5.
But there is a newer version of
Kubernetes. What if I want to use that
better? We are DevOps engineers. We are
uh cloud engineers. We like to have a
single source of truth for all of our
applications which is why we do githops
right and wouldn't it be nice if you can
just version control your clusters as
well uh that right now I have got one
cluster which has two agents maybe I
want to increase it let me put into
GitHub and that is exactly what K3D
allows you to do with a very simple
cluster config file and this one has
lots of options which you can go to K3D
uh and look on the documentation.
However, I I've kept it very simple.
This one gives me one master. K3D allows
you to create multim masteraster
multi-node cluster. Again, I'm just
going with one because I don't need high
availability.
And second, I'm going to be using two
agents here, which is going to be the
worker nodes. And this bit tells me the
version of the Kubernetes that I want to
use. And that's the one which we will be
using.
You also need Docker because K3D uses
Docker because it creates containers in
which it runs your Kubernetes cluster
which runs containers and that's a whole
inception going on out there. But these
are the two things that that I would be
using. If you have any other
distribution of Kubernetes, you can very
simply use that. So I've got Docker um
running on my machine. I've got or stack
which is actually giving me docker in in
the background which is giving me a
runtime in the background I would say
and to talk about K3D its architecture
is fairly fairly simple. So what it does
is that this is how it looks like. So
you have your laptop or you have your
computer on which you want to create
multiple Kubernetes clusters. Now as a
developer I might need different
clusters for different applications. I
might want to promote them from dev,
testing, QA just to have a pipeline
going for a complete software
development life cycle. That's also
possible for me too. And that is where
K3D shines really well. When you make a
cluster in K3D, it creates a separate
Docker network for all of them. So they
are completely isolated from each other
and they have their own tier as as you
will. have their own network uh in which
they would be talking to. So here you
can see I've got one cluster here which
is blue and there's one cluster which is
green cluster A and cluster B and this
is the master node and these are just
robots which is our work is cuz that's
where the actual work gets done and we
have these docker networks created right
now if you do docker network list you
see the standard docker networks that
are created when you install docker
however when you do k3b cluster create
with this config file which is our
source of truth. When you do that,
there's going to be a new network
created which I just showed you. So we
will see that just in a moment. Once
this is created when you you know when
you ask it to create a cluster not just
it creates your cluster for you not just
it sets up a gateway for you not just it
creates your workers for you it also
updates the cube config or rather it can
help you to get the cube config and here
you can see my context is automatically
set to cubectl. It says you can use it
like ctl cluster info. And if I do that,
that's where my clusters are. That's
where my cluster is running. Now if you
do docker ps, you will see there are a
couple of containers that are just
started. And this is our infrastructure
for K3D. We have got two agents which is
our worker nodes. We have got one server
and we also have this engineext proxy
container which is there for some reason
and the reason why it is there is for
you to talk to your API server because
you can use K3D to create multiple
masters. You need to have a load
balancer. So you should not be needing
to set it. That's why K3D does it for
you. And here it creates a container
that is listening on your port on your
computer's port which is 5745
and that's actually uh forwarding the
traffic to 6443 of the master or in case
you have multiple of the masters and
that's why you see the Kubernetes
control plane is running on 5745
on all the IP addresses of your
computer. If you go to this port, you
will be talking to Kubernetes. You will
be talking to the cube API server.
Now, what can you do? Every time you
have a cluster, it's good to do a smoke
testing. A very simple one. So, we can
do cubectl
get nodes. There you go. You have got
one control plane, one master. You've
got two agents which are ready. You can
do cubectl get service. There you go.
You can do cubectl get pods and some of
them are code DNS which is very simple.
It comes with a metric server also. It
comes with traffic insert which is again
uh it allows you to expose your services
outside or work as an ingress if you
will. Um and it has got a local part
provisioner which is for storage. I
talked about the metric server already.
Now let's try to do some smoke tests.
And if you can do cubectl create
deployment or kc create deploy
it's going to be creating a deployment
and it's going to create a pod um k get
pods and here you can see it's container
creating. If I do k logs and if I can do
my deployment this is a log for
engineext. That is fairly fairly simple.
If you had used engine x this should be
nothing new. You can also expose your uh
deployment. We want to check the network
connectivity between our applications.
If one service or one pod can talk to
other application in the cluster, let's
just validate that. So I could do uh I
want to expose my service. I want to
expose my deployment called my
deployment and the port number for that
would be 80. Here you can see it's a
service resource in Kubernetes and it
has got a cluster IP. Now you know if I
want one application to talk to uh
another application in my Kubernetes
cluster I can use this cluster IP and
that's exactly what we would be doing.
What we would be doing is here okay um
so here we have a pod in our new cluster
for which we just created a service. I
want to test the networking in my K3D
cluster. So I would create a new pod. I
would try to curl this service and I
should get a response from this pod and
I should be able to curl it because it
is HTTP cuz I know I just ran an enginex
server and this should work because it
is a single cluster. You know you cannot
by default expose your service IP
addresses outside the cluster. However,
inside it should work fine.
And that is where we can use our trusty
curl image. This lets you just do a curl
to any other IP address or host name.
And that's where we can do k run. I want
to use this is my container. I want to
create a curl container with the name of
curl. This is my image. And I want to
connect on the IP address of my service.
That's that. Let's look at the pod. This
pod is container creating and it's
completed already. Crash loop back off.
That's fine. Let's check what happened.
And if I do logs for curl, it wasn't my
crash loop back off. It just started,
exited, started, exited, and it's like
what is going on? It was not a chron job
that runs till completion. Um but you
can see here this is the response that
you get from the service uh which is
engine X and that tells me that my
cluster is ready for connection. My
cluster is ready for me to build
applications and also uh you can
probably go to um you can also check
from your cluster if you have external
connectivity because we would be talking
to Amazon.
Might as well check that. So we can do k
run curl or let's say Google and I could
do httpswww.google.com.
Do I have a pod now? Uh, Google
container creating and let's say and
that looks Google to me. Um, looks fine,
right? So, we have connectivity between
our applications and we also have
connectivity now uh to external
environments and this is going to be the
foundation on which we will be building
our application.
Um you also would be needing to have go
on your computer which we talked about.
You need docker git the standard
developer tools. So um that's it. This
will be our uh our setup. Now I think we
should talk about what are you going to
be really building in this course and
what is a reconciliation loop? How does
kubernetes know what you want it to do?
How does the controller or what is even
a controller in the first place? How do
they know that I want to do something?
The user has asked me to do something
and uh I should do that. How do they
know that the state of the cluster is
not matching the state of the you know
uh desired uh versus current state? How
do they know about it? So let's get uh
let's let's learn that now. So if you
want to know how to build an operator,
the best thing to use is an already
available framework which is called cube
builder.
There are also some other frameworks
that helps you to build cubernetes
operators like operator SDK. However, um
cube builder is also one of the very
famous operator frameworks that allows
you to write your own controllers for
kubernetes. This is for people who are
using Kubernetes and they want to
develop a very indepth uh knowledge of
how Kubernetes reacts on certain
resources, how the operator loop
functions, how is it identities,
how would you know um you actually
compare the state to the desired state.
What is a web hook? How does it work?
How do you implement versioning with a
cubernetes operator?
That's all which is very very inbuilt
and which is very simple with cube
builder. So this allows you to have a
starting point without spending so much
time on what is going to be my project
structure. How would I you know uh
structure my code? How would I structure
my test cases? Um how do I generate my
um metrics? How do I add a locking into
my soft into my controllers? Am I going
to have a leader election? How do I
implement a leader election? How do I
expose a metrics? on what port do I
export the metrics? All of that is taken
care by your builder. What it does is is
it allows you to have a directory
structure in which it has the
boilerplate code for building your
Kubernetes operators already there
thousands of lines. Uh instead of you to
have to write it allows you to focus on
the business logic. It allows you to
focus on what is going to be your
specification of the custom resources.
It allows you to tell what to do in
order to you know how to react in case
there is a change in those custom
resources. That's what it allows you to
do instead of uh looking at how do I
start with an operator in the in the
first place. It also lets you generate
the role based access control. It lets
you generate the cube um um what's it
called? It lets you generate the the
customize resources as well in case you
want to deploy your operator into
different places. It also lets you wrap
your operator into a Helm chart for its
own deployment. So that um it can be
used in any cluster regardless of
whether you are running on cloud,
whether you are running on prem on
wherever you are running. It allows you
to version control your APIs as well. So
for us, let's get started with that. And
the first thing you can do is you can
quickly install
um install cube builder. Let's go there
and installation and setup or maybe I
look on GitHub and there should be some
releases um that you can you can
download. Um we can also install uh
using um the installation book. There
are many different ways of installing
it. Either you can download it from the
releases which uh which one uh is
working for you. I'm using a Mac. So I
have got an ARM 64 because I'm using a
Mac and that's my architecture. And once
it's downloaded, I think you can also
use Buu. I'm not sure if you can but um
how can I install that but as I show you
there you go. So you can install Cube
Builder using a very simple third
command.
Now first thing that Cube Builder needs
or what you do with Cube Builder is you
create a project. Now a project, think
of that project as a collection of your
APIs that you will be building and it's
a simple directory structure that allows
you to initialize um you know um your
your APIs and let's do that now. So
first thing we will do I have cube
builder cube builder version already uh
which is which is available 4.5.1. I
think the latest one is 4.7.1.
I'm not too far behind but that's okay.
So I've got the cube builder and the
first thing we will be doing is we will
create a project where we will be
hosting or we will be you know um
building our API. The first thing cube
builder uh in it and here is the
important thing when you are building
your custom operator um let's say you
are working in a company called uh
example um you want to build your uh
custom resources in a certain domain
which makes it easy for Kubernetes to
know where this operator is coming from.
If you do cubectl
API resources and if I do less here you
can see every resource in Kubernetes is
actually its own identifiable API um
every resource that we see is an
identifiable API c um API resource for
example if I uh look at let's say um
AI services here for example hub.tra.io
io/me1pha
1. We will talk about what the group
version kind is. But uh just to just for
you to know uh you can define the domain
in which your API should be uh declared
in which your API should be built. So
for example, I could say uh Q builder uh
in it I want to be building things
related to cloud and let's say I work
with um um Netflix for example and my
products should be under the domain of
netflix.com in this case I'm using
cloud.com and the repository in which my
um in which my code would be hosted just
as a project descript encryption. What
it does is it writes the customized
manifests for you. So you can have it
deployed in different clusters based on
your requirements. It writes a lot of
scaffolding code for you. And what it
does is is it creates you a directory
structure. It writes you a docker file
which you can use to build your operator
into a uh into a deployable image. It
creates you a make file that uh you can
use to generate your custom resource
definitions. Maybe I open this in VS
Code. That would make more sense.
Um maybe I open this here in cursor.
That would make more sense. So it gives
you a make file that lets you generate
your um your you know uh your RPA lets
generate your custom resources, custom
resource definitions. It helps you
deploy those into a cluster and install
them from the cluster. If you are doing
a local testing, this make file is
really really um helpful. And this is
where is going to be your project. Uh
this is the project uh information on
where uh what is the name of the
project? What is the domain under which
your project uh is is defined and um and
what is the version of uh of of the cube
builder project that you are using.
Apart from that and this was the docker
file that we were talking about. Apart
from that it gives you this cmd
directory. Now it has already created a
lot of files and a lot of folders for
you. So let's quickly go through that.
The first cmd main.go is actually the
entry point of your operator of your
controller. So this already is done for
you. you would have to worry about what
libraries in Go I want to import in case
I want to build a custom operator.
Whenever I say operator um when I'm talk
I'm talking about controller because
that is a loop that actually uh does a
job for us. So you would be thinking
what library am I um supposed to be uh
you know importing for example take this
the client go and the uh client o
package. So this O package is actually
the one that allows you to talk to um
Kubernetes.
It it imports all the Kubernetes client
O plugins in case you were using GCP,
Azure, uh you want to talk to the
clusters. It lets you get the cube
config and this is the package that lets
you work with. You also have a package
for uh importing the Kubernetes API
machinery. We will talk about API
machinery in a in a bit. uh this lets
you define uh do the runtimes that are
needed to define a cubidity schema. How
do you declare a health endpoint? How do
you do logging for your operator? It let
you create a lot of codebase and this is
the main go which is the main file from
which you declare your um your code.
This is the entry point for your code.
We will talk about that when we um when
we write it. You also have a lot of
config folders where you define u how
are you going to uh be working with your
it has some defaults for kubernetes like
your services like your customized
files. It has customization that lets
you deploy your operator to different uh
clusters and name spaces. It has the
customization for your manager which
lets you create a deployment and the
name space in which you want it to be
deployed. It's a fairly straightforward
customization file. It lets you also
create role bases uh access control. It
lets you create cluster roles, cluster
role bindings. Um so it easier for you
to be running your operators. Otherwise,
if you are managing, let's say you write
an operator
which listens on a resource called um
EC2 instance, but it doesn't have the
permission to uh to to be uh you know um
listing EC2 instance in a in a
namespace. You will not be able to
manage those resources in that name
space. So without you worrying about how
does my role based access control would
look like it lets you create a lot of um
boilerplate code along with it lets you
create the rolebased access control as
well uh for you it also gives you end
toend testing so you don't have to write
your own testing fees it lets you help
uh it helps you with that as well and
the one thing that is uh interesting
with that which I was looking for is the
where did that So where is my cmd config
hack? I simply I'm missing Oh yes
because yeah so this is just the project
resource. This is just the project uh as
a boiler plate that cube builder allows
you to do. The second thing we can do
with cube builder. The next thing we can
do with cube builder is to actually
create an API. And this part is amazing.
This is going to be our resource that we
this is going to be our custom resource
that we will be creating. So what we
have just done is what you have just
done now is we declared a project called
cloud.com.
Now with cloud you have many resources
to manage. You might have uh things like
compute to manage. You might have things
like um storage to manage. You might
have things like network to manage
things in compute. Could be uh your um
let's say um EC2 instances you know it
could be your AMI in images for example
they could be your security groups as
well. In storage it could be a EBS uh
EBS module. It could be an S3 bucket
that you want to manage. Uh in network,
you might want to manage a VPC. You want
to uh manage a firewall rule perhaps. So
the thing that I'm trying to say is you
can create multiple APIs in a single
project in a single domain and this is
what we are going to be doing. we will
be building our own API which is going
to be in the compute subdomain and it's
going to be our EC2 resource. So that is
what cube builder allows us to do uh is
to create our own little API. So let's
do that. I would do cube builder create
uh here we go cube builder create API.
The group is going to be compute uh and
kind is going to be EC2 instance.
I want to create the resource. Yes. So
this has created the custom resource and
the custom resource definitions for me.
Uh it has written them on the disk. And
yes, I want you to create the controller
as well. So it downloads um many
different go uh go packages. It also
creates a directory called API/v1.
And this is absolutely
uh important. This is the API the
version of our API and we are building a
file uh we we building a resource called
EC2 types and that is where we define
our EC2 types.code.
Um now once we talk about um now once we
talk about the uh the EC2 type.go we can
take a look at that how does it look
like and this is where the actual
business logic would go for us.
This is where the actual specification
of our API would look like. Now before
you can build your own Kubernetes
cluster, I'm sorry, before you can build
your own operator for EC2, let's let's
see what would this actually look like.
You know how you going to use the YAML
for that? So if I give uh EC2 operatory,
I would probably say um kind is an EC2
operator.
Um meta, it would have some metadata. I
would give it a name and name would be
um my instance and then uh name space
would look like uh default
um
API version. It's defined in
compute.domain.com.
Um, this is a version one of our EC2
operator API. And then I would have two
things. So, every resource you have
would have a spec or almost all of them.
Uh, and then they would have a status
field. And this is something which is
very very important. When you are
writing a custom resource, you have to
define what the resource is going to
look like. What is going to be things in
the spec of your resource and what is
going to be in the status of your
resource. And this is what um the the
file in API v1 EC2 instances.go helps us
to do this. It lets us declare our given
um spec for the resource that we are
trying. Um, for example, my spec would
have um um AMI ID and this is going to
be the my dummy AMI ID and I would have
a key or I would have an SSH key.
This is going to be my key pair that I
want to use on Amazon. Uh I would have a
instance uh let's say I would have a
type. So maybe T3 micro I want to have.
And then you could have a storage and
you would have uh in storage you would
then say um I want a standard disc.
Maybe you could say I want a a
persistence or you could say fast disk
which translates to one of the faster
block devices in Kubernetes because you
want see you all you want to do is you
make you're making the developer life
easy. you're abstracting the actual
details um from the developers. So they
can say okay I could go for a standard
disk of size maybe 10 gigs and fast
would be of size of 50 gigs
that is that is the data that I need and
this would be one of the minimum things
you can use for your cubernetes cluster
and with this spec that you're giving
every resource has a spec and that is
defined for kubernetes it is defined as
at a strruct in collab. So if I uh look
at this DC2 operator, I let's say we
just keep this simple. We're going to
keep these three AMI ID, SSH key, and
type. Um this is going to be my things
that I want to use and all. Let me just
copy the um let me just comment this
out. Where did that go? There we go. So
I define the spec. Now this is the spec
for my uh Kubernetes uh for my operator
and I'm going to say my EC2 instance
spec will contain an AMI ID. It will
contain the SSH key and also the type of
the it will contain the type uh of the
instance that I want to be using. Now
this is where uh it's very important for
you to give these JSON tags because when
you give a request to Kubernetes about a
kind of EC2 instance it needs to
marshall your request. It needs to
understand what is this key uh and what
to do with that is this key is AMI ID
this key is SSH key this key is type. So
these uh JSON tags are absolutely
required for serialization so that
Kubernetes can know this field relates
to a certain um required um key for
example.
Then you can also have the status for
your EC2 instance. Maybe you want to
give out uh things like in in this one
you might want to give uh the space as
probably it's running if your EC2
instance is running or not. Maybe you
want to give out things like um public
IP and that's going to be a 1.23.
And this is what you will be putting in
the status field. So I would say um if I
look in here you see to operator I want
to have phase
um I want to have phase which is going
to be a string this is the type of
string and I want to have uh let's say I
want to have the instance ID as well and
I can just simply go for a public IP. So
these three things are which I want to
um be be having. Now this is very
important when you are using when you
are building resources like this an AI
editor would really help you uh like you
can see I'm using cursor uh this really
helps you to speed up your development
again you are the one who's doing the
thinking you are the one who is coming
up with the spec you are the one who is
coming up with um you know what what
should you be showing in the status
however it helps you as a as a very good
helper
Now you got the spec, you got the status
because these two things are absolutely
important to be um to be in a resource.
Now how would your overall resource look
like? The instance the EC2 instance
would have um the type metadata and
object metadata. So when you see any
Kubernetes resource this kind and API
version this is actually coming from the
type meta. So this meta v1 is actually
you can see this is a package in
kubernetes. This defines the metadata of
any kubernetes resource. This go package
defines the metadata of any uh resource
and has two type of uh you know it has
two strcts there. So the kind and API
version that we see on all the
cubernetes resources it is actually
defined in a strct in Kubernetes called
type meta. And this is what the EC2
instance would look like. It would have
some type meta. So you can see here on
if I copy this probably this would make
more sense.
Let me just copy that all the way here.
Uh and this would be
there you go. So let's comment that out.
Now this is a type of EC2 instance which
is the kind of a EC2 instance. So I got
that. There we go. So the first thing
this kind has is the API version and the
kind. The first thing the resource has
is the API version and the kind. And
these two things are defined by the type
meta. And then we have the metadata of
the object itself and that's defined by
the object meta which contains the name
of the object which contains generated
name of the object the name space the
UID the resource version the creation
timestamp every every object would have
these two um you know struts declared
inside of that which defines what object
it is and second which defines what is
the object's metadata and then You have
the spec where you have defined this
spec and then you have the status which
defines the status of the resource and
this is how an API is created. This is
how you declare what resources are going
to be in your API. Now I don't have to
tell my developers that guys you need to
raise me a ticket so I can create you a
resource in Amazon. Oh, you wanted 10
gigs. I probably gave you 15 gigs. Maybe
I did not hear that correctly. Let me
delete and recreate that or resize it.
You do not have to do that. If I just
give this to my developer,
it is so much easier for them. Maybe I
can have them a simple UI that lets them
declare the name of the instance, the,
you know, the count of the instance,
what storage they want. It automatically
creates me this manifest. And because I
already have a Kubernetes operator and a
you know a controller listening on top
of that, it is very easy for me to track
every request that a developer is making
for these uh instances because um they
are all they can be put into a version
control system. They can be put into
GitHub and you can use our code CD that
makes developers life so easy. They do
not need to know about what is a fast
storage. They don't need to worry about
what is a standard storage. Of course,
they need to know the benchmarking of it
but they don't need to know it is a
persistent disk. They don't need to know
the different type of stoages uh Amazon
has to offer. It is offloading from them
and that is what it makes it very very
simple.
Now things that you see here um these
ones plus Q builder object root true. So
these ones are called cube builder
markers and they are there for code
generation. They are there for custom
resource definition generations for you.
For example, this one says this is
actually a Kubernetes resource. So
somebody could say cubectl get EC2
instance. Somebody could say for example
here is where somebody could say cubecdl
get instance list and that is going to
be uh what is returned this defines it
also has a sub resource called status
which we are defining here above. So
this is what cube builder helps you with
and in the end we are registering our
EC2 instance and EC2 instance list with
the cubernetes schema. this function uh
it uses the resources that we just
created. It gets the APIs that we just
declared and it initi registers that
with the Kubernetes schema which is
actually this function comes from a file
called group version_info.co.
Now this one it's a very simple file. It
uses the Kubernetes schema runtime
package uh from API machinery and the
controller runtime. What these packages
allow you to do is they let you declare
your uh they let you declare your APIs
and the kind to Kubernetes and here you
are saying that you have a group
version. So you're declaring a schema
group version. The group is called
compute.cloud.com
again. So you could say your domain
uh domain was actually uh cloud.com
and then your group was uh compute
uh and then uh your compute.cloud.com
cloud.com and then your version is v1
and then your kind is e2 instance group
and this is how every resource in
kubernetes think of that as a URL every
object on the web has its own unique
identifiable um URL for example
um think of that as kubernetes every
resource is declared in a group it has a
version and it has a kind. Every
resource does that. Every resource has
it. Pod service. If I do that, maybe I
could do kubectl
explain service. You can see here it
kind is called service. Its version is
v1. If you do not see the group, that's
because it is in the core group of
kubernetes, which is uh which is which
doesn't have a name, but it's called the
core group. So is the same for pod. Uh
if you go ahead um here you can see pod
is v1.
So this is why you now understand when
you write kind we are pod API version v1
you are telling kubernetes that this
yaml that I'm giving you it is a
resource of kind pod which is declared
in this group and I want the version v1
for this resource. Every resource have a
group version kind and this code is
actually adding your declared schema and
it is adding your um declared group into
Kubernetes. So it's loading your
resource YAM your actual custom resource
declaration into Kubernetes. So when you
give it a YAML of EC2 instance it knows
what spec this resource has. What is the
AMI ID? what is going to be um the phase
that is running what is going to be the
uh the public IP that I'm going to be
returning so it knows what is your spec
and status that is what we are doing
here we create a schema builder so that
we can add our own schema and then we
have um this this add to schema um it it
does add the type in your group version
to kubernetes and that's where the magic
actually happens this is where you
declare what is going to in your a in
your resources.
Um once you have that then you can also
uh look into another directory that it
has created for you called the internal
controller and that is where the
reconciliation logic happens. That is
where you get the reconciliation logic
uh of what to do. So this one is about
custom resource but what to do on top of
that custom resource? What do I do with
that? That's given in the controller um
package in the internal /controller
directory and there's a file called your
API named_controller.go.
What this does is it creates its own
package and it then creates your um you
know it creates a reconiler.
In this reconiler strct it is having two
um it imports two uh interfaces. one is
the client which gives you the actual
Kubernetes client that you can use to
talk to Kubernetes clusters and then
there's a there's a schema that we can
then use to convert between the YAML
that you are giving and what Kubernetes
knows about you know what is declared in
Kubernetes um resources
then you have some custom markers for
for rolebased access control and this is
where the actual reconcile dilation loop
happens. This is the one uh this was the
actual logic that makes sure your
cluster state is equal to the desired
state. That's the one that makes sure
your cluster state would be um it reacts
on the cluster state and looks on the
desired state and say this is where your
logic will go. This is the heart of your
controller. This is the heart of your uh
of what you are writing what you want to
do with that and then you return a
result
and an error. Now we will talk about um
these two things as well. I'm just
running you through the code when we
write our own as an example then we will
uh we will look into this. Once you have
the reconciliation logic, it is actually
adding um it's adding uh the controller
with the controller manager. So this
setup with manager, it uses the
controller manager to add your
controller too. I think it makes sense
if we talk about the architecture a
little bit of cube builder and that
would be so much helpful. So if I go to
architecture, this is the one that will
make so much sense. what Cube Builder
allows us to do. Oh, wait a minute.
Okay, so the when you run, let me go
here.
When you run uh maybe a little bit
bigger would help.
Let's say here.
When you run a Kubernetes um controller,
the first thing that it runs is it runs
the main.go
program. If you remember, this is from
the cmd/main.go
which is the file here. It starts with
the cmd uh main.go file. So the main go
file is the one which is responsible
when you build your operator into a
binary. Here's a main function that's
the entry point of of the operator. So
let's take a look at its main file from
the beginning. It's part of the main
package and it does import quite a few
of um inbuilt packages from Golang.
However, for it to really be working as
an operator, there are many more
packages that are imported um and that's
from the Kubernetes itself. So let's
take a look on those packages. The first
one that we see here, this is the O
package. And this lets your operator uh
use the exec entry point plugins or um
you know uh talk to your EKS clusters,
talk to your GKE uh cluster API server
or using the OIBC if in case you're
using for authentication. This one's
responsible for making sure that your
operators can use the cube config or the
exec entry points and they can talk to
your cubitus cluster.
The runtime package from the API
machinery is responsible uh to kind of
you know you understand YAML but
Kubernetes does not understand YAML. It
understands objects which are ghost
trucks you know in example. So this one
defines schema. This one's define
objects that can help you to convert
your YAML into Kubernetes understandable
constructs. Kubernetes understandable
objects. And when you do um cubectl get
pods, the YAML that you get is actually
converted from the pod object in
Kubernetes by using the runtime package.
We also have in the API machinery util
package and uh this would be looking
like it's the same package again but
this one's defined in pkg runtime in the
API machinery and this one's defined in
the util uh as runtime and this one is
more like a utility function that helps
your operator be stable in case there
was a panic which is kind of like a
fatal error that your operator got. So
instead of completely crashing the
process, this lets you log that
particular panic and still uh complet
still continuing with the with the
operator process so it doesn't just
completely crash onto you.
We then have uh the client go package
which is again uh this is the I think
the SDK for go for kubernetes and here
we are calling the schema or scheme
package and this one lets you register
your APIs that you have defined the
custom resources. It also lets you
define the pod services the core
constructs of Kubernetes um with your
operator or rather think of it this way
that it gives your operator the
knowledge of the predefined Kubernetes
resources like pod deployment uh secret
services and also it lets your operator
register the EC2 operator um custom
resource that we are creating.
We also have the controller runtime
package and this one right here is the
secret source which is responsible to
have you or to work with a manager that
can help you with clients caches and the
leader election. This one, this
controller runtime is the one that is
responsible that gives you the tools to
construct the controllers that can
listen on changes on your custom
resources and then uh you know they can
uh handle the caches, they can handle
the clients to talk to the API server uh
and eventually um if in case you want to
have early election or not uh that also
is done by the controller runtime. So if
I want to talk about a little bit of the
architecture of how this um controller
would look like. So we would have the
process which is again started from the
main.go
and this main.go would have a manager.
Again you will see this as coming ahead.
But here's where a manager is the one
that manages two things. one, it has a
client and this is used to communicate
to the cube API server and it also
handles the caches of your requested or
um the the the custom resource that was
updated.
Imagine this, you want to write an
operator that reacts on a change uh to
the EC2 operator object and that's where
the EC2 operator object YAML or the spec
will be stored. We're going to talk
about the cache much more in the in the
future in the video. Not right now. It
doesn't make more sense. However, um for
me to explain uh the manager, it does
have the client which is used to talk to
the API server. Then we have the cache.
And here's where the interesting thing
comes into the picture. This is what we
are writing right now. Or rather this
green bit. This green bit right here is
our user provider logic which is what we
are using in the reconcile function.
This controller is responsible for
reacting on the changes and eventually
running the reconiler which is our logic
that tells what to do if in case the EC2
operator object was changed or you know
um whatever change you made to that this
is where it's going to be um this is the
logic which is going to be uh running.
You can also have in the manager in your
controller you can also have a web hook.
This is kind of like the similar um
validating web hook and mutating web
hooks. If in case you want your operator
to also uh serve those web hooks, it's
possible to do so.
Now we also have couple of um we also
have couple of packages for the
certificate watches. This is the one
which is responsible um when you are
working with uh let me rather draw it.
This will make more sense when you are
using things like C or let's say you
have um uh admission control admission
web hook in your operator you have a
mutating web hook.
Here you have a mutating web hook and
your Kubernetes API server. You register
this web hook with the API server and
this can then talk to this mutating web
hook. The API server will simply ignore
or will not talk to your web hooks. You
know, I'm not going to explain the
mutating web hooks or validating web
hooks because this is not a part of this
course. Um, it's something there are
very good documentations that you can
read about. However, when your API
server talks to any of the web hooks,
whether it is mutating or whether it is
validating, uh it has to have a valid
certificate.
It does not talk over HTTP. You have to
have a valid certificate. And a lot of
times you would be using the cert
manager to issue your certificates to
this uh service your your controller
that is hosting the mutating web hook.
Now if in case the search manager uh
again it's used to uh issue certificates
for your web hook and every 90 days I
think by default it will be rotating
your certificates and in this case if
your certificate has changed maybe you
are storing that certificate into a
secret then it is given into the pod. Um
however if this certificate is changed
you will need to restart your
controller. You will need to restart
your controller pod. So eventually the
new certificate is loaded and the next
HTTP request uses the new certificate
which is renewed by the search manager.
This offers a downtime and to fix this
we have theert watcher um package. This
one creates a watcher for the change
certificates and it reloads them on the
fly without you to have to restart your
controller package. So you don't have
any downtime uh in case you are updating
your certificates in case you updated or
search manager did an update for your
certificates.
We also have the health package what
lets you uh expose the the you know the
livveness probes and the readiness
probes that you can use for your
operator. This exposes the health and
the readiness endpoint probes which you
can use in your deployment when you are
deploying this operator and you can say
uh check at this endpoint every now and
then. Uh it's a similar uh it's a
standard Kubernetes livess and readiness
probe. We also have the zap package
which is mostly used for logging. We
then have filters package in the metrics
package here. And this one let's uh I
think this makes sense for me to first
talk about the metrics here and then we
talk about the filters. See when you are
writing your operator with cube builder
it doesn't just let you focus on the
reconciler. I mean this is what your
business logic is. That's what you are
uh supposed to be writing. However, with
cube builder, your operator
which is running in a pod, it by default
exposes an endpoint called matrix. And
this might be looking sim familiar to
you. Um because this is something which
we use a lot in Prometheus. When you are
writing a Prometheus service monitor or
when you are writing a scrape config,
you give three things to the Prometheus
server. the IP or the service name, you
give the port number of the scrape
config and then you also define uh the
you know the path the scraping path.
This same you can use uh with your
operator cube builder. When you are
building an operator, cube builder
exposes the metrics endpoint and this it
it exposes couple of Prometheus readable
metrics like what is the success rate of
your operator? How many times the
reconciler has executed? How many times
it event it resulted into an error? How
many times it resulted into a success?
So it's not uh it doesn't give you an
idea of how many EC2 instances have you
created but rather this is more on the
metrics of the operator itself and then
if in case you want to maybe you you
have a requirement that my operator can
create EC2 instances but I also want to
know how many it has created
successfully. So you know you can also
expose your metrics you can instrument
your code with Prometheus uh go packages
and as soon as you were able to create a
VM you know uh on on Amazon we'll look
into the code in the future uh in the in
the further parts of the video uh you
can then increment your uh AWS instance
count uh to one because you were able to
create just one more um instance and
then you can expose this to the metric
endpoint. The thing that I'm trying to
explain here is it's already done for
you by cube builder and by default there
is no username or password. It is open
to everyone and then you can use
Prometheus with a scrape config to
scrape this metrics the operator related
metrics uh into Prometheus and show that
onto Grafana.
However um you can also then use this
filters uh package. This lets you define
some sort of authentication that this
metrics endpoint is not publicly. It it
should not be publicly accessible. I
only want to um I I only want to allow
someone who has this username and
password. Uh I want to have some sort of
authentication on this matrix endpoints.
And these are the this filters package
provides us these functions where we can
use um these authentication gate um
gated authentications for our metrics.
We then have the web hook. Again, this
is the package which is responsible for
you to create these validating web
hooks, mutating web hooks. There are
many many videos available. Uh we also
did a live stream on cube simplify of
creating your own validating web hook.
You can definitely take a look at that.
I'll put the link of that in the
description. And uh this one helps you
declare your validating and mutating web
hooks. These are core heart of your
operator. You know without these
packages it without cube builder using
these packages it would be very very
difficult for you to build an operator.
So cube builder is really good in terms
of scaffolding your project. When I say
scaffolding it means it is it gives you
a very good blueprint. It gives you a
lot of boilerplate code which again you
can uh refactor but to begin with you
only focus on your reconciler logic and
that for me it's amazing.
Now here's where the repository where my
code is going to be in the API v1 and
this is where I am uh calling my custom
resource definition which I declared.
You remember we had API then v1 and then
we had the EC2 instance right here. This
was our spec of the EC2 instance. That's
what we are calling in uh in the the
main.go. So I am calling my um my v1
with the name of compute v1 and then I'm
also calling my actual controller logic
which has the reconiler or this is where
my reconiler logic will be in the
future.
Now coming forward we have couple of
variables. This setup log is fairly
simple. This sets up a logger for our um
you know for our controller and the
scheme that you see here. Think of this
as a phone book. This is an
instantiation of the new scheme
function. The scheme is acting as a
phone book. It is acting as a registry
where you will write all of your objects
that you want Kubernetes to know about
or rather your operator to know about.
And that's what we do here in the
function in it. We use the util runtime
which is available here. for this
runtime package.
And here's where we have a must
function. So what this does is in case
this must function returned an error um
in case this must function you know was
not able to register if there was a
panic the program will stop right here
because your operator is completely
useless. um yet your operator is
completely useless if it doesn't know
about the core uh API types like pod,
deployment or rather also your own EC2
instance. So we register the default um
core u we register the default API types
and you can look at that using um
cubectl. Let me increase the font a
little bit. We can do cube ctl uh API
resources here. You see? So think of the
phone book which is our scheme. We are
adding all of this um to our phone book.
So we are telling our operator this is
what we have available uh all of this is
what we have available in our uh AP in
our cubernetes cluster and then we also
add our own default u our own custom
resources which is what we are calling
from the API B1. So essentially we are
telling Kubernetes that the scheme that
we declared over here it's an empty book
and in that empty book using the add to
scheme function which is here given uh
to us by the client go scheme we add the
built-in types so our operator knows
what built-in types are available into
Kubernetes
and also we add our custom type which is
the EC2 instance and then our uh
registry or the scheme is a complete
catalog And that's what our operator
would be able to use. Now,
now here's the main function. This is
where everything starts for any go
program. And we are defining a couple of
variables. For example, I want to define
the metric address um on which IP of my
port the metric would be listening to.
And once we have defined these
variables, we also define some flags
from the command line uh when you are
running your your you know when when you
build this with go build and when you
run this binary you can give these uh
command lines as metrics address probe
address you can define leader election
and all that. So we define the IP
address on which our metrics should be
sobbed. We declare some variables which
is the path for our metric certificates
because um just like web hooks can be
served over a certificate we can declare
that our metrics also is declare is you
know um accessible over HTTP or it needs
a TLS config as well and that's what we
can define with these variables what is
the path of our certificate what is the
name of the certificate and the key we
want to use for our metrics. The same
goes for our web hook.
Now there's a very good fun there's a
very good um concept that operators can
help you with or rather when you are
running distributed systems like HCD or
especially when you talk about your cube
uh controller manager
see that is also a controller what we
are writing it has many it's a
collection of multiple controllers but
this runs as three different pods in
your cluster or rather It runs each one
on the master in your cluster. The thing
is when you are writing a controller uh
it is very important of how these
controllers are running in parallel and
do they all make changes or not. For
instance, uh take if I was running two
copies of my EC2 controller. So this is
one controller and this is another
controller and there was an update um
which lets me create
uh an instance
you know uh I did an update I created an
object of EC2 instance kind there was
update and this update was seen by both
of my uh controllers controller number
one and controller number two
they both are going to go and create me
an EC2 instance and this is not What I
want I do want high availability but it
should be active passive. There should
be one leader. There could be multiple
replicas for high availability but only
one at one time should be running. And
this is what leader election u you know
um uh is something that you can use and
cube builder makes it very easy for you
uh that it allows you to uh declare the
leader election with a simple boolean.
So in that case this is also running.
This is also running but this is a
leader. So if in case an update
statement or an event comes from the API
server only this one is seeing it and
only one instance is created which is
what we want to do. The other one is
there but it's not the leader. If the
leader is no longer running or or
automatically it's going to become the
leader and this will be then serving
your requests for the EC2 instance
custom resource changes. This is what uh
leader election means and then you can
enable if in case you want to have
reader election and you can run your
operator into high availability. We
define the probe address on which your
uh health probes are available. So you
remember this this package which is the
health where you declare your health's
endpoint and the ready endpoint on what
port number uh they are exposed by
default the port number I think is 8081
here which is the health probe bind
address
the command line flag and this is the
variable that is going to be responsible
for it. Do you want to use secure
metrics or not? And this this variable
secure metrics and metricsert paths uh
name and key they are related because
you can say I want my metrics to be
exposed over h over TLS and if you say
that you want them to be over TLS then
you can define your metric certificate
paths the certificate name and the key
otherwise there's no need for that. Uh
you can also say if your operator does
enable HTTP2
or um HTTP you know it does not enable
HTTP2 and then we have a list of
functions uh that are R TLS options.
I'll make it simpler explanation as we
go ahead. So we declare a couple of
variables we declare a couple of command
line flags. We define some options for
our logging that this is development
true. when you say development, it
actually um gives you a stack trace on
warnings as well. Um it doesn't give you
any sampling. Uh if you go for
production, it only gives you a stack
trace on um on errors and it does do a
sampling for you. So if in case you are
deploying this to production, that's
something you should always consider um
development as false.
Now we set up a logger. We uh we passed
all of our flags of the CLI that was
given by the user. We um you know we
define our options for logging. We
create a new logger. Essentially what
we're doing in this line here is we are
setting up a new logger with our zap
options or with our logging options.
Now with your TLS when you have this TLS
config it's kind of like a list of
options that you can do. One option here
is if you want to disable uh you know uh
if you do not enable HTTP2 here in case
you are disabling uh HTTP2 you can
append that to your TLS options. So we
say in this case uh my um you know I did
not enable HTTP2. So for me in the TLS
options it would be I disable the HTTP2
and I only enable the version 1.1 of my
HTTP because I'm disabling the HTTP2.
Now here's where you create some
watchers for your certificates. You
remember we talked about these
certificates for the metrics and there
could be certificate for the web hook
because you can expose both of them um
over over TLS. So the certificate could
be for your web hooks. The certificate
could be for your metrics. And we have a
we have a cert watcher. So essentially
what happens is let's say uh this is
what I already explained. You have a
cert manager. The search manager renews
your certificates on the disk. This
watcher will be detecting those changes
on the certificates. It will load them
into the memory in the current pod in
the current operator. It does not
restart the operator. It does it on its
own. There's no there's no downtime.
There is zero manual intervention.
Otherwise, you'll have to um restart
your your operator because your um you
know your certificate was updated by the
search manager.
We define our TLS options which is again
a list of functions uh that returns us a
TLS config and we um instantiate a new
variable. So it's kind of like we are
creating an alias and this is the one um
by by this time the TLS options is a
default TLS options um that we would be
using and we declare a new variable and
we set that as a value. So we can
customize um the TLS configuration for
our web hook server uh if in case we
want to use a watcher or in case you
want to even use a certificates or not.
So uh it's easier for us to customize.
Now if you really gave a web hook
certificate path which is here if you
did give a certificate path that means
you want your web hook to actually be
serving over TLS and that's the that's
the thing then if the length of your
variable uh is greater than zero we will
say initializing web hook certificate
watcher and I will be then using the TLS
as well and I will be using the
certificate. So we define a variable
error and here's where we create a new
watcher for the certificate uh path and
the certificate key. If in case there
was an error, you just simply exit one
because you wanted a TLS config for your
web hook but you couldn't get one. So it
makes sense to stop right there. And
here we are adding a new option to our
web hook TLS uh options. this variable
it contains right now till this point it
only has one TLS option which is disable
HTTP2 that's what we we did here you
know uh by this time it only starts with
one uh TLS option which is disabling
HTTP uh 2 and if you have given a
variable um if you have given a webbook
certificate path we then append onto
this TLS options that we do want to use
um another we do want to use a web hook
uh certificate and this is the get
certificate function from the TLS config
that gives us the name or the
information of the certificate we want
to use for our web hooks and here's
where we are creating a new web hook
server with these TLS options
similar thing happens when you are
working with a metric server options so
these are the metric server options uh
where we define the bind address on
which our metric is going to be exposed
this bind address is 80081.
Do you want to use secure metrics or
not? And what are the TLS options?
Again, by this time we are just
disabling the you know um the HTTP2.
Uh we don't have any TLS right now
because if you do not do uh if you don't
give secure metrics which is as a
boolean if you do not give um secured
metrics then there would be no TLS
options. you only work with HTTP 1.1,
you disable HTTP uh you know uh you
disable HTTP2 but if in case you did
give secure metrics you will be using
some sort of authentication
um um that your metrics endpoint is not
publicly uh it's reachable but not
accessible. There is some sort of
authentication and authorization and
only the authorized users and service
accounts can access your metrics.
Now u this was the metric service
options that we started with. If in case
you did want secure metrics you give uh
some sort of authentication and then
this is the same logic that we did for
our web hook certificate path that if in
case you do give you know your metric
certificate path you create a watcher
like we did for our web hook. Uh there's
a watcher which is for our metric
certificates. Then we append uh the
metric certificate option TLS options uh
with the certificate uh information.
Essentially what this does is if you did
give me a certificate path if it's not
zero the length of the certificate path
is not zero you give me the path of the
certificate I'm going to run your metric
server with the TLS option that that
serves the certificate information.
That's essentially what it is doing. So
you should not get confused on uh on
what this is happening, what this is
doing. I just told you. If you do give
the parts of your metric certificate,
it's just going to expose your metrics
endpoint on this certificate that you
have given. The same thing happened
here. If in case you did give a
certificate for your web hook, it's
going to expose your validation or
mutating web hook over with this
certificate information.
Uh and here's the one which is quite
interesting. This is the from the
controller manager from the controller
runtime. You see this is the one which I
just showed you. This one uh lets us
create a manager. Within the manager you
can have multiple um controllers. It
looks something like this. So here I
have in my main.go file
um this is my operator. This is the
main.go file. In here I have a manager.
Oh, wait a second.
This is my manager. Let's take it this
way. And within my manager, then I will
have my controller. And I can have
multiple controllers in a manager. If I
wanted to uh write something about this,
this is my controller.
This is my main.go go which is
responsible uh for creating a manager
using the controller runtime and then
the manager is responsible to or it's
our responsibility to register our
controllers
with the manager
and that's essentially what we're doing
here. So once we declared all the
variables, once we gave all the flags,
once we defined all of our TLS options,
once we have configured if we want to
use TLS for our web hooks and metrics
and if in case we want to use
authentication with our metrics or not.
Once all of that is sorted, we start or
we use the new manager function that
returns us a new manager which is
available here. This is our manager.
This variable has our manager with all
these options. What is the scheme? So
our controller knows about all the
resources, custom resources or the the
core resources available in Kubernetes.
What are our metric server options? If
in case with the metric server options,
do you want secure metrics or not? You
know, uh what is the port number for
your metrics that you are binding to?
What is the IP address for the metrics?
What is the endpoint which is usually uh
by default/metrics?
And then if in case you have given some
um certificates
the same thing happens for our web hook
server is it secure in terms of have you
given certificates to that or not. Uh
and that is our web hook server. Um we
then declare the health probe endpoints
which is um which is a probe address
that is um where did that go? 8081. This
is what your livveness probe and the
readiness probes will be looking into
the container when they are doing a
probe.
And here's where the leader election
because when you are creating a manager,
the manager should know uh are you
looking forward to have a leader
election and you should definitely do
this when you are building an operator
that you want to run in multiple
replicas in multiple pods. There should
be only one which is both of them are
running but only one is active at any
time. So this is this is absolutely your
responsibility um that you can enable
the leader election and then if you
could not make the manager because new
manager returns you the manager and also
an error. So if you could not create a
manager or you give the error that I was
unable to start the manager and you
simply exit because if you don't have a
manager, you don't have anything. You
don't have a controller. So that's
that's the over um that's the one that
looks on your controllers. If there's no
manager, there's no reason to continue.
Just just exit right there. And that's
why we use the OS package. Now once your
manager is created we need to register
our controller which is the the custom
resource which is uh what we need to do
here. So if you were able to create the
manager we are using you know um we from
the EC2 instance reconiler we define the
client and we use the manager.get schema
which tells our manager what is the
schema of our EC2 instance. Essentially
we use a function called setup with
manager and this one sets up our EC2
instance custom resource with the
manager and which is available here in
the EC2 instancecontroller.go file. This
is this is the one uh which is where our
reconciling logic is and where our
reconider logic will be. So it sets up
our controller in here in the main.go
go. At this point once we started our
manager, we set up or we add our um you
know we add our um custom resource or we
add our controller to our manager. So
the manager knows that I have this
particular controller. This is what I
need to listen on to if any changes are
done to this custom resources and this
is what the logic is what I need to run
with uh with the operator.
Now it's also interesting here if in
case you were having some certificate
watches if this was not nil you add the
certificates to your um you know you add
the certificate watcher to the manager
for your metric server for your web
hooks again um we don't we don't use
certificates right now and I'm also not
using any web hooks for mutating or
validating so I'm not going to do
anything um any certificates for me it's
going to be empty otherwise you will be
adding the certificate watcher to your
manager. So manager has couple of
things. It has controllers. It has
another controller. You can have more
than one. It will then have the watcher
as well for the web hook
certificates. It also has a watcher for
the metrics
certificate and it watches and renews
the certificates or reloads the
certificate on the fly. So you don't
have to restart your your operator
from the manager. We also get uh a
function called add health check. And
this is where u the health check is is
being done. Uh we add two health checks
or two endpoints. One is a /halth, one
is a /ready. And this is what you can
use like a fairly simple Kubernetes
health check that lets you see if your
operator is healthy, if your operator is
ready or not. And here from the manager
which was written by the new manager
function. This manager has a another
function called start. And this is the
one that starts our manager. It's kind
of like you got a car which has an
engine which has a you know which has um
a mechanism for the airbag. As soon as
you turn on the key the whole thing
starts. So first thing is your engine
starts. It starts sending power to other
components. This is a similar analogy
when you are starting this particular
manager. So the manager starts and then
it starts the other controllers inside
of this process. It starts the watchers
and uh and everything comes in into
life.
Now of course if you were not able to
start the manager or create the manager
above here. So either you were not able
to create a manager instance or start
the manager. We simply just exit because
without the manager there's nothing uh
that is available.
So I think this was the whole main.go
file and uh what I wanted to also show
you here is we do import quite a lot of
packages. We do import quite a lot of go
packages around here. One of them is
compute v1 in our API v1 directory. And
here in the spec, this spec matches to
our um where I go config CRD basis the
CRD around here. See whatever you give
in your um custom resources spec that
gets reflected into a resource called
custom resource definition. And here's
why you're declaring. You're telling
Kubernetes this YAML file gets installed
into Kubernetes it's a resource that
tells Kubernetes about other resources.
It's a custom resource that tells
Kubernetes about other custom resources.
So you tell Kubernetes that I'm telling
you about another custom resource who
looks something like this.
Its version is V1. Its name is EC2
instance. It's a list you know it's the
name namespaced
scope um object it is under this
particular group and here's where the
spec for your EC2 uh instance and you
can see here the same one to one mapping
we have the AMI ID we have the instance
type we have the SSH key and we have the
storage which is again given uh given
here now at any time when you are
writing a spec for the API you might
want to change something. Imagine you
could say I want to give a tag.
um or you would say I want to give a
department
and this is going to be a simple string
which is what you can use as tagging you
know so when you create an instance you
use this department value to add that as
a tag to your EC2 instance and whenever
it's very important at any time when you
make changes to your specification you
have to run the make command more
precisely ly um you need to do the make
manifests because your CRD is not aware
that you just make changes to your
specification. The CRD is still older.
Think of this as now it's outdated
compared to the spec where we added a
new uh value. Do I have a department
here? I don't have that. I can't search
for it. Okay. So once you make changes,
we do make manifest. And as soon as we
do this, you see a new department um
spec is now added which is type of a
string. We can also say um um maybe I
want to add project which is going to be
another um tag. Uh and then again I will
need to use the make manifest because as
soon as I do make manifest you can see
on the right side it's going to be added
here. You see um the project was again
added. So at any time you make changes
to your spec, your CRD needs to be
updated uh on the disk which is with
make manifest and you also need to
update the CRD into Kubernetes because
see the flow looks something like this.
This is you
this is the spec
and you make changes to the spec. Now,
this all is happening on your computer
right here. It's all happening. This all
is happening on your computer right
here. Um, and
wait a second.
Okay. So, this was a spec that you
changed and you changed your CRD on the
disk. However, um the CRD doesn't just
need to be updated on your computer
where where you're you're developing.
You then have this Kubernetes cluster
which is again um where you need to have
a CRD and from where you can then create
a custom resource. We talked about it
the CRD now and from that you create a
custom resource.
Now you see you made the change
and it's updated here. It's version two
of the CRD but you are still using an
older one. You're still using a version
one. So you can use make manifest in the
make file which is given to you by uh
the cube builder. You can do make
manifest. It updates it on the disk. And
if you are pointing to the right
Kubernetes cluster using your cube
config the environment variable, you can
then use make manifest and make install.
It is then going to apply the same CRD
which was generated by the spec change
all the way to your Kubernetes cluster
as well. So they're always in sync.
You're not thinking I made the changes
to my spec, my CRD is updated, but when
I try to make changes here for this new
change, you know, I want to add a new
field called project. It says there's no
field called project, but I see it here.
It's probably because you did not um
update your CRD in the cluster. You only
updated that on your disk and that's not
going to cut it. So um usually if I ever
make changes to my spec, I do make
manifest
many fifths and I do a make install. So
I update this on the disk and I also
install this. Make sure you're connected
to the right cluster otherwise um if
it's a different cluster and the
resource the the custom resource does
not exist it gets installed there or if
it's there it gets updated and there
could be some breaking changes that
you're introducing. So be very very
careful when you're doing it.
All right, I think this was the whole uh
explanation of the main.go which is
probably something you will not use a
lot, you will not make changes to but
it's it's absolutely important to know
all these options what the web hook
watcher does. Why do we have so many uh
packages involved? Um you know you can
expose your metrics uh endpoint
securely. When I say securely
I mean with an authentication and you
can also use TLS or not. This is
something optional both of them. The
same goes for the web hook endpoints. So
it is something which is which is
absolutely important to know uh that you
can also do the leader election and uh
this is the main uh function where your
operator starts.
So now that you have a very good idea of
the main uh go file uh which is the one
that starts everything. Let's see how
the reconiler works. Let's see the
reconciler in action. We will make
changes to some custom resources. See
how our operator gets those changes and
then what can we do on top of that. This
is what we will be laying as a
foundation of creating our operator that
reacts on the changes of the EC2
instance object and then we um we will
move on ahead from there. Okay. So
whenever you want to write your own
custom operator, the first thing you
need to ask yourself is what kind of
resource are you going to manage? In our
case, it is going to be an EC2 instance
over on Amazon. We are writing an
operator. We building a custom operator
that goes to Amazon based on our behalf
and it uh you know creates you an EC2
instance. So something would look like
this. You're going to have your
Kubernetes cluster in which you have
your operator running and there is going
to be a human a certain someone that
gives you a YAML file because we talked
to Kubernetes via YAML. The the
interesting thing about this YAML is the
kind that you have declared it's going
to be um you know the API version that
you have declared using cubebuilder
which is cloud um which is a compute I
think which is compute.cloud.com/ver
one of this API resource and the kind in
this case is E32 instance. Now of course
in the end what's happening you give
this YAML to let's say the Kubernetes
API server because it knows about um the
EC2 instance which we will deploy our
custom resource definitions to
Kubernetes.
This resource change maybe you say I
want to create a resource of this kind.
The controller in here will look on this
change. It will get the data from the
API server and this is the one
responsible to go to Amazon and creating
you an EC2 instance. This is the one
responsible for making the
authentication with EC2. This is the one
which is responsible to provide the
minimal set of instructions you need to
give to Amazon when you want to create
an instance. This could be uh the
instance
you know it could be the instance type
that you want to give it which is
absolutely required. This could be uh a
security group you want to give which I
think is absolutely required. Um you can
you also would definitely need to give
some storage on how much your machine
would be needing. Some things could be
required some things could be optional.
For instance, um tags they are
completely optional. You can give that,
you cannot give that. It is up to you.
So when you are writing an operator, you
are writing a custom resource like this,
it is on you to have some minimum at
least most required uh things that you
want to send to Amazon. And this is
where when you are designing your spec
because when you give you a YAML you
will have kind API version then you will
have a spec and then you will have a
status. So this spec
here is actually matching what you give
in your YAML for other resources and
that is what you will be having. So in
this case your YAML would look something
like speci
ID. This is going to be the name of the
key. Uh then SSH key, instance type and
instance subnet. We are using these JSON
tags so that Kubernetes can unmarshall
uh it knows what does this particular
thing that you are giving me called an
AMI ID. What to do with this particular
object with this key and then the
request that is coming to the API
server.
We might want to extend this in terms of
let's say uh in here for example
storage. So I am giving um I also have
an option called tags in my YAML which
is going to be a map of string and
string and you can also create your
custom strct types. For example, if I
give storage um and then I can have a
custom object here. You see we know what
is a string. Go knows what's a string.
Go knows what's an integer. Go knows
what a boolean is. it doesn't know the
embedded type of uh storage config and
that's the problem. It says it is
undefined. So you can define another
strct which is going to be uh type
storage config. Oh wait a minute storage
config
uh and there you can give um you know
the size of the and that's what I love
about these AI editors. So you can give
the volume size, you can give the type
of the volume that you want. You know,
Amazon have different type of volumes
there. And then if in case you want to
give your device a name, you don't want
that. The only thing I would like is a
size. And
um it's going to be the type of the in
uh the device that I want, which is
going to be uh one of the Amazon
provided ones. And then you can also
have additional uh uh storage which is
in here in this case one is a root disk
and then you can have additional resour
devices and this is where this omit
empty comes into the picture. It's very
very handy. The same thing could be done
for our tags as well. The thing is
sometimes these resource these options
are you know these things are optional.
You can give a YAML for Kubernetes that
creates an EC2 instance but you may have
tags you may have additional storage.
You absolutely need the instance type.
You absolutely need the AMI ID that
would be wrong. If I do um you know in
here if I go and say uh omit empty this
is wrong because it has to be a required
field in your YAML manifest your
resource. So you can choose when you are
building your um when you're building
your spec as into what things you want.
In this case the additional storage is a
string or it's a it's a list of storage
config. So you can add additional
storage configurations. In my case um I
I would keep the additional storage just
as you know a type and a string.
The same thing will happen for your
status. So when you do cubectl get
status hyphen o yaml what you see is the
status dot you will see the phase in
which it is you will see the instance ID
you will see the public IP. This is
probably the information you get back
from Amazon. Imagine this if this
developer gives you a YAML. Let's say
you are building a internal development
platform. You want the developers to
query uh the resource that they have
created. when they do cubectl create-f
which has an EC2 instance and then when
this guy says cubectl uh get EC2
instance you need to give him some
information or you need to give her some
sort of information probably the first
thing you want the user to know is the
state if the instance was failed if it
was running if it was pending whatever
state it was and then you probably won't
want to give them the public IP of this
instance. If in case you allow your
organization allows for the instance to
have a public IP, you will do that. The
only place you can get this information
is from Amazon. So when you are creating
your instance in this case, you want to
pull if it's running in certain time. If
it's not, you fail the operation.
Otherwise, what you do is you get back
some information and this information is
going to be the state of the instance
and then it's going to be a public uh
IP. In our case, this is what we care
about. There will be many things that
can be given back uh as an operation of
creating instance. But that's not what
we care about. We want to show the user
that in their status they can see um the
phase which is going to be a string uh
the instance ID which is also a string
and the public IP which is also a string
in our case. Now whenever we build um
whenever we u make changes to our API
spec I told you that is absolutely
important that you run the make command
from cube uh from the root of this um
cube builder project so that it
generates you the custom resource
definition. Um what has happened uh wait
a second.
All right. So when I do um in my API
version one EC2 instance types.com
actually it's in config CRD the basis
and then compute um cloud.com v2
instance this is the actual custom
resource definition that you have
created when you make changes into your
spec like in here when you make changes
into your spec what happens is um when
you take command cube builder code knows
how to write the custom resource
definition as a boiler plate and this is
where you define the group for your
resource. This is where you define the
kind for your resource and then you
define the version of your resource. So
this would tell you that for a for a
single kind of resource you can have
multiple versions because you see it's
uh it's a list of versions that are
available. So you can have it to a
cloud.computee.com/w1 compute.com/we1
then this schema would apply cloud uh
compute.cloud.com/me2 cloud.com/me2
another version of the schema would
apply and this is why you probably might
have seen that this particular key is
only available in a newer version of
your YAML there could be some key which
is only available in the newer version
or uh in the older version it's um it
was only available in the older one
because in the version two that might
have been removed the important thing is
um is the spec in here so we have got
our properties. We have an AMI ID. We
got our instance type. We got our SSH
key. And then we got our subnet. And
these all are required because we did
not get the omit empty. But now because
we made some changes into our spec, I
absolutely have to regenerate these
manifests. And for that I can simply do
make um manifests.
So what that will do is it will be
updating your custom resource definition
with a few um more um you know with a
few more parameters. For example, one of
them is the additional storage. It was
not there before but now it is. So then
you can have an additional storage and
then there's a new option also called
storage and there's a new option as well
called um tags which is a type of
string. So every time uh it also updates
the required because some of them do not
have omit empty they are absolutely
needed. This is how Kubernetes knows
that this strip this particular key is
not available in the YAML. I have to cry
about it. I cannot let the user give me
this request because the custom resource
definition has marked this particular uh
you know string as required. this
particular key in YAML as a required but
the user has not given me that.
Now this is what you will give to your
Kubernetes cluster before you can create
an EC2 instance before you can do
anything w with the operator. The first
thing you need to do is you need to give
this to Kubernetes because if you do not
then when you create a when the
developer creates a YAML of kind EC2
instance and then the API version which
is cloud.com
Kubernetes has no idea what is this um
you know what this resource that the
user is talking about what is this group
called compute.cloud.com cloud.com in
version one I don't have an inst a
resource called EC2 instance and this is
something you can either uh use a
cubectl apply uh with this custom
resource definition yaml or you can use
make install command with the make file
that's something what is actually done
for you so you see we use customize to
build our custom resource definition and
then we apply that to um cq cubes apply
- f and on the standard input and this
is where we now have our custom resource
definition
first make uninstall and see what
happens if I do um cubectl get e2
instance uh dot um oh wait here see if I
do k get e2 instances compute.cloud.com
cloud.com or if I just do tell me how
many ETR instances do I have Kubernetes
says I don't have that resource but if I
do okay I'm going to deploy you or I'm
going to give you a custom resource
definition
at least don't say I don't know what
that resource is if you have that give
the user if you don't have that just
tell them I do not have that resource
but don't just say I don't know what
resource are you talking about so this
is What you do when you uh give your um
you know when you do a make install
creates your custom visro definition
which if I see here you can see this is
uh you can do a CRD on you can do a get
on your CRDs and this is the custom
resource definition that I have which
you actually can uh also see like this
and this is the same thing that I just
showed you on uh on on on cursor so
which is not on cursor distributed in
terminals.
So that's how one would actually um
update or create the custom resource
definitions. In my case, uh how would
the YAML look like? So if I would
probably ask um you know um my my AI
that okay take uh take this spec and
give me
an updated YAML for this resource. It's
going to just spit out how the YAML
would look like. And see this is
additional to what's what's going to
happen. Uh I just going to I'm just
going to accept that. Here we go. So
this is how your YAML would look like.
It's going to be a kind of ETO instance.
Then you see some specs are there. It
tells you um you would have an AMI ID,
the SSH key, instance type. Maybe when
you give this to your developers, you
might want to make it a lot simpler or
at least make things like um instance
type or VM preset, something like that.
if they're more familiar with with those
words. Uh maybe you can do uh SSH key.
That makes total sense. Um I think this
this does make sense. Um it would have
been better for um other other examples.
But in this case, the YAML is perfectly
fine.
So I I would say okay, this was what I
wanted to show you as a YAML. In our
case, um the next thing that you would
do is once you have your YAML defined,
once you have your everything defined,
now we need to look into the reconcile
loop. See, by this time, Kubernetes
knows that it has some a custom resource
called um EC2.cloud.acample
doommain.com. In version one, there's a
resource called EC2. Now if someone
gives it a EC2 instance, what to do on
that? If someone gives it a YAML that
please create me an EC2 instance, what's
going to happen? How would it react to
that? And this is where we will be
looking into our reconcili. So let's get
started and let's see how we will build
a reconcili.
So our reconstru loop would look
something like this. It's under internal
controller EC2 instance_controller.go.
This is where the magic happens. This is
where whenever you make changes to your
custom resource, that's the place where
it comes to and then this is where the
logic you would be giving to operate on
the resource that has been changed which
is your custom resource. It is in the
package controller. It is importing
quite a few things. One of them uh is
controller runtime. This is absolutely
important that handles the runtime of
your controller. And also you see it is
actually getting our own um uh ECI EC2
instance. This is going to go to
github.com in here operator repo API v1
and then it's going to call this as
compute v1. Essentially what this is
doing is your controller needs access to
your spec of the U custom resource. It
could also have been very simply done
but I could say you know um please go to
API v1 because I have that locally
available or okay I think it's better
because I have it already on GitHub. So
what's going to happen is if I show you
um if I go to githubhub.com
and here is going to be my repository.
Let's go to GitHub. I should have just
copied otherwise it goes to Golang
Populator repo.
Here you can see an API v1 and this is
where it's looking for the EC2 instance
type. This is where our code is. So this
is what Kubernetes operator will be
using and the controller will be using
to map your request to a particular
known um data type to a particular known
uh spec or status of the custom
resource. This is the heart of your um
object that you're creating.
Um it creates you a reconiler which is
used in the client runtime that helps
you communicate to the Kubernetes API
server. It has a schema object which
registers your schema. Couple of cube
builder um markers. I think yes it's a
marker which is creating you the arbback
rules so that you can uh work on the
custom resources because when you create
an operator
it would be running it in its own name
space but if a custom resource is
created in a different one the operator
needs access to see in that name space
as well. So this is where the arbback is
extremely extremely helpful. Now what
can we do with this? This is the
reconcile function. This is the one
where all of your requests are going to
be uh looked into. This is where all of
your you know um whenever you do a
cubectl get or cubectl apply this is
where the changes are going to be looked
upon. This function has two return um it
returns two things. First it returns the
result of the reconciliation
and second it returns the error in the
in case if there was any if there was no
error it will simply return a nil. Now
this is the beauty of Kubernetes
selfhealing. You know how uh if you
create a pod which has a persistent
volume claim but that PVC is not bound
to a PV yet that pod is going to be in a
pending state. It keeps on being in a
pending state but as soon as you create
a PV as soon as you add it to the
persistent volume claim the pod is then
automatically started because there was
a recue going on for that particular
pod. the the reconciliation for the the
logic for the pod kept being if the
requests are fulfilled if they are not I
give you an error and I start the
reconciliation again it puts it in the
queue to reconcile this is the beauty of
selfhealing it will be done eventually
once all the conditions are met and you
don't have to trigger that reconcile uh
you know uh you don't have to trigger
another run of the reconcile loop
yourself kubernetes sets for you and
this is where you will be giving your
logic. The first thing that you do the
very first thing that you will be doing
is you need to operate on that instance.
For example, um the basic thing you when
you talk to your you know when you say
that users can then create an EC2
instance of their type. You want the
user to give the name of the instance.
You might want them to give the tags of
the instance. You might want them to
give the storage config. So you need to
extract this information. You need to
extract this information from the
request from the API server request that
came to the reconciler loop so that you
can use this information to talk to
Amazon in in our case because it's a
cloud operator. So you need to store
or you need to get all these objects
that are being given in this YAML by the
user in certain variables. So you can
iterate on top of that. So this is very
important. The user is actually creating
a resource of kind EC2 instance. You
also need to have a variable of kind EC2
instance. So that Kubernetes you can use
the Kubernetes schema to store your
actual keys in your variable. It's think
like the user is sending a circle. You
need a mold that can hold the circle. If
the user is sending a triangle, you need
a mold that can hold a triangle. If a
user is sending an int data, you need a
variable of type int. The same thing
happens. The user is sending the data of
kind EC2 instance. You need a variable
that will be of kind EC2 instance. So
let's declare that first. The first
thing I do is EC2 instance object
uh is going uh EC2 instance object
is actually here. So I'm using the
compute v1. This is the compute v1 and
in here I have declared the EC2 instance
and there you see this is essentially
what we created in our uh types.co. Now
we do have a spec then we have a status
but essentially this is the root of our
Kubernetes um object the EC2 instance
will have some metadata it will have
some object metadata uh it will have the
spec and then the status. So this is
what we are calling and creating a
variable in our reconciliation object.
Then what you can do is because the API
server will be sending a request to your
controller or rather it's the other way
around the controller will listen if
there was any changes done on your
custom resource
and then you can iterate on on top of
that. So you create a variable of type e
to instance object. I would rather make
it simple just to keep it easy to
instance. And then we can use the get
function. What this get function does is
it uses the context which is of your of
your request. More importantly, it gets
the name space under which this resource
was of changed. I'm not saying created.
I'm saying it gets the name space of the
resource in which the update happened on
your custom resource and the actual
inflight um YAML the actual context of
your YAML is then going to be stored in
this particular object in the EC2
instance. So think of this as you take
the YAML from the user and you give it
to your reconciler. So now it knows the
name space in which this object was
updated. U the name u you know the the
instance type this YAML had the kind um
the storage type this this YAML had the
number of tags the user wanted that you
can now create on top of this. So I was
want to say let's say um log
uh there was also I think before this
there was a logger that we can also use.
So here you can see we have a log
function and we can say um I want to log
all of my request and using log.info I
can do that. So I got I create an object
of type EC2 instance and this is the EC2
instance instance type. I get my object
which is coming from um the inflight
request and then I'm saying reconciling
EC2 instance and you see I can uh get
the name of that particular resource
rather than info. Let's just print this
for now.
So um I would say I want to have an EC2
instance. EC2 instance
and then I can say uh print lm I got a
um I got a request for an EC2 instance
in
the name space and then I could say and
the EC2 instance is um EC2 instance just
just keep it like that uh you can also
probably then say fmt.print print ln and
you can print the entire spec you want.
I don't want to print the entire spec. I
would just say I got a request for an
EC2 instance in this name space and the
instance is instance name. You see these
are all the options that you can where I
go
here is and the instances are here I can
say and the easy to inst Oh my god wait
let's get let's do that again I want to
I want to just uh see the data that has
been given to me and I can say uh I got
a request for an EC2 instance in the
name space And let's keep new uh uh
prints. The instance, the EC2 instance
name is EC2 instance.name.
Uh and I would say then the instance
FMT.
LM instance
type is
uh EC2 instance.spec. And you see this
is this is the beauty of uh the AI
editors again. Now you can see I am able
to get all the information which was
sent which was you know watched by my
reconciler by my operator under E2
instance spec and you can see AMI ID SSH
key subnet tag store regation storage
this is essentially what you were
building in the spec of your custom
resource this is a onetoone mapping that
is why we created a variable of type EC2
instance and then we got the you know
the inflight request that we received
from the API server and then I'm saying
I got a request for blah blah blah the
only thing I do not have see I have the
instance ID I have the AMI ID SSH key
subnet tag everything I don't have the
name for my instance and actually this
is the name of the object that I'm
giving but maybe I want the user also to
give the name of the instance and I
would simply say instance name because
instance name could be different than
the kubernetes object kind uh metadata
the yaml that you give they would be
different and here I can say um my name
would be in the spec
uh spec dot instance name it's going to
be a capital spec Now the important
thing is I just added another uh object
in my spec my custom resource definition
that is right now in Kubernetes. It has
no idea about this new instance name. So
I would have to do my magic again. So I
would do make generate
make manifests and then I would say make
install. So that my Kubernetes is now
updated that there's a new resource
called there's a new uh there's a new
key in the strct for the spec which is
called instance name and um that's it.
Now once you get the data once you you
know uh iterate on top of that in my
case I'm just printing it right now but
as we move forward we will use this data
to talk to Kubernetes and then I will uh
create myself an instance that is where
you would have your actual business
logic what we will then do is once you
have used the data in my case I'm not
changing anything in the object object
there was a resource created I got
information about this but I'm not
updating that resource so then in that
case I will be returning a a result
which is going to be you know um is if
in case you are spending it contains the
result of the deconeller reconciler
invocation if you go to the controller
runtime on go on on the go uh consider
this result actually contains two
things. Whether to recue this or not and
this is a default to false. This is very
important. When you exit your reconciler
function, you need to tell two things.
Whether there was an error in the
reconciler function and is there a
requirement to rerun this reconciled
loop. You only remember we talked about
this in the previous part of the video
that you only reconcile
you only uh you know rerun the reconcile
loop if you have updated the API object.
We are not doing that right now. So we
do not need to u you know send any uh
reconcile boolean which is VQ as
boolean. By default it is false. So in
case in our case we don't have an error
and we are also sending a pause for the
reconciliation
this reconcile loop will not run again.
Uh it's kind of like um when you start
with the reconcile loop. So this is this
is how it looks like. You have the
reconcile
function
and the request came over to this
function. you made your change, you made
your business logic, whatever you wanted
to do. In our case, I'm just printing
things. I'm not creating an EC2
instance. I'm not updating my custom
resource with the status of uh the EC2
instance creation. I'm just printing
this. So, because um here um should be a
bit bigger change
made.
So in my case, did I make a change into
the custom resource into the custom
resource that request came to me which
is in the EC2 instance? And I would say
no or you could say a yes. And in in
case you have no changes, you would
simply return
nil for the error and then false for
your reconciliation. If you did give a
yes, if you made some changes, then you
have to return a true for um the
reconciliation and then if there was an
error, you will return the error. If in
case the error was nil, you will return
the nil. This part we will talk about uh
coming up. But for now, I'm not
returning any uh you know, I'm not uh
changing anything in the EC2 instance
object. So I'm just returning a false.
Now this at this point guys uh let me do
a l.log and then I would say um let's do
here reconcile reconciling EC2 instances
the name and I would here say
reconstance
and this is the name of my instance. Now
let's get a YAML and then see how this
will be um functioning. Now this is the
time we run our operator in Kubernetes.
Now we can make a container image. We
can you know push that container image
to a registry and then get it from
there. The good thing about using cube
builder is when you have a working
development environment and you have a
cube config which points to a operate on
your cubernetes cluster you can just run
the main program locally and then it
will be as if like it's running in your
kubernetes cluster
and I will again uh call my trusted AI
to use the spec and give me a dummy yaml
So we can create that.
Uh uh uh this is the spec. Let me
quickly get this and then I would say
please undo everything. I don't need
that change.
Cool. So um let's say Kubernetes. Do I
have a folder called Kubernetes? No. Let
me do an example.
Uh instance instance.yaml.
And there is our spec.
Before spec we have a API version and
then we have a kind and then we have a
metadata and then you see we then have
spec. The API version is v1. The kind is
EC2 instance and the metadata
uh wait the API kind is API version v1
but it's compute.cloud.com/v1.
E2 instance metadata would be name of
Kubernetes object for uh EC2
and there we go this is simple uh what
we have and then I would say let's run
our operator
now we can do go run cmd main go because
in in the cmd folder um wait where that
goes here in the cmd folder, you have
your main program. This is the entry
point. In any go code, your entry point
is always going to be the main uh go
file. This is the one that registers
your schema from Kubernetes. This is the
one that creates a client so that you
can talk to Kubernetes. It registers
some uh you know um some booleans, some
flags if you will. We will clean this up
because we don't need a lot of that. We
already have gone through this code. The
most important thing that it does is it
starts the manager here. Uh enable
enable enable enable enable. But I think
we did see somewhere
that it was starting um the manager.
Wait a second. This is the new manager
function. Just going to give the manager
new.
Where was that?
uh uh uh uh here. So we're going to have
a log of starting manager because we did
not work with the web hooks. We don't
have any readiness check, livveness
check, nothing. So we should just see
starting manager and then we should we
we will be seeing if we get any request
to our controller. So here we will be
exporting the cube config. Let me
increase that on a little bit. And here
I would run my function. If this is a
little a bit small for the font, please
bear with me. I hope this is this is you
know seeable. But uh essentially what
I'm doing is I'm running the main
function now. So we will be running our
operator.
Do I have any EC2 instances? Uh no. Do I
have them in any name space?
Uh no.
How does our example look like? So if I
do k - f example. Oh wait, I need to go
to operator folder here. And then I can
do k - f example instance create. Let's
do a dry run. See if our yaml was good.
And there you can see the yaml was fine.
Um then I can just simply say first I
run my program in here. And this is how
your go code will be running. So you see
this is all what cube builder does for
you. You do not have to set up your
authentication with the API server. You
do not have to set up your um you know
um how would you run your your
controllers? How would you run your
multiple operator loops that you have
for different API versions? It does that
for you. It starts an event source. So
it's kind of like the listener for your
object in Kubernetes and here it's
starting a worker for there's a
controller for EC2 instance. This is the
group and this is the kind uh which is
uh EC2 instance. So it's kind of like
you have one controller for one
resource. It is a onetoone mapping. You
can have multiple instances of that
controller and in this case you would do
a leader election uh because if one
object uh if one you know instance is
managing your request for that custom
resource others should not do that but
in our case we only have one replica but
we have one controller park object if I
was uh if I was creating um more
custom resources let's say right now I
have an EC2 instance This is my custom
resource. For that I have a controller.
My controller here you can see it's
called uh also EC2 instance.
If I was to create another custom
resource which was let's say a storage
bucket. Maybe I want the users to be
able to create buckets in my Amazon
account very easily. There would be
another controller uh which is going to
be then storage bucket. They could be
running in the same manager
in the same manager or operator pod.
I think this is where you can review uh
the part of the video before where we
talked about what is in the operator.
There's a manager within manager. Then
you have multiple controllers but it's a
onetoone mapping to the object and um
the the controller. Think of this if
anything happens to this resource this
code will apply. If anything happens to
this resource then this particular code
would be would be applied.
Now is the uh now is the moment of
truth. Would it would something happen
if I uh simply just say please create me
an instance.yamel I should see something
in here. That's what I am more concerned
about. So let's create that.
Um of course it's invalid. I cannot Oh
there you go. So it says the kind is
invalid. It must be EC2 instance. Of
course in the dry run for client side
by much
sample here my kind was wrong.
And then if I do a create again you see
there is my request.
I know that the instance name is my EC2
instance. This is Kubernetes not knowing
about this. This is our operator knows
about it. So it started the worker. Cube
builder started the worker and this is
our code from here till here. This is
our code. We get the log which is
reconciling instance and you can see
this is the code which is um started
here reconciling instance name and then
we get all of our u program executed. I
got a request for an EC2 instance in the
name space. You see it gives you the
name space default and then the object
name as well.
um which is a request.namespace.
This is telling you the the namespace as
well as the name of the object that you
have. The EC2 instance name is this is
now reading the spec and you can see
tags are it's giving you a map of
environment dev owner is Alice which is
this is what in your YAML looks like uh
example
and instance. So essentially what the
user gave my program our operator our
controller most importantly knows about
it you see so um my storage would be
size 50 and then type GP2 it's actually
just printing this as an object map but
we can u do that even better for storage
let's let's make some changes I want to
say storage size is 50 and type is g2.
So I could say storage size is um you
can say f is size
and type is you see storage dot type you
can obviously
access any sort of object that was there
in your spec like this storage dots size
because this is how you access yaml so I
would say spec dot storage dot size and
that's also what's happening spec dots
storage dots size and same for the type.
Um, you can also do
a delete. Now, see this is very very
important. This bit executed when we
created the resource. When I delete that
resource,
when I delete that resource, you see my
reconciliation loop started again from
the very beginning. This is absolutely
absolutely important.
Whenever you make any changes on your
object, the reconiler starts from the
very beginning. It does not know whether
you created the resource, whether you
deleted the resource, whether you um you
know um whether you updated some
metadata annotation. It has no
distinction of what the uh what did you
do? It knows about the update that has
happened. And this is where it is your
duty as someone who is writing the
operator, someone who's writing the
controller logic that you can make
changes. You can run your reconciliation
loop many times. But if no change was
required, no change is actually made to
your object
which in this case you can see because
the request because the resource was
deleted we don't have any EC2 instance
name we don't have any instance type
nothing but the loop ran completely
and here you can it says reconciled EC2
instance blah blah blah something more
evident would be when I just uh show you
let's say the name of the instance I
want to get rid of uh all these things
because I want to keep it simple. Uh or
rather I would say um
I would say fmt.
Ln uh got a request simple or update was
made
to the e uh EC2 instance restores. I'm
not saying the name or anything. I'm
just saying that there was an update
made and this is why I am um called or
reconiling that makes no sense.
Now I will run this again. I stop my
program and this is the beauty of
stopping the program when you are
building this with cube config uh with
the cube builder because it has a
graceful shutdown. It doesn't just stop
the program abruptly. It is a graceful
shutdown and um it it helps you uh
cleanly shutting down your manager
because I made some logic changes. I'm
now starting this again
main.go and then I would say um k - f
create. So here you can see it says
update was made to the EC2 instance
resource and this is why I am recon I am
reconciling it. That's the main uh logic
here. And then I got the instance type
which is E3 medium. If I made me make
some changes to this EC2 instance let's
say I want to add a metadata. I want to
add a label here. So I want to say
labels and I would say hello colon
world. I save and exit. You see I got
another line.
It's not like I created the object. It's
kind of like I only um updated it. So
you see I was not making a change as in
I was creating that resource. I just
edited that and that was only a simple
metadata change which was the labels but
my code ran again from the very
beginning. What if I add some
annotations to my object? If I do um
here let's go to my annotations
and I would say hello again and world.
You see
the whole reconcile loop runs again. The
thing that I'm trying to tell you is
whenever you make any changes in your um
object in your custom resource, the
whole reconciliation loop will run
always.
What if I maybe remove my label that I
had added or remove the annotation? Say
that again. You see running it again.
Kubernetes does not differentiate
whether it was a metadata change,
whether it was a spec change. It does
not do that. It just simply goes ahead
and says okay, you change the resource
and this is the update.
This is why when you make changes to
sync, let's say your instance name or
instance type, the reconciler finds
this. This is the beauty how a
reconciler would work. Whenever you make
changes, let's say in your instance
type, you make a change from T3 medium
to T2 micro, the reconil has no uh
state. First of all, it does not
remember that before it was T3 medium
and now the user has asked for T2
medium. It does not remember the past
request. It knows the status right now.
I mean it's in the HCD. But in this
case, let's say when you are going and
when you are saying that uh my my type
for the instance was T3
medium before and you change that to T2
medium.
This before is stored in HCD. That is
correct.
But the reconciler loop that will run,
it will have no idea that previously the
user asked for a T3 medium. They're
completely stateless. What the
reconciler loop will now do is it gets
your request. This is your logic that
you would add that allows the user to
change the spec for instance type or
maybe um you know the user can
dynamically change the tags that they
want to give. So here in this case if
the user has made updates to the type uh
key in in the in the YAML of the EC2 uh
resource, it is your responsibility that
goes to Amazon
and sees if the instance of T3 medium
was available and if it was you delete
that and you create a T2 micro because
you can't change the instance type as
far as I remember. if you can that's on
on Amazon side that's a different story
but the reason what I'm telling you is
your operator your controller the
reconcile loop will not remember the
past request it always has to check the
current state is T3 medium the desired
is T3 medium nothing needs to be done
but if the current state in the cluster
is T3 medium and the desired is T2
medium it goes to Amazon is okay this
needs to go away and this needs to be in
action and this is how you do
selfhealing or eventually consistent and
then you update the object which we will
see in the next u sessions.
So this is how you will be building an
operator that knows how to watch the API
server for your custom resource changes
that knows how to watch the API server
um and update the reconciliation logic
in case there was some changes you
change the object which we will see um
and um yeah that was it. So this will be
giving you a very good idea, a very
beginner idea. I would not say beginner
but it's a good enough idea for you to
build your operators and then you run
them on Kubernetes.
Next thing that we going to be learning
is I already have it available. This is
going to be how we will be ziting an
operator which will be actually creating
us an EC2 instance. The next parts of
this video are going to be more onto how
to use Amazon SDK in Golan to create an
EC2 um instance on Amazon because we now
know from Kubernetes point of view, from
the operator point of view, we know how
to write an operator, we know how to
write the spec, how to install the
custom resource definition and how to
react on changes into our custom
resource in the operator. Now it's about
what do you do with that change. In my
case, I'm just printing it. In the
actual case of the course of this video,
we will be building we will be using
these changes and then we will be
building them on top of Amazon. We will
be creating an EC2 instance. So that is
what we will be doing next. Till this
point you know how to write your
operator. You know you can get requests.
you know how you can you know the
reconcile loop does it for you. So in
the next part we will be using uh the
Kubernetes SDK in Golang to create us an
easyto instance and then we will see if
in case a request was successful we
don't need to reconcile again we will
talk about finalizes but that's all
coming in the video. So let's look at
how we can create uh EC2 instances with
our operator uh using Golac. Okay,
before we can actually get started for
the code, there is something which is
absolutely important for you to
understand. We have been working with
the reconciler loop and we talked about
that the reconiler is the one that takes
your request and runs it through a
series of you know your logic and that's
where you get your changes for the
current state to be equal to the desired
state. However, this reconciler is
expected to return two values. One of
them is the result and the other one is
actually an error or it's going to be
nil. These two return values are
actually required by Kubernetes to know
what needs to be done for your current
reconciler request. So imagine your
reconciler got a request here and you
made some changes to your environment.
You made some changes to your you know
resources that you need to change and
then you have to tell Kubernetes whether
you want to re rerun the reconciler for
the same request or you just want to
wait for new requests. Uh wait for new
requests.
In this case, you did not get an any
error. You did not return any error.
Based on these values of the result and
the error, that is when Kubernetes
decides, do I need to rerun your
existing request with the reconciler
again? And this is how we work with
things like selfhealing.
If you know about this, you can give
this an uh you know, give this the give
this a try. Get yourself a pod that is
in a pending state because of CPU or
because of memory. Ask for resources
that are not available in your cluster.
The pod is going to be in a pending
state. Then go ahead and add a new node
that will be able to host that
particular pod. And once that node is
active and available, this pod gets from
pending into the running state. You
didn't have to do anything. You didn't
have to tell Kubernetes that, hey, I got
a new node. Please put my pending pod on
this new node. It it didn't work that
way. It was self-healing because when
the first time the when the first time
Kubernetes tries to put your pod to a
node, it says, "Okay, there is no node
available. I'm going to put this in a
pending state." Think of this as a recon
silent. So, the decision were made that
I'm going to put the pod in the pending
state.
And the controller responsible foruling
your pod returns a pending which is
actually uh it sends an error that for
the request that came to me I was not
able to properly process it and there
was an error and this is where
Kubernetes knows I have to retry again
for that request and this is how
self-healing works while Kubernetes was
retrying and retrying and retrying with
an exponential ial back off you happened
to add a new node and this is when once
you added the new node when the you know
when the logic ran again it was no
longer pending the reconciliate said
okay you asked for eight CPUs and I have
node now which is 20 CPU available 20
cores that are available I'm not sending
any error rather I'm going to send a nil
for an error that I did not get any
error and the pod was scheduled and the
pod but then eventually went into a
running state. This is something that
Kubernetes does for you. And as a
developer for this reconiler, it is
absolutely your responsibility to tell
Kubernetes whether your reconil function
was okay or did you get any error and
would you like uh Kubernetes to actually
retry that particular thing. This could
happen for EC2 instances. Imagine when
you tried to have your reconcile
function and you were calling the AWS uh
API to create an EC2 instance and you
were not able to do that. You had the
right credentials, you had the right
access for AM for your user that you are
using but maybe uh there was some
network timeout happen
or anything that could stop your request
from processing um happened. you would
like to retry again, right? Maybe after
like 10 seconds or 20 seconds or
whatever your time is, you would like to
retry. In this case, you can tell
Kubernetes that there was an error. My
reconciler function is returning an
error that please retry that again. And
based on the requests and the error
values, Kubernetes decides do I need to
retry this particular request or do I
need to wait for new events or new
updates to the custom resource for which
this reconciler is listening on. So
there's a very simple um condition that
your reconciler can actually uh look
into or look for and this is also in the
priority order.
If your reconciler function, if your
reconciler is you know um is returning
an error. So your error is present you
are returning an error. This result is
completely ignored. That means whatever
you send in the result is completely
ignored and you are then using an
exponential backoff. A little bit about
the result. What are you actually
sending in this result is two things.
First you are sending do you want to
recue or not? Usually it's a it's it's a
boolean where you can say I want to
recue or not. And second you're sending
a time for the recube.
If you are sending an error if there is
an error present in your reconciler this
result is completely ignored and you
will always be retrying. Kubernetes says
okay the reconciler function or the
reconciler is giving me an error that
means it could not prop properly process
the the request that came in I will
retry this and this is where the
self-healing uh loop comes into the
picture if in case you are uh not
sending any error so and this is the
second thing if you think okay
everything is fine I have processed my
request I'm not sending any errors and
you do send a custom recue after. And
this rec is actually this time rather I
should have put this as um wait a second
I can probably get a better color here.
Um this should be rec
after this is the time after which your
reconciler should again be uh started
and this is like a forever running loop.
So imagine this. You create an instance.
You create the instance. It's okay. You
probably want to check for the instances
every 10 seconds
or every 20 seconds. Maybe you are doing
some sort of drift detection there. And
if you were able to look for your
instance, everything was fine, that
means you are not having any errors with
that instance. Um but you re you want to
retry that again after 20 seconds and
this is what you are sending. You are
not sending any error because you did
not have any errors. However, you are
sending a fixed time. You are telling
Kubernetes that there was no error in my
request but I want you to rerun this
reconiler every 20 seconds. And this is
kind of like a forever running loop. It
never stops because you don't have any
errors but you always want to retry that
again and again. You want to rerun this.
What could be the reasons for it? I just
told you. Maybe you want to do some sort
of a drift detection.
The third condition could happen if you
are not sending any errors and you want
to you know recue and your custom
timeout is not set which is kind of
similar that you have no errors and you
also want to recue but you don't have
any re uh recue after set that means you
are asking Kubernetes that hey my
reconciler was okay I want you to retry
again, but I'm not telling you in what
frequency do you have to try. It's kind
of like similar to level two, which in
this case you're also not sending any
errors, but you are telling how frequent
do you want to try. In this case, you're
also not sending any errors, but you're
not also telling Kubernetes um how
frequent do you want to try. You are
letting this with Kubernetes and this is
going to be the exponential backoff.
This is where Kubernetes will say okay
the user said there is no error for the
reconcile loop the function was running
properly fine but they are not asking me
to run this in a forever loop I would
probably uh I'm going to use an
exponential backup so it will run your
request and then maybe another time it
takes 2 milliseconds the next time it's
going to take 4 milliseconds the next
time it's going to take uh probably 16
milliseconds or so and this is going to
be an exponential back off until I I
think the maximum limit is 1,000
milliseconds um until then it stops
doing it.
And the last condition that you can
return for your reconciler is you do not
have any errors and you also did not
send any rec flags. Probably you just
said result result was empty and then
you are sending a nil. You are returning
a nil. This is where Kubernetes says
okay everything was fine. I'm not doing
anything. I'll just wait for a new
update or I will wait for a new event
where the custom resource has been
updated. Kind of like for the new
requests here. This is absolutely
critical for you to understand otherwise
you might see your reconiler making
changes again and again or you might see
your reconciler running again and again
because you did not send the right set
of values. you did not put the right
return values for the recon and for
kubernetes to understand what to do
now as as I was talking about once this
is understood uh I was talking about we
will be looking into the go code so
let's take a look here and let me get
that here so in your screen you can see
that I've made some changes to our um
our instance spec before this this
before u now it was a very simple one.
It was just having an instance type, an
AMI ID, probably a key pair and a
security group. But when you want to
make things more robust and when you
want to make things more production
ready, you have to think from an overall
point of view. When you want to create
an EC2 instance, there could be many
things that you have to give. You
definitely have to give the instance
type whether you want to use a T2 micro,
T3 micro or any other instance family.
Then you absolutely have to give an AMI
ID which is going to be the the AMI ID
you want to use. You have to give the
region as well under which your instance
should be created. You need to give the
availability zone. You have to give the
key pair so that you can log into the
instance. You need to give the list of
security groups around here. the subnets
in which your instance could be running
and also when you want to provision the
machines as soon as they boot up with
your changes we usually use Amazon's
user data and uh that also is what you
can give you can probably give tags as
well you can also give some storage you
have to give storage and whether you
want the instance to have a public IP or
not this is kind of like a boolean that
you can give now on the right side you
can see this omit empty. This is
actually that uh a place where you can
control what kind of fields in a YAML
when you give your EC2 instance spec are
optional or what kind of fields are
required. For example, tags could be
optional. User data is optional but
storage is absolutely needed. AMI ID is
absolutely needed. Instance type is
required. So this omit empty lets people
define the only important or the
required fields otherwise the other ones
could just be skipped.
So here you can see I have a storage
which is type of a new strct called
storage config and here's a new strct
called storage config where we define a
root volume and then we can also
probably give some additional volumes as
well. This is an example where you give
your root disk as 100 gigs and maybe you
want a VM for a database. You can add a
bigger disk in the instance and this
could be done by the additional volume
and both of them are of type volume
config and a list of volume config
because additional volume itself is a
list of additional disks that you can
add to your instance. And this is a very
simple volume config where you define
the size of the disk. You define the
type of that disk, the device name which
is going to be available in the instance
when you mount it or attach that and the
encrypted uh boolean if in case you want
the device if in case you want the disk
to be encrypted or not because Amazon's
uh allows you to encrypt your discs in
case you want that.
So this think about the EC2 instance as
a more holistic approach whether you
want to give or you want to allow the
users to be able to declare their um set
of set of data and the metadata. In this
case, you are allowing the developers to
create an EC2 instance, not just create
one, but also you are letting them login
with their key pair and you are also
allowing them to use their user data
that you can, you know, give to Amazon
when you are creating the instance that
lets it preconfigure before they can
even login and the VMs are exactly how
they want it to be.
So this was a bit of a change in uh our
EC2 instance spec to make it more
production ready to make it more not
from development but actually to
production. I also made some changes to
the status where when you do cubectl get
uh EC2 instance you will see the spec
and also you will see the status. So in
the status I would like to see the
instance ID so it is easier for people
to see what is available on Amazon and
what your Kubernetes knows about the
state of that instance if it is running
if it's terminated it is unknown it is
stopped all those Amazon EC2 instance
states and also a very important thing
is going to be the public IP because
when I do cubectl get EC2 instance I
should have enough that lets me log to
this public uh to the to the particular
instance and this is a public IP and
that's what I want to show when someone
does an EC2 uh uh cubectl get EC2
instances
and then again we have the standard
strct of our EC2 instance which contains
the type metadata the object metadata
and our spec and status and this is kind
of like just when you get a list of
instances what's going to happen and
this is how Kubernetes knows uh what is
a set instance would look like for you.
Now I've already made the changes and I
told you whenever you make make the
changes you have to run the make
manifests command and then you have to
install that to your Kubernetes cluster.
So my Kubernetes cluster already has
this custom resource definition. If I do
cubectl get EC2 or CRD which is EC2
instances.computee.cloud.com
cloud.com- oyl and let's look at this
you can see the name is easyto instance
the list kind the plural the singular it
is a namespace scope resource
and there you can see I have got couple
of things such as the kind and there's
my spec I have got the AMI ID the
associate boolean um a public IP or not
it's a boolean the availability zone I
want to run my instances on and things
like my security group which is type of
an array because you can give multiple
security groups and then I've got my
storage configuration where I give one
root volume and I have got additional
volumes which is type of an object which
is then again uh globally it's a type of
an array so you can give multiple
additional volumes but you can only have
one root device you can only have one
root um clock device
now once we have uh defined our spec
properly. There is going to be now some
uh code that actually uses this and
creates an EC2 instance. So let's see
that. Um once I have my instance type, I
can actually go to my U E2 controller
and this is where everything starts.
This is where you will be seeing the
reconcile loop. We saw this before. We
use the reconiler to actually see uh
what happens when I get a request and
this is what your to-do list is my logic
starts and I have created a logger
for this context that is aware of the
context and you can use l.info which is
going to just print stuff when you are
running your operator. So it makes it
more verbose and you can see what is
going to be uh what's going to happen or
what is happening with your controller.
It prints out a function uh it prints
out uh an info message that the
reconsidered loop has started and this
is the name space under which you got a
request. So Rick RQ is the request that
comes to your consiler and then you send
a result back to Kubernetes. So it came
from this name space and the name of the
request was uh request.name. That's the
name of the object that uh we are uh we
are working with.
Then there are some comments which I was
building this. I put some comments for
us to be easily understanding this
again. But you know what we are doing?
We are creating a variable of type EC2
instance so that we can marshall the
object which is coming to us in this
reconciled loop by Kubernetes into a
variable and then we can easily iterate
over on top of that. we get the object
uh into uh our EC2 instance variable
from this name space and if you could
not get the object and this is
absolutely very important. See, you may
have any problems uh for getting the
object. Maybe you have a wrong YAML.
Maybe you probably were supposed to give
a string, but you give a boolean to one
of the keys. Or you probably did cubectl
delete the object. That's correct. Even
if you delete the object, it is an
update to the custom resource.
Then again this reconciler is going to
be started and you have to check if the
error that you got when you are trying
to get the object was is not found. This
is one of the errors from Kubernetes.
Kubernetes has a package called errors.
And let me show you here and here you
can see it has all these errors defined
for you. So it makes it easier for you
to declare what was the error in my
case. See, sometimes when you create an
object, it gives you the object already
exists. It's an error. But you can
actually see what kind of error it was
because if I was just say if I was just
saying get me the object and if error is
not equal to nil, I would just say okay
there was an error please try again. But
the user will never know what the error
was. In this case, I can say, "Hey, you
know what? I was trying to get your
object. I was trying to get it into my
variable, but I got an error while
trying to get it and the error was is
not found." And that is where it returns
a true if the condition was that I could
not find the object. This is probably
when you are deleting the object. Um, it
again runs the reconciler. It looks
something like this. you have uh the
object here.
Whatever change you make on this,
whatever change you make on this, the
reconciler will be running again. So the
change could be you added an annotation.
That's an update. Then the reconiler the
change could be you added a label on top
of that to the object. Again the reconil
would be running. It is your
responsibility to write this reconiler
in a way that if it surely should not be
changing anything if the change that you
did to the object doesn't require a
change it should not be changing the
actual resources. For example, your
object could be EC2 instance. Maybe on
this Kubernetes object, you want it to
give a label. That doesn't mean you have
to change something external to the
Amazon instance. That doesn't that
should not happen. So this is something
you have to code in your reconciler.
Even when you say cubectl delete,
when you say delete, the object is
deleted. there was a change on the
object and then another uh run of the
reconciler would happen. So you have to
check that when you were trying to get
your object you could not get that and
there was an error and the error was
actually is not found you will simply
say um the object does not exist or uh
there's no need to reconcile because the
object was deleted and then we just
return an empty result and a nil.
Remember this is one of the return types
that you have to say. What you're
telling Kubernetes is everything is
fine. There was no error from my side
because the object does not exist in in
our case and please wait for the new
requests which are coming to the
deconidered. This one request that came
in is all good. If you could not get the
error for any other reason then it is
not found. Maybe uh you did not have
write um arbback to get that to get the
object in that name space. For whatever
reason you could not get the object of
the request, you will then say u you
will send an error. And you see here the
moment you send an error you are telling
Kubernetes please retry this object.
Please retry running this reconiler
please retry the operation of the whole
reconciler loop. And this is where the
self-healing would work. Probably you
had some problems where you could not
get um you could not get the you know
the object but you try again and it if
it works then you're happy because you
don't you don't have any errors anymore
otherwise you send an error again and
this is going to go ahead with a
exponential backoff. So it's like when
the first request comes in, it was an
error but it was not an is not found
error. You send uh you return an error.
Okay, it goes back to the reconciler
again, runs to the reconciler, there was
another um there was an error again
which was returned goes back to the
reconiler and this happens with an
exponential backoff. This is where
return values of the reconiler function
are absolutely critical. Absolutely
critical.
Now the next thing is um whenever you
delete an object you set a deletion time
stamp and I will talk to uh you about
this in in the future. We are not going
to talk about this right now. This is
when you delete the object. We will
first learn how to create one and then
we will delete one. And this also is
about deletion. This is the logic of
deletion. We'll talk about that later
and I'll tell you why this is here. This
is the logic of checking if the instance
is already there because you want to be
at the potent. You don't want to create
the same instance with the same instance
ID if it is already there. And this is
also the logic which I probably would
talk to you about later. This also is a
logic which I will probably talk to you
about later.
Um so here is where we start in our
loop. The first thing you do is you
start your reconider. You create an
object. You try to get the object into
your EC2 instance variable the custom
variable type that you have created. And
then you say okay I'm starting
completely new. I have no ID. I have no
instances on my Amazon and I'm going to
create a new instance.
The first thing you do when you create
an instance is or when you create an
object, it's a very good idea to add a
finalizer. You might have seen this in
Kubernetes. You when you do cubectl get
hyphen or YAML, you might see this
finalizer in the metadata texture of
your object. What this finalizer
actually does is is very it's very
interesting. So let's say when you
created this object in Kubernetes which
was an EC2 instance
and there you added a finalizer.
Finalizer is nothing but it's a list of
key value pairs that you can add. Let's
say I add my finalizer as hello colon
word.
Then this object was created and your
reconciler actually went to Amazon and
give you a new uh AWS instance.
All right, everything is happy. You got
the instance. Now the the thing that
happens with finalizer is when you say I
want to delete this object
when you say I don't need this instance
anymore I want to delete this particular
object you can delete this from
Kubernetes however that's not the only
thing where you need to delete it from
you also have to delete this from Amazon
so how do you tell Kubernetes that why I
am deleting this from an external
resource from an external platform. Do
not delete this object from Kubernetes.
Only when this instance is completely
gone from here can only you delete this
particular object. That is the role of a
finalizer. Finalizers will hold the
deletion will hold the deletion of the
object in Kubernetes until the actual
cleanup has happened. And once you have
you know um once you have deleted the
object you then remove the finalizer and
then Kubernetes will allow you to delete
this particular uh EC2 instance
Kubernetes object that we have created
as a good practice when you create the
object in Kubernetes that's where you
should add the finalizer
and this is extremely important this
finalizer is being added to your
Kubernetes object. object which is EC2
instance and this also is an update and
this is also going to rerun the
reconciler loop any update to the object
whether you are adding a label whether
you are adding a metadata whether you're
adding anything really whether you are
updating the status of the object in
your code that will recon uh that will
start a new reconsidered loop so this is
where you have to be very careful of a
depoency uh in your code and you see
what happens. We we print a message and
we say I'm about to add a finalizer and
I use this um this append function
because it's just a key value pair I'm
adding in my EC2 instance finalizer
because I've already mastered this using
the R.get. My EC2 instance has actually
the YAML of the request that was given
to me and I'm creating uh a key in here
called finalizer and I'm appending uh my
finalizers here in uh called EC2
instance uh EC2
instance.comput.example.com.
I'll show you how it looks. It just
creates a new key under your object in
your object and then it just um adds
this uh this as a as as a list uh there
as an entry in the list because you can
have multiple finalizers
uh in your object.
Once you have declared that I want to
add a finalizer, the actual way of
updating your object is going to be
R.update reconciler.update.
There are a couple of functions we get
with the reconciler. Get lets us get the
object of the uh you know get gets uh
the get function lets us get the yaml of
the object into our variable and then
you can also use r.update.
This lets you update the object that the
reconciler is working on right now. So
in this case I'm updating my EC2
instance where I'm actually creating and
adding some finalizers and we will see
this when you create the instance it
will actually give you the finalizers
when you uh as soon as you create the
the instance and because you made an
update on the object it will start a
reconciler again. Not right now though
this is very important. See it's very
important to uh to remember when you
have the reconciler let's say when you
have the reconciler it starts let's say
here you made an update to the object
maybe you updated the annotations
this is where you will start another
reconciler but not right now you will
move ahead in your code and you probably
make another update in this case you
updated the labels of your uh reconciler
of your object. Kubernetes also records
this as a second time it has to run.
Then you do a return
and you do a nil. What's going to happen
is Kubernetes will run this reconiler
twice because you made updates to the
object twice. It's kind of like it
remembers that this is where an update
was. I have to rerun the reconciler.
This is what an update was. I have to
rerun the reconciler. It is a golden
rule of reconcilers that any update to
the custom resource whether it was done
by you with cubectl commands or whether
your reconciler is doing it will start
another reconciler loop. It will not
stop u you know the current reconciler.
It's not like it got an update. it will
directly go from here. It's not like
that. It will finish the proper
execution and then based on how many
updates did you make, this is where the
reconired loop is going to run again.
And this is your responsibility to make
sure that um you know when you run this
again, this update does not happen. This
label does not happen because they're
already there. And then you will say,
okay, I did not make any changes
throughout my reconciler on this object.
um I I need to do nothing. I don't need
to start the reconider again for this
particular uh custom resource. If you
make a new custom resource then again
yes the reconciler will be started and
the loop will keep on running.
Extremely extremely important to know
about the return types of the
reconciler.
Now once I have uh updated you know um
my object I'm telling Kubernetes please
add this finalizer to my object and if I
got an error that said failed to add
finalizer actually if I got any error um
I'm printing an error that says hey I
was not able to add the finalizer and I
say please recue and I'm sending an
error. So this is where you are
returning another uh return type. And
you see whenever you get an error
whether you are trying to get the
object, whether you're trying to delete
the object, whether you're trying to
update the object, you want to retry
again. And this is where you will send
uh you will return an error. And
whenever you return an error, this whole
result is completely ignored. It's
absolutely absolutely important to
understand this. when you are returning
or when your reconciler you see here
this reconciler function is returning
two values one is the result the second
one is an error if you return an error
the result is completely forgotten
kubernetes says you know what the
reconciler had an error I'm going to
retry that again with an exponential
backoff and this is where this is how
the self-healing works in Kubernetes
Now once you have added the finalizer
and this printing an info message for
info log which says finalizer was added
this updates a new reconider loop
execution but the current will continue
and this is where we create an EC2
instance I'm just sending a log uh I'm
just printing a log continuing with the
EC2 instance in the current reconciler
and this is the beauty what we were
waiting for this is what's happening
when we want to write an operator that
talks to uh that that creates mere
Kubernetes cluster. Guys, this is where
it all comes down to. We have our spec
of the custom resource. We have the, you
know, the logic that listens on the
update of our custom resource. We have
the logic to get the manifest or think
of this as this way to get the YAML of
what the user has given in my EC2
instance. Now I need to create an EC2
instance. This is absolutely important.
Now it's going to be so much fun. Now
when you want to create an Amazon
instance, you know what you need.
You absolutely need an Amazon account
and you need to have the credentials.
You need to give the credentials to your
operator or to your controller so it can
go on your behalf and work on Amazon.
And this is exactly what we will be
doing. Before we can actually go ahead
and create an instance, we need to
figure out the authentication.
Then we will use a client that we
created using this authentication and we
will give it this particular YAML and
we're going to ask it to go ahead and
give me an instance on Amazon. And this
is exactly what's happening now in this
create EC2 instance function. So I've
created an EC2 instance. I've created a
function which is called create EC2
instance and I pass the users requested
YAML. I pass the user manifest the EC2
instance and let's see what this
function actually does.
This function which is the create EC2
instance. It is accepting a value of
type EC2 instance which is the whole
YAML from the user and it is returning
two things. First, it returns another um
it is returning a variable. It's
returning a type of created instance
info. See, when you create an EC2
instance, you get a lot of output. You
get a lot of data. But we don't want all
of that. We only want to um you know
when a user creates an instance
when a user creates this instance
probably they care about the state
whether the state is running or not.
They care about um you know uh created
or not true or false. They also care
about the public IP
that was it you know um created of was
it there or not? What do I have a public
IP or not? And for this information I
have created a new strct which looks
something like this created instance
info and this helps me to send back the
data from my create instance function.
What I'm sending back is I can send an
instance ID which is important so people
can know what instance ids are there on
Amazon uh using cubectl get I'm also
sending a public IP I can send a private
IP a public DNS private DNS the state
that is all I can send from my function
and this is the um this is the return
type and my function which is going to
create me the EC2 instance it is
returning two things first is the strct
which is the information of the created
instance and second is an error um which
is I probably could tell the user that I
tried to create the instance but there
was an error maybe the authentication
was a was a problem maybe you don't have
enough quota in your region in your
account I I want to send I want to uh
send them something so they are aware of
what has really happened why the request
failed to create the instance so we
create a logger
which is create AC2 instance. This is
good. You can have your logs with custom
uh log name and this is easier for you
to know which file which function
created this particular log entry when
you do cubectl logs in your operator.
Um so I'm putting up a info which I'm
saying I'm starting the EC2 instance
creation. This is going to be the AMI
ID. I'm going to use what the user has
given from the spec. This is the
instance type and this is the region
under which I'm going to create my
instance. So the first thing I have to
do is to create an EC2 instance client
guys. Now it has nothing to do with
Kubernetes. It is completely how you
create an Amazon instance in Goland. It
has nothing to do with Kubernetes
because you already have the instance
YAML or rather the instance info because
EC2 instance is the whole instance that
should be created. Now you uh are doing
the generic things on how to create
instances on Amazon. The first thing you
do is you create the EC2 instance client
with this AWS client function. What this
is essentially doing is it's reading the
AWS access key and the secret key from
your OS environment variable. You see
because you need to give some sort of
authentication on how you would talk to
Amazon. I'm using the AWS key and the
access key and then I'm using the config
from Amazon to load the default config
with these credentials that I have
given. Um and then if I could not create
the config, I return an error.
Otherwise, if you have the access key
and the secret key, um you are able to
create a config and then you are
returning a new config. Think of this as
this function is just returning an EC2
client. And this is absolutely
important. You would use this client to
talk to Amazon. So you use your access
key and your secret key to create an EC2
client. And then this EC2 client is
here. Till this you have the
authentication to Amazon. Till this 24
you have the authentication to Amazon.
Now you need to say hey Amazon please
create me an instance with this key in
this subnet. This is the minimum count.
This is the maximum number of instances
I want of of such. Um this is going to
be the instance type that I want you to
create. And this is going to be the
image AMI ID I want to use. These are
the input instance parameters. You are
creating an instance. Amazon expects you
to give certain instance inputs and
these are um one of them. There are many
other instance inputs that you can give.
If I show you, you can tell Amazon what
is the maximum count of instances you
want, what is the minimum count of
instances you need, any u block device
mappings you have, you probably can say
what is the capacity reservation
specifications, the CPU options you can
give, is it a dry run or not, is it a
EBS optimizer or not. Um, so there are
many many different uh options you can
give when you are creating Amazon. This
is just something when you create Amazon
instance, this is the information you
give. You can also give security group
IDs if in case you want the security
group to be used for creating this
instance. I'm just keeping it very
simple so that we know what's really
happening. We are creating our query to
create an EC2 instance with these
inputs. And once I have my input
declared, I'm using my client EC2 client
which I created above. And there's a
function called run instances. And this
is the function from go uh SDK of
Kubernet of Amazon that launches the
specified number of instances using the
AMI for which you have the permissions
for and this is absolutely the one that
has creating the instance for you. So,
so far you created the client, you
created the instance uh input and now
you have created your actual instance
here. Now, for whatever reason, this run
instance is going to be returning uh two
things. First, it returns the actual
output.
See, I told you when you create the
instance, you get a lot of output. So,
this is what's going to be returned. So
if you look into this EC2 instance dot
uh EC2
run instance output. If I look that on
Goland
uh here
you can see I'm going to look for the
run instance output.
Run instance output here. And this is
what is being returned to you. You are
returned. Where did that go? Whoa. Whoa.
Whoa.
I think I was a bit quick there. Let's
wait.
Uh run instance output here. So you see
you are returned the act the the the
growth you returned the instances that
were created and this is the metadata.
There's something wrong with my browser.
Wait.
Essentially what you are given is the
what you are given is uh where did I go
return instance output
uh there's a type so you get the
instances and this is where you have the
instance metadata what is the primary uh
what is the private IP what is the
instance um uh ID that you that was
created for you what region it was
running in what zone it was running So
think of that as a metadata of your
instance when you created that and
that's what we are um saving in the
result. If in case there was any error
because this run instance does return an
error as well you will say I failed to
create the EC2 instance and then you
return the error back to the main
program and you say this was the actual
error because of which I could not
create the instance. There could be many
reasons why you could not make one. Um
perhaps you did not have the permissions
in that region. Perhaps you did not have
quotas in that region. Perhaps you used
a wrong AMI ID which doesn't exist. Um
could be a typo or anything. You just uh
are returning this to the user. It's a
good thing to return them the reason why
uh it failed.
So that's what you're checking. If the
instances returned is zero, you will
just say um uh there were no instances
returned for me. And till here if we
have no error, we have an instance for
ourselves. And this is what has happened
so far. our code. We had the client and
then we use the run instance
function
to actually create an instance and it
gave us some output back.
Till this time this output contains
things like uh the state that was you
know when the instance was created at
that time what was the region the
metadata the private IP the private DNS
um DNS name
by this time there is no public IP
however there's one thing important
second when you make an API call to
Amazon with this run instance function.
What you essentially got back in the
output is the state of the instance at
the time when AWS received that request.
It might not be running. You know how
when you create an EC2 instance, it goes
into uh pending, creating, initializing,
then it eventually goes into the running
state. At this time you have an instance
created for you but it might not be in
the running state. It might take some
time for the instance to be in the
running state. And this is what you
want. You want to wait until the VM is
running. So when you uh run when you
execute the run instance function it
creates the instance and gives you back
the metadata. What it does not have
however is the public IP and the state
whether the state is running or not. So
it's kind of like you say hey Amazon
create me the instance. Amazon says cool
I will give you one here's some metadata
but you don't get the public IP and you
might not be in a running state when you
created this actual instance
but this is what you say if I got back
in my result because you see this one
instances it gives you back the list of
instances and you are checking if um if
there was you know uh an actual instance
where I could I did not have any errors
and there was an instance created you
will just send an info that says okay I
was able to create the instance
successfully and this is the instance ID
that was returned to me you store the
result of uh you store the output that
was given to you and uh there you can
access things like instance ID uh
private IP public IP because it's all
returned for uh returned by Amazon to
you
and now we wait for the instance to be
running. See, it's a good idea that you
created the instance, but it's not like
imagine this uh there's this developer
and he goes to you and says, "Can you
give me an EC2 instance?" And you go to
Amazon and you say, "Please create me
the instance." And you get back the p
the private IP
in your company. you guys are using a
bastion host which is available and
using this bastion we can talk to the uh
VM which has a private IP because you
might not have a public IP you might
have disabled the public IPs. So
essentially what happened the guy asked
for a VM you said hey Amazon create me a
VM and you got the private IP and you
gave it to to him. You never waited to
see if the instance was actually in the
running state or not. Maybe the instance
was created but it never reached
running. Maybe there were some problems
in the region of Amazon or maybe the
instance malfunctioned. Whatever could
have happened, you you were not waiting
for the instance to be running. You gave
it to him and he logs in to the bastion
only to find out that this instance is
not running. So he cannot use it or she
cannot use it. And that's where the
problem is. As an operator, it is your
responsibility that you create that
particular resource. You create the
instance and you wait for it to be in a
certain state that you want. In our
case, it is running. So what I could
have done is I could have had a for
loop. I could have had a for loop that
keeps polling Amazon. Hey, is this
instance now in running? Is it now in
running? Is it now in running every
maybe 5 seconds? That's also doable.
Think of this as um a while through and
I would say think it like this. So I'm
using a while loop check the instance
and that's it. This is kind of like my
function. It keeps running. I am giving
it an input to describe me the instance
and I'm describing the instance with
this function and I get some responses
back. If the state name is running,
think of this as a pseudo code. um then
you break otherwise you keep running. So
you keep running every every 5 seconds
or 10 seconds. That is a doable option
but it's not a good idea. It's not a
good idea because Amazon gives you these
waiters that can do this for you more
gracefully.
A waiter is nothing but it's a construct
from the Golang from the go package of
Amazon that waits for a certain time um
for a certain state to be reached of the
instance. In my case, there's a new
instance running waiter. If I go to that
and if I show you or probably even here,
um there is going to be a new instance
uh running waiter. Now what this does
what this does is it defines a waiter
for instance one. This one actually has
the logic to wait for the instance to be
in the ready state. It does the polling
more efficiently compared to me having
this writing in my logic uh in my code.
So you can define a waiter which is
going to wait for the instance to be
reaching the running state. You can also
give the maximum time for which you want
to wait because it's not like if Amazon
takes forever for your instance to be
created. Um you just say um the you know
the checking loop will keep running
forever. You have to give some feedback
to the user and typically you can give
the run max time which is going to be
time dot minute and three. So you're
giving three minutes that you want to
wait for the instance to be reaching the
running state. This could be depending
on your uh requirement you can you can
increase this or decrease this. Every
request that you make this waiter will
be exponentially uh backing off. So it's
it's kind of like it starts from like
every 10 seconds and then uh it
increases this time out up to your uh
given time. So it does it a lot better.
You create a waiter and then you use the
wait function to ask it to wait on this
instance ID. So you are telling this
waiter that please wait for this
instance for this maximum time for it to
be reaching the running state. And by
this time if it was not reached the
running state there would be an error.
And if the error was not equal to nil at
this time you will just say failed to
wait for instance to be running and the
instance could not reach the running
state in 3 minutes. Now this is
important. You do say the maximum time
for your instance or that you want to
wait is 3 minutes. However, if the
instance has reached the running state
in the first 30 seconds, then the loop
will stop. It's not like you're going to
wait for 3 minutes dedicated even if the
instance reached the running state
before. It's not going to happen like
that. This is why the waiters are quite
interesting. They have the logic for it.
So, you don't have to deal with with
that. This only comes from uh experience
when you are using the SDK. These are
the things that you can also Google. How
do I make my code more efficient? How do
I use waiting? And you will get that.
Now by this time when the instance has
been created, we get the remaining uh
information back. We got the state
because we were waiting for it to be
running and we only break our loop when
the running state is there within 3
minutes. Of course, now by this time,
Amazon has also associated your instance
a public IP. 3 minutes are good enough
for Amazon to give your instance a
public IP and there then um what you can
do so actually um okay this was a bit
wrong by this time we are just waiting
for the instance to be running we don't
have the public IP yet this is where I
probably skipped ahead when you are
using a veator you were only waiting for
this instance to be running and once the
instance is running, Amazon will give
you a public IP. So you have it running
but you don't have the public IP yet
because it was not given to you in the
output when you created the instance.
This is where you will use another
Amazon function describe instance.
Now you say okay I created the instance
I waited for that to be in the running
state. Now I'm describing this
particular instance. Now I'm going to
get my public IP as well. Of course,
given if you have uh the public IP
allowed in your Amazon account and this
is where another request happens to
Amazon. So we waited for it to be in the
running state within 3 minutes. And then
I'm saying calling Amazon describe
instance API to give the instance
details. I tell Amazon that I want to
describe an instance whose ID is what I
got when you created this instance for
me and uh I want to store this result
which is of the describe instance um
function and I want to store this into a
describe result variable. If I could not
describe the instance again this is a
very simple go check you will say I
failed to describe the instance whatever
reason you are having you just give the
instance or if you could describe it
your result is going to be in the
describe result which is again a type of
describe instance output.
Now when you describe the instance you
get a strct back from Amazon. You get
some data in in a in a specific strct
which we can see it here.
You do get some output with describe
instances. Wait for that. And this is
the describe instance output. What you
get is you get the information about
your reservations which is the instances
on Amazon. And within these
reservations, you have the instance
information. So if you look at the
reservation there, you have the
instances that were described for you.
So you can call the reservation because
I know I only created one instance. So
it's only going to have one element in
the list because the reservation is a
list of reservations. And for these
instances which is only one the public
DNS name I uh is going to be describing
like this. So I print the public IP and
then I say the state is state dot name.
Again this is returned by the instance
str of the golab because if I show you
here if I go to instance
it will have public DNS. Let's look for
that. There you go. The instance strct
is returning a public DNS. It is
returning me a public IP address and
also it returns me the state. Um here
the state is of type instance state
where we also have an instance state
name. So it's kind of like they have
created packages for all the other one
and here you see you have the name. So
you have asked for I want to describe my
instance and the input is this instance
ID. I store the result. I store all the
reservations that was returned to me by
Amazon and all the instances inside of
it. I know I only have one instance. So
I can call it with zero index and tell
me the public DNS name and tell me the
state of the of the VM.
Now here's interesting thing. By now you
have all the information you need for
your EC2 instance. You've got the
private IP, you've got the public IP,
you've got the instance state, you've
got the name of the instance that was
created, the key name that was using and
now you can actually um
u so so by this you have uh all the
resources that you need for your Amazon
VM uh to be to to be given to the
developer who has asked for this
instance.
The next thing that you can do is you
can get the information uh about the
instance. But this way this thing is not
needed because we already
um
uh we already requested the the instance
information. That's not what we are
doing. However, this is what we are
returning back. This is extremely
important.
See when you have all the information
about your instance that you have
created, we want this to be returned
back to the actual controller. So that
this is where we have the instance
information.
This function is returning me um a type
which is created instance info. And I
just showed you the created instance
info here. Where did that go in my API?
This one. So here's a strct uh which is
created instance info and that is what
my function should be returning. This
create EC2 instance should be returning
and this is what I'm preparing now. So I
got all my instance information in a
variable called instance from the
described result and then I'm preparing
my return type because you know this
function is returning two things. First
is an error if there was any and second
is the created instance info which
contains the public IP the private IP uh
which contains a public DNS private DNS
the state and the instance ID and this
is what I've prepared now so once this
uh is done we will simply say uh I have
created my EC2 instance and there I'm
returning my um my return types because
I did not have any errors when creat
creating this instance, I'm returning a
nil and I'm returning the information of
my instance which was created. What
might be interesting is uh this function
called dreer string uh dreer string.
What this does is it is actually just
dreferencing my pointer. The reason why
is we are dreferencing this pointer is
because when you talk to Amazon SDK, it
is returning you things like public IP
address uh which might not be available
at that time. So when you're returning
this, it might be a uh it is a pointer
type but you are returning a nil pointer
and that's a problem. So this is this is
important that we are able to
distinguish between whether it was an
empty string or whether it was a nil
value. If it was indeed a nil value and
you were trying to dreference a nil
pointer, that's going to be a problem.
And this is essentially why we waited
for so long for the instance to actually
have a public IP. This dreference
function just dreferences my string to
return a string which I can give back to
my main function. And by this time
and by this time I have an EC2 instance
that was created for me. Now the create
EC2 instance function doesn't just
create me an instance on Amazon. It it
does return me two different values as
well. One is an error which is a good
idea that your function does return an
error if there was any or it returns a
nil so that you can use that error in
the further steps. For example, we are
using it here to say if there was an
error, we want to put that error as a
log output of our reconil. So when
people are looking at the logs of our
application which is a reconciler in
this case they will know why there was
an error which you know which stopped
you from creating an EC2 instance and
then you can use this error to be sent
as a reconcilers's return value because
you remember reconciler
uh here it is expected to return two
things first is the result of the
reconciling function which is the
current reconciler and then if there was
any error with that reconciler.
Now it depends how you are creating uh
you know um your your reconciler. Maybe
you want to retry
after waiting some time you want to
retry creating that that easy to
instance. And this is why uh you can
return an error within the return
function. What you're essentially
telling Kubernetes is I tried to do an
an operation which was in my case was to
create an EC2 instance and I could not
do that. Whatever the problem was, I
want you to take some time and retry
that process again. Retry that function
again. And this is where Kubernetes will
say, okay, I'm going to retry running
that reconciler loop. So, I'm going to
retry to create that EC2 instance for
you. And this is kind of like being done
in an exponential backoff. It tries, it
fails, it waits a little time, it tries
again. If it fails again, it waits a bit
longer. And this is how the Kubernetes
will be doing its uh exponential backoff
with your request. So you're not getting
rate limited uh by you know um by Amazon
that you keep trying and asking for an
EC2 instance every 2 minutes or 3
minutes or whatever your uh reconcile is
it waits during the time and it uh it's
an exponential backoff. Now once you
have the EC2 instance once you were able
to create the EC2 instance I'm returning
the info as well which is if you
remember it's a strct that we created uh
probably somewhere around here.
This is the strct that we created and
this is the information I want from my
create EC2 instance function because
this is something I want to give to my
users when they do a cubectl Jet EC2
instance. They should know the instance
ID and most importantly they should be
knowing the the public IP of the
instance so they can always log and they
can start working there. You also would
probably want to give them the state of
that instance. How is it right now on
Amazon it is running? Is it stopped? Is
it terminated? Um or if any other state
that is you want to update them as well.
So we do a very small log. We are saying
that okay I was able to create the
instance and now I will update my
status. If you remember
every EC2 object that we create, every
EC2 instance that we create has a spec
and it also has a status field. This is
much like with any other um this is much
like with any other Kubernetes object
which is where you have the flexibility
to tell what status should be. In our
case, we are telling the instance
ID. In our case, we are telling the
public IP.
In our case, we are telling the state of
this instance whether it is running,
whether it is stopped, whether it is you
know um terminated or all the other um
states that Amazon instances can have
and that is where we are appending the
actual object. This EC2 instance here if
you remember we actually create a
variable for it and we got the object
from the request that came into our
reconciler. So the user asks uh to do
something on the EC2 instance custom
resource. We got that YAML. Think of you
get the YAML from the user with the
R.get method and store this into a
variable. Now this EC2 instance uh has a
spec which you are reading and you are
using that information to work on. So
this is where the user is giving uh the
instance type they want to use. The user
is giving what the object is for the
storage, what user data they want. that
is a spec. We usually use the spec to do
our operation. We use the spec of the
EC2 instance to create ourselves an
instance that the user is asking for.
And then the status is for us as a
Kubernetes developer to tell what
happened with this particular object.
And that's where um the AC2 instance
status is where we can tell that what is
the instance ID, what is the state,
public IP, private, public DNS, private
DNS. This is all actually is what we
have defined in here. You can see our
instance, our EC2 instance also has a
status truck because the way our actual
EC2 instance looks like is it has the
metadata for the object and the type and
then it has a spec and then it has a
status and this is what we will be
updating now because we already did an
operation. Maybe it failed, maybe it was
successful. If it failed, we handled it.
We ask Kubernetes to rerun the compiler
but if it was successful you want to
update on the status and that's what we
are doing. So the status or my instance
ID the state I'm actually getting it
from uh this function. So the instance
ID is given to me from this function
under the create instance and created
instance info variable uh which is
having an instance ID and then we are
setting up the state the public IP all
on the right side of these uh
substitutions is given to me by this
function and I'm updating the status of
my custom resource which was which was
you know uh picked up by the reconciler.
Now you have associated the output of
the function to the status of this uh to
the status of this um EC2 instance
variable.
It doesn't just update it. However, you
need to use a function called
r.status.update
because you actually want to update on
the status. If you see here the
reconiler has got couple of functions.
One because r is the type of e2 instance
reconciler. It has couple of functions.
First one is the r.get.
This lets you get the actual object
which is coming to the reconciler. In
our case, think of this as it lets you
get the gaml of the EC2 instance object
which the user has created. Then the
other one is also um r.update. In this
case you are doing an update on the EC2
instance. We will be uh using this. We
use this in case of adding the
finalizer. you are adding um on the
object that there's an update which is
the EC2 instance and the finalizer. You
can then also update the spec which is
using the R dot um you know using the
status function and you want to update
on this status of the object. You are
not updating on the metadata in here.
You are not updating on the spec. You
are only updating the status. And that's
why we tell our reconciler that we want
to work with the status of our object
and essentially we want to update the
status with this information that we
have just added here.
So to to sum up that again you use the
creates create EC2 instance function we
get some information from that and then
we are updating our object status with
this information.
If you were able to update that
everything is fine. But if there was
indeed an error when you were trying to
update the status of this object just
say um I could not update you know I
could not update the object and then you
return an error which will try to
reconcile it again and we'll try to
re-update your um your your status and
if everything is fine we reach the end
of our loop and then we just say it's
all done nothing needs to be done keep
looking for the object updates and this
reconciler is all ended for However, if
you remember, I did tell you couple of
things. If you remember any update that
you do, any update to the object, in our
case, the object that the reconciler is
looking for is a EC2 instance. If you
update this or a user updates this with
cubectl edit, it does not matter. If at
all there is any update to this object,
there's going to be another run of the
reconiler.
It's absolutely important for us to
understand. So the way your reconciler
is working right now is first thing it
gets the object.
Second, it tries to create an instance.
It then updates the finalizer
on the object.
Then if the instance creation was okay,
it goes ahead and it updates the status
of the custom resource. And then we
reach the end of the loop.
The problem however is you updated the
object here. So there was a change on
the object here at this place because
when you update or even if you want to
add a finalizer, maybe you add a label
to your object, maybe you add an
annotation to the object, it does not
matter. Kubernetes does not
differentiate what kind of change you
did on the object. It says, okay, the
reconciler looks for an for an update.
You did an update here also when you
were able to create the instance you got
some in instant information back and
that is where the status was updated. So
here was also an update.
The way our reconil works is it starts
from here. It sees that okay right now
the instance that the user is looking
for it is not it is not there
because it's a new instance. Then what
happens is it categorizes this. It's a
new instance because this object when I
say object it's the EC2 instance object.
It does not exist in my Kubernetes CD.
It's a new object. Then it says it's a
new object. It creates you an instance
and then you add the finalizer. And this
is where you update the status. When you
have updated the status, you actually
give the instance ID in the status
and you can only give the instance ID if
you have an instance ID and you will
only have the instance ID when the
object was created when the instance was
created for you. So think of this as
what's what's happening here.
you caught the instance and then you are
updating your uh status of uh with an
instance ID and this will be triggering
a new reconciler event because I I'm
telling you again and again every time
you make an update to the EC2 instance
object it does not matter whether you
update the label whether you update
things in the metadata whether you
update things in the spec or you update
things in the status and the the
reconciler will start again and this is
where it is your responsibility to make
sure your reconciler is ident
because what's going to happen then when
you reach the end of the loop it's not
just going to wait for uh it's not just
going to wait for new EC EC2 instance
object
Kubernetes remembers it when it is
running through your reconiler it marks
that okay this particular operation
which was updating the finalizer this
was an update so I will rerun the
reconiler it doesn't just stop the
execution current one the current will
go ahead we think of this as a handler
in anible if you know about that it says
this particular step asked me to update
my object which is the EC2 instance I
will run the reconciler once again and
then it goes on the fourth step and here
as well you update the the status of the
custom resource. The same thing happens
here. It says okay this operation as
well is updating my custom resource. I
will rerun the reconil again. So in this
case when your first execution happens
you will start again on your reconiler
to be to be running and this will happen
two times because you have updated the
object here and here two times
when you get a kind of like you know
when you update the object the current
execution does not stop. It's not like
you reach three and you start again. It
doesn't happen that way. You will run
through the entire reconciler. You will
keep noting which operations were
updating the custom resource and then
for how many times that was updated the
reconil would be running and now this is
your responsibility
to make it at potent. Imagine guys, you
created this instance, you got the
instance ID and when you run it again,
you create one more instance and you
update the object, you update the custom
resource, then again you will get a new,
you will create a new instance, update
the finalizer, create this update the
status, and then you will go again and
you will keep creating instances. And
this is kind of like a forever loop.
And the reason why it's a forever loop
is because the reconiler has no state.
It does not remember that the last
request is where I created the instance.
It doesn't have any remembrance of what
was happening with this in the end. So
it is your responsibility that
once I have executed through my
reconiler
when I updated the object here let's say
and I also updated the object here uh
for this object when the request will go
again to my reconiler I should be
checking the state of the you know what
is my current state you have to check
the current state and you have to check
if This is meeting the desired state. In
our case for the EC2 instance, we have
to check that. See here we update the
status.
Here we update the status and we give
there is an instance ID.
When you make a new EC2 instance object,
this will not have an instance ID
because it is brand new. But once you
run this through the reconiler, you
create an instance, you update the
finalizer and then you update the custom
resource object with the status and
there's an instance ID. It is then you
can use in your reconilider. You can
check if the if the request that's
coming in to me for this status of this
object is there an instance ID. If there
is an instance ID, I already have run uh
I've already worked with this with this
instance before. I do not need to create
a new instance is not needed because
this one already has an instance ID.
Then you work on that instance. See if
that is running that is stopped. You
know, you do the drift detection. But at
least the new instance doesn't need to
be created. And this is essentially
what's happening in our loop once we
make couple of updates. In this case, we
are updating the spec and also we are
updating our finalizer.
Here we reach there. So we will say um
okay I I'm done with the reconciler. I
would have waited for a new object but
because in the reconciler I did update
my status I'm going to go on the very
top of my reconciler and run it again.
So it's going to start from the very
beginning again two times because your
reconciler is updating the object twice.
Absolutely important. Without this you
will be creating a reconciler that keeps
on working uh and it doesn't really stop
or it doesn't really know what uh what
to do. So to understand this a bit
better, let's see how your reconsider
can go into a loop and do the things
again and again and again and how you
can stop this. Um and how do we stop
that in our controller? So this is kind
of like the request that you give.
Imagine this is your EC2 instance
request that you are giving. You give
things such as your kind. Oh, wait a
second. So you give your kind here. You
define your object's metadata and what
you are also defining is the spec.
That's what you want the instance to be
created as. And right now there would be
a status but this one is actually empty
because you are creating this in object
in Kubernetes for the first time. It
will not have any status. it will only
have a status once the reconciler has
run through its logic and that is the
one which will be updating the status.
So you feed your uh your object you know
you feed your object uh information when
you do a cubectl create hyphen f on this
object that is then sent to the
reconciler and the reconciler logic says
I will be creating an instance now this
one imagine what's happening here it
goes to Amazon and then it gets
information it creates you a VM the VM
is running here it gets the public IP
the state of this instance which is we
are more concerned about it should be
running that's why we have a waiter that
waits until this VM is running and this
is the information we get back from
Amazon it's a very simple description of
what we are doing in our code once we
get this and once we are able to create
this instance what we then do is we
update this particular object in our
case we are updating Updating this for
the finalizer.
So we update the object to add the
finalizer here and then we update the
status of that particular object.
Eventually once the object will be
exiting once the object will be exiting
your reconiler
this whole thing here is kind of like
the reconciler. This is your reconciler
logic. Um it makes sense for me to
increase the thickness so you will see
this. This is the actual reconiler
that's happening. Reconiler let me
increase the size of that a little bit
there. So the reconil creates an
instance it updates the object and this
is the output of your reconiler apart
from the Amazon VM that has been created
on Amazon. you get your spec back
and also one interesting thing is this
bit. Now your object has a status
because we updated on the status. It
will also have a finalizer which I have
not added there because I want to keep
it simple. But we have a status now. And
because you updated the object here,
this is very important. Because you
updated the object, you will be passing
the same object. You will be then
passing the same object to the
reconiler.
And then what's going to happen is you
will be creating the instance. And then
what's going to happen? You will be
updating the object. And then what's
going to happen? This is how you will be
reaching a forever running loop which is
then going to be problematic because the
reconciler has no idea that it has
created the object already. It has
created the VM on Amazon already. It it
doesn't have a correlation between what
it did and what to be done because it
has no state. So essentially what you
would be looking for is once you have
made changes to your object which is in
our case you have this status added
there I will change my logic a little
bit. What I would say that okay create
instance if the object dot status
dot uh instance
id is blank
which is then if it's empty then it's a
new object that the user have created
when I'm saying object I mean this
particular yaml if this does not have a
status or at least it doesn't have a
status and an instance ID that means it
has never run through me. the reconciler
has never created an EC2 instance for
that and that is when you should be
creating a new EC2 instance otherwise if
it is not blank then just skip directly
do not make any changes whatsoever on
the object because as soon as you make a
change on the object it starts again and
your reconciler would be doing any
change that you make in your reconciler
to the object of the Kubernetes object
it will trigger a new uh execution you
will trigger a new loop and that's
exactly I have this sort of a demotency
done by this particular function. See we
already do uh we already get the object
which is coming to a reconciler using
the r.get get we get the YAML the spec
and the status of my EC2 instance in
this variable and then I'm checking if
the status if the status field is
populated and the instance ID is not Z
not not blank you remember now right it
will be blank if it was a new instance
if if it was a new object and for a new
object I need to create a new instance
but if the object is not new it will
have a status
it will have an instance ID and then I'm
saying if the instance ID is not empty
we simply say requested object already
exists in Kubernetes not creating a new
instance because I've already created an
instance for the same instance ID on
Amazon that's why you have the instance
ID because I was able to create that and
there's no need then simply I would just
be returning a nil and I'll wait for a
new update on this object which is EC2
instance
nothing to be done. This bit makes our
code adempted.
Now here you can be a little bit more
cheffy if you want. You can be a little
bit more you know uh complicating things
where you can introduce a drift
detection. The thing is imagine you did
create the EC2 instance um which is on
Amazon. So here is AWS.
You did create the instance and it was
in a running state. You get that back
and you update your status of the object
here. And when somebody will do cubectl
get EC2 instance, they will see the
instance ID, they will see the public IP
and then they will see the status as
well which is running that matches your
Amazon instance. The problem is if you
go outside and you stop your instance,
if you stop your instance, it does not
update the status of your object because
Kubernetes does not know what you did to
your instance outside.
It just doesn't know that. So it could
be your your you know it could be your
uh feature in the program in in your
software where you can say if this
instance ID is not empty that means on
Amazon I do have an instance it might be
running it might be stopped it might be
in some other state I don't know I have
it then I will go to Amazon and then
check if it is indeed running or not so
this goes to the reconciler
you will say you know what the instance
ID is not empty so I will not make a new
instance but I will just go to Amazon
and see if this instance with this
instance ID what is the state of that so
you go there you find it is stopped you
then update your state from running to
actually stop so you can have a drift
detection if the instance was stopped
there you also update your status here.
This is kind of like a Shepy thing you
can do. In my case, I'm keeping it
simple because I'm saying the instance
was already been operated on. It is
there on Amazon um and uh it's also in
Kubernetes. I will not do anything. I
will not create a new instance. But you
can have a drift detection as well where
you take the instance ID, describe the
instance, get the state, update the
state. You know, we we know how to
update the state. we did it here and
this is going to be your own little
drift detection and I think that's going
to be interesting to to build. So if you
have followed the course till here um I
would really encourage you guys to add
this functionality as well which I'm
leaving deliberately because I don't
want to make it too much complicated
and if the instance ID was empty uh was
not empty in our case we already have an
object and then I'm not creating new
instance see my program will just back
off from here it will not create new
instances and this is what this is
actually what um what's going to
So this is now I'm going to show you a
bit of a demo of how all this look like
to here. This is where we will actually
deploy this um to our Kubernetes
environment. And then let's try to
create and see this in action. I already
have this running. But what you can do
is first thing you can do is u make u
many fit
and you know about this we already did
this when we were building the API. This
creates the custom resource definition
and then you can do a make install. It
installs or creates those custom
resource definitions in your Kubernetes
object in your Kubernetes cluster. I
already have that because if I do k
explain ec2 instances you take it it
knows about my instance which is in the
compute.cloud.com
and this is the group this is the
version and this is what my fields can
be for the EC2 instance. I can also do
cubectl get EC2 to instance and you can
see it's it doesn't say I don't know
what this object is because I did a make
manifest uh manifest and you can do
these things together make install it
creates you the CRD it installs it on
the cluster
and then so far what I what we have
built we can run our um reconider using
go run cm MB main.go. I'm in the root of
my uh project. And now you can see this
is what my reconciler is now running. If
you go through the logs a little bit, it
starts the manager. The manager manages
multiple controllers and then you can
have more than one worker for a
controller. And this is what you see
here. We start the manager. We start the
controllers and then there are a couple
of workers. I only have one uh but you
can you can read about multiple workers
and you can you know uh create more than
one if you have much workloads but for
us one worker would be enough.
Now to actually create an EC2 instance
on Amazon, let me just quickly open up
my AWS console and show you how would
this look like.
And right now I don't think I have any
instances which is running. That makes
sense because I didn't make any
instances.
Um let it load.
Hello. Hello.
Let me try that again.
Okay. Uh, wait a second. Now, let me try
that again.
All right. Probably it was my tail scale
that was behaving a bit weird. But as
you can see right now I do not have any
running instances.
Now what we can do is I have an object
which looks like the spec that you would
like. Here you can see I want to create
an EC2 instance. This is the name. This
is the name space. And I've given my uh
T3 medium my AMI ID that I want to use.
It's in the region for Amazon Linux 2.
The region is central one. the
availability zone. I already have this
key pair. I already have the security
group. I already have this subnet in my
Amazon account because you know you need
these things before you can make a BM.
And this is how my instance is going to
be created. Now because we already have
our controller running, as soon as I say
please create me an EC2 instance,
as soon as I say please create me this
instance, you will see your logs acting
up. Let me just do it here so you will
see that better.
Now let me do a cubectl create there.
You see automatically as soon as you
gave a create instruction your reconcil
loop started and this is the logic that
we have given from the very beginning
that's the log we are seeing reconcile
loop started and this is where you get
the object uh you create a variable of
of that type you get the object and then
you see if the instance ID is there or
not if it has a deletion termination
timestamp nothing is there because it's
a new object object and then you will
see creating new instance adding the
finalizer the finalizer would be added
all that we went through will be
happening now so let's go through the
logs a little bit you can see the
reconstru loop was started and then I
see the log of it's creating a new
instance and then you add the finalizer
it's interesting to see this in the
object so if you do k get ec2 instance
you see this is the eventual uh output
you're going to get and there was an
instance created on Amazon for me from
my Kubernetes in from my Kubernetes
cluster that is exactly what we were
working through we are able to create an
EC2 instance from our Kubernetes
environment using the controller that
you just have written and you're able to
get the information the state is running
the public IP is the same public IP that
you see here 35.159.299
299 uh 220 and alert 7. It's the same
here and it is the same instance ID. Now
this time I got an instance ID because
the VM was created. But what I'm more
interested to show you is um this thing
which is going to be the finalizer.
You see we get the log on the left side.
It says about to add finalizer. You can
see here about to add finalizer. And
this is what the update um that we did
to our object. If you see this here,
this would make sense. You add the
finalizer which is ec2
instances.computcloud.com
and then you do an actual update on the
object. And that's the result of this
update function.
Then once the object was created, once
the instance, let's go forward and see
the logs. How did they they go ahead. So
we say we are creating a new instance.
We are about to add a finalizer. We add
the finalizer and it says this update
will trigger a new reconcile loop but
the current will continue. As I've
already told you we do an update, we
register this so that we will come back
and restart the loop again. But we keep
continuing. We don't break the existing
exe execution right there.
So you add a finalizer to your
Kubernetes object and then we continue
with the EC2 instance creation in the
current reconciler and that's where we
actually create the instance.
Once you create uh once you make a call
to Amazon to create the instance you can
see here we call the run instance API.
The EC2 instance creation was completely
successful and this is now where for the
first time we get the instance ID. you
will only get the instance ID if the
instance was created,
right? When the instance was uh was
running. So we we then make another call
to Amazon to get the public IP cuz you
don't just get the public IP as soon as
you create the VM. It takes a little
time for the the public IP to be
populated. And we just say I'm calling
the Amazon describe API to get the
instance details. And this is where you
just print. This is just like some debug
that I was doing with this with this
code. And here you can see we get the
private IP 172.31.25.250
which is the same uh in here. If I check
that here you can see uh 172.3125250
that is my output. My domain name is uh
this is the public domain name. This is
the instance IP, the region and you know
the metadata that was given to me by the
describe instance API. You can see the
name of the key that you were using when
you were creating this instance. And
here you now have the public IP.
Very important. Till this we have made
update to our reconciler um one time you
know we have made the changes to the
reconciler uh once which is updating the
finalizer. Now we update the status as
well. Now we update the status and um
what actually happens is now you will be
updating things in here. This is the
status update.
So we do the status update. We update
the instance ID, the private DNS, the
public IP, the private the public DNS
and the state of that which is running.
Out of these five things, we are only
showing uh four. The public IP, the
state, the instance type and the
instance ID. Now, it is absolutely
important to remember we made the
changes to our object. The reconil will
be starting again. And there you can see
after you made the changes to your um to
your status the reconciler was started
again. However, this time we saw the
requested object already exists in
Kubernetes and not creating a new
instance. This is where we use or we
introduce the item potency in our
Kubernetes um in our controller. Um it
is missing one log however which is a
bit misleading that you might think
because we made updates to our object
twice this should be running twice and
that is absolutely correct. I think it's
missing a log. So let's try that again.
Now what might look like there's a
missing log entry and it could give you
an indication that I said whenever you
update an object in Kubernetes as many
times the reconil loop will be running
those many times. So we updated our
custom resource once at the finalizer
and then we update our uh custom
resource once with the status the two
times the reconciler load should be
running. But we only see the log entry
here once which says uh this is the
reconciled loop now started and
requested object already exists. We
should be seeing this twice cuz that's
what I've been telling you because we
updated the resource twice. But we don't
do we don't see that here and I think
it's going to be a lot more clear when
we see the internals of how the operator
is knowing that there was a change and
what really happens internally. So let's
say you are working on this Kubernetes
uh resource which is our custom resource
and you do a update
or you add this resource whatever you do
you have triggered a change on this
custom resource. Now any change that you
do to any resource in Kubernetes the
first one that knows about this or gets
to know about that is your API server.
This is where you have your
authentication. This is where you have
your authorization. And once you have
gone through the authentication,
Once you have gone through the
authorization,
you know, and also the admission um
controllers, the the web hooks, this
will be persisting your change into HCD.
This is the time where you have made a
commit to the ETCD and this is your
source of truth. This is your single
source of truths and that's where you
have added your desired state. Okay,
that's where you have added your desired
state. Now, as soon as the API server
makes an update to this HCD, this API
server kind of not really broadcasts,
but you can think of that it tells
everyone that hey guys, there was an
update to this custom resource which is
kind um which is you know of kind uh EC2
instance
and you know there are Many many
controllers in Kubernetes responsible
for different resources. For example,
one is a pod controller which is
responsible for changes into the pod.
One is a deployment controller which is
responsible for changes in the
deployment. One is a service controller
which is responsible for the service.
What they do? They are only reacting on
the resources which they have been
programmed for. In our case, we do have
a controller which is only listening on
the EC2 instance uh type of the
resource. So the API server tells
everyone. Think of this as broadcast but
it is not really broadcasting. It sends
an event to anyone who is watching this
custom resource. The pod controller
watches the pods. The deployment
controller watches the deployment. In
our case, this EC2 instance controller
is watching the custom resource of kind
EC2 instance. So this update is listened
by this guy or it's watched by this
controller. Think of this as your
controller subscribes to the API server
and it is telling the API server
whenever there is a change of uh in this
kind EC2 instance tell me the API server
registers it that okay there's this guy
who is watching and listening for it and
then the API server will be telling um
once it triggers an update like this
once it triggers an event then the EC2
instance uh controller who's watching
this update this this event gets to be
notified about that.
Now if we zoom in in this op in this
controller a little bit let me uh create
what's really happening in this
controller. So this will make more uh
sense because I just said the EC2
instance controller gets it. But really
what happens here
is that this particular controller,
this particular controller who's
watching,
you know, who's watching the API server,
the one responsible for watching uh or
doing the watch is called an informer.
Think of informer as a piece of software
that opens a long running or a very long
uh living um you know stream to the API
server and it always catches these
updates that hey guys okay there was an
update I am now notified about it. This
is the part of your controller. Your
controller has an informer which helps
you to open a watch to the custom
resource that you are interested about.
Now as soon as there was from the API
server uh it says there is an update the
API server sends this update as well as
the object
and the watcher consumes it. The watcher
subscribe to it. And this is how this
informer when I say watcher it is the
informer. This informer gets the you
know the actual update event and then it
gets the actual object which is the
yaml. This object is kind of like the
yaml of our kubernetes resource. Now
this informer has couple of uh things to
do. The first thing it does is it stores
the object. It stores this object which
is given with this update event into a
cache.
This cache is managed in the controller
itself. You don't have to do that. Cube
builder already has bootstrapped these
things for you uh using the contain uh
using the controller runtime. There are
packages that manages caches for you.
And this cache is where you are storing
your whole object that was given to you
with this event of an update.
The first thing that the informer does
is always adds it to the cache.
Then the informer has couple of
something called handlers
or we call them event handlers or we
call them resource event handler. These
are think of this as functions that
would be running uh when you make a new
add operation, when you do an on update
operation, maybe you did an update to
your resource or when you do a delete.
These functions they are not doing
anything except all three. The only
thing that they do is when the informer
has stored this object into the cache
these handler the informer will be
triggering this handler and based on
what you have done they add this
object's key into this working queue
into this work
this is the infamous work Q that we have
been talking about now what's really
added in the working queue is not the
whole object of your um your of your um
custom resource. It's not the whole
object. The thing that is added in this
working queue uh the thing that is added
in this working Q is the name space
and within that it is the name of your
object which is the name of the EC2 uh
instance kind and then you have the
metadata. So metadata name that's what's
added. It does not add the entire object
the spec the status they are not added
there.
Now once you have added once your
handler has added this in the queue then
in your controller you have workers
or we here we have the reconcile uh loop
that we have been working with and
there's a worker running
this reconcile loop. This worker keeps
on looking in this working queue. As
soon as there is a key in this working
queue which has been added by the
resource handler in this case, the
worker runs the reconcile logic. And
this is eventually how your um
Kubernetes you know how your controller
how your operator gets to know about
that there was a change and so that I
have to run my worker. Now during our
reconciler we did update our spec uh we
did update our uh object two times.
One was for the finalizer.
So look at look at what's going to
happen once you update this for the
finalizer. The same thing begins from
here. It's kind of like the same process
because whether you update your custom
resource, whether you do this or an
application does that, the API server
has no differentiation of who made the
change to this custom resource. All it
knows there was a change. That's all I
care about. Now, the first time when we
added the finalizer and we updated the
the object, the same thing happened.
This update was sent to the API server.
Then it was stored in the HCD and then
um you know your controller was called
because it has the same thing. The
informer was there. This informer then
triggered a handler
and this handler added this particular
key in the working queue.
And let's give this the name space as
default
default slash uh EC2 that's the name of
my um kind metadata.name name. This is
the name of my object in Kubernetes.
That happened and while you were running
your recon, while you were running your
loop, the second thing that you did is
you updated, you know, the status
for the status. We updated things like
the instance ID.
We updated the instance ID. Then we
updated the public IP.
Wherever at any time there's an update,
the same thing will happen. The only
difference is
this working Q is kind of uh single
for uh for one controller. There's one
working Q. It's shared for that
controller.
So the hander adds it here. And the same
thing will be happening here. So let's
say you update the status. Now this
update to the custom resource is seen by
the API server adds it to the HCB. Then
the informers are watching uh they first
thing they update the cache locally.
They update update the cache locally and
uh then the handler is called. Now what
was happening this this is where um the
one single log line explanation is
coming. See your handler is responsible
to add the name space and the name of
that object in the working cube. But
what happened was when this handler
wanted to add it, it said I want to add
an object which is in the default name
space and the name of my object is EC2.
This working Q is smart
in a way. It says an object with the
same name in the same name space that
you are trying to add is already in the
queue. So I'm going to do something
called the dduplication.
And this is such a power powerful
mechanism because if you did not have
that you will be kind of running a
reconciler storm. You know, every time
you make changes to your object, imagine
this while running your reconciler,
maybe you made changes to your object 10
times or 20 times, right? You made
changes to the object 10 and 20 times.
You do not want to run the reconiler 10
or 20 times. Just one run because you
already have that the same key is
available in the working queue. it's
going to use um the spec of that object
and be done with it. This dduplication
is already handled by this working Q
package in Golang which is again we are
using this weather controller runtime.
We don't see this but this is uh
eventually what's happening in the
background. So we do not add it again or
rather you can say you add it but then
it is getting dduplicated.
Now once you finish your reconciler see
this is what happened you update the
finalizer you then create the EC2
instance if you haven't forgotten the
flow this is what is happening in our
reconil logic we update the finalizer
then we add the EC2 instance uh on
Amazon
now once we added this EC2 instance we
update the status And here once we have
updated this tariff we are done. The
worker is creep
the worker now
here is free.
Now the only thing the worker is ever
responsible for is looking at this
particular working queue as soon as it
gets free.
When I say it gets free, I mean the
reconcile loop has run successfully, you
know, and now the worker is looking for
any other changes or it is looking if
there is a key in the working queue.
These are the two reasons why the
reconcile loop would be started or why
the worker will run the reconciler
again. For the worker to be running this
reconciler, there are two uh ways it
could do that. First, there was a change
made by the user
to the custom resource which is what I
told you the whole process just now
where you make a change to a custom
resource the API server and then the
worker sees the working queue here's a
working queue it starts that or second
the worker will be running the
reconilers
the worker is going to run the reconiler
in case you have some changes made uh by
the user to the custom resource or your
req
interval
uh is done
and then it's time to uh retry that
again if in case there was a change uh
to the custom resource if not it doesn't
do anything it should not do anything it
should just simply um be uh exiting the
the reconciled loop
the third thing third reason why it
could run after it is finished is there
is an object
in the working in the work queue.
It's kind of like imagine um you know
you are
moving bricks from point A. This is
point A. You have a brick here and you
have a work to move this to point B.
This is you.
So either you if there is no break let's
say if there is no break you wait
you wait and every 10 seconds every 10
seconds you see if there is a break
there's no break okay I don't do
anything
this is uh the req interval every x
seconds or minutes you are watching but
there's nothing there so you don't do
anything second one is you know your
manager or your owner or whoever that
is. This guy places a break in here. He
does it on himself and then he tells you
I have added a break. Now you get
active. You put this brick uh in your
hands and you move it to the point B.
This one's when somebody has made
changes um you know uh when somebody has
made changes to the custom resource
manually.
The third reason why you might move with
brick. The third reason why you might
move this brick is imagine that um this
is point A and this is point B. There
was a break. You were watching this
here. This is important. You are
watching this now. So there was a break.
You are watching this. You move this
brick there. And while this brick is in
the you know think of this as it's
transporting
here another one comes up another one
was added now as soon as you finish
moving this brick from point A to point
B your work is done you immediately look
at here what there's another one you
move it again and this is kind of like
there was already an object pending
while you were finishing your work
someone put another brick and As soon as
you are done with moving that that break
in your hand, you look back, there's
another break, you move that. That's the
working cube that we are talking about.
As soon as there's an object, u your
worker will be starting again. And this
is why we only see one log entry because
when the worker finished creating the
instance, when the worker finished
updating the status, it's it was done.
it was exiting and then it saw that
there was a new object in this working
queue. In this case, the third um you
know in this case the third stage is uh
applicable to us. So the worker saw hey
I was making uh you know because there
was already a key added to this working
queue I started working on that but
something already added another key in
this working queue. So as soon as I'm
done with the in you know with the
current run of this instance creation I
look there and then I run that again. I
run the reconcile once more because
there is a key in my working cube. And
this is why you only see that once
because we already have this item
potency that if you know for our status
dot uh instance ID
is not equal to null that means we
already have worked on this. This is the
adap potency you have to add in your
operators so that it doesn't keep
running in a forever running loop. Um,
we talked about this already, but this
is where you only see that once because
there was only one key added even though
you had two updates, even though you
made changes to your custom resource
twice, the dduplication happened in the
working queue and there was only one
object and as soon as the worker is
taking this, you know, it reads the
object from there. The working queue
gets empty because the worker is working
on the only available key. It reconciles
it, sees the ident potency kicks in and
then there's nothing. There's no object
in the working queue. The worker is now
waiting. The worker is happy. Now the
worker will be triggering the reconciler
if there was a change made by the user
or when your reconcile interval is re
and uh you know you have configured your
reconciler to to watch for or to ask the
API server is there a change to my um
object every uh reconcile interval which
is I think we have it here.
So somewhere around um you know um I
think we don't we do not configure a rec
uh I think we did not configure a DEQ
interval. So this is also a very small
piece of information. When you write
your reconiler
I'm not sure if I've explained this
already in in there or not. I think I
did but let's talk about that again. For
the reconciler you can define a recq
after
which is if I show that again here in
this result uh we can send two things.
Uh one is the recue. Do you want the
operator to recue after the worker is
done working? You know you want to try
that again for a new object and what
duration after. So think of this as you
might want your controller
in a in easy words you might want your
controller every uh 1 minute
to run again
even if there is no object in the
working queue. You want it to be running
every 1 minutes. And this is how you can
configure um you know um when you are
returning the result you could say vq
after um time do 1 second. So this is
kind of like making this as a chron job.
Now you can create your operators as a
chron job or you can make them behave as
a chron job by using the dq after you
say I'm not returning an error so
there's no retry needed. However, after
this time, um, you know, rec should
happen. So that's that's like making
this as a chron job. Imagine you want
to, uh, delete the pods which are
container creating error state. So you
want to scan through all the pods,
delete the ones which are in container
creating state as a cleanup and you want
to rerun this process after um, x amount
of seconds or or minutes. So this is
kind of like a chron job but your
operator can do that as well with this
req after.
So now that you have a very good idea of
how the internals of your operator are
working where is the uh working queue
managed again the working queue is
managed in the operator itself in the
memory of the of the controller. Uh the
informer is part of the controller. The
handlers are part of the controller.
This cache is also a part of the
controller and this working Q is also
the part of the controller. The worker
is part of the controller and again the
reconcile loop it's all what's making up
the controller and uh yeah that's that's
something important.
Now with this cache you might think why
is this cache uh you know created here?
What what is the reason of this cache?
Think of this as when API server makes a
change, it sends the update that I have
made a change to the custom resource and
it also sends the actual object to uh
which is the whole YAML of of of the
object of the custom resource. It sends
an update event.
It sends an update event and the object
custom resource or the whole custom
resource object. Let's put it. It sends
a custom resource object
and this event is seen by the uh
controller.
This event is seen by the controller.
Now what's happening once this
controller sees this event? the first
thing uh or informer inside this
controller is the one that watches it.
It adds this object into its cache or I
should rather wait I will do it like
this. This whole object is actually
added by the informer in the cache of
the controller.
And then you know what happens? There's
then the event handler. This adds the
key in the working Q. And then there is
a worker
that keeps looking at this working
queue. As soon as it finds that there's
a key in the working queue uh which is
added by the event handler you know the
event handler adds this key which is the
name and the name space of the object
now the whole YAML not the whole object
but just the name and the name space the
worker starts it now if the worker needs
to access the spec you know that's what
we are doing in here um let's say if I
say create instance and that's what I am
or rather if I begin
I'm saying it here. Um, see, I'm
checking if this object has a deletion
timestamp. I'm reading that object. I'm
seeing if that object status has an
instance ID or not, I'm reading that. So
whenever you want to read that object,
you know, whenever the worker logic or
the whenever the reconciler
logic wants to read the object, it does
not read it from the HCD. It doesn't go
to API server then reads it from the
HCD. It does not do that. it reads it
from this cache
which is which is faster in orders of
magnitude you know um compared to when
you were going to the API server and
reading it and this cache is always
updated as soon as this update event or
add event you know or delete event comes
in the first thing the informer does is
refreshes the old copy and updates the
new one so that you're recon compenser
will always be getting the latest state
from uh from the cache of your custom
resource object and this is how it's
reading the object. The worker reads the
spec, the worker reads the status,
whatever you do with this object, it's
being done from this cache and this
makes it really really fast. You don't
have to go out of your process or rather
the controller to the API server then to
the HCD then read the the spec or the
status. It's right there whenever you
need it. So I think this was a very good
idea of uh just explaining why we only
saw one line you know one run of the
insider not two. I think this was the
right time for me to explain this and I
hope this is clear to you guys how the
API server um up sends an update. How
the informer is looking for it. How the
watcher and the informer which is one of
the same similar things is looking at
that. How the handler adds it to the key
and then the worker reads the working
queue uh and then you know runs the
deconsider logic. Now that you know all
this, this will be giving me a good idea
to tell you how Kubernetes handles the
object deletion. We talked about how
does it create one. We talked about how
do we work with the caches? We work with
the informers, the handlers, you know,
the working queue, how the reconciler uh
does that, how the dduplication works.
So now let's talk about how Kubernetes
will handle the deletion of objects.
This is where the deletion timestamp
will be coming into the picture. When
you have an object and when you say
cubectl
delete
the object,
the reconciler does not know whether you
have asked for deleting
the object, whether you are updating the
object, whether you are updating the
finalizer,
whether you are updating the status of
that object. It absolutely has no idea
about that. And this is where how would
you differentiate that the particular
update was to delete that object. How
does Kubernetes know the user is asking
for deleting this object? It knows about
it by adding something called a
timestamp or um essentially it's a
deletion time stamp.
See if you look at my object now um if
you look at my object right now it's
less it does not have a deletion time
stamp what it has is a creation time
stamp what it does not have is a
deletion time stamp but I can run this
on a loop let me show you that uh
cubectl
get dogs and And you can see it here. I
probably I can just show you the
the metadata. So you can see here this
is my current object right now. This is
my current object that I have created.
And you can see my reconciler is happy.
It knows about this object that it was
already created. It is on Amazon. And
you can also uh the cubectpl get EC2
instances and you know the public IP and
everything that's that's okay but when
you want to delete this the way
Kubernetes knows that this was a
deletion operation is by adding a
deletion timestamp and this is what our
program can actually look for. So you
can see here this is what the program uh
could could look into.
Um here that's where you can check if it
was a deletion request or not. So as
soon as your reconciler is started it
doesn't know the reason why it has
started. Was it an update? Was it a
delete object you know delete um
operation on the object? And this is
where you can use this deletion
timestamp. If the deletion time stamp is
zero,
then it's not a deletion request.
However, if the deletion timestamp is
not zero, that means you can say the
user has actually requested a deletion,
the instance is now being deleted. And
then you can call Amazon to delete your
instance, clean up properly. And once
only you have done the deletion then you
can remove your finalizer. And this is
how finalizers are used when you are
deleting an object. As soon as you give
a request to delete something finalizer
will hold the deletion uh of that
Kubernetes object until the actual
resource has been terminated. And if you
were not able to remove the finalizer
you do not delete that object from
Kubernetes. you know, you just say try
again. You keep on doing it and once the
instance was gone, then eventually you
let the you let the finalizers be
removed and you know the object will
actually be gone.
Now this is what you will see in the
logs right now. As soon as I will do uh
let me go to the directory and as soon
as I will do delete on this EC2
instance. I hope you can see this in in
the
you know in the size of the font. What I
essentially want you to look at uh to
look at is as soon as we give a delete
request to Kubernetes there will be a
new deletion timestamp and because it is
an update to our object the reconciler
will be started then the reconciler will
know oh it has a deletion timestamp so
basically I need to delete my instance
and then delete it from Kubernetes. So
if I do cubectl uh delete, see what
happens. It got a deletion timestamp
automatically. My object was updated and
then my Kubernetes reconcile started
again and it saw oh wait it has a
deletion time stamp and you see my
object is not deleted yet. It's waiting.
It did not give me the response back.
It's waiting because I have a finalizer
that is holding the deletion. So you see
it says has update deletion timestamp
instance is being deleted. Then we call
the delete EC2 instance function which
I'll just show you the instance
termination was initiated and it is
waiting for the instance to be
terminated which you go here. You will
be seeing the instance is no longer
running because it is right now shutting
down. This is the instance that we just
created. It's right now shutting down
and I'm waiting for my object is waiting
until it goes into the terminated state
as the other instances are in
terminated. We want for the instance to
be properly terminated and once the
instance is now you can see it's
terminated.
See what happens. We're waiting for the
instance to be terminated and let's just
give it a little time to actually uh the
maximum time it's going to wait is for 5
minutes for deletion. This will be then
updating and removing the finalizers for
me.
Um, and eventually that's going to be
then cleaned up. And that's what
happened. I only was able to delete my
object once the instance was terminated.
You see here waiting for instance to be
terminated. it was waiting and then once
it returned the terminated state we say
EC2 instance successfully terminated and
again uh because you deleted this object
now it is another update on the reconil.
So you can see the reconiler has started
again.
So see what happens again is this is
very interesting any update that you
make it starts. So what happened is
what happened is you have an object
you do a deletion
on this object. This object gets a
deletion timestamp
and this is seen as an update by the
reconiler.
The reconciler runs, it sees, okay, it
does have a deletion timestamp. So, I
need to talk to Amazon to delete this
particular instance.
I need to delete this particular
instance. And once it is deleted,
you know, once it's deleted,
what I will do is I will remove the
finalizer
and then let the object deletion
be done.
And here's where you're making an update
again. Now, see what happens. Um, if I
go back to our code, how it works. You
can see we delete the EC2 instance. This
is a very simple delete EC2 instance
function. All it does is it says
deleting the instance and then it runs
the terminate instance function for
terminating the actual instance. It
waited it waits for it to be terminated.
This is quite similar on how we were
doing a running waiter. we have a
terminated waiter and we wait for the
waiter to actually return uh terminated
and then we say instance was terminated
just fine. If we were not able to
terminate that if there was an error we
try again but in our case we did not
have any error that means we were able
to terminate the EC2 instance that means
this cleanup has happened now I can
remove my finalizer and this is again an
update. This will again start the
reconciler.
Absolutely uh important. Any update that
you make, it's going to start the
reconciler.
So you remove the finalizer using the
control uh using the controller runtime
uh using the controller utils and you
say please remove this finalizer and
then you update the object and there you
see as soon as you update the object
Kubernetes says cool I will go back and
run the reconciler again and at this
point the instance is terminated and the
finalizer is removed. We go back to the
beginning of our reconciler here and
that's what you see now in the logs when
we were able to terminate the instance.
Uh because we updated the finalizer at
this location, the reconstruction loop
started again for a particular um
object. But you see here uh it's quite
interesting
between the time when you remove the
finalizer between the time when you
remove the finalizer
when you were removing or you removed
the finalizer
which is registered as an update to the
object.
Uh and
a new
run of the reconciler.
New run of the reconiler.
Your object in Kubernetes,
your object in Kubernetes is actually
gone. It's deleted.
Think about this that the custom
resource for which you remove the
finalizer it's um it's UID because every
resource in Kubernetes has a UID think
of this as a it has a UID was a B c then
what happened you removed the finalizer
and then um you updated the object um
you updated the sorry let's start that
again so between the time when you
remove the finalizer which is triggered
as a which is registered as an update
your object was actually deleted but the
reconciler says because you updated I'm
going to start the reconciler again from
the beginning
and I will start it for an object whose
UID is ABC it's a it's not a new object
because you updated an object which
existed and then it was not existing
anymore but the reconciler does not know
that it is deleted. It just says I'll
start the reconciler again for the same
object. And this is where you have to
tell the reconciler that even if you are
starting again
if you try to get the object and you get
an error you will get an error because
there is no object that exists with that
ID that you are trying. But if the error
is uh it's is not found you know when
you do cubectl get uh pod xyz you you
get the pod is not found it's kind of
like that error simply just say okay
it's it was a cleanup I will not do
anything and I'll just wait for the new
uh request and that's what's happening
in our logs it started again you updated
the finalizer
the object was deleted the update was
registered the reconcil was started from
the for the same reconciler object you
know but it is gone now and that's where
we are handling it we say otherwise the
reconciler would have said um not found
not found not found we want it to be
handled saying the instance was deleted
that you know um you are trying for if
you cannot find this object it's okay
because that's what you get not found
that was the error that is not found
and that is the entire end to end
functionality of our controller.
This by this time you know how to create
resources on Amazon. Your controller is
able to ignore or have a then potency.
So it doesn't keep running the loop
again and again and again. It can handle
the termination of instances. It can
handle the finalizer as well. And only
when you remove the finalizer, it will
clean up or it will let Kubernetes clean
up the objects and subtly ignore if the
object
you know for which there was an update
that was uh registered and the rerun of
the reconciler happens but the object is
gone. Meanwhile, we handle it that if
you try to get the object and the error
that you get while getting the object is
uh is not found. Kubernetes has a
package called errors which is where all
the errors are defined. You see um is
already exist or uh is not found.
Sometimes you create an object and you
get Kubernetes says this already exists
and you can use these uh error types in
your program. I'm using one which is
called is not found and this is what it
says. is not found returns true if the
specified error was created by new not
found. So you are trying to get an
object but you get an error because it
doesn't exist and the error is is not
found. So that's what we are doing. If
it's not found simply say okay the
instance is deleted no need to reconcile
and I'll wait for new objects to be
coming up to me and that is what you
will see this loop now will only run
when a new object of EC2 instance type
is created or updated let's try that
again I will do it that please give me a
new instance because I don't have this
object in Kubernetes it's going to be
creating a new VM form. So let's do that
and you will see this in action again.
Create and you see on the right side the
reconcile is started because there was
an update which is create on our
instance and this is the same thing
happening. We see that this object is
new. So we create the instance again. We
add the finalizer and then the finalizer
was added and then I call the Amazon API
to get me an instance. I describe the
instance, get the public IP and I update
my status.
You go back to Amazon and you will now
see that there is an instance which is
running. There we go. This is the
instance which is now uh you know it's
in the running state and we had a waiter
for it to wait for the running state uh
to to uh for the instance to be reaching
the running state. And this is the same
public ID and this is the same uh thing
that we see in here 3647
52 and the same instance ID. Now um I
will actually um I wanted to show you
something which is again that's what's
happening here. We do get the object uh
we do get the object and as soon as we
update the status the recon starts again
and it says the requested object is
already existing because you have an
instance ID. I will not do anything.
Try the deletion again just to see it's
working. I do a deletion. This time I
will get the deletion termination
timestamp and we will clean up the
Amazon instance. Then we remove the
finalizer, remove the object and handle
the next run of the reconciler by saying
the object was deleted. It's okay.
Nothing is required to be done. So you
see it got started again. It found the
deletion timestamp. That means the
instance is being deleted. See, you will
not have the deletion timestamp if you
do not delete an object. I showed you in
the previous examples. It was only
having creation time stamp. But as soon
as you add a deletion timestamp, that is
a cue uh that you are trying to delete
an object and that's what we're looking
for. If your object has a deletion
timestamp, that means it should be uh
terminating. And there you can see my
instance state is changing from running
to shutting down, which is what we saw
before as well. It will wait for some
time for that to be terminated because
we're using a waiter. Golang uh the Go
SDK for for Amazon. lets you wait using
these waiters. Instead of you waiting
for two seconds, then pulling again,
then pulling again, there's a waiter
that lets you do it very very easily. Or
you can also do uh kind of like long
polling if Amazon supports it. I'm not
sure, but you can do a periodic polling
that you wait for 5 seconds up to 5
minutes and then you pull if the
instance state is terminated and blah
blah blah. But you see here my object is
holding the deletion or Kubernetes is
holding the deletion of my object. I'm
waiting I'm waiting for the finalizer to
be removed and the finalizer will only
be removed when the object is is cleaned
up because of our logic in here. We wait
for the object to be deleted. If the
object was properly deleted then only I
remove the finalizer. Otherwise, I just
send an error and I go back to the
beginning and try that again. Try the
deletion again. And because I would be
able to terminate my instance, it's
waiting a little bit longer. And these
are um this is the beauty of using a
waiter.
It's not like you are waiting for
dedicated 5 minutes. If uh using a
waiter with with the Golang SDK of
Amazon, if the instance is terminated
even before, you don't wait for the
entire 5 minutes. it's more efficient
waiting for the resources and then you
can see now it is terminated
successfully. Uh so you remove the
finalizer that triggers an that
registers an update your object is
deleted that registered update start the
reconider loop again and then you say
cleaned up no need to reconcile. This is
the beauty of Kubernetes operators. What
this shows you is that you are able to
manage your Amazon instances right from
your um Kubernetes environment and this
was entirely that this course was about
you can make this more chessy as I said
you can have bit of a drift detection or
in in my case I didn't have it because I
want you guys to you know build it in
your own program where you can say if
the instance is stopped on Amazon
It updates my status of my uh of the
Kubernetes object from running to stop
it clean the public IP that all but this
is a very very good example of uh using
Kubernetes as a platform to manage other
platforms. You can use Kubernetes not
just as a destination platform for your
applications but you can use Kubernetes
as a platform to manage your resources
on any other platform which is which is
the beauty of Kubernetes expending it or
extending that using the operators is
what you can do. So this was the entire
demo. This was the entire code that I
want you guys to try again. And now
let's see how you can package this
properly with Helm and how you can
actually run this inside of Kubernetes
because right now this one is running on
my local computer. It is using the cube
config environment variable to actually
connect to my to my Kubernetes. But
let's package this using helm and then
see how you can ship it and you can run
this inside of kubernetes and uh let's
let's get started there. So now that we
have seen our application our controller
is running end to end and it's able to
create the uh you know the ecto
instances. Um the way it is running
right now is there's my computer and I'm
running um the go I've installed go
there and I'm running it with go run.
Here's where the reconciler is running
connecting to the Kubernetes cluster
using this cube config environment
variable. There are a couple of
environment variables as well which is
the AWS access key and the secret uh key
which is used to connect to Amazon and
eventually create an instance because I
need to authenticate to Amazon and uh
from there uh we are able to get
ourselves an instance and this gets
reflected in my Kubernetes cluster. The
thing however is you're not going to be
running this application here.
Essentially an operator runs inside your
Kubernetes cluster. It's running here as
a pod as a generic pod that has access
to uh the credentials needed to talk to
your Amazon environment. This pod is
running in Kubernetes with a service
account. Now it's quite important uh how
arbback plays a role whenever you are
writing an operator. Imagine this is a
name space called um uh name a give it
any name space and you have another
nameace under which your operator is
running. Let's say this this nameace is
actually called bidding. you have a team
whose name is bidding and they do
bidding for for clients and um they are
using your custom resources which is the
EC2 instance. So they create a object of
EC2 uh EC2 instance.
Now to let your operator know that in
this bidding name space there has been a
change in the EC2 instance object
because your operator listens um on
these uh on these objects and changes
you need to run you don't you actually
need to give the access to the operator
pod which is running with a Kubernetes
service account. So you need to give the
service account access in this name
space to be able to list get you know
the the basic Kubernetes arbback. You
need the service account to have access
to these name spaces for the object
called EC2 instance.
And you would be needing to give this
service account access to both read of
this in of this object and also to write
for that object because you need to
update the status of this EC2 instance.
So both of them are needed and this is
how your operator will be able to manage
this namespace or at least manage the
object EC2 instance in this name space
and be able to go to Amazon create uh an
instance there and eventually update
this EC2 instance status giving them
giving the billing theme the public IP
of the instance which was just created
for them for that to be running inside
the uh inside the name space for this
operator to actually be running in the
Kubernetes cluster. We need to build a
Docker image and this is no-brainer. You
saw this coming miles ago. We need to
build this image and you will be pushing
this image to a repository
and from there you will be creating a
deployment
in Kubernetes. you will be creating a
deployment in Kubernetes that uses this
image. Um and then you will be deploying
this pod
which is the operator pod. You also need
the credentials here. You need the AWS
access key and you also need the secret
key.
Now you can also create you will also
then need to create a secret in
Kubernetes reference the secret in this
particular deployment and then roll out
that pod. So eventually the pod has the
logic for creating our instances and
managing them on Amazon and then it will
also have the right authentication
um artifacts needed to talk to uh to to
Amazon. For this building of your image,
there is a make file available from um
cube a make file available from cube
builder which is very very simple. So
let's let's see how this works. This is
a make file in the project which is from
cube builder. And first thing you will
need to change here is the URL of your
um image where you want to uh be able to
push the image where you want docker to
tag this image and eventually push it
for you. For me it's my uh Docker Hub
repository and I think I'm keeping this
public so if anybody wants to use that
um they can and this is uh the Docker
image and it has lots of targets
available. You have this make file and
you already used it for creating your
manifest when you update changes in the
API spec you had to regenerate the
manifests you know creating the custom
resource definitions. uh you have some
uh you have some targets for testing
your application environment. You also
have some llinters available. And here's
where things gets interesting. You can
just say go run what we have been doing.
Go run uh cmd main.go. We can also say
make run. So it's kind of like an alias.
It generates the manifests the the
boilerplate code. It formats your go
code and it also runs another um bet.
What is it? uh it it runs the go with
against your code. What we are looking
for is the docker build. It has a target
called docker build. Essentially what it
just does is it runs your container
tool. For me in for my case it's docker.
So it will be running docker build
hyphen tag and it's giving me the image
tag which is what I have declared above
here and essentially it builds me a
docker image with that particular tag
and then I can do a docker push to push
my image to a registry. Now this goes
without saying that your kubernetes
cluster will need access to this
container registry because without that
they will not be able to pull the
images. You also can build images from
multiple architects. Right now, I'm only
building for ARM because I'm running
this on Mac and my Kubernetes cluster is
also running on Mac. So, it's all ARM
for me. But you might be building this
on your Mac and you might want to run
this operator for an AMD machine. Uh you
can use a Docker buildex
target to build it for different
platforms and then generate a single
manifest and be able to deploy it there.
This make file makes it very very simple
for you to be able to build your images
for your platform that you are running
with or for crossplatform as well. So
let's do that. I will do make docker
build. And what this does is it's
building me the docker image from my
main branch or this repository which is
the EC2 operator. Now this is where it
takes a little bit time. You see it's
building it for Linux but it's building
it for ARM 64 architecture and this is
where we are building our source code
into an executable binary which is going
to be called as manager. So let's wait
for that to finish
and once this will be done you can see
the Golang version we are using is 1.23.
If you want to see the docker image it
is very uh minimal. You are using the
Golang 1.23 as your builder. You copy
your go mod. You copy your go sum. Set
up a working directory. Uh copy your
APIs internal your main.go and
eventually you go ahead and build your
manager because this is the one which is
running your controller your controller
manager
from the disase images. You just execute
this manager binary which you have built
with go with a user 6532. So it's a
nonroot user which is a good thing. you
you always almost want to run your
container images with a nonroot user um
for security reasons and once your image
is build I can simply say make docker
push and this is going to push my image
to the registry I've already pushed
couple of few layers because when I was
trying with this course I have it and
now this one is pushing your container
registry your image to the registry if I
want to see this uh let's Go to
dockerhub. Can I see that here?
Of course. So hub.docker.com
and there will be the image. Search for
my username and there will be couple of
images I have. What was the name of the
image? EC2 kubernetes operator. Here is
so this one is where I have uh the tag
which is only latest. You can have a
CI/CD pipeline if you are storing your
code in GitHub. You can use GitHub
actions to always update your images in
case your API spec is changed. In case
your main main go files changed or you
know your internal folder which contains
the actual controller logic has uh been
updated. You can trigger a new build and
then you can trigger a new deployment.
With this thing uh aside we have our
image. However, these artifacts still
are needed. We need a deployment. We
need this secret for this deployment to
work. We need a name space. So, building
this image wasn't as big as a problem
because you need to have quite a few of
resources here. You also need Kubernetes
artifacts for the rolebased access
control. you need to give the service
account running this pod access for EC2
uh EC2 instance resources on the cluster
level because it should be able to work
in any name space um at least for this
object only. So this rpack is also
required. So you should be getting where
we are going with this. We need
something to be able to ship this
application for other customers and
that's what we will be using Helm for.
You can create a Helm chart which will
be you know one of one of the things
that I wanted to do with this course is
a Helm chart which shows an end toend
delivery of this application.
Helm you can do that yourself. We have
the Helm init command. It makes it very
simple for me to create a Helm chart.
Then you will update your deployment to
set the environment variables from the
secret. You will create the secret. It's
simple. However, there are two ways in
which you can do uh uh in which cube
builder can actually help you. The first
one is you can do u make uh build
installer. There's a target called build
installer. And what it does is it reads
your um make file. It reads your make
file. It reads the image that you have.
It then generates a file called
dis/install.yamel.
And if I show you this file um what this
looks like, it's a new file. And you see
it has all of these artifacts which is
needed for your application to be uh
deployed in Kubernetes. So it creates a
name space called EC2 operatory system.
Then it has uh the the custom resource
definition which is our EC2 instance.
Then it has the service account um which
is going to be running our actual pod.
Then you have a couple of roles. You
have some cluster roles. You have some
cluster role bindings. Um at least it
lets you be able to create update delete
in this API group for this resources.
Cube builder really helps you to be able
to bootstrap your um deployment
strategy. So with a make file with this
target you can create a single
deployable unit. And here's the
important thing. It gives you a service
as well as it gives you a deployment
here. This one you see it's using our
image that we just has uh that we just
had pushed. It has couple of uh
livveness probes. It has some uh
readiness probes as well. In our
program, we did not create an endpoint
at /halth. We do not have a livveness
probe. We don't have a readiness probe
because I wanted to keep it simple
focusing on the operator. So, you will
probably be removing them. So, uh
getting rid of you know the liveless
probe. It's here.
And then you will be getting rid of the
readiness probe. when you are actually
deploying this for production, these
things are really really good to have.
So you can check the health of your uh
you know you can check the health of
your um operator.
Now for this run container because you
also need some environment variables. So
you will see uh you can say environment
and you can see see here uh you can have
these access key environment variable
and secret key environment variable and
it's coming from a secret called AWS
credentials and then in the end you can
actually append uh here and you can
create API version secret there you go
some random data is being spilled but
that's okay
this will give you a complete deployable
um YAML file which you can just do a
cubectl apply hyphen f and be able to
deploy this application. However,
there's no version control on this file
as we would be able to do these things
on a helm chart. Somebody who wants to
deploy this controller, they need to
know the very layer the lower level
details of where to create the secret,
where to update my AWS access key and
secret key, where do I update my uh
controller parameters in the deployment.
So that is still a problem and that's
where cube builder helps you uh instead
of giving this big file which is a
single deployable unit. You can use cube
builder edit command. What it does is
you can tell cube builder that I want to
use the plug-in helm/v1 alpha.
Essentially this gives you a helm chart
created to deploy your operator. And if
I do that, you can see here generating
Helm chart to distribute the project and
you don't have any web hooks created
which we discussed in the beginning. So
it doesn't do that. However, it gives
you all these rolebased access controls.
It gives you all the templates for your
um deployment uh for your service
account for your services everything in
the desk
chart folder. And this is how your Helm
chart would look like. If I want to show
you this one is created here. So the
name of your chart is EC2 operator which
is also the name of your project. Um and
once you have this you can look at the
templates where the event with the
actual uh resources Kubernetes
constructs are created. You have them
for CRD which is the actual custom
resource definition. And of course we
want to be able to deploy that uh the
assert manager for the issue.
So what you see in the template folder
is couple of uh resources created which
is the cubernetes construct. We have the
CRD which is essentially what we want to
be able to deploy. We also have search
manager which is going to give us an
issuer certificate.
Um and this is part of web hooks in case
we were using any we want to use manager
for that. Here's where the interesting
thing is. This is the deployment of your
manager. This is where you will be using
a values file to define your um to
define the values for your um for your
resources.
You also can see if you have metrics
available for the service. If you have
metrics available uh from your operator
in case you are sharing them for
Kubernetes, you can create some network
policies. You can have uh service
monitors. And then here's where a
plethora of arbback rules which are
created for you. So this makes it very
very simple for you to ship your uh
project without you doing a helm in it
and actually creating all these
resources by yourself letting you easily
control the behavior of your operator by
the single Helm chart with the values
file. So the values file which is
shipped with the helm chart that we just
built with cube builder. It controls the
deployment of my controller manager that
I'm shipping up with my operator. So
this defines how many replicas do I want
of my controller manager, where's the
image coming from. And this is something
that you pushed with uh with docker. So
if I do docker push, this is where the
actual image was was pushed. So let's
take that and uh let me change that
here. Now there are a couple of
arguments available to your controller
manager. Uh I don't need the push so
that can be taken out. There's a couple
of arguments available. Uh the first one
is leader elect. This is something you
would be using in case you have a leader
election where you run multiple you know
replicas of of your controller. In our
case we are just running one. So it
doesn't make much sense for us. And
there are two arguments which is metric
bind address and the health probe
binding address which we will talk about
in a little bit. This is a standard
concept of Kubernetes where we define
how much limits for my CPU and my memory
the application would be needing. And
here's where uh it is really really
interesting. So usually when you are
building an application, it is your
responsibility to run an HTTP server
inside of it. If in case you want to use
the HTTP get type of livveness probe or
the readiness probe and it is your
responsibility to create a endpoint for
example in this case it is the health
endpoint and so is the ready endpoint as
well. Usually you write the application,
you make these uh you make these API
endpoints available and then you tell
Kubernetes that check my application on
this port number and on this part and
see if you get a response which is 200.
If you do that within this uh period
seconds and the delay after the initial
delay um my application is is live, my
application is ready. Otherwise do what
you need to do in case when an
application uh fails its readiness probe
or the livveness probe which is either
you stop sending traffic to it or you
kill the container and redeploy that. We
did not create any sort of API for the
health and the readiness right now and
that is the beauty of this operator
framework that we are using which is
cube builder. These API endpoints are
already available to you in uh you know
uh in your controller. So you can make
use of them right out of the box using
the readiness probe and the livveness
probe. And this is where I'm configuring
that my health probe binding address is
any address in the container and it's on
the port number 8081 on which my health
and the ready probes are running. It's
also important to understand that you
get some metrics out of it uh out of the
controller manager already when you are
using um when you're using cube builder
to write your own operators. You don't
have to implement the logic of how would
I export some metrics of my application.
It is automatically done for you by cube
builder. Of course, it's a very limited
set of metrics which we will see. We
will explore what kind of metrics there
are available. But it makes total sense
from the controller's point of view if
it's working properly or not. Uh how
many times it has reconciled, how many
times it has failed, how many times the
reconcile loop is successful. All of
that is right out of the box for you to
use in your uh operator.
Then there are a couple of uh security
related contexts that we don't want to
run our container as a root user and the
service account name we want to use. And
um um and here's where things gets
interesting.
See when you are working with Kubernetes
you usually in the same cluster let's
say this is our Kubernetes cluster
in this cluster you create a name space
for your operator and then there are uh
you know this is a customer name spaces
so here let's say I give this as EC2
operator this is where my operator
usually would be living which is running
as a pod and here is where I will be
creating um my object which is EC2
instance.
Now if if a developer is creating this
if a developer is creating this object
in their name space the pod or my
controller I should say should be able
to react on this change because that's
the object that my operator should be
listening into rather if there was any
number of name spaces anywhere in the
cluster if the EC2 uh EC2 to instance is
actually created or deleted or updated.
My operator should be able to see that
change and this is why my operator pod
is running with a service account. I
will need to give access to this service
account that this service account has a
role uh and a role binding or a cluster
cluster role binding which allows the
service account to list, get, update,
patch, watch, delete the changes
happening on this particular Kubernetes
object. That is absolutely important.
Otherwise, you will only be able to
create your instances in this ob uh in
the same namespace. But that's not what
we want to do. A pattern for Kubernetes
is you create your operator in a
different namespace in a dedicated name
space and you let users use that
operator in their own um name uh in
their own name spaces by creating the
object on which this operator would be
listening onto. And that's what we are
doing here. We want to enable all the
rolebased access control needed which is
again coming from the templates and are
back here. All these rolebased access
control roles, role binding, cluster
role and the cluster role bindings that
are required for my operator which is
running with this service account to be
able to list, get, patch, update all
those Kubernetes related uh constructs
that I can do on an API endpoint and I
want it to be allowed. Otherwise it
would be you who have to figure out what
roles I need to give to my operator.
What role bindings I need to give to my
operator. What what you know on the
cluster level I need to do for allowing
it access on the name spaces on the EC2
operator resources.
So this this helm chart from cube
builder really makes it simpler for you.
You can also control if you want to uh
enable the custom resources. So this
helm chart does not just deploy the
controller, it also deploys the custom
resource definitions for you. And here's
where you can say enable true that yes,
I want to deploy the custom resource
definition as well. And I want to keep
them in case someone does a helm in
uninstall for my chart. See, you will be
using this Helm chart to um you know u
to deploy this operator. You would be
using this Helm chart to deploy this
operator. Now you might decide that I
want to uninstall but what to do with
that CRD? Would you like that CRD to be
here available? So somebody could also
deploy an operator maybe manually you
know creating a deployment at least your
cluster would understand the custom
resource definition or you also want to
clean this up. This is the flag where
you can just use uh it will be keeping
the custom resource definition or it
will be deleting the custom resource
definition.
There's also matrix available as I said
the the operator that you have written
with cube builder it comes with pre
given metrics available. We will explore
these metrics and you can say that you
want these metrics to be exposed or to
be uh to to be uh accessed from uh from
outside the pod and for that what it
does is it creates you a service in the
name space. So if I go to my templates
and if I show you here metrics and
here's what it's doing. If values matrix
is enabled all it does is it creates you
a service type of resource in Kubernetes
and the target port uh the port on the
service is 8443 and the target port is
also 8443. However in our case the
metric port is listening on 8080. So I
will change that here to 8080. So this
will create me a Kubernetes service type
which will be listening on port 8443 and
forwarding it to my pod at port number
AL.
This is used by um the Prometheus
service monitor. Again my cluster does
not have Prometheus installed but if it
it would be installed. Uh this enabling
uh of the Prometheus key will be
creating a service monitor which then
uses this particular service to see if
the pod is running or not and to scrape
the metrics from it uh just to show you
uh in in Prometheus and then you can
have a dashboard available on that using
uh that Prometheus as a data source in
your graphana. Pretty straightforward
stuff. And here's where we have access
to uh controlling in in case we want the
search manager injection to our web
hooks or not. Right now we are not
working with any web hooks. So I'm just
going to keep that as disabled. And I'm
also not using any network policies. So
I would be disabling that as well. You
probably want to allow this if in case
you want to have metrics and those
metrics should only be created by
Prometheus running in a certain name
space and you can use metrop policies to
control um that behavior.
Now once you have this we can deploy
this uh helm chart but it's missing one
thing.
See your board is responsible to go to
Amazon and then create resources on top
of that which is which happens to be an
EC2 instance.
You need access uh to Amazon. In that
case you need the authentication.
Now when we were doing this locally
which I explained to you when we were
doing this thing locally I had my Amazon
environment variables already exported
but right now my pod does not have them.
The code reads them from uh from
environment variables but I also need to
set the them in my pod. So you have to
set some environment variables for the
AWS access key and secret key so that
you can authenticate to Amazon. And
that's what I had already done in my
shell. If I show you uh env for AWS and
you can see these are my access key ID
and this is my secret key which again by
the time you are watching this I already
would have disabled them because there's
no way I want those keys to be um uh to
be exported publicly.
Now once you see this what you can do
there are two options for us to pass
these environment variables into our
controller. The controller is created by
a deployment which happens to be uh
here. So this is the deployment which is
responsible for deploying our controller
which is using the image that we have
given which runs the manager command
that's being set by the docker file when
we build this container image. And
here's where we can define some
environment variables. Um, you can
create a secret. So you could do
something like this. So I would say AWS
secret.gamel.
And this is going to be kind of secret.
And you can see you can create a secret
called AWS credentials in that name
space. And then type is opaque. And you
have your AWS access key and the secret
key. Of course you will put them as
plain text. And then you can refer that
secret. For example, in here uh I can
say access key ID and secret key ID.
That's one way of doing that. And that's
probably a better way because you have
your sensitive data in a secret. You use
that secret into the deployment and then
uh you deploy the application. And
eventually it's going to get the secrets
uh from uh it's going to get the
sensitive data which happens to be the
access key ID and the secret access key
from your secret and then the code will
be running fine. That's one way of doing
it. Another way of doing that is which
I'm going to do and that's a little bit
um that's that's quite wrong. We should
not do that but for the demo I'm just
doing it. This deployment that is
created by the helmchart, it reads the
environment variables from
values.controller manager.container.env.
So I can actually set some environment
variables like this. This is my AWS
access key ID and this is my AWS access
secret key. Of course, this is something
you would be creating a secret for and
then referencing which I just showed
you. But for me to keep it simple, I'm
just um um showing you there's another
way of doing it which is a bad way. But
um you have been warned. So be very
careful about uh controlling your access
key ids or the secret keys. You should
never never put them in plain text in
your code or in your Helm charts or in
your values file. You should never do
that. Probably in this case when you are
you know on a journey to build an
operator, you already know about the
external secrets operator. uh uh project
and that's what you would be using to
read these secrets from a secrets
manager like vault like Google secrets
manager or they have integration with
other things as well.
Now once we have this once we have our
controller we define the number of
replicas I want we define the
environment variables we have the right
image repository we define the liveness
probe and the readiness probe and
everything else it's time to deploy this
Helen chart to our cluster and for that
I can just go to this chart because
that's where it is created this is my
values file I can say um let me just see
if I have any errors in my the file
somewhere
and end probably not. It looks good to
me. This is the range again. This is the
end of the if condition. Looks good. So,
let's do that now. Helm uh install
of I have no EC2 instances. I already
have the custom resource definition
because I was trying this Helm chart.
But let's delete that as well.
uh delete the custom resource
definition. So that is deleted. My my
Kubernetes resource does not understand.
It could not find the requested
resources. And uh let's do that now. So
Helm install give me uh install my EC2
operator Helm chart. That's the name of
my Helm chart. I want you to create the
name space. The namespace named is EC2
operator which is already existing but
if it's there it doesn't do anything. So
there's no problem. It's kind of like in
a temp potent field and here's the
values file and dot would be my helm
chart that I have just created.
Now uh let me look at the pods and you
can see the pod is running now which is
using our image. If I describe that pod
here, we can see we have the livveness
probe, we have the readiness probe, we
have the environment variables as well.
And this is running uh we can do k get
pods.
It is running on this particular node.
Now I'm using k3d and I have got some
nodes available. these uh there's one
control plane, one is master and then
there are uh two worker nodes that I
have. All of them are actually docker
containers. If you remember when we were
setting up the environment for our uh
you know the development environment, we
are using k3d and if I do docker pfs gp
for agent zero, you can see k3d ec2
operator agent zero. That is the same
name of my uh worker on which this pod
is running. What I want to show you is
if I exec uh into
uh docker exec husband it uh sh if I
exec into this container
um I can get the IP of my pod and I can
say port 8080 / health. It doesn't have
cur but if I do wget um that also fails
saying on this IP port 8080 there's no
health. Let me check what was the
endpoint for my health checks
for the health probe. It was 8081.
That's correct. So I need to look on
port 8081 / health to find out if my pod
is healthy or not. And uh it says health
already file exists. So let me do a
little cleanup. Uh health ready metrics
cuz I was trying this before. So it was
already there. Let's start from the
beginning.
I want to see inside my controller on
this pod 8081 is there a health endpoint
and you see health is saved. If I do cat
health it's okay. So this tells me my
health probe my um you know the
livveness probe is working fine because
I have on this port number I have uh
this this API endpoint and it returns me
a value of okay. The same thing goes on
if I use the ready endpoint. Maybe let
me increase the font a little bit. And
here you can see I also have get ready
is saved. And if I do cat on ready,
that's also okay. So on both of these
endpoints for my livveness probe and the
readiness probe on this port number on
both of these endpoints, I get an okay.
That means these endpoints were created
by the controller runtime for us. uh so
that we can do a health uh health check.
It's also interesting that we have uh
some metrics as well on port number
8080. So if I do uh wget on port number
8080 on the metric endpoint matrix
endpoint you see there was something
available on this endpoint as well. And
if I do less on metrics, these are all
the Prometheus matrix which are already
built in and exposed by the application
which you have built. This is not what
you have done. This is already given by
the um by the controller. And here's
where you can see um the controller
runtime total. How many times the
controller runtime has reconciled and
resulted into an error. how many times
it reconciled and result into a recess,
a rec.
And all of this is what you can use this
information, you know, you can use these
informations and how many errors total
you have had so far to show a dashboard
um of how your operator is doing. And
with these metrics if in case you see
that the errors are going up you can
make changes to your operator you can
make changes to your code eventually to
be at a better stage than the previous
one because you have metrics you have
insights of how this is this is going
on. You can even do um your own code
instrumentation for a bit of metrics.
For example, you can tell how many EC2
instances have been created by this
particular operator. Um, having
Prometheus in your code and then
exporting those metrics in in a way that
Prometheus understands it and can scrape
it is a different topic al together. But
if in case you know that it would be
nice for you to have this instrumented
in your code and then you can export
this information of how many EC2
instances were created um deleted so you
can know how how how much people are
using your um your particular um your
operator.
So it looks like my pod is running. It
looks like my pod has got the right
health endpoints. my part has got the
right um health and the readiness probes
and also I've got the metrics available.
But now I want to create an instance
because that's what it should be doing.
It's okay. Everything is is happy. But
is it really doing what it's supposed to
do? So let's do that. And I'm going to
look on the logs of my EC2 operator. You
can see it's starting the workers. It's
all healthy. It's waiting for my
resources. And now what I'm going to do
is I'm going to do the same thing what I
did when we were running this out of the
cluster which was on our computer. And
I'm going to create an EC2 instance
which looks something like this. This is
my EC2 instance. I give the instance
type, the AMI ID, the region that I'm
using, the availability zone, the key
pair. This is something we already had
used before. But I want now that
operator now it's running in my
Kubernetes cluster to create me this
instance because eventually that is
where you will be running this uh inside
of the Kubernetes cluster. So if I do a
create this is the moment of truth what
we have been working towards so far and
right now for me to keep it simple you
can see uh AWS console
you can see I do not have any instances
running
so there's nothing running uh let's do
that and I will do a create
now as soon as I did that this output
should be looking familiar to you. This
is where um we got a request. It says
the request was new. So I'm creating a
new instance and then we go to Amazon
and we add a finalizer. We go to Amazon
and we wait for the instance to be
running and you can see the instance is
already created. If I do EC2 instances
uh in the default name space and here
you can see
can I ping this instance? Of course. And
uh that was the the beauty of my
operator. I got an EC2 instance which I
can access right with the public IP from
my computer right from the public IP of
my in of my you know of my comfort just
to get cubectl get instances get the IP
and log and I start working there. The
important thing is this object EC2
instance is in the default name space
and my operator is actually running in
the EC2 operator name space. So these
are different and this is what I was
talking about here.
This name space is default
and here's where the object was created
and this is the pod which is running in
my EC2 operator name space went to
Amazon because here's where the instance
available. You can see it's running now.
And once it's done, my EC2 instance
operator went to this object and it
updated
the status.
So my operator needs access to not just
read the the object but also to write to
that object so that it can update the
status such as it can give you the state
of this instance uh the public IP and
the instance ID that was created on
Amazon. And that is what we have been
looking forward to so that we can um you
know uh we can go ahead and create our
instances or manage our Amazon
environment to clean it up. So I can
show you it's actually also deleting
resources. Let me do a little bit of um
you know a buffer. So we start from
when we delete our instance what happens
as soon as would I do a delete my
reconcile loop starts because there's an
update to the object and I see it has a
deletion timestamp instance has been
deleted so we print that we are now
deleting the EC2 instance and then we
use Amazon API to delete to send a
terminate request to our instance and
then we wait for the instance to be
terminated and that's essentially what's
happening
in here. So you see it's not running
anymore. It is now terminated. This was
the instance which was terminated.
And as soon as it got terminated, my
waiter said, "Okay, it's all fine. It's
not terminated." And eventually I was
able to delete my EC2 instance object in
the Kubernetes cluster. It was pending.
it did not delete it until the actual
resource on Amazon was was cleaned and
then the finalizer was removed from my
object and then essentially the object
was actually deleted. So this is how you
will be building an application. You
will test this locally. You will build
it into a container image and then you
will ship this to your different
clusters that you want to deploy using a
Helm chart. Essentially what you were
able to do locally is now all happening
in your Kubernetes clusters because the
operator should be running in your
Kubernetes cluster. So this is what I
wanted to show you guys an end to end
starting from bootstrapping the project
then going ahead building the project
testing it and eventually making it work
and then we deploy that with Kubernetes
and essentially run this in the cluster
with all the proper role based access
control with all the proper line probes
the redness probes and also the matrix
which is which is which is nice to have
to see how your controller is really
doing and when everything is good
there's no need to reconcile all is
happy. So I think this makes a this
makes a lot of sense uh to write your
own operators and I want you guys to try
this out and see how this works and let
me know if you have any questions and
I'll be happy to help and let's move on
ahead.
All right, so the code for building this
operator for the cloud that manages your
EC2 instances is now coming to an end.
And I have to admit it's quite a lot.
But trust me, what you have just done
with this course is that you have
actually understood one of the most
advanced concepts in Kubernetes which is
the reconciler which is how to write
applications which are self-healing
which is how to write an operator.
While you know the basics now, while you
know a very good understanding of how to
work with operators, there's no limit to
it. Think of this course as a logistic
that makes you, you know, that enables
you to go ahead and build cool stuff
that runs on top of Kubernetes. Not just
using um container images to run on
Kubernetes, but rather software that
runs on Kubernetes and manages your
other infrastructure, which is what we
did with Amazon. You might be using
Azure. Try to make the developers life
easier by managing resources um using
Kubernetes on Azure. Maybe you are doing
this on GCP. The sky is the limit for
you. Now, now you not just know how to
use Kubernetes, but you also know how to
write applications which are native to
Kubernetes that manages your different
other environments. So you know how to
work with the reconcile loops. You know
how do you design the API endpoints. You
know what is the controller logic is you
know what the integration looks like
when you're working with cloud. And you
are very much applicable now or you
already are probably working as a
platform engineer. You might be a site
reliability engineer. You might be a
DevOps engineer but now you know how to
expand and expand on Kubernetes. Cube
Builder as a project we have looked into
good detail in this course. We have got
we have really you know struck the nerve
of using it to create a production ready
bootstrap plus bootstrap operator but I
would not say it is right now production
ready. When you build an operator you
run it fails people complain about it
you refine that and eventually it
becomes a production ready. But you have
the tools now to go to that journey on
yourself. Next try to build your own
operator. Try to extend this particular
operator to have the metrics available.
Try to write your own operator that
manages the SP buckets. Now for people
try to write your own operator that
manages EFS um file systems on on Amazon
or maybe anything else. No, it doesn't
have to be limited to the cloud. So go
ahead, have fun, have, you know, have
fun building uh building new operators
and have fun building new tools that
runs on top of Kubernetes. And if you
have any questions, let me know.
In this hands-on Kubernetes Operator course, you will learn how to extend Kubernetes by building your own custom operators and controllers from scratch. You’ll go beyond simply using Kubernetes and start treating it as a Software Development Kit (SDK). You will learn how to build a real-world operator that manages AWS EC2 instances directly from Kubernetes, covering everything from the internal architecture of Informers and Caches to advanced concepts like Finalizers and Idempotency. 💻 Code & Resources: https://github.com/shkatara/kubernetes-ec2-operator Shubham: https://www.linkedin.com/in/shubhamkatara/ Saiyam: https://www.linkedin.com/in/saiyampathak Kubesimplify: https://www.youtube.com/@kubesimplify Course Curriculum & Timestamps Part 1: The Theory of Controllers - 0:00:00 Introduction & Prerequisites - 0:01:55 What is a Controller? (The Observe-Compare-Act Loop) - 0:06:45 Idempotency in Controllers - 0:07:55 Deep Dive: The Reconcile Loop (Happy Path, Sad Path, & Error Handling) - 0:19:45 The Foundation of Writing Operators - 0:23:05 What is an Operator? (The "Helper" Analogy) - 0:27:35 CRDs (Custom Resource Definitions) and CRs (Custom Resources) Part 2: Kubernetes Extensibility - 0:31:35 Kubernetes as an SDK & Extensibility - 0:34:00 Networking, Storage, & Admission Controllers - 0:35:20 Internal Developer Platforms (IDP) & Platform Engineering - 0:39:50 Bootstrapping with Kubebuilder Part 3: Setting Up the Environment - 0:41:05 Setting up the Local Environment (K3D, Docker) - 0:52:15 Introduction to the Kubebuilder Framework - 0:56:35 Project Initialization (kubebuilder init) - 1:00:30 Exploring Scaffolding (Makefiles, Dockerfiles, main.go) Part 4: Building the API & Logic - 1:04:15 Creating your first API (kubebuilder create api) - 1:06:45 Defining EC2 Instance Types & Specs in Go - 1:13:05 Understanding TypeMeta and ObjectMeta - 1:21:45 Internal Controller Logic Breakdown - 1:24:05 Deep Dive: Manager Architecture & Controller-Runtime - 1:31:05 Cert Watchers, Health Checks, & Prometheus Metrics - 1:52:55 Initializing the Manager in main.go Part 5: Hands-on Development - 2:07:35 Implementing the Reconcile Loop Logic - 2:22:35 Custom Resource Definitions (CRDs) in Action - 2:46:45 Running the Operator Locally - 3:01:45 AWS SDK Integration in Go - 3:22:55 Using Finalizers for Cleanup Logic - 3:36:55 Creating EC2 Instances on AWS via the Operator - 3:53:20 Implementing Waiters for Instance State (Running/Terminated) Part 6: Advanced Internals & Deployment - 4:13:45 Idempotency & Reconciler Loop Internals - 4:46:35 How Informers, Caches, and WorkQueues Work - 5:11:20 Handling Object Deletion & Timestamps - 5:32:05 Packaging the Operator with Helm - 5:43:05 Deploying to Kubernetes (RBAC & Service Accounts) - 6:16:20 Conclusion & Future Steps