Loading video player...
alerts related to persistent volume
filling up or CPU overcommitment infra
alert. Okay,
now I want to investigate this alert
because I don't really know what this
means at this point. CPU overcommitment
infra alert. Uh this is something that's
new to me as an engineer. I as I had
mentioned before I'm not super
wellversed with Kubernetes as I should
ideally be but I'm learning on the job.
So there are some things that I'm not
completely aware of and helps me to you
know basically be able to interact with
it I want to ask it a question as well.
So before it actually goes in and
investigates. Okay let it finish the
investigation. My question is, hey, can
you explain the alert to me and what it
means and
uh should I be concerned?
Seems like it has already identified an
issue in a certain uh pod request, pod
resource request and limits. So from
those commands, it's able to actually
get to that. So
uh okay
it's also found out okay it's it's given
me some commands to run here which it
has run
okay
while it's investigating I think uh yeah
so I think okay so I've completed the
investigation of the CPU over commitment
here's what it has learned
most CPU intensive. Why did this alert
fire? Okay, this cluster has 32% more
CPU limits allocated than physically
available. Now,
what does it mean is that there could be
a risk of resource starvation. If
multiple pods spike to their limits,
then they'll compete for CPU causing
throttling. Oh, yeah, that makes sense.
Basically, I have multiple pods in a
cluster and some of the pods are having
a CPU uh like it said there was a CPU no
there was a CPU overcommitment issue on
the node on certain nodes actually on
all nodes according to this two of them
at five and 6x and one at 2x now uh it's
it's given me this wherein it's telling
me that okay uh need to look into the
metric having absurdly high limit Right.
So, one of these met pods is actually
having a very high limit that I need to
review and uh potentially avoid because
the risk here is let's say I have six or
seven services within six or seven pods
within one of my nodes and uh some of
these pods in the cluster basically have
higher CPU utilization then there could
be competition. Oh, like there are pod
restarts as well that are happening. So
clearly looks like this could be
potentially correlated. So it's also
given me a dashboard here. Okay. Now I
think it also queried a certain
dashboard, right? So yes, so it actually
queried this dashboard. Let me see what
dashboard did it query. It dashboard the
Kubernetes compute resources cluster. It
actually is the same link that it just
shared here, right? So uh it it had
queried this dashboard and in this
dashboard I think only the time range is
different. Interesting. So here the time
range is actually what time was the
alert at
right. So I think
I think this is the time of the alert at
which this dashboard is and this is for
the current time. So when I go to this
dashboard I can see okay this is the
dashboard that it has actually analyzed
some of the metrics from and given me
the analysis for so it's it's actually
analyzed quite a bit of metrics right so
I can see that here now
okay so I think now I have a clarity
here cool so I don't really need to ask
it why what does this alert mean I
understand what it means cool thanks for
this so
what it means is
uh I need to review the allocation.
Hey Shvari, uh I need to review the
allocation of
um memory to different was it memory
that you mentioned to me that I need to
review? Right. So uh no the uh
allocation of course to each of these
pods. Got it. How should I go about
assigning course to it?
and what distribution
would you recommend?
Now this will help me because when I
share this investigation with my
teammates I can also tell it that hey
for this investigation
the core issue that's identified is
actually related to uh resource
starvation that's happening because some
CPU uh usage competition is happening
between pods and it seems like it's
happening because of un uneven
distribution right so let me just is it
it's still ongoing I think we can come
back to this in a minute. Uh,
so yeah. So, let me just see if it's
actually investigated this as well.
Not yet. Okay. So,
yes. Now if I
where is that URL again? So if I go to
this URL and I see what the
investigation is done till now. Uh-huh.
Let me ask it again.
Thanks for this. And uh Shar what we are
doing here is we've seen a Kubernetes
alert related to um CPU overcommitment
in in one of the clusters. So we trying
to understand and it's identified the
agent has helped me identify I was not
completely wellversed with Kubernetes.
So it went into analyzing some of the
metrics related to Kubernetes that like
some of the graphana dashboards related
to Kubernetes and ran some commands in
the cluster and it told me that there is
a uneven distribution where in some of
the pods have extremely high CPU limit
uh you know setup which is actually
making it hog the CPUs or potentially
hog it in in cases there. So now I've
asked it okay what could be the best
practices here. So it's actually told me
this is what I could actually uh you
know
make the distribution across different
components or different services so that
I can actually do a better job. So now
when I send it to my team wait this is
super helpful. Thank you agent Droid.
Thanks agent. So I just now
Deep dive into how an AI agent handles a real Kubernetes CPU overcommitment alert from start to finish. Watch the complete investigation process that would typically take 30+ minutes of manual work. 📊 What the AI agent does: - Automatically identifies alert context and severity - Queries Grafana dashboards and cluster metrics - Runs kubectl commands to analyze pod resources - Discovers dangerous resource allocation (32% overcommitment!) - Explains why this creates risk for production stability - Provides actionable remediation with specific CPU/memory recommendations 🎯 Key insights revealed: - Some pods allocated 5-6x available CPU resources - Risk of resource starvation during traffic spikes - Best practices for Kubernetes resource management Perfect for: DevOps engineers, SREs, platform teams, and anyone managing Kubernetes in production. No more 2 AM debugging sessions. This is what intelligent incident response looks like. #Kubernetes #DevOps #AIOps #IncidentResponse #SRE #ProductionDebugging #InfrastructureMonitoring