Loading video player...
Did you know that if I woke you up and then asked you to work on a problem like well resolving an
IT system anomaly, it would take you on average something like 22 minutes to move from a sleeping
state to a cognitive state where you're like actually productive? And with every minute of
downtime costing potentially thousands of dollars, that can be an expensive result of sleep inertia.
Fortunately, agentic AI can help with anomaly detection and with resolution. But before I get
to how, let's discuss where the magic bullet of let's use AI could actually lead you astray. So,
imagine it's 2:00 a.m. and an observability tool just detected a business impacting instance.
Something pretty bad like like your authentication service rejecting 90% of user logins or uh maybe a
super laggy payment gateway. So, now it's up to our sleepy SR to come to the rescue. That's site
reliability engineer. And what they need to do is first of all identify what the specific
problem is. When they've done that, they need to figure out the cause of the problem. And
once they figured that out, then they need to come to a resolution. And on the face of it,
this is an area where AI is well suited to help because IT environments, they generate massive
volumes of telemetry data like logs and traces. And traditional incident response requires SREs
to manually sift through all of this noisy data and diagnose the probable root cause. So, lots of
data. We're looking for a needle in a hay stack. Gosh, why not just send all of this telemetry data
into a large language model and then ask it to sift through everything and figure out the root
cause for us? Well, many LLMs have pretty huge context windows, but they're not bottomless pits,
and a single node cluster can crank out gigabytes of log lines per hour. So, if you pipe that fire
hose straight into the large language model and then ask it to come up with a cause, well,
welcome to hallucination city. Because LLMs, they rely on statistical patterns and if you overfeed
them unrelated noise, they will confidently fabricate causal links that just don't exist.
Because the LLM's goal is to predict plausible words rather than to verify facts. It will happily
stitch together all sorts of coincidences like CPU blips and benign restarts and old warning logs,
it will stitch them together into kind of a neat but entirely imaginary narrative. So,
if we want to use AI in anomaly detection and resolution, this brute force dump it all into the
model kind of approach isn't going to get us very far. What we actually need is context curation.
And that's the first of several areas where agentic AI can help. So we have here a whole bunch
of collected data. We've got metrics and events. We've got logs. And we've got telemetry. That's
actually better known as Melt. And then we've got this thing that I mentioned already called context
curation, which is going to take a look at all of this melt data. And instead of just dumping all
of that collected data straight into our AI model here, we're actually going to do an intermediary
step, which is we are going to strategically feed it only the signals that actually matter
for the instant at hand. And we're going to do this through topology aware correlation. Now,
an observability platform maintains a real-time map of how all the services connect and depend on
each other and it knows that an authentication service that talks to a user database and that
sits behind a load balancer which connects to and well a load of stuff and you get the
picture. But when an incident fires, the agent doesn't just grab random logs from everywhere.
It uses this dependency graph just to pull in the telemetry data only from the components that are
actually involved. So when that authentication service starts rejecting logins at 2 a.m., well,
the agent knows to check the user database it depends on and maybe the Reddis cache it uses
for sessions and any recent deployments to those specific services. What it's not doing is wasting
time analyzing logs from a completely unrelated reporting microservice that just happens to be
running on the same cluster. So let's work through how an AI incident investigation agent might work
using this contextually correlated data. Now the process starts when an anomaly triggers
an alert. So the incoming thing that starts this all off is we have uh incident alert coming in.
And to be clear, this is a post incident detection scenario here. We're not predicting when something
might go wrong. We're diagnosing what happens so that we can fix it. So from the instant alert,
the agentic AI, it considers this curated content that we have been talking about. Now,
that is just specific to the actual problem we're trying to solve. Now, we've talked on this channel
before about how AI agents work. This basically works in a number of different phases. So,
we start by agents perceiving their environment. And once they've perceived their environment,
they can reason forward with the best steps that they should take. Once they've reasoned,
they can actually act on that action plan that they've built and then they observe the results
of that action. And then round and round we go back to perceive in a feedback loop. And that
is what's happening here where we're going to form a hypothesis as to what the actual problem
is. That's using causal AI to analyze those disparate melt data sources. Now the agent
will systematically request additional data and evidence needed to validate or refine its
hypothesis. So we might have more curated content coming in here. So for example,
if a a web service is slow and the agent well it might go and fetch some related logs, then
it might notice that there's an error connecting to the database. So that prompts it to retrieve
some database metrics and then it realizes the database was recently updated. So now it's going
to prompt a check of configuration changes and so on and on and on we go. And ultimately this leads
to the big moment, the moment of identification of the probable root cause. So this is what we
think was the problem all along. That's where the agent pinpoints the most likely root cause of the
incident. But it doesn't stop there. We also have explanability. Now, explanability actually puts
some weight behind this root cause because the agent's reasoning process, how it arrived at the
probable root cause, can be made transparent to the human operator, showing its chain of thought.
And the agent can also provide some supporting evidence as well for that probable root cause that
led to that conclusion. and together the chain of thought reasoning and the supporting evidence
can be reviewed by an SRE to supervise and validate the agent's analysis. So we figured out
the probable root cause. Great. But the ultimate goal here is to actually resolve the incident and
agentic AI can assist an SRE in four ways to do that. So one way that they can assist is in
validation. So rather than taking a probable root cause at face value, agentic AI can generate steps
to help an SRE validate that the identified root cause is actually correct. We still want
human input into a production system before remediation. So providing verification steps
that will be a step in the right direction. Now upon validating that we're on the right path, an
agentic AI can then produce a stepbystep runbook. Now that is kind of an action plan to fix the
issue. This runbook's essentially an ordered list of recommended remediation steps. So for example,
if root cause is a filled disk causing a database to crash, well the notebook might one first of
all archive old log files on the DB server to free space. Then it might restart the database service
and then it might monitor disk usage growth and configure an alert if it exceeds a certain
capacity. The idea is that the SRE can follow the script in this runbook here quickly and even if
they're not deeply familiar with that component, it will still guide them through what to do.
Aentic AI can also take these suggested actions and it can build automation scripts or what we can
call workflows to help as well. So for example, the agent could turn each of these runbook steps
into a bash script or an ansible playbook snippet. And here the AI provides the exact command syntax
and the exact parameters for each step. Now another helpful byproduct of these AI agents
is the automatic documentation that comes from all of this. So after resolution the AI can generate a
summary incident report essentially writing the post incident review for you and agentic AI can
also document an ongoing summary of the incident in progress so that new people working on the
incident are brought up to speed. So, agentic AI can help redefine how IT teams handle anomalies
and outages. And these agents operate under human oversight. They're augmenting rather than
replacing human decision makers. And an SRE can verify the AI's findings and then focus energy on
the remediation steps, many of which the AI might have already helped set up. And all of this leads
to a substantial reduction in the all important, MTTR. That's mean time to repair. Not to mention
less operational stress and a bit less sleep inertia when those 2 a.m. system alerts come in.
Learn more about Anomaly Resolution here → https://ibm.biz/BdeSeB Learn how AI agents can improve DevOps processes → https://ibm.biz/BdeSed Can AI agents reduce downtime and resolve IT anomalies faster? Martin Keen explores how agentic AI improves anomaly detection, root cause analysis, and MTTR through context curation and automation. Discover how AI transforms IT operations for smarter, reliable systems. 🚀 AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdePqi #aiagents #anomalydetection #itinnovation #mttr