AI Agents: Transforming Anomaly Detection & Resolution | DailyDevLists

Loading video player...

Full Transcript

1,630 words • EN

Did you know that if I woke you up and then asked you to work on a problem like well resolving an

IT system anomaly, it would take you on average something like 22 minutes to move from a sleeping

state to a cognitive state where you're like actually productive? And with every minute of

downtime costing potentially thousands of dollars, that can be an expensive result of sleep inertia.

Fortunately, agentic AI can help with anomaly detection and with resolution. But before I get

to how, let's discuss where the magic bullet of let's use AI could actually lead you astray. So,

imagine it's 2:00 a.m. and an observability tool just detected a business impacting instance.

Something pretty bad like like your authentication service rejecting 90% of user logins or uh maybe a

super laggy payment gateway. So, now it's up to our sleepy SR to come to the rescue. That's site

reliability engineer. And what they need to do is first of all identify what the specific

problem is. When they've done that, they need to figure out the cause of the problem. And

once they figured that out, then they need to come to a resolution. And on the face of it,

this is an area where AI is well suited to help because IT environments, they generate massive

volumes of telemetry data like logs and traces. And traditional incident response requires SREs

to manually sift through all of this noisy data and diagnose the probable root cause. So, lots of

data. We're looking for a needle in a hay stack. Gosh, why not just send all of this telemetry data

into a large language model and then ask it to sift through everything and figure out the root

cause for us? Well, many LLMs have pretty huge context windows, but they're not bottomless pits,

and a single node cluster can crank out gigabytes of log lines per hour. So, if you pipe that fire

hose straight into the large language model and then ask it to come up with a cause, well,

welcome to hallucination city. Because LLMs, they rely on statistical patterns and if you overfeed

them unrelated noise, they will confidently fabricate causal links that just don't exist.

Because the LLM's goal is to predict plausible words rather than to verify facts. It will happily

stitch together all sorts of coincidences like CPU blips and benign restarts and old warning logs,

it will stitch them together into kind of a neat but entirely imaginary narrative. So,

if we want to use AI in anomaly detection and resolution, this brute force dump it all into the

model kind of approach isn't going to get us very far. What we actually need is context curation.

And that's the first of several areas where agentic AI can help. So we have here a whole bunch

of collected data. We've got metrics and events. We've got logs. And we've got telemetry. That's

actually better known as Melt. And then we've got this thing that I mentioned already called context

curation, which is going to take a look at all of this melt data. And instead of just dumping all

of that collected data straight into our AI model here, we're actually going to do an intermediary

step, which is we are going to strategically feed it only the signals that actually matter

for the instant at hand. And we're going to do this through topology aware correlation. Now,

an observability platform maintains a real-time map of how all the services connect and depend on

each other and it knows that an authentication service that talks to a user database and that

sits behind a load balancer which connects to and well a load of stuff and you get the

picture. But when an incident fires, the agent doesn't just grab random logs from everywhere.

It uses this dependency graph just to pull in the telemetry data only from the components that are

actually involved. So when that authentication service starts rejecting logins at 2 a.m., well,

the agent knows to check the user database it depends on and maybe the Reddis cache it uses

for sessions and any recent deployments to those specific services. What it's not doing is wasting

time analyzing logs from a completely unrelated reporting microservice that just happens to be

running on the same cluster. So let's work through how an AI incident investigation agent might work

using this contextually correlated data. Now the process starts when an anomaly triggers

an alert. So the incoming thing that starts this all off is we have uh incident alert coming in.

And to be clear, this is a post incident detection scenario here. We're not predicting when something

might go wrong. We're diagnosing what happens so that we can fix it. So from the instant alert,

the agentic AI, it considers this curated content that we have been talking about. Now,

that is just specific to the actual problem we're trying to solve. Now, we've talked on this channel

before about how AI agents work. This basically works in a number of different phases. So,

we start by agents perceiving their environment. And once they've perceived their environment,

they can reason forward with the best steps that they should take. Once they've reasoned,

they can actually act on that action plan that they've built and then they observe the results

of that action. And then round and round we go back to perceive in a feedback loop. And that

is what's happening here where we're going to form a hypothesis as to what the actual problem

is. That's using causal AI to analyze those disparate melt data sources. Now the agent

will systematically request additional data and evidence needed to validate or refine its

hypothesis. So we might have more curated content coming in here. So for example,

if a a web service is slow and the agent well it might go and fetch some related logs, then

it might notice that there's an error connecting to the database. So that prompts it to retrieve

some database metrics and then it realizes the database was recently updated. So now it's going

to prompt a check of configuration changes and so on and on and on we go. And ultimately this leads

to the big moment, the moment of identification of the probable root cause. So this is what we

think was the problem all along. That's where the agent pinpoints the most likely root cause of the

incident. But it doesn't stop there. We also have explanability. Now, explanability actually puts

some weight behind this root cause because the agent's reasoning process, how it arrived at the

probable root cause, can be made transparent to the human operator, showing its chain of thought.

And the agent can also provide some supporting evidence as well for that probable root cause that

led to that conclusion. and together the chain of thought reasoning and the supporting evidence

can be reviewed by an SRE to supervise and validate the agent's analysis. So we figured out

the probable root cause. Great. But the ultimate goal here is to actually resolve the incident and

agentic AI can assist an SRE in four ways to do that. So one way that they can assist is in

validation. So rather than taking a probable root cause at face value, agentic AI can generate steps

to help an SRE validate that the identified root cause is actually correct. We still want

human input into a production system before remediation. So providing verification steps

that will be a step in the right direction. Now upon validating that we're on the right path, an

agentic AI can then produce a stepbystep runbook. Now that is kind of an action plan to fix the

issue. This runbook's essentially an ordered list of recommended remediation steps. So for example,

if root cause is a filled disk causing a database to crash, well the notebook might one first of

all archive old log files on the DB server to free space. Then it might restart the database service

and then it might monitor disk usage growth and configure an alert if it exceeds a certain

capacity. The idea is that the SRE can follow the script in this runbook here quickly and even if

they're not deeply familiar with that component, it will still guide them through what to do.

Aentic AI can also take these suggested actions and it can build automation scripts or what we can

call workflows to help as well. So for example, the agent could turn each of these runbook steps

into a bash script or an ansible playbook snippet. And here the AI provides the exact command syntax

and the exact parameters for each step. Now another helpful byproduct of these AI agents

is the automatic documentation that comes from all of this. So after resolution the AI can generate a

summary incident report essentially writing the post incident review for you and agentic AI can

also document an ongoing summary of the incident in progress so that new people working on the

incident are brought up to speed. So, agentic AI can help redefine how IT teams handle anomalies

and outages. And these agents operate under human oversight. They're augmenting rather than

replacing human decision makers. And an SRE can verify the AI's findings and then focus energy on

the remediation steps, many of which the AI might have already helped set up. And all of this leads

to a substantial reduction in the all important, MTTR. That's mean time to repair. Not to mention

less operational stress and a bit less sleep inertia when those 2 a.m. system alerts come in.

AI Agents: Transforming Anomaly Detection & Resolution

IBM Technology

83 days ago

11:45

AI Evaluation & Monitoring

Rank #2

Description

Learn more about Anomaly Resolution here → https://ibm.biz/BdeSeB Learn how AI agents can improve DevOps processes → https://ibm.biz/BdeSed Can AI agents reduce downtime and resolve IT anomalies faster? Martin Keen explores how agentic AI improves anomaly detection, root cause analysis, and MTTR through context curation and automation. Discover how AI transforms IT operations for smarter, reliable systems. 🚀 AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdePqi #aiagents #anomalydetection #itinnovation #mttr

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

November 14, 2025

Quality Rank

#2

AI Recommended