Which AI Model is Best for DevOps? I Tested 10 (Shocking Results) | DailyDevLists

Loading video player...

Full Transcript

4,261 words • EN

[Music]

You're a software engineer, right? Maybe

you're doing DevOps or SR platform

engineering or infrastructure work.

You're using large language models, or

at least, you, should, be., But, which, ones?

How do you know which model to pick?

Now, I was in the same situation. I made

choices based on gut feelings or

benchmark scores that didn't mean

anything in production or marketing

claims. I thought that I should change

that. So I run 10 models from Google

Antropic, OpenAI, XAI, DeepS and Mistral

through real agent workflows, Kubernetes

operations, cluster analysis, policy

generation, systematic troubleshooting

production scenarios with actual timeout

constraints. And the results were

shocking,, at least, when, compared, to, what

benchmarks and marketing promised. 70%

of models couldn't finish their work in

reasonable time. A model that cost $120

per million output tokens failed more

evaluations than it passed. Premium

reasoning models timed out on tasks that

cheaper models handled easily. Models

everyone's talking about couldn't

deliver reliable results. and the

cheapest model, it delivered better

value than options costing 20 times

more. By the end of this video, you will

know exactly which models actually work

for engineering and operations tasks

which ones are unreliable, which ones

burn your money without delivering

results, and which ones can't do what

they're supposed to do.

In this video, I'm comparing large

language models. So, LLMs from Google

Entropic OpenAI XAI DeepSc and

Mistral. Some of these companies have

multiple models in the comparison. So

we'll see how the different versions

stack up against each other. Now, you

might be wondering, hey, why am I

creating my own comparison instead of

just relying on existing benchmarks? The

primary reason is simple. I need to know

which models work best for the agents I

am building. But there's a secondary

reason that's probably more important. I

want to truly know which model performs

better and under which conditions. This

cannot be based on my feelings or

personal experience. It needs to be

based on data. We are engineers, damn

it. We're supposed to make decisions

based on data, not how we feel. Well

some, of, us, at least., I, also, don't, trust

the standard benchmarks because models

are often trained on them to gain the

results. So how am I doing this? I have

a set of tests that validate whether my

agents work correctly. I run those same

tests for each of the models. Since

these are actual tests, I know whether

the output is acceptable or not. When

tests fail, I analyze what's wrong and

add that information to the data sets.

Those data sets include duration, input

and output tokens used, the prompts

themselves, air responses, pass and fail

statuses, and so on and so forth. All

the data is then injected into prompts

that run the actual evaluations and

provide scoring. Now, this approach is

slightly different from standard AI

evaluations. Most levels have to figure

out both whether something worked and

how well it worked. Mine doesn't. The

functional tests already determined if

the output is acceptable. The levels are

scoring how well it worked with full

context about what passed or failed and

why. It's a more grounded approach

because the scoring is based on

realworld agent performance, not trying

to guess if something is correct. Now

all the models were compared using the

same agent based on the Versel Versel

something like that SDK. That means I'm

ignoring the differences in performance

you would get from specific agents like

cloud code. Those differences can be

huge, but that's not what I'm testing

here though. This comparison is about

the models themselves. Now, as for what

I'm actually testing, well, I'm not

measuring code generation directly since

that's more subjective. Instead, all the

tests are based on Kubernetes, which is

well understood by all these models. I

expect the results would be similar for

any type of development tasks, but these

specific results are fine-tuned for

DevOps or ops and SE type of work. Now

let's break down exactly what criteria I

used to evaluate those models and what

scenarios they were tested against.

Now, I believe understanding the

criteria and scenarios is important for

making sense of the results. But hey, if

you're anxious to see which models won

and lost, feel free to skip ahead to the

next section with the results. I won't

be offended. I promise. So, I evaluated

these 10 models across five key

dimensions. First is overall

performance. Basically, quality scores

measuring how well each model handles

different types of agent instructions.

Second is reliability. Can the model

actually complete evaluation sessions

without crapping out? This is measured

by participation rate and completion

success. Third is consistency.

How predictable is the model's

performance across different tools. You

do not want the model that excels in one

area but completely fails in another.

Fourth is cost performance value. Your

quality score relative to pricing per

million tokens. Raw performance doesn't

mean much if it costs 50 times more than

alternatives. And the last one is

context window efficiency. How well

models handle large context loads. Some

scenarios send over 200,000 tokens to

the models. Having a massive context

window doesn't guarantee good

performance if the model cannot actually

use it effectively. Now, timeout

constraints are critical here. If a

model cannot deliver results in

reasonable time, it's not useful in

production. Real world agent workflows

have time budgets like 5 minutes for

quick pattern creation, 45 minutes for

comprehensive cluster analysis. When I

say an evaluation failed, I mean the

timeout was exceeded. It isn't about

models working indefinitely. It's about

delivering what we need in reasonable

periods of time. So let's look at the

five evolution scenarios. Each one tests

different aspects of what makes AI

agents actually useful in production

environments. First is capability

analysis. This is basically an endurance

tests that puts model through about 100

consecutive AI interactions over a

45minut period. The goal is to discover

and analyze every single Kubernetes

resource in the cluster. What makes this

challenging is maintaining quality

throughout those marathon sessions while

demonstrating deep Kubernetes knowledge

for each resource type. The question

we're asking is can the model sustain

performance without degrading over

extended workflows. This matters because

in production you need models that don't

get sloppy or confused after dozens of

interactions. Interestingly

70% of the models we tested completely

failed at this completely. Second is

pattern recognition. This evaluates how

well models handle multi-step

interactive workflows within tight

fiveinut timeout. The workflow goes

something like this. Expand trigger

keywords into comprehensive lists.

abstract those specific requirements

into reusable organizational patterns

and create templates that get stored in

a vector database. Think of it as

capturing your team's development best

practices so they can be automatically

applied to future deployments. The

challenge here is speed combined with

abstraction. Transforming specific

requirements into general patterns

quickly. What we're really testing is

how well the model handles rapid

interactive refinement workflows. This

matters because in production you often

need quick answers for pattern matching

not deep philosophical analysis. The

third is policy compliance. This tests

whether models can perform schema-to-sk

analysis of cluster resources within a

15 minutes timeout. The tasks involve

expanding policy triggers into

comprehensive resource categories and

generating complete syntactically

correct Kybero policies with cell

validation expressions. This is about

proactive governance integrating

security and compliance requirements

directly into AI recommendations rather

than blocking manifests after they're

created. The challenge is in time

pressure. Comprehensive schema analysis

cannot be rushed, but you also cannot

take forever. The question is, can the

model balance depth of analysis with

time constraints?

30% of models completely failed this

scenario because they didn't have large

enough context windows to handle the

schema complexity. Fourth is

recommendations. This is manifest

generation under extreme context

pressure within a 20 minute timeout. The

context load here is brutal. Up to 50

large Kubernetes resource scheas

totaling over 100 or 200,000 tokens. The

process involves transforming user

intent into productionready YAML

manifests through targeted clarification

questions. What makes this different

from generic deployment tutorials is

that it needs to understand your

specific clusters capabilities, your

organization's patterns, and your

governance policies. The challenge is

processing this massive schema context

while still being accurate. The question

we're testing is, hey, how efficiently

does the model actually utilize context

windows for complex generation? This is

where we learned that having a massive

context window doesn't mean the model

can use it effectively. 50% of models

failed at manifest generation between

the timeout constraints. Fifth and final

is remediation. This one is different

from the previous scenarios. The earlier

four were onehot interactions. Hey, send

a prompt, get the response. Remediation

runs as an investigation loop. The model

receives the issue description along

with a list of tools it can use. It then

decides which cube control commands to

run together data. The agent executes

those commands and adds the output back

into the context. The model analyzes the

new information and decides whether to

request more tool executions or conclude

with hey here's the problem and here's

how to solve it. The model in this

scenario is completely free in how it

investigates. It could decide it has

enough information without executing any

tools at all. Or it could loop up to 30

iterations, gathering more and more

data. In practice, successful

investigations typically run five to

eight cycles, all within timeout

constraints for that specific scenario.

What makes this challenging is that the

model must conduct intelligent

systematic investigation. Kubernetes

failures present symptoms, not causes.

For example, a pod that will not start

might actually be failing because of a

missing persistent volume claim or a

network policy blocking traffic. The

model needs to decide which cube control

commands to run next based on what it

learned from previous commands to

maintain investigation context

throughout all those iterations to

understand cross resource dependencies

and know when it has enough information

to provide the actual root cause and

remediation steps with proper risk

assessment. The performance variance

here was extreme. We saw a 40 times

difference between the fastest and the

slowest models ranging from 2.5 seconds

to over 22 minutes for the same

investigation. So those are five

scenarios testing different aspects of

what makes AI agents useful in

production. Endurance, speed, attention

to detail, context handling, and

systematic investigation. Now let's see

how these 10 models actually perform.

Let's start with what honestly shocked

me the most. Seven out of 10 models

that's 70% couldn't complete the

capability analysis within 45 minutes.

They just run out of time trying to work

through those 100 consecutive

interactions. Half of the models

exceeded timeout constraints for

manifest generation. Three out of 10

failed at policy compliance because they

couldn't finish the work within the 15

minutes window. Now this is important to

understand. When I say a model failed, I

don't mean it crushed or true errors.

These are timeout failures. The models

were still working, still trying to

complete the task, but they couldn't

deliver results within reasonable

production time frames. And that's what

matters in real deployments. If your

agent takes 2 hours to analyze a cluster

when you need answers in 45 minutes or

45 seconds, that model isn't useful to

you, no matter how good its eventual

output might be. But some models had

truly truly disastrous reliability

issues that go beyond just being slow.

GP5 Pro was the real shocker here. This

model couldn't complete nearly half of

all evaluations. only 52%

participation rate. It exceeded timeout

so often that it failed more tests than

it passed. Pattern recognition zero.

Couldn't finish within the time budget.

Recommendations also zero. Nothing. GP5

Pro is supposed to be the advanced

version of GP5. It is positioned as the

model for complex reasoning tasks. And

here's the thing, these aren't

impossibly hard tasks. We are talking

about Kubernetes operations that

production agents need to handle. It's

like having a brilliant mathematician

who might excel at complicated algebra

but takes hours to answer, hey, what's 2

+ 2? Being brilliant doesn't mean much

if you cannot answer straightforward

questions relatively quickly. Then

there's Mistral large. Now I expected

Mistral might struggle. I did. It

generally not considered among the

absolute top tier models. I included it

anyway because hell we need to support

European AI development as well. Right.

But the results were rough. This model

couldn't finish remediation

investigations at all. Zero score. It

would start the investigation low but

couldn't complete the systematic

troubleshooting process within the time

constraints. only 65% participation rate

overall, meaning it failed to complete a

third of all evaluations. Now, let's

talk about cost because this is where

things get interesting. The cheapest

model isn't always the best value and

the most expensive definitely isn't

either. For example, Grog for fast

reasoning absolutely dominates the cost

performance category. It accomplished a

value score more than three and a half

times better than the next competitor at

35 cents per million tokens. That's 20

cents for input and 50 cents for output.

Is the cheapest model in this

evaluation. But here's the catch.

Remember that policy compliance test I

mentioned earlier? Gro for fast

reasoning couldn't complete it within

the 15 minutes window. It scored around

40%. which is catastrophically

low. This makes it production dangerous

for policy generation workflows. So

yeah, it's cheap and fast for most

tasks, but you absolutely cannot use it

for generating let's say governor

policies. On the other hand, Germany 2.5

flesh offers a more balanced value

proposition. At 30 cents for input and

250 for output per million tokens, it

delivers solid 78% overall performance

with no critical failures. It's about

twice as cheap as some premium options

while still being production ready

across all scenarios. Now, about those

premium models, some of them cost

anywhere from 8 to 20 times more than

budget options. The question is, do they

deliver proportional value? We'll see

that one later. And then there's GPT5

Pro. Remember that model that couldn't

complete half of its evaluations? The

one with zero scores in pattern

recognition recommendations? That one

costs almost $68

per million tokens. $15 for input and

$120

for output. It's so ridiculously

expensive compared to everything else

that it actually screwed up my cost

versus quality graph that you see on the

screen. Now looking at full specific

performance reveals something

interesting. There's no universal winner

that dominates everything. Instead, we

see specialization. Some models excel at

certain tasks but struggle with others.

In capability analysis, the performance

spread is massive from 90% at the top

down to around 64% at the bottom.

There's a significant difference in how

well models handle these 100 consecutive

interactions over 45 minutes. Pattern

recognition shows even more dramatic

variance. Some models excel at this

rapid five minute workflow while others

remember GP5 Pro yeah scored absolute

zero. Couldn't complete it at all.

They're not even in this graph. Policy

compliance ranges from 83% down to 40%.

That bottom score is Grock 45 reasoning

failing to complete the comprehensive

scheme analysis within 15 minutes. For

recommendations, we saw a 42 times speed

difference between the fastest and

slowest models. Same task, same schemas

widely different completion times. Some

models process those 100,000 plus tokens

efficiently, others choke on them. And

remediation shows the widest gap of all.

Complete failures at zero versus near

perfect scores at 95%. This systematic

investigation workflow really separates

the capable models from the ones that

just cannot handle iterative problem

solving. And here's where things get

really interesting. Context window size

matters, but efficiency matters even

more. The pattern is clear. Models with

larger context windows generally score

higher in recommendations. Makes sense

right? That scenario sends a 100,000 or

more tokens of context in each

interaction, including up to 50 large

Kubernetes schemas. Bigger window means

more room to work, right? Well, not

really. Look at Cloud Hiku. That

completely breaks the pattern. This

model has only 200,000 context window

yet it scored 93%. the highest score in

the entire recommendation category. It's

processing these massive context loads

with 50 plus schemas more effectively

than models with five times larger

windows.

That's not about having more space.

That's about using the space you have

more efficiently. Now on the flip side

Grock 4 fast reasoning has a massive

massive 2 million token context window

but could not complete tasks within the

timeout constraints. All that capacity

but it cannot process the information

fast enough to actually matter. The

lesson here is clear. Context size alone

doesn't guarantee performance. It's

about how efficiently the model utilizes

what it has. Now when I synthesize all

this data, the reliability issues, cost

trade-offs, tool specific performance

and context efficiency

clear performance tiers emerged. Three

distinct categories separated by real

performance gaps. Let's start at the

bottom. The bottom tier includes models

scoring below 70% overall or below 80%

reliability.

Four models landed there. Now, two of

these I expected to struggle. Mistral

large and deepseasoner aren't generally

considered top tier models, so their

results were not surprising. Deep seek

struggled with pattern recognition and

endurance testing while Mistral couldn't

complete remediation investigations.

Not great, but not shocking either. Now

the real disappointments were GP5 Pro

and Gro 45 reasoning. These were

supposed to be competitive options, but

they failed in ways I didn't anticipate.

The primary issue they take unreasonably

long to analyze tasks consistently

hitting my timeout constraints. GPT5 Pro

could not complete nearly half of all

evaluations. That catastrophic 52%

participation rate with zero scores in

pattern recognition and recommendations.

It's not that it got the answers wrong.

It's that it could not deliver answers

within specific time frames.

And Grock for fast reasoning with fast

literally in its name has that

production dangerous 40% policy

compliance score because it could not

and I repeat could not complete the

comprehensive scheme analysis within 15

minutes. The irony of calling it fast

when it consistently exceeded timeouts

is kind of funny. It's not lost on me.

The bottom line, these models have

critical failures that make them

unsuitable for production use or at

minimum you would need to be very very

careful about which scenarios you use

them for. Now moving up to the mid tier

we have four models that are actually

production ready. They score between 70

and 80% overall with at least 80%

reliability. This includes Gemini 2.5

flesh, our balanced value champion at 50

cents input and 250 output per million

tokens. GPT5 without pro with its solid

consistency in pattern recognition.

Gemini 2.5 Pro with strong capability

analysis despite that weak remediation

score and Gro 4 with good reliability

and remediation performance. These are

dependable models with some tool

specific weaknesses but nothing

catastrophic. They will get the job done

just not at the absolute highest level.

And then we have the top tier. What's

interesting is that the data naturally

separated into those groups. There's a

real performance gap between the

mid-tier models clustering around 75 80%

and the top performance. The top tier

models consistently score above 85%

overall with significantly higher

reliability and consistency scores.

They're not just little better, they're

noticeably better across the board. Only

two models made it here. Claude Hiku

which was released in mid October

very recently, scored 87% overall.

actually the highest overall score in

the entire evaluation. It leads in four

out of five tool categories. Capability

analysis, pattern recognition

recommendations, and remediations. This

is the model with only 200,000 context

window that somehow achieved the highest

recommendation score, 93%. Its

efficiency is remarkable and it costs $1

input and $5 for outputs per million

tokens. Closet also scored 87% overall

slightly lower RO performance than

Hayeku, but it has the highest

reliability and consistency scores in

this entire comparison. 98% reliability

and 97% consistency. It achieved 100%

participation rate across all

evaluations, never failed to complete a

test, never exceeded timeouts. When you

absolutely absolutely need your agent to

finish what it started, this is your

model, but that reliability comes at a

premium. $3 input and $15 output per

million tokens. That's three times more

expensive than H. So, here's the

trade-off between these two top

performance. Hiku gives you the best raw

performance across most scenarios at a

reasonable price, but with around 90%

reliability. Sonnet gives you near

perfect 98%. It will never fail you, but

at three times the cost and with

slightly lower performance in most

categories. Both are excellent choices.

The question is whether you're

optimizing for maximum performance or

maximum reliability. Okay, now I've

shown you the data, the failures, the

costs, the performance patterns, the

tiers. But data alone doesn't help you

make decisions. You need concrete

recommendations. Which model should you

actually use? Let's break this down from

worst to best so you know exactly what

to avoid and what to adopt.

Now before we dive into the rankings

the final summary, there is an important

note. This ranking is specific to ops

devops, sor and software engineering

tasks. Models that scored low here might

excel at other types of work like

creative writing, general conversation

or completely different domains.

Similarly, models that scored high here

might not be the best choice for

everything. This comparison is focused

on agent-based technical workflows.

Keep that in mind as we go through the

rankings. 10th place, Gripity 5 Pro. $15

per million input tokens, $120 per

million output tokens. You want me to

pay premium prices for a model that

cannot finish half the tests. This thing

failed more evaluations than it

completed. When it does manage somehow

to finish something, it takes forever.

You want us to pay those prices for

this? Hell no. Now, ninth place, Mistral

Large. This is marketed as Europe's

answer to OpenAI. Europe's shut at

independence.

It cannot even troubleshoot a Kubernetes

cluster. Zero in remediation. Fail the

third of all tests. Come on, Europe.

This is embarrassing. Eighth place, Deep

Sea Reasoner. The model everyone's

talking about trained for pocket change.

OpenAI thinks they stole their data.

Create benchmark scores and it cannot

handle real agent workflows below 70%

overall. Turns out you can gain

benchmarks but you cannot fake

production performance.

Seventh place, Grock for fast reasoning.

Oh, fast reasoning except it's neither

fast nor good at reasoning. Exceeded

timeouts on complex analysis, 40% on

task requiring systematic thinking.

What's the point of putting reasoning in

your name if you cannot reason? Cheapest

model. That's true. At 20 cents input

50 cents output. You get what you pay

for. I guess these were the models you

should avoid. Now, let's look at the

middle of the pack models that actually

work but will not blow your mind. Sixth

place, Germany 2.5 Pro. Mid-tier

performance. Nothing exciting, nothing

terrible. The model you choose when you

have no strong opinion. It works fine.

It will not impress anyone. Then at the

fifth place, Grock 4. Mid tier.

Reliable, boring. It will not blow your

mind, but it will also not let you down.

It's Ellen's safe option. Then on the

fourth place, GT5

without pro, the normal version, not the

expensive disaster. Mid-tier

consistent, predictable. It's like

Toyota Camry of AI models. Gets the job

done without drama. And now we're

getting to the good stuff. The top three

models that actually deliver. In the

third place, Geminy 2.5 flesh. It's the

best value for money. 30 cents input

$2.5 output, 78 overall performance with

no catastrophic failures. When your CFO

asks why you're spending money on AI

show them this model. Good performance

reasonable price, smart choice. Then on

the second place, Cloud Summit, $3

input, $15 output, 87 overall, 98

reliability, the highest in this entire

comparison. It never failed a single

test. When you absolutely cannot afford

your agent to crop out, this is your

model. Enterprisegrade reliability

three times more expensive than fake.

Worth it if failures cost you money. And

now the winner, first place, Claude

Hiku. $1 input, $5 output, 87% overall

tied with Sonnet, but wins on raw

performance. Leads in four out of five

categories. And here's the kicker. Only

200,000 context window, yet it achieved

the highest score for processing massive

context loads. That's not about having

more space. That's about using what you

have efficiently.

Best price performance ratio. This is

the model to beat. Use this. Use it

unless you're not price conscious and

you can afford Sonnet's premium for

maximum reliability. So here's the

bottom line. Start with clo for most

work. Switch to cloet when you need that

98% reliability and you can afford the

premium. Use Gemini 2.5 flesh when

budget matters and avoid the bottom tier

entirely unless you have very very

specific use cases where their

weaknesses do not matter. So here's the

question. Was this useful? Should they

repeat those evaluations every time a

new model is released? Which criteria do

you think is missing? What should they

include in future evaluations when new

models appear? Let me know in the

comments. The full report with all the

detailed data is available at the

address over there. Thank you for

watching. Hey, it's in the description

as well. Thank you for watching. See you

in the next one. Cheers.

Which AI Model is Best for DevOps? I Tested 10 (Shocking Results)

DevOps & AI Toolkit

74 days ago

33:34

Platform Engineering & DevOps Culture

Rank #3

Description

A comprehensive, data-driven comparison of 10 leading large language models (LLMs) from Google, Anthropic, OpenAI, xAI, DeepSeek, and Mistral, specifically tested for DevOps, SRE, and platform engineering workflows. Instead of relying on traditional benchmarks or marketing claims, this evaluation runs real agent workflows through production scenarios: Kubernetes operations, cluster analysis, policy generation, manifest creation, and systematic troubleshooting—all with actual timeout constraints. The results reveal shocking gaps between benchmark promises and production reality: 70% of models couldn't complete tasks in reasonable timeframes, premium "reasoning" models failed on tasks cheaper alternatives handled easily, and the most expensive model ($120 per million output tokens) failed more tests than it passed. The evaluation measures five key dimensions: overall performance quality, reliability and completion rates, consistency across different tasks, cost-performance value, and context window efficiency. Five distinct test scenarios push models through endurance tests (100+ consecutive interactions), rapid pattern recognition (5-minute workflows), comprehensive policy compliance analysis, extreme context pressure (100,000+ token loads), and systematic investigation loops requiring intelligent troubleshooting. The rankings reveal clear performance tiers, with Claude Haiku emerging as the overall winner for its exceptional efficiency and price-performance ratio, while Claude Sonnet takes the reliability crown with 98% completion rates. The video provides specific recommendations on which models to use, which to avoid, and why cost doesn't always correlate with capability in production environments. #LLMComparison #DevOps #AIforEngineers Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join ▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/best-ai-models-for-devops--sre-real-world-agent-testing 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai 🎬 Analysis report: https://github.com/vfarcic/dot-ai/blob/main/eval/analysis/platform/synthesis-report.md ▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below). ▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/ ▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox ▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Large Language Models (LLMs) Compared 01:54 How I Compare Large Language Models 05:01 LLM Evaluation Criteria and Test Scenarios 13:23 AI Model Benchmark Results 27:34 AI Model Rankings and Recommendations

Video Details

Category

Platform Engineering & DevOps Culture

Featured Date

January 13, 2026

Quality Rank

#3

AI Recommended