Loading video player...
[Music]
You're a software engineer, right? Maybe
you're doing DevOps or SR platform
engineering or infrastructure work.
You're using large language models, or
at least, you, should, be., But, which, ones?
How do you know which model to pick?
Now, I was in the same situation. I made
choices based on gut feelings or
benchmark scores that didn't mean
anything in production or marketing
claims. I thought that I should change
that. So I run 10 models from Google
Antropic, OpenAI, XAI, DeepS and Mistral
through real agent workflows, Kubernetes
operations, cluster analysis, policy
generation, systematic troubleshooting
production scenarios with actual timeout
constraints. And the results were
shocking,, at least, when, compared, to, what
benchmarks and marketing promised. 70%
of models couldn't finish their work in
reasonable time. A model that cost $120
per million output tokens failed more
evaluations than it passed. Premium
reasoning models timed out on tasks that
cheaper models handled easily. Models
everyone's talking about couldn't
deliver reliable results. and the
cheapest model, it delivered better
value than options costing 20 times
more. By the end of this video, you will
know exactly which models actually work
for engineering and operations tasks
which ones are unreliable, which ones
burn your money without delivering
results, and which ones can't do what
they're supposed to do.
In this video, I'm comparing large
language models. So, LLMs from Google
Entropic OpenAI XAI DeepSc and
Mistral. Some of these companies have
multiple models in the comparison. So
we'll see how the different versions
stack up against each other. Now, you
might be wondering, hey, why am I
creating my own comparison instead of
just relying on existing benchmarks? The
primary reason is simple. I need to know
which models work best for the agents I
am building. But there's a secondary
reason that's probably more important. I
want to truly know which model performs
better and under which conditions. This
cannot be based on my feelings or
personal experience. It needs to be
based on data. We are engineers, damn
it. We're supposed to make decisions
based on data, not how we feel. Well
some, of, us, at least., I, also, don't, trust
the standard benchmarks because models
are often trained on them to gain the
results. So how am I doing this? I have
a set of tests that validate whether my
agents work correctly. I run those same
tests for each of the models. Since
these are actual tests, I know whether
the output is acceptable or not. When
tests fail, I analyze what's wrong and
add that information to the data sets.
Those data sets include duration, input
and output tokens used, the prompts
themselves, air responses, pass and fail
statuses, and so on and so forth. All
the data is then injected into prompts
that run the actual evaluations and
provide scoring. Now, this approach is
slightly different from standard AI
evaluations. Most levels have to figure
out both whether something worked and
how well it worked. Mine doesn't. The
functional tests already determined if
the output is acceptable. The levels are
scoring how well it worked with full
context about what passed or failed and
why. It's a more grounded approach
because the scoring is based on
realworld agent performance, not trying
to guess if something is correct. Now
all the models were compared using the
same agent based on the Versel Versel
something like that SDK. That means I'm
ignoring the differences in performance
you would get from specific agents like
cloud code. Those differences can be
huge, but that's not what I'm testing
here though. This comparison is about
the models themselves. Now, as for what
I'm actually testing, well, I'm not
measuring code generation directly since
that's more subjective. Instead, all the
tests are based on Kubernetes, which is
well understood by all these models. I
expect the results would be similar for
any type of development tasks, but these
specific results are fine-tuned for
DevOps or ops and SE type of work. Now
let's break down exactly what criteria I
used to evaluate those models and what
scenarios they were tested against.
Now, I believe understanding the
criteria and scenarios is important for
making sense of the results. But hey, if
you're anxious to see which models won
and lost, feel free to skip ahead to the
next section with the results. I won't
be offended. I promise. So, I evaluated
these 10 models across five key
dimensions. First is overall
performance. Basically, quality scores
measuring how well each model handles
different types of agent instructions.
Second is reliability. Can the model
actually complete evaluation sessions
without crapping out? This is measured
by participation rate and completion
success. Third is consistency.
How predictable is the model's
performance across different tools. You
do not want the model that excels in one
area but completely fails in another.
Fourth is cost performance value. Your
quality score relative to pricing per
million tokens. Raw performance doesn't
mean much if it costs 50 times more than
alternatives. And the last one is
context window efficiency. How well
models handle large context loads. Some
scenarios send over 200,000 tokens to
the models. Having a massive context
window doesn't guarantee good
performance if the model cannot actually
use it effectively. Now, timeout
constraints are critical here. If a
model cannot deliver results in
reasonable time, it's not useful in
production. Real world agent workflows
have time budgets like 5 minutes for
quick pattern creation, 45 minutes for
comprehensive cluster analysis. When I
say an evaluation failed, I mean the
timeout was exceeded. It isn't about
models working indefinitely. It's about
delivering what we need in reasonable
periods of time. So let's look at the
five evolution scenarios. Each one tests
different aspects of what makes AI
agents actually useful in production
environments. First is capability
analysis. This is basically an endurance
tests that puts model through about 100
consecutive AI interactions over a
45minut period. The goal is to discover
and analyze every single Kubernetes
resource in the cluster. What makes this
challenging is maintaining quality
throughout those marathon sessions while
demonstrating deep Kubernetes knowledge
for each resource type. The question
we're asking is can the model sustain
performance without degrading over
extended workflows. This matters because
in production you need models that don't
get sloppy or confused after dozens of
interactions. Interestingly
70% of the models we tested completely
failed at this completely. Second is
pattern recognition. This evaluates how
well models handle multi-step
interactive workflows within tight
fiveinut timeout. The workflow goes
something like this. Expand trigger
keywords into comprehensive lists.
abstract those specific requirements
into reusable organizational patterns
and create templates that get stored in
a vector database. Think of it as
capturing your team's development best
practices so they can be automatically
applied to future deployments. The
challenge here is speed combined with
abstraction. Transforming specific
requirements into general patterns
quickly. What we're really testing is
how well the model handles rapid
interactive refinement workflows. This
matters because in production you often
need quick answers for pattern matching
not deep philosophical analysis. The
third is policy compliance. This tests
whether models can perform schema-to-sk
analysis of cluster resources within a
15 minutes timeout. The tasks involve
expanding policy triggers into
comprehensive resource categories and
generating complete syntactically
correct Kybero policies with cell
validation expressions. This is about
proactive governance integrating
security and compliance requirements
directly into AI recommendations rather
than blocking manifests after they're
created. The challenge is in time
pressure. Comprehensive schema analysis
cannot be rushed, but you also cannot
take forever. The question is, can the
model balance depth of analysis with
time constraints?
30% of models completely failed this
scenario because they didn't have large
enough context windows to handle the
schema complexity. Fourth is
recommendations. This is manifest
generation under extreme context
pressure within a 20 minute timeout. The
context load here is brutal. Up to 50
large Kubernetes resource scheas
totaling over 100 or 200,000 tokens. The
process involves transforming user
intent into productionready YAML
manifests through targeted clarification
questions. What makes this different
from generic deployment tutorials is
that it needs to understand your
specific clusters capabilities, your
organization's patterns, and your
governance policies. The challenge is
processing this massive schema context
while still being accurate. The question
we're testing is, hey, how efficiently
does the model actually utilize context
windows for complex generation? This is
where we learned that having a massive
context window doesn't mean the model
can use it effectively. 50% of models
failed at manifest generation between
the timeout constraints. Fifth and final
is remediation. This one is different
from the previous scenarios. The earlier
four were onehot interactions. Hey, send
a prompt, get the response. Remediation
runs as an investigation loop. The model
receives the issue description along
with a list of tools it can use. It then
decides which cube control commands to
run together data. The agent executes
those commands and adds the output back
into the context. The model analyzes the
new information and decides whether to
request more tool executions or conclude
with hey here's the problem and here's
how to solve it. The model in this
scenario is completely free in how it
investigates. It could decide it has
enough information without executing any
tools at all. Or it could loop up to 30
iterations, gathering more and more
data. In practice, successful
investigations typically run five to
eight cycles, all within timeout
constraints for that specific scenario.
What makes this challenging is that the
model must conduct intelligent
systematic investigation. Kubernetes
failures present symptoms, not causes.
For example, a pod that will not start
might actually be failing because of a
missing persistent volume claim or a
network policy blocking traffic. The
model needs to decide which cube control
commands to run next based on what it
learned from previous commands to
maintain investigation context
throughout all those iterations to
understand cross resource dependencies
and know when it has enough information
to provide the actual root cause and
remediation steps with proper risk
assessment. The performance variance
here was extreme. We saw a 40 times
difference between the fastest and the
slowest models ranging from 2.5 seconds
to over 22 minutes for the same
investigation. So those are five
scenarios testing different aspects of
what makes AI agents useful in
production. Endurance, speed, attention
to detail, context handling, and
systematic investigation. Now let's see
how these 10 models actually perform.
Let's start with what honestly shocked
me the most. Seven out of 10 models
that's 70% couldn't complete the
capability analysis within 45 minutes.
They just run out of time trying to work
through those 100 consecutive
interactions. Half of the models
exceeded timeout constraints for
manifest generation. Three out of 10
failed at policy compliance because they
couldn't finish the work within the 15
minutes window. Now this is important to
understand. When I say a model failed, I
don't mean it crushed or true errors.
These are timeout failures. The models
were still working, still trying to
complete the task, but they couldn't
deliver results within reasonable
production time frames. And that's what
matters in real deployments. If your
agent takes 2 hours to analyze a cluster
when you need answers in 45 minutes or
45 seconds, that model isn't useful to
you, no matter how good its eventual
output might be. But some models had
truly truly disastrous reliability
issues that go beyond just being slow.
GP5 Pro was the real shocker here. This
model couldn't complete nearly half of
all evaluations. only 52%
participation rate. It exceeded timeout
so often that it failed more tests than
it passed. Pattern recognition zero.
Couldn't finish within the time budget.
Recommendations also zero. Nothing. GP5
Pro is supposed to be the advanced
version of GP5. It is positioned as the
model for complex reasoning tasks. And
here's the thing, these aren't
impossibly hard tasks. We are talking
about Kubernetes operations that
production agents need to handle. It's
like having a brilliant mathematician
who might excel at complicated algebra
but takes hours to answer, hey, what's 2
+ 2? Being brilliant doesn't mean much
if you cannot answer straightforward
questions relatively quickly. Then
there's Mistral large. Now I expected
Mistral might struggle. I did. It
generally not considered among the
absolute top tier models. I included it
anyway because hell we need to support
European AI development as well. Right.
But the results were rough. This model
couldn't finish remediation
investigations at all. Zero score. It
would start the investigation low but
couldn't complete the systematic
troubleshooting process within the time
constraints. only 65% participation rate
overall, meaning it failed to complete a
third of all evaluations. Now, let's
talk about cost because this is where
things get interesting. The cheapest
model isn't always the best value and
the most expensive definitely isn't
either. For example, Grog for fast
reasoning absolutely dominates the cost
performance category. It accomplished a
value score more than three and a half
times better than the next competitor at
35 cents per million tokens. That's 20
cents for input and 50 cents for output.
Is the cheapest model in this
evaluation. But here's the catch.
Remember that policy compliance test I
mentioned earlier? Gro for fast
reasoning couldn't complete it within
the 15 minutes window. It scored around
40%. which is catastrophically
low. This makes it production dangerous
for policy generation workflows. So
yeah, it's cheap and fast for most
tasks, but you absolutely cannot use it
for generating let's say governor
policies. On the other hand, Germany 2.5
flesh offers a more balanced value
proposition. At 30 cents for input and
250 for output per million tokens, it
delivers solid 78% overall performance
with no critical failures. It's about
twice as cheap as some premium options
while still being production ready
across all scenarios. Now, about those
premium models, some of them cost
anywhere from 8 to 20 times more than
budget options. The question is, do they
deliver proportional value? We'll see
that one later. And then there's GPT5
Pro. Remember that model that couldn't
complete half of its evaluations? The
one with zero scores in pattern
recognition recommendations? That one
costs almost $68
per million tokens. $15 for input and
$120
for output. It's so ridiculously
expensive compared to everything else
that it actually screwed up my cost
versus quality graph that you see on the
screen. Now looking at full specific
performance reveals something
interesting. There's no universal winner
that dominates everything. Instead, we
see specialization. Some models excel at
certain tasks but struggle with others.
In capability analysis, the performance
spread is massive from 90% at the top
down to around 64% at the bottom.
There's a significant difference in how
well models handle these 100 consecutive
interactions over 45 minutes. Pattern
recognition shows even more dramatic
variance. Some models excel at this
rapid five minute workflow while others
remember GP5 Pro yeah scored absolute
zero. Couldn't complete it at all.
They're not even in this graph. Policy
compliance ranges from 83% down to 40%.
That bottom score is Grock 45 reasoning
failing to complete the comprehensive
scheme analysis within 15 minutes. For
recommendations, we saw a 42 times speed
difference between the fastest and
slowest models. Same task, same schemas
widely different completion times. Some
models process those 100,000 plus tokens
efficiently, others choke on them. And
remediation shows the widest gap of all.
Complete failures at zero versus near
perfect scores at 95%. This systematic
investigation workflow really separates
the capable models from the ones that
just cannot handle iterative problem
solving. And here's where things get
really interesting. Context window size
matters, but efficiency matters even
more. The pattern is clear. Models with
larger context windows generally score
higher in recommendations. Makes sense
right? That scenario sends a 100,000 or
more tokens of context in each
interaction, including up to 50 large
Kubernetes schemas. Bigger window means
more room to work, right? Well, not
really. Look at Cloud Hiku. That
completely breaks the pattern. This
model has only 200,000 context window
yet it scored 93%. the highest score in
the entire recommendation category. It's
processing these massive context loads
with 50 plus schemas more effectively
than models with five times larger
windows.
That's not about having more space.
That's about using the space you have
more efficiently. Now on the flip side
Grock 4 fast reasoning has a massive
massive 2 million token context window
but could not complete tasks within the
timeout constraints. All that capacity
but it cannot process the information
fast enough to actually matter. The
lesson here is clear. Context size alone
doesn't guarantee performance. It's
about how efficiently the model utilizes
what it has. Now when I synthesize all
this data, the reliability issues, cost
trade-offs, tool specific performance
and context efficiency
clear performance tiers emerged. Three
distinct categories separated by real
performance gaps. Let's start at the
bottom. The bottom tier includes models
scoring below 70% overall or below 80%
reliability.
Four models landed there. Now, two of
these I expected to struggle. Mistral
large and deepseasoner aren't generally
considered top tier models, so their
results were not surprising. Deep seek
struggled with pattern recognition and
endurance testing while Mistral couldn't
complete remediation investigations.
Not great, but not shocking either. Now
the real disappointments were GP5 Pro
and Gro 45 reasoning. These were
supposed to be competitive options, but
they failed in ways I didn't anticipate.
The primary issue they take unreasonably
long to analyze tasks consistently
hitting my timeout constraints. GPT5 Pro
could not complete nearly half of all
evaluations. That catastrophic 52%
participation rate with zero scores in
pattern recognition and recommendations.
It's not that it got the answers wrong.
It's that it could not deliver answers
within specific time frames.
And Grock for fast reasoning with fast
literally in its name has that
production dangerous 40% policy
compliance score because it could not
and I repeat could not complete the
comprehensive scheme analysis within 15
minutes. The irony of calling it fast
when it consistently exceeded timeouts
is kind of funny. It's not lost on me.
The bottom line, these models have
critical failures that make them
unsuitable for production use or at
minimum you would need to be very very
careful about which scenarios you use
them for. Now moving up to the mid tier
we have four models that are actually
production ready. They score between 70
and 80% overall with at least 80%
reliability. This includes Gemini 2.5
flesh, our balanced value champion at 50
cents input and 250 output per million
tokens. GPT5 without pro with its solid
consistency in pattern recognition.
Gemini 2.5 Pro with strong capability
analysis despite that weak remediation
score and Gro 4 with good reliability
and remediation performance. These are
dependable models with some tool
specific weaknesses but nothing
catastrophic. They will get the job done
just not at the absolute highest level.
And then we have the top tier. What's
interesting is that the data naturally
separated into those groups. There's a
real performance gap between the
mid-tier models clustering around 75 80%
and the top performance. The top tier
models consistently score above 85%
overall with significantly higher
reliability and consistency scores.
They're not just little better, they're
noticeably better across the board. Only
two models made it here. Claude Hiku
which was released in mid October
very recently, scored 87% overall.
actually the highest overall score in
the entire evaluation. It leads in four
out of five tool categories. Capability
analysis, pattern recognition
recommendations, and remediations. This
is the model with only 200,000 context
window that somehow achieved the highest
recommendation score, 93%. Its
efficiency is remarkable and it costs $1
input and $5 for outputs per million
tokens. Closet also scored 87% overall
slightly lower RO performance than
Hayeku, but it has the highest
reliability and consistency scores in
this entire comparison. 98% reliability
and 97% consistency. It achieved 100%
participation rate across all
evaluations, never failed to complete a
test, never exceeded timeouts. When you
absolutely absolutely need your agent to
finish what it started, this is your
model, but that reliability comes at a
premium. $3 input and $15 output per
million tokens. That's three times more
expensive than H. So, here's the
trade-off between these two top
performance. Hiku gives you the best raw
performance across most scenarios at a
reasonable price, but with around 90%
reliability. Sonnet gives you near
perfect 98%. It will never fail you, but
at three times the cost and with
slightly lower performance in most
categories. Both are excellent choices.
The question is whether you're
optimizing for maximum performance or
maximum reliability. Okay, now I've
shown you the data, the failures, the
costs, the performance patterns, the
tiers. But data alone doesn't help you
make decisions. You need concrete
recommendations. Which model should you
actually use? Let's break this down from
worst to best so you know exactly what
to avoid and what to adopt.
Now before we dive into the rankings
the final summary, there is an important
note. This ranking is specific to ops
devops, sor and software engineering
tasks. Models that scored low here might
excel at other types of work like
creative writing, general conversation
or completely different domains.
Similarly, models that scored high here
might not be the best choice for
everything. This comparison is focused
on agent-based technical workflows.
Keep that in mind as we go through the
rankings. 10th place, Gripity 5 Pro. $15
per million input tokens, $120 per
million output tokens. You want me to
pay premium prices for a model that
cannot finish half the tests. This thing
failed more evaluations than it
completed. When it does manage somehow
to finish something, it takes forever.
You want us to pay those prices for
this? Hell no. Now, ninth place, Mistral
Large. This is marketed as Europe's
answer to OpenAI. Europe's shut at
independence.
It cannot even troubleshoot a Kubernetes
cluster. Zero in remediation. Fail the
third of all tests. Come on, Europe.
This is embarrassing. Eighth place, Deep
Sea Reasoner. The model everyone's
talking about trained for pocket change.
OpenAI thinks they stole their data.
Create benchmark scores and it cannot
handle real agent workflows below 70%
overall. Turns out you can gain
benchmarks but you cannot fake
production performance.
Seventh place, Grock for fast reasoning.
Oh, fast reasoning except it's neither
fast nor good at reasoning. Exceeded
timeouts on complex analysis, 40% on
task requiring systematic thinking.
What's the point of putting reasoning in
your name if you cannot reason? Cheapest
model. That's true. At 20 cents input
50 cents output. You get what you pay
for. I guess these were the models you
should avoid. Now, let's look at the
middle of the pack models that actually
work but will not blow your mind. Sixth
place, Germany 2.5 Pro. Mid-tier
performance. Nothing exciting, nothing
terrible. The model you choose when you
have no strong opinion. It works fine.
It will not impress anyone. Then at the
fifth place, Grock 4. Mid tier.
Reliable, boring. It will not blow your
mind, but it will also not let you down.
It's Ellen's safe option. Then on the
fourth place, GT5
without pro, the normal version, not the
expensive disaster. Mid-tier
consistent, predictable. It's like
Toyota Camry of AI models. Gets the job
done without drama. And now we're
getting to the good stuff. The top three
models that actually deliver. In the
third place, Geminy 2.5 flesh. It's the
best value for money. 30 cents input
$2.5 output, 78 overall performance with
no catastrophic failures. When your CFO
asks why you're spending money on AI
show them this model. Good performance
reasonable price, smart choice. Then on
the second place, Cloud Summit, $3
input, $15 output, 87 overall, 98
reliability, the highest in this entire
comparison. It never failed a single
test. When you absolutely cannot afford
your agent to crop out, this is your
model. Enterprisegrade reliability
three times more expensive than fake.
Worth it if failures cost you money. And
now the winner, first place, Claude
Hiku. $1 input, $5 output, 87% overall
tied with Sonnet, but wins on raw
performance. Leads in four out of five
categories. And here's the kicker. Only
200,000 context window, yet it achieved
the highest score for processing massive
context loads. That's not about having
more space. That's about using what you
have efficiently.
Best price performance ratio. This is
the model to beat. Use this. Use it
unless you're not price conscious and
you can afford Sonnet's premium for
maximum reliability. So here's the
bottom line. Start with clo for most
work. Switch to cloet when you need that
98% reliability and you can afford the
premium. Use Gemini 2.5 flesh when
budget matters and avoid the bottom tier
entirely unless you have very very
specific use cases where their
weaknesses do not matter. So here's the
question. Was this useful? Should they
repeat those evaluations every time a
new model is released? Which criteria do
you think is missing? What should they
include in future evaluations when new
models appear? Let me know in the
comments. The full report with all the
detailed data is available at the
address over there. Thank you for
watching. Hey, it's in the description
as well. Thank you for watching. See you
in the next one. Cheers.
A comprehensive, data-driven comparison of 10 leading large language models (LLMs) from Google, Anthropic, OpenAI, xAI, DeepSeek, and Mistral, specifically tested for DevOps, SRE, and platform engineering workflows. Instead of relying on traditional benchmarks or marketing claims, this evaluation runs real agent workflows through production scenarios: Kubernetes operations, cluster analysis, policy generation, manifest creation, and systematic troubleshooting—all with actual timeout constraints. The results reveal shocking gaps between benchmark promises and production reality: 70% of models couldn't complete tasks in reasonable timeframes, premium "reasoning" models failed on tasks cheaper alternatives handled easily, and the most expensive model ($120 per million output tokens) failed more tests than it passed. The evaluation measures five key dimensions: overall performance quality, reliability and completion rates, consistency across different tasks, cost-performance value, and context window efficiency. Five distinct test scenarios push models through endurance tests (100+ consecutive interactions), rapid pattern recognition (5-minute workflows), comprehensive policy compliance analysis, extreme context pressure (100,000+ token loads), and systematic investigation loops requiring intelligent troubleshooting. The rankings reveal clear performance tiers, with Claude Haiku emerging as the overall winner for its exceptional efficiency and price-performance ratio, while Claude Sonnet takes the reliability crown with 98% completion rates. The video provides specific recommendations on which models to use, which to avoid, and why cost doesn't always correlate with capability in production environments. #LLMComparison #DevOps #AIforEngineers Consider joining the channel: https://www.youtube.com/c/devopstoolkit/join ▬▬▬▬▬▬ 🔗 Additional Info 🔗 ▬▬▬▬▬▬ ➡ Transcript and commands: https://devopstoolkit.live/ai/best-ai-models-for-devops--sre-real-world-agent-testing 🔗 DevOps AI Toolkit: https://github.com/vfarcic/dot-ai 🎬 Analysis report: https://github.com/vfarcic/dot-ai/blob/main/eval/analysis/platform/synthesis-report.md ▬▬▬▬▬▬ 💰 Sponsorships 💰 ▬▬▬▬▬▬ If you are interested in sponsoring this channel, please visit https://devopstoolkit.live/sponsor for more information. Alternatively, feel free to contact me over Twitter or LinkedIn (see below). ▬▬▬▬▬▬ 👋 Contact me 👋 ▬▬▬▬▬▬ ➡ BlueSky: https://vfarcic.bsky.social ➡ LinkedIn: https://www.linkedin.com/in/viktorfarcic/ ▬▬▬▬▬▬ 🚀 Other Channels 🚀 ▬▬▬▬▬▬ 🎤 Podcast: https://www.devopsparadox.com/ 💬 Live streams: https://www.youtube.com/c/DevOpsParadox ▬▬▬▬▬▬ ⏱ Timecodes ⏱ ▬▬▬▬▬▬ 00:00 Large Language Models (LLMs) Compared 01:54 How I Compare Large Language Models 05:01 LLM Evaluation Criteria and Test Scenarios 13:23 AI Model Benchmark Results 27:34 AI Model Rankings and Recommendations