Loading video player...
OpenAI says GPT52 codecs is the best
coding model yet. Reliable tool calling,
native compaction, and even more
efficient use of tokens while reasoning.
So I did what I always do. I put it up
against five other models. Ran it
through the exact same challenge, the
same analysis, the same implementation,
and the exact same evaluators to see how
it performed. The short answer, and I'll
give it to you right up here, right up
front, is no. Actually, I couldn't find
any major differences at all. Now,
admittedly, they call out cyber
security. I will admit I did not I did
not challenge that aspect at all.
However, everything else that I did
challenge, not only did 52 codecs not
perform quite as well. There were some
places that it actually performed a
little bit less well. It's a very, very
good model, don't get me wrong. And in
fact, if you're on Windows, it may
really turn the uh the dial for you
because they really did some real tool
calling work in Windows. So, it really
is the place to go there. But it was a
little bit slower. It used a little bit
more tokens and it came out with worse
results. And the most important one to
me is it actually communicates a little
bit worse than the 52 non-codex model.
But I want to show you what I did. I
took seven models from OpenAI, Claude,
and Gemini and measured two things. how
well they analyzed and identified the
problems in an old codebase and how well
they fixed them. And I used evaluators
to judge everything. I'm going to walk
you through the results of these systems
and take a look at their analysis. And
in fact, I'm going to show you a couple
surprising findings, including one model
that really completely fell over in a
way that I'm still surprised by and
can't really explain. And I'll be honest
with you, the most important finding of
all of this multiple days of work to get
here was not really the codeex thing
because as you can tell, they kind of
are very comparable with one another at
best. It was actually how these models
communicate how they got to the answers.
Not just what their analysis was, that's
something we've seen in the past, but
kind of their thought process throughout
the whole thing. And I'm calling this
context mapping. the idea of being able
to hand off how a model got to its
conclusion or where it is currently, the
things that it's considered, a whole
bunch of other things. I think this is
going to be fundamentally important to
us with working in with agents in the
future, and it's something that I'm
shooting a whole video on. If something
like that's interesting, please
subscribe. That I hope will be my very
next video, and I think it's really
important. But first, let's talk about
this codeex problem. Okay, so what did I
actually throw at these models? This is
YouTube TV's NFL page. What you're
seeing is a list of all of the shows
that it's aware of. And you'll see it's
a partial list. That's important in a
second. I wrote a Chrome extension. This
is probably seven or eight years ago
almost at this point to help me kind of
identify what I'm interested in and what
I'm not interested in. Okay. What it
does is pretty simple. It puts these
little dots here on each one of these
episode rows. And you'll see some rows
have different levels already applied to
them. So, if we took this game here, I
might say, "I'm mildly interested or not
interested or highly interested." And
that's what the dots can do for us is
just allow for kind of selection of
excitement around a specific game or
episode or something like that. Now,
this sounds really simple, but of
course, when you reload the browser, we
need this to come back. It needs to be
sticky. It needs to keep drawing this
yellow ring around this game forever.
Otherwise, it's pretty much meaningless.
And really right within that is the
whole crux of the system. That's the
most difficult part because a system
like this doesn't give you as an
extension developer much to work with.
This row or this game is not actually
identified in any meaningful way. And
all the extension has to work with is
basically the HTML if you will that's
been rendered here that's drawing this
this row. That HTML can be as dynamic as
the team wants it to be. It doesn't have
to say anything unique about this game.
The main problem that you'll be hearing
is being able to identify each one of
these rows. Being able to get an ID is
almost impossible as a source of truth.
So it is that extension is kind of
really doing a lot of work to try to
infer what game this is from the names
of the teams that are playing, the date
that it's playing, the thumbnail URL,
the URL that it'll launch to if you kick
off the game. It's got a lot of
different mitigation strategies to try
to get down to an ID so that this yellow
ring doesn't accidentally start showing
up against a different game. That's
really the crux. And also, by the way,
like I said, this is an extension that I
wrote for my own use. I've been using it
for many years. No one else has ever
seen this. So, I really didn't care how
it was written at the time. If I needed
changes, I would make a quick change. It
is not architected all that well. So
what I decided to do is give this to
these models because it really
represents a good example of working
code that cannot be disrupted but
absolutely has some room for growth. Can
it identify each model that is what it
could be better at or where the risks
are. That's the real challenge here.
Okay, let me show you just a little bit
of the methodology and then we'll get
into some of the results. What I'm
sharing here is a folder called eval.
You'll see that there's five steps here.
The first one is the agent's job and
then I use the evaluate. The agent is
whether it's Gemini or it's GPT52 or
whatever it might be and I say go
execute this file 01 agent analysis. It
reads this. One of the things that it
knows it needs to look into is this
folder of instructions. We'll look at
that in a second. The next step once
this is finished, I then go over to
claude code in planning mode for opus 45
so that they're all all of these eval
steps are done by the exact same model
intentionally so that the comparison
between all these parts are the same. So
each one of these you can see the agent
does its analysis and then we use the
eval agent to analyze its results and
give us results. The agent analysis
which would be that very first step. Go
take a look see what you can find. It
has a start here. Each one of these has
a start here. Did that start here as
kind of a orientation of here's what
you're doing in this step. Here's the
responsibilities you have etc etc. What
you're going to want to go read is this
PRD and this as you can see is a pretty
big PRD of me describing everything the
objective of the system what it's trying
to do. I was trying to be as fair as
possible as if I had given this a real
task of not just cold go learn
everything but here's what it's trying
to do. Go take a look at it. figure
these things out like separation of
concerns, fragile string manipulation,
etc., etc. So, I was being pretty
predictive of what they might be looking
for and what they might be able to find.
But I'm also very clear about go find
things that are not described. Okay,
enough of kind of the what are we doing?
What are we asking these systems to do?
Let's take a look at some of their
findings. Okay, so let's take a look at
the results that these put out. Now
recall what I'm doing is I'm asking the
system to go do the analysis, determine
everything that I've kind of described
in that big PRD, but also go find as
much as it can that I might not have
described. And so we're looking for some
novel findings in these outputs. I am
also very very clearly saying I want you
to go at those documents and create what
you're seeing here, a single page
application. You might hear me say spa.
This is kind of that context map. I want
to understand not just what your
findings are, how you arrived at them,
what you considered, what really the
risks are, but I need you to talk to me
in two ways. I need you to talk to me a
little bit technically so that I know
what you're doing. And I need you to
talk to me as if I'm just kind of a a
decision leader that might not have full
awareness of what's going on in this
system. All right. So, I want to look at
these two outputs briefly. We can't go
through them. They're super dense. It's
not worth actually going through them,
but I will show you the differences and
kind of their approaches to some degree
and the scores that the evaluator gave
to each. It's pretty simple. So, the
first one we're looking at, this is the
non-codex model. So, this is the old one
couple weeks ago. This was top of the
top of the heap for OpenAI. And what
it's trying to do is tell us
specifically, this is what it's asked to
do. Tell us about the application that
we're we're doing. That's this
extension. It adds five dots for
interest that you can use on YouTube
TV's browse blah blah blah things we've
seen. You'll also see that it has an
overview of the architecture, how it lay
out lays out and what's important, what
files might be important in that aspect
or what technical aspects are are
needed. And then at the bottom down
here, you'll see all of the proposed
changes. This is basically the stuff
that it found. I will say just for
simplicity sake, they roughly found the
same things. As I've mentioned, these
things really performed the same. And
so, we won't necessarily go into these
except to say they are intended that you
tell us what is wrong. Restrict the
content script to YouTube TV hosts.
Okay, good. Gives us a little bit of a
technical detail here. It is kind of a
very open pattern and you should do it
just for the tv.youtube.com
domain. Okay, that makes sense. Gives us
some details on it. It's a little bit
light and you have to be pretty
technical to understand what it's
saying. So, I wouldn't say that this
matches the bar of if you're just a
thought leader and you're trying to make
a decision of something whether or not
it's important. It does describe this
the risks, the why you would do it, what
needs to change, but it takes a lot of
ingesting to figure out what they're
talking about. Up the at the top here,
it talks about the what this is and then
the risks themselves. So in other words,
if we don't do some of these things,
what might go wrong or what do we need
to protect against? And then what is the
plan that we're taking on? Okay, so this
GPT52
non-codeex, I think it does a pretty
darn good job of communicating what it's
trying to do or what it what it found,
you know, kind of its advice that it's
giving. So let's take a quick look at
the codeex version. All right. And so
here we are in the same system from
Codeex uh 5.2 Codeex Max standpoint. And
it's trying to do exactly the same
things because it was given the same
requirement to build an SPA of the
findings that it has. Again, you can see
that it tells us a little bit about the
the feature itself, what it's trying to
do, the major aspects of that feature.
The system flow, this is that
architectural flow to some degree, how
the how the information or how the
different aspects of the program come
into play and when they're important.
Tells us a little bit about why it's
fragile. And then there are some
constraints that are non-negotiable. It
goes through the key risks that it found
and then each one of the changes. This
again is that scope it to YouTube TV and
not be as open. If we look in it has a
little bit of a description, but it's
not really exactly clear what's going on
here. It did a lot of work to try to
come up with these bars and a
representation, but I really have to be
technical to understand what it's
talking about here. It definitely is not
telling me any code information, code
lines. It's not giving me hints of what
it might change or why or what the risks
are if we don't change them. So, it's
missing quite a bit. And if we look at
the scores that these two systems got.
This is the evaluator scoring. The the
on the left is the non-codeex version.
That's the old version. And on the right
is the codeex version. They look
different because there's no uh template
or prescription to how this stuff comes
out. But what you can see is, okay, we
get a 27 on one side and a 26 on the
other. So, it's gone down to plus or
minus one. I would say is basically the
same, but it's worth saying that the
system comprehension for the old model
was a little bit off. It says it spent
more time on the what than deeply
explaining the why. And on the right, it
doesn't tell us the trade-offs. As I was
mentioning, it doesn't tell us what's
the risk if we don't do something or is
what you're advising more technical than
we need. Does it end up bloating the
system? It's just a simple extension.
Does it need it? It also gets dinged.
Unlike this one over here, which gets a
four out of that communication step, the
document we were just looking at, this
one only gets a three. And it's a pretty
low score across all of the models that
have have delivered. And I believe these
numbers actually hold a little bit. The
intention isn't necessarily the numbers.
Don't get too hung up on that. This is
pretty subjective stuff on a pretty
small surface area. But at the same
time, looking at the two documents we
were just looking at, I do agree one
does a much better job of communicating
even though they've largely found really
closely the same list. Okay, so here I'm
going to share this one with you just
briefly. I won't go through everything
it's doing, but this is our baseline.
This is Opus 45 in planning mode. This
is the SPA or the context map that it
came back with to try to tell me what
system are we looking at, what is it
supposed to do, what does it do well,
what did I think about, what risks does
it have, what problems does it have that
I think we should all of those kinds of
things. And you can see it's telling a
story. It's even a thematic story in
this case of kind of uh I guess going to
the hospital to some degree. So our
patient is a Chrome extension that
solves real problems for NFL fans on
YouTube TV. Imagine browsing YouTube
TV's NFL selection. you see a wall of
games, you're uh some you're excited
about, others you couldn't care less
about, but they all look the same.
There's no way to mark your preference.
Okay, so it's it is obviously getting
that concept of telling a story. Someone
that doesn't understand the surface area
would fully understand this and in fact
go so far as to create a dynamic
interface here of exactly how the thing
works to kind of describe it to
somebody. So I think this is a great
example of what I was looking for and
very explicitly what I was asking for.
and it goes through all of the different
parts. Here is how the application lays
itself out. It even mentions that
there's a a fundamental challenge.
YouTube TV provides no official API and
no stable identifiers for games. It's
like trying to recognize people by their
outfits when they keep changing clothes
every day. And it's like really honestly
trying to tell us what's difficult all
the way down to the bottom where you can
see here are the different strategies.
here's the the the class that it's in
and the lines that you would care about.
Here's the different mitigation or
identification strategies that are used.
So, a really good job of trying to
slowly walk us into what the challenge
is, the different names of them and what
they're useful for. And then it goes in
and finds those different uh kind of
risks that we were seeing in the other.
So, each one of these is a technical
risk. It gives us references to the
files, the lines that it was found on.
Um, it even highlights the areas that it
thinks are actual problems and how to
fix them. This is fantastic. This is the
way that we need to see what models are
doing on the inside of stuff. Again,
later video very exciting. I definitely
want to talk about this, but I just
wanted to show you this is best in
breed. I want to be communicated to this
way, not just a table of bars that say
this one's important and this one's not
and trust us. Okay? And I just have to
very briefly share with you the model
that fell on its face. It earned the
award of most surprising finding of just
about everything and that's Gemini. What
I'm sharing with you is the Gemini
output. I am not going to go through it.
You're going to just have to take my
word for it. This was a train wreck. I
ran it many times. This is four or five
times that I ran this. It always did
this. If you remember the big PRD that
we put in and said, "Go find things
similar to this and look for other
stuff." It found only some of the things
that were mentioned in the PRD and it
literally did not find a single thing
that was not mentioned in the PRD unlike
every single other model. All it shares
here is these three problems all of
which were named inside of the uh PRD
itself but also it did not find many of
the other challenges like the way that
we're dealing with the three different
mitigations for IDs things that were
major cruxes for the rest of the models
and they all found those kinds of
things. It is so lightweight. It's not
worth going through. But I will say
this, I I find the Gemini 3 Pro model a
good model to code with. So this is just
shocking. It is worth saying its
implementation when it went and did work
was against this surface area. Not
surprisingly, what it finds is what it's
going to fix. And what we've asked it to
do is go find what it can find. And this
is what it found. So, I will I I just
want to put this warning out there that
you need to keep an eye on the Gemini 3
model for a minute to make sure that
it's really finding everything it
should. It works very well against the
area that it's actually finding. That's
not the problem. But this was shocking
the difference between what it found and
what all of the other models found. All
right. I I did say it up front. The
there's no real big difference here. I
will call out cyber security again. It's
one of the things they they called out
very clearly in this model release. if
you have any interest in security, those
kinds of things, the codeex model is one
you want to kind of take a look at and
run some tests against. And also,
Windows tool calling was very
specifically called out and that can't
be understated. Anything that can do a
better job of tool calling is going to
be helpful. And I think had I used these
models on something that was a much
longer time horizon, this is a pretty
small change, that I would have seen a
bigger delta between them. and the
codeex model probably would have shown
out to be more both more efficient and
maybe a little bit faster. That better
tool calling really does matter. If if
it's a coin flip for you and it really
doesn't matter either way, definitely
use the codeex model. That's what it's
intended for. So, it's going to be
better at it. But you have my real
findings are they're basically the same
thing at this point still largely. All
right, with that I will say keep an eye
out for that next one. I'm super excited
about that next video, which is really
about that context mapping and where I
think that we're going to be really
needing these models to be able to tell
us what they're thinking in a much more
informed way than just simply here's my
answer and we need to move forward with
it and we're just going to have to trust
it. I hope you're interested in that.
I'm definitely interested in that.
Thanks for coming along for the ride on
this one.
I ran GPT-5.2 Codex through the same challenge as six other models. Identical codebase. Identical prompts. Identical evaluators. The results surprised me—but not in the way OpenAI probably hoped. OpenAI says GPT-5.2 Codex is their best coding model yet: reliable tool calling, native compaction, efficient reasoning. So I built a repeatable evaluation framework and put it head-to-head against GPT-5.2 (non-Codex), Claude Opus 4.5, and Gemini 3 Pro. The test? A real Chrome extension I wrote years ago—working code with real technical debt that needed analysis and refactoring recommendations. Each model had to do two things: analyze the codebase and identify problems, then communicate those findings in both technical docs and stakeholder-friendly reports. I used Claude Opus 4.5 as the evaluator across all models to keep scoring consistent. The findings reveal something important about where these models actually differ—and it's not where you'd expect. If you're evaluating AI coding assistants for real development work, this gives you actual comparative data. Developers working with legacy codebases, engineers curious about how different models handle complexity analysis, and anyone deciding between OpenAI, Anthropic, or Google models will find this useful. Whether you're deep into AI-assisted development or just getting started, there's signal here on what matters and what doesn't. #GPT52Codex #ClaudeCode #AICoding #OpenAI #Anthropic 00:00 - Intro 02:35 - The Challange 05:04 - Methodology 06:45 - Analysis Overview 07:51 - 5.2 Analysis 09:50 - 5.2 Codex Analysis 11:01 - Analysis Scores 12:17 - Best in class 14:43 - The Collapse 16:22 - Conclusion