Why GPT-5.2 Codex Didn't Impress Me | DailyDevLists

Why GPT-5.2 Codex Didn't Impress Me | DailyDevLists

Loading video player...

Full Transcript

3,756 words • EN

OpenAI says GPT52 codecs is the best

coding model yet. Reliable tool calling,

native compaction, and even more

efficient use of tokens while reasoning.

So I did what I always do. I put it up

against five other models. Ran it

through the exact same challenge, the

same analysis, the same implementation,

and the exact same evaluators to see how

it performed. The short answer, and I'll

give it to you right up here, right up

front, is no. Actually, I couldn't find

any major differences at all. Now,

admittedly, they call out cyber

security. I will admit I did not I did

not challenge that aspect at all.

However, everything else that I did

challenge, not only did 52 codecs not

perform quite as well. There were some

places that it actually performed a

little bit less well. It's a very, very

good model, don't get me wrong. And in

fact, if you're on Windows, it may

really turn the uh the dial for you

because they really did some real tool

calling work in Windows. So, it really

is the place to go there. But it was a

little bit slower. It used a little bit

more tokens and it came out with worse

results. And the most important one to

me is it actually communicates a little

bit worse than the 52 non-codex model.

But I want to show you what I did. I

took seven models from OpenAI, Claude,

and Gemini and measured two things. how

well they analyzed and identified the

problems in an old codebase and how well

they fixed them. And I used evaluators

to judge everything. I'm going to walk

you through the results of these systems

and take a look at their analysis. And

in fact, I'm going to show you a couple

surprising findings, including one model

that really completely fell over in a

way that I'm still surprised by and

can't really explain. And I'll be honest

with you, the most important finding of

all of this multiple days of work to get

here was not really the codeex thing

because as you can tell, they kind of

are very comparable with one another at

best. It was actually how these models

communicate how they got to the answers.

Not just what their analysis was, that's

something we've seen in the past, but

kind of their thought process throughout

the whole thing. And I'm calling this

context mapping. the idea of being able

to hand off how a model got to its

conclusion or where it is currently, the

things that it's considered, a whole

bunch of other things. I think this is

going to be fundamentally important to

us with working in with agents in the

future, and it's something that I'm

shooting a whole video on. If something

like that's interesting, please

subscribe. That I hope will be my very

next video, and I think it's really

important. But first, let's talk about

this codeex problem. Okay, so what did I

actually throw at these models? This is

YouTube TV's NFL page. What you're

seeing is a list of all of the shows

that it's aware of. And you'll see it's

a partial list. That's important in a

second. I wrote a Chrome extension. This

is probably seven or eight years ago

almost at this point to help me kind of

identify what I'm interested in and what

I'm not interested in. Okay. What it

does is pretty simple. It puts these

little dots here on each one of these

episode rows. And you'll see some rows

have different levels already applied to

them. So, if we took this game here, I

might say, "I'm mildly interested or not

interested or highly interested." And

that's what the dots can do for us is

just allow for kind of selection of

excitement around a specific game or

episode or something like that. Now,

this sounds really simple, but of

course, when you reload the browser, we

need this to come back. It needs to be

sticky. It needs to keep drawing this

yellow ring around this game forever.

Otherwise, it's pretty much meaningless.

And really right within that is the

whole crux of the system. That's the

most difficult part because a system

like this doesn't give you as an

extension developer much to work with.

This row or this game is not actually

identified in any meaningful way. And

all the extension has to work with is

basically the HTML if you will that's

been rendered here that's drawing this

this row. That HTML can be as dynamic as

the team wants it to be. It doesn't have

to say anything unique about this game.

The main problem that you'll be hearing

is being able to identify each one of

these rows. Being able to get an ID is

almost impossible as a source of truth.

So it is that extension is kind of

really doing a lot of work to try to

infer what game this is from the names

of the teams that are playing, the date

that it's playing, the thumbnail URL,

the URL that it'll launch to if you kick

off the game. It's got a lot of

different mitigation strategies to try

to get down to an ID so that this yellow

ring doesn't accidentally start showing

up against a different game. That's

really the crux. And also, by the way,

like I said, this is an extension that I

wrote for my own use. I've been using it

for many years. No one else has ever

seen this. So, I really didn't care how

it was written at the time. If I needed

changes, I would make a quick change. It

is not architected all that well. So

what I decided to do is give this to

these models because it really

represents a good example of working

code that cannot be disrupted but

absolutely has some room for growth. Can

it identify each model that is what it

could be better at or where the risks

are. That's the real challenge here.

Okay, let me show you just a little bit

of the methodology and then we'll get

into some of the results. What I'm

sharing here is a folder called eval.

You'll see that there's five steps here.

The first one is the agent's job and

then I use the evaluate. The agent is

whether it's Gemini or it's GPT52 or

whatever it might be and I say go

execute this file 01 agent analysis. It

reads this. One of the things that it

knows it needs to look into is this

folder of instructions. We'll look at

that in a second. The next step once

this is finished, I then go over to

claude code in planning mode for opus 45

so that they're all all of these eval

steps are done by the exact same model

intentionally so that the comparison

between all these parts are the same. So

each one of these you can see the agent

does its analysis and then we use the

eval agent to analyze its results and

give us results. The agent analysis

which would be that very first step. Go

take a look see what you can find. It

has a start here. Each one of these has

a start here. Did that start here as

kind of a orientation of here's what

you're doing in this step. Here's the

responsibilities you have etc etc. What

you're going to want to go read is this

PRD and this as you can see is a pretty

big PRD of me describing everything the

objective of the system what it's trying

to do. I was trying to be as fair as

possible as if I had given this a real

task of not just cold go learn

everything but here's what it's trying

to do. Go take a look at it. figure

these things out like separation of

concerns, fragile string manipulation,

etc., etc. So, I was being pretty

predictive of what they might be looking

for and what they might be able to find.

But I'm also very clear about go find

things that are not described. Okay,

enough of kind of the what are we doing?

What are we asking these systems to do?

Let's take a look at some of their

findings. Okay, so let's take a look at

the results that these put out. Now

recall what I'm doing is I'm asking the

system to go do the analysis, determine

everything that I've kind of described

in that big PRD, but also go find as

much as it can that I might not have

described. And so we're looking for some

novel findings in these outputs. I am

also very very clearly saying I want you

to go at those documents and create what

you're seeing here, a single page

application. You might hear me say spa.

This is kind of that context map. I want

to understand not just what your

findings are, how you arrived at them,

what you considered, what really the

risks are, but I need you to talk to me

in two ways. I need you to talk to me a

little bit technically so that I know

what you're doing. And I need you to

talk to me as if I'm just kind of a a

decision leader that might not have full

awareness of what's going on in this

system. All right. So, I want to look at

these two outputs briefly. We can't go

through them. They're super dense. It's

not worth actually going through them,

but I will show you the differences and

kind of their approaches to some degree

and the scores that the evaluator gave

to each. It's pretty simple. So, the

first one we're looking at, this is the

non-codex model. So, this is the old one

couple weeks ago. This was top of the

top of the heap for OpenAI. And what

it's trying to do is tell us

specifically, this is what it's asked to

do. Tell us about the application that

we're we're doing. That's this

extension. It adds five dots for

interest that you can use on YouTube

TV's browse blah blah blah things we've

seen. You'll also see that it has an

overview of the architecture, how it lay

out lays out and what's important, what

files might be important in that aspect

or what technical aspects are are

needed. And then at the bottom down

here, you'll see all of the proposed

changes. This is basically the stuff

that it found. I will say just for

simplicity sake, they roughly found the

same things. As I've mentioned, these

things really performed the same. And

so, we won't necessarily go into these

except to say they are intended that you

tell us what is wrong. Restrict the

content script to YouTube TV hosts.

Okay, good. Gives us a little bit of a

technical detail here. It is kind of a

very open pattern and you should do it

just for the tv.youtube.com

domain. Okay, that makes sense. Gives us

some details on it. It's a little bit

light and you have to be pretty

technical to understand what it's

saying. So, I wouldn't say that this

matches the bar of if you're just a

thought leader and you're trying to make

a decision of something whether or not

it's important. It does describe this

the risks, the why you would do it, what

needs to change, but it takes a lot of

ingesting to figure out what they're

talking about. Up the at the top here,

it talks about the what this is and then

the risks themselves. So in other words,

if we don't do some of these things,

what might go wrong or what do we need

to protect against? And then what is the

plan that we're taking on? Okay, so this

GPT52

non-codeex, I think it does a pretty

darn good job of communicating what it's

trying to do or what it what it found,

you know, kind of its advice that it's

giving. So let's take a quick look at

the codeex version. All right. And so

here we are in the same system from

Codeex uh 5.2 Codeex Max standpoint. And

it's trying to do exactly the same

things because it was given the same

requirement to build an SPA of the

findings that it has. Again, you can see

that it tells us a little bit about the

the feature itself, what it's trying to

do, the major aspects of that feature.

The system flow, this is that

architectural flow to some degree, how

the how the information or how the

different aspects of the program come

into play and when they're important.

Tells us a little bit about why it's

fragile. And then there are some

constraints that are non-negotiable. It

goes through the key risks that it found

and then each one of the changes. This

again is that scope it to YouTube TV and

not be as open. If we look in it has a

little bit of a description, but it's

not really exactly clear what's going on

here. It did a lot of work to try to

come up with these bars and a

representation, but I really have to be

technical to understand what it's

talking about here. It definitely is not

telling me any code information, code

lines. It's not giving me hints of what

it might change or why or what the risks

are if we don't change them. So, it's

missing quite a bit. And if we look at

the scores that these two systems got.

This is the evaluator scoring. The the

on the left is the non-codeex version.

That's the old version. And on the right

is the codeex version. They look

different because there's no uh template

or prescription to how this stuff comes

out. But what you can see is, okay, we

get a 27 on one side and a 26 on the

other. So, it's gone down to plus or

minus one. I would say is basically the

same, but it's worth saying that the

system comprehension for the old model

was a little bit off. It says it spent

more time on the what than deeply

explaining the why. And on the right, it

doesn't tell us the trade-offs. As I was

mentioning, it doesn't tell us what's

the risk if we don't do something or is

what you're advising more technical than

we need. Does it end up bloating the

system? It's just a simple extension.

Does it need it? It also gets dinged.

Unlike this one over here, which gets a

four out of that communication step, the

document we were just looking at, this

one only gets a three. And it's a pretty

low score across all of the models that

have have delivered. And I believe these

numbers actually hold a little bit. The

intention isn't necessarily the numbers.

Don't get too hung up on that. This is

pretty subjective stuff on a pretty

small surface area. But at the same

time, looking at the two documents we

were just looking at, I do agree one

does a much better job of communicating

even though they've largely found really

closely the same list. Okay, so here I'm

going to share this one with you just

briefly. I won't go through everything

it's doing, but this is our baseline.

This is Opus 45 in planning mode. This

is the SPA or the context map that it

came back with to try to tell me what

system are we looking at, what is it

supposed to do, what does it do well,

what did I think about, what risks does

it have, what problems does it have that

I think we should all of those kinds of

things. And you can see it's telling a

story. It's even a thematic story in

this case of kind of uh I guess going to

the hospital to some degree. So our

patient is a Chrome extension that

solves real problems for NFL fans on

YouTube TV. Imagine browsing YouTube

TV's NFL selection. you see a wall of

games, you're uh some you're excited

about, others you couldn't care less

about, but they all look the same.

There's no way to mark your preference.

Okay, so it's it is obviously getting

that concept of telling a story. Someone

that doesn't understand the surface area

would fully understand this and in fact

go so far as to create a dynamic

interface here of exactly how the thing

works to kind of describe it to

somebody. So I think this is a great

example of what I was looking for and

very explicitly what I was asking for.

and it goes through all of the different

parts. Here is how the application lays

itself out. It even mentions that

there's a a fundamental challenge.

YouTube TV provides no official API and

no stable identifiers for games. It's

like trying to recognize people by their

outfits when they keep changing clothes

every day. And it's like really honestly

trying to tell us what's difficult all

the way down to the bottom where you can

see here are the different strategies.

here's the the the class that it's in

and the lines that you would care about.

Here's the different mitigation or

identification strategies that are used.

So, a really good job of trying to

slowly walk us into what the challenge

is, the different names of them and what

they're useful for. And then it goes in

and finds those different uh kind of

risks that we were seeing in the other.

So, each one of these is a technical

risk. It gives us references to the

files, the lines that it was found on.

Um, it even highlights the areas that it

thinks are actual problems and how to

fix them. This is fantastic. This is the

way that we need to see what models are

doing on the inside of stuff. Again,

later video very exciting. I definitely

want to talk about this, but I just

wanted to show you this is best in

breed. I want to be communicated to this

way, not just a table of bars that say

this one's important and this one's not

and trust us. Okay? And I just have to

very briefly share with you the model

that fell on its face. It earned the

award of most surprising finding of just

about everything and that's Gemini. What

I'm sharing with you is the Gemini

output. I am not going to go through it.

You're going to just have to take my

word for it. This was a train wreck. I

ran it many times. This is four or five

times that I ran this. It always did

this. If you remember the big PRD that

we put in and said, "Go find things

similar to this and look for other

stuff." It found only some of the things

that were mentioned in the PRD and it

literally did not find a single thing

that was not mentioned in the PRD unlike

every single other model. All it shares

here is these three problems all of

which were named inside of the uh PRD

itself but also it did not find many of

the other challenges like the way that

we're dealing with the three different

mitigations for IDs things that were

major cruxes for the rest of the models

and they all found those kinds of

things. It is so lightweight. It's not

worth going through. But I will say

this, I I find the Gemini 3 Pro model a

good model to code with. So this is just

shocking. It is worth saying its

implementation when it went and did work

was against this surface area. Not

surprisingly, what it finds is what it's

going to fix. And what we've asked it to

do is go find what it can find. And this

is what it found. So, I will I I just

want to put this warning out there that

you need to keep an eye on the Gemini 3

model for a minute to make sure that

it's really finding everything it

should. It works very well against the

area that it's actually finding. That's

not the problem. But this was shocking

the difference between what it found and

what all of the other models found. All

right. I I did say it up front. The

there's no real big difference here. I

will call out cyber security again. It's

one of the things they they called out

very clearly in this model release. if

you have any interest in security, those

kinds of things, the codeex model is one

you want to kind of take a look at and

run some tests against. And also,

Windows tool calling was very

specifically called out and that can't

be understated. Anything that can do a

better job of tool calling is going to

be helpful. And I think had I used these

models on something that was a much

longer time horizon, this is a pretty

small change, that I would have seen a

bigger delta between them. and the

codeex model probably would have shown

out to be more both more efficient and

maybe a little bit faster. That better

tool calling really does matter. If if

it's a coin flip for you and it really

doesn't matter either way, definitely

use the codeex model. That's what it's

intended for. So, it's going to be

better at it. But you have my real

findings are they're basically the same

thing at this point still largely. All

right, with that I will say keep an eye

out for that next one. I'm super excited

about that next video, which is really

about that context mapping and where I

think that we're going to be really

needing these models to be able to tell

us what they're thinking in a much more

informed way than just simply here's my

answer and we need to move forward with

it and we're just going to have to trust

it. I hope you're interested in that.

I'm definitely interested in that.

Thanks for coming along for the ride on

this one.