OpenAI dropped a new model built for developers | DailyDevLists

Loading video player...

Full Transcript

5,237 words • EN

Remember back when 04 Mini dropped and I

made the bold claim that OpenAI suddenly

seemed to care about developers and was

starting to wage war against all the

other model companies? Well, it's gone a

heck of a lot further than I ever would

have expected. Today, OpenAI dropped yet

another new product for us as devs. And

it's not just a product. It's a model.

It's a specific model for us to use for

the work that we do every day. And since

OpenAI cares a lot about developers

they took the time to reach out to me

and a handful of other developers to

give it a quick shot a few days ahead of

launch so we could see what it's good

at, bad at, and more. And man, have I

had a lot of fun playing with it. Can't

wait to show you guys all the things

that it does well, but more importantly

all the things it does wrong. But first

we need to know what this is. As I said

before, it is a new model, and obviously

that means it needs a new name, which is

why I'm super excited to reveal Codeex

guys. Really? Again

we're on 10 codeexes now. We have the

C-pilot model from 2023. We have the

CLI. We have the web interface. We have

the extension. We now have God. Guys

you can come up with names. It's not

that hard. Just ask GPT5. Look at that.

You got some free names. I'm using your

service. Just name it anything else next

time, please. That said, the numbers are

looking really, really good. On their

code refactoring tests, it's getting way

better numbers. And on SWE, it's

performing meaningfully better, too. Not

quite as big a gap, but the numbers are

really good. But that's not where this

gets interesting. The model behaves

fundamentally differently. And while

it's not available on API yet, it should

be soon. The thing that's really cool

here is how deeply tied this model is to

the tools that we use it with

specifically the Codeex CLI and the

Codex web interface. Oh, and also the

Codex extension. Thanks for making this

so easy for me, OpenAI. I cannot wait to

show you guys all the cool things and

all the broken things about this new

model. But first, since OpenAI did not

pay me, we do have bills to cover. A

quick word from today's sponsor and then

we'll dive right in. There's one

technology that is more inevitable than

anything else in all of software. And

no, it's not AI. Let's be real, it's

JavaScript. There is no escaping it. No

matter how good our HTML tooling gets

you will always need the ability to run

some real JavaScript on real web pages.

Especially if you're building AI agents

that are browsing the web for you to get

information. And that's why today's

sponsor is so helpful. Browserbase is

the best way to set up a browser in the

cloud. If you need an agent to access a

website or you just need to go get a

screenshot of some inventory somewhere.

If you need to control a browser with

code, your options are suffer or

browserbased. And you should probably

not pick suffering anymore. Tons of

other companies have made the move

already like perplexity and versel. Yes

really. You would expect a company like

Verscell to have all of these things

handled. And to an extent they do, but

when they wanted to introduce the

ability for tools like Vzero to go hunt

across the web to find specific things

their existing tooling just wasn't there

for it. So they made the move to

browserbase and they have been very very

happy. In particular, the tools that

existed were not reliable enough. The

CDN challenges were blocking them from

accessing various things. The quality of

the data was absolute garbage and the

limited parallelization that they would

have problems with was insane because

each of these instances needs a real

processor on a real computer doing real

browsing. You're curious how simple it

is. Here you go. They already set up

playwright on this browser window. So

this is all happening in the browser. We

have window.playwright.chromium.connect

connect over CDP to the connection

string which is something you just get

from their dashboard and copy paste. And

now we have an actual context for the

browser. We have a page that we can do

things to. Page go to URL. Now you're

browsing the web AI already knows how to

use Puppeteer, but does your infra have

a good reliable way for it to do that?

It probably doesn't because you're not

using browserbase yet. Thankfully

that's an easy thing to fix. Check them

out today at zoyv.link/browserbase.

So, as I mentioned before, these two

benchmarks showed some pretty cool

numbers, but that's not what I'm most

excited about. If you know what I've

been complaining about with GBD5, other

than the fact that it was broken in half

the surface areas that we used to

interface with it, the thing I've been

most annoyed about by far is how slow

and token hungry the model tends to be

for developer tasks. And that seems to

be one of the square like pointed

focuses of this change. So, when I think

about the tasks that I give to different

LLMs, I think a lot about how hard is

this task to solve? Like how many tokens

does it need? How complex is the

problem? Does it need internet access?

Does it need all these different things?

And what I found is that for different

tasks, there's quite a bit of variety.

If I'm asking a model a question and we

have a spectrum where on the left here

it's the simplest and on the right here

it is the most complex. So the most

tokens, I'll just say most tokens. I'll

change simplest to least tokens. If I

ask a model something like count to 10

most models are going to not use too

many tokens for a task like this. But if

you ask it for something more complex

like to write code in 15 different

languages or to count the number of Rs

in the word strawberry, it's going to

take a good bit more tokens. You'll end

up somewhere over here instead. But if

we think about the range of how many

output tokens are necessary for these

tasks, something like this is going to

be frankly 10 tokens because it's

counting to 10 and each number will

probably be one token. But something

more complex like counting letters or

actually though like writing code in a

chat interface can get as high as

something like 100,000 tokens. And while

this range seems pretty big, the range

gets a hell of a lot bigger for code

tasks where some code tasks might only

need like a 100 tokens and some other

code tasks much bigger ones might need a

million tokens. And while this might

seem like an exaggeration, I really wish

it was. Just my early playing around was

able to get up to 628k tokens used and

I've managed to break a million many a

time doing oneoff playing around code

tasks in the past since I currently

can't really code much. Uh yeah, I've

been doing a bit more vibe coding lately

and this has been very fun for me to

play with because of these new

characteristics. One of the things I've

been most frustrated with with most of

the AI coding tools is that they are

slow. I am a very fast typer. I go like

160 to 170 word per minute when I have

both of my hands functioning properly. I

have not had that since my surgery and I

miss it dearly. I can't even press the

space bar with my left hand right now. I

can barely press command option in

control. I can't even copy paste my way

through stuff right now. It's been

rough. So I decided to give this a spin

as a true vibe coder would trying my

best to avoid reading code and I did

really good for a bit. We'll go over the

project in a second, but I do want to

show the thing that brought me here

which is the range of how many tokens

are being used. When I was using other

models like GPT5 in standard high

configuration or if I was using models

like Claude or Gemini 25 Pro, I found

that the minimum number of tokens for a

task was still pretty high. And even if

the models were fast, they still felt

slow because they were generating so

many tokens to complete basic work. I

have a bunch of videos where I talk

about this. In particular, the video

about the pricing changes in cursor went

really in-depth on this and all the

things that made it tough. But that's

what makes these changes fun because

trying specifically to handle the small

tasks with small amounts of tokens and

the big tasks with big amounts of

tokens. On OpenAI employee traffic, we

see that for the bottom 10% of user

turns sorted by model generated tokens.

So they're sorting these tasks that were

given by employees to the model by how

many tokens each task used. GBT5 codeex

uses 93.7% fewer tokens than GBT5 did.

So on these simple tasks, it uses almost

a 20th as many tokens. That's an insane

drop. But for the top 10%, it can

actually use significantly more.

Spending twice as long reasoning

editing, and testing code, as well as

iterating in general. That's really cool

to see. The gap in between these numbers

is significant, and the results speak

for themselves. So I'm going to run the

same prompt twice. once with codeex

using standard GPT5 and once with codeex

not using GPT5. Of course, we're doing

the classic image studio. So, I will

send this here and separately I'm going

to spin this up with the new GPT5 codeex

version. The thing I'm particularly

curious about is how many tokens does it

use for this task. This task should be

in quotes relatively simple because it's

just styling the page to look good and

make a mock application. I always was

kind of concerned at how many tokens

would be used when I did this with GPT5.

But right now, even though I started

GPT5 one earlier, it's used fewer tokens

than the GPT5 codeex version.

Interesting to think this might be one

of the complex tasks. While those are

running, I'll show you guys the much

deeper testing I've personally been

doing. So, when I first tried building

this project, I got a decent looking UI

out of it. It's fine at that. It was

different from usual. I can go back and

show it in a bit. It's not that

important. But then I asked it to

actually implement the service because I

tried that before and had varying luck.

This time I told it to use convex and

foul and it got decently far. It did run

into some problems though. It tried too

hard to use next.js and more importantly

it import everything from convex/s

schema, which, was, weird., I, don't, know if

this is a thing that convex used to do

but it's definitely not a thing they do

now. It has to be convex/server. So I

had to go make this change myself. After

I made that change, it built and could

deploy on Convex, but code wouldn't

actually run because of errors with how

it was configuring between the client in

server actions as well as within the web

interface as well as within Convex and

trying to build a complex relationship

between those that wasn't necessary. So

I told it up front, you did this wrong

try again. But this time, I paid

attention to what it was doing. I did

give it search access, which I didn't

realize until recently you have to do

with a command line argument. you have

to say d- search for it to have the

ability to search the web when you use

codeex the CLI. So when I did that it

was able to search and what I found is

that it kind of sucks really hard at

search. Let me show you guys some of the

queries that it made. Here we go. Search

for foul client import foul from

fileclient subscribe example.

This was because it had an error with

how it was importing and configuring

foul initially. It just did it entirely

wrong even though you don't need to

because I already had the environment

variable set up properly. So that was

quite annoying. Here it searched for

Convex XJS setup guide 2025 official

documentation. That was a good search

but that only happened because I said

this is not the correct way to use

Convex. Try again. Follow the official

Nex.js setup guide. I'm more and more

seeing the value in templates. I even

went full vibe code here and just pasted

in errors and told it to try and fix

them. And it didn't. Okay, here's what I

was looking for. This pile of just

absolute junk searches.

FAI/Fluxp Pro/V1.1 ultra API example

file.subscribe prompt aspect ratio

guidance scale. I did not ask for any of

this I do not know why it was

going this hard here. Also, convex react

use query context provider example 2025.

It sucks at search. I am at how bad it

is at search. It's kind of annoying.

Cool. These both finished. This version

used 23.6K tokens. And this was the GPT5

standard. I didn't even have it on high.

I just had it on standard. And then this

was GP5 codeex which was 27.8. I just

realized I should have done a test on

high. I'll test that in a minute. Let me

just bun rundev for these and see how

they look. And here we go. This is the

version that GPT5 Medium made. And

here's the version that GPT5 Codeex

made. Normally I wouldn't read too much

into the UI differences here, but a lot

of the things that are different have

been consistent through various runs.

Now it does definitely behave

differently with UI. It still looks

good, but I've noticed more of these

types of errors where like things are

clipping into each other where you have

these weird layers in the UI. Just bugs

that I didn't see as much of when I was

using standard GPT5. Not sure what that

is. Hopefully, it'll be fixed, but that

does kind of break my heart a little bit

because, one, of the, things, I, loved, about

GPT5 was how good it was at UI. It'd be

kind of annoying if we have to switch

back to standard GPT5 for UI and then go

to GPT5 codecs for other things. to be

determined. Here's the version I made

when I was actually building a working

demo using all of this. And it it it

took a few renditions to get the UI in a

state where I didn't hate it. I'll show

you what it looked like when I first

started. Quick, here we are. This is

what it looked like when I first spun it

up. Way too big of an area up top here.

This looks okay. Too much text. I don't

know what it's doing there. Where things

fall apart is near the bottom here. This

is a mess. I don't know what happened. I

really don't.

This is not the GPT5 I know and love. As

I said, when I told it to design in a

different direction, it was able to

handle it fine. Even the worst things I

generated from codecs look better than

the best things I could generate from

cloud when it comes to from scratch UI

stuff. All of that said, I still

recommend you just go grab a screenshot

of something that looks like what you

want and use that as the starting point.

This is more meant to demonstrate the

native behavior for UI that these models

have. As I was mentioning before though

was not super impressed initially when I

told it to just kind of go off and work

on things. I know that none of these

tools can just invent engineering for

you. Like you need to guide them.

They're kind of like a co-orker. In

fact, that's how this was pitched to me

by OpenAI is they really wanted this new

model to feel like a good co-orker that

might not know everything about the

codebase yet or exactly how to work, but

could be instructed to go do a thing and

work alongside you with that thing.

definitely felt that a lot more than

what the other models I've played with

but it still can commit really hard to

things that are incorrect. I am also

just really unimpressed with the search

in Codeex. Not because Codex is search

itself is bad, but because Codex the

model is bad at doing search with the

CLI. Also, there's so many little UX

things that are still screwed up. Like

when I was working on setting this up

earlier, the way agent internet access

works was interesting. I am thankful

they put this call out here. Enabling

internet access exposes your environment

to security risks. Yeah, real thing. But

the fact that I have to manually turn on

internet access, switch this to all and

there isn't a process where it requests

things when it needs it instead of doing

it this way. Not great. And even then

like the whole process for creating an

environment, it's a bit much. It's kind

of slow and tedious. But now that I've

done this, I should be able to go in

here and change where this goes. Oh, it

doesn't autoupdate here. I have to

refresh. That's not even enough. I have

to reload the window. And hopefully now

when I go to the codeex tab in my

editor, I will see the environment that

I just made. Interesting that this is

under Oh, no. I see what they're doing

here. This is awful UI. ping.gg/T3 chat.

Hover over. And now it gives me other

options here. It's not clear that this

is to select a different environment.

This looks like they are within

ping.gg/T3 chat, which they're not. Look

here. Use local changes. No, I want to

get back on main. It also used npm

which I'm annoyed with. It should be

able to figure that out that I don't use

that, but it didn't even ask. It just

kind of went with the thing that I would

consider wrong. Also, this use local

changes thing breaks so much of the UI.

Make sure you switch over to main. Even

though my main branch is the same as

main there for local changes, there's

just a lot of little UX things they have

to figure out. They're admittedly

annoying. It means I just use the CLI

most of the time or the web interface. I

don't use the extension much for these

reasons. So I'm going to tell it to add

the Gemini image models. Add the Gemini

image models for editing and generating

images through foul. And now I've spun

up a cloud instance that is taking

advantage of the same new model, same

CLI. They're trying really hard to

standardize the like codeex system and

interface across the different

platforms. And they're also supposedly

planning on putting out an SDK which

could be really cool. would mean that

anyone could spin up their own codeex

like tool in the cloud. It doesn't

really seem like they want to win this

by making something no one else can use.

It seems quite like the opposite. They

want their models and their protocols to

power how we do agentic coding, which is

why they're open sourcing pretty much

everything around it. It is kind of

crazy if you think about it the amount

of money and time being spent on the

codec and how that is powering so many

other things. They just give it out for

free MIT licensed on their GitHub.

Actually, I might be wrong about the

license. I think they were more generous

than that. I'm correct. It was Apache

2.0. That's kind of nuts for them to do

something like that. And they're

literally merging things as we speak 2

minutes ago. And I'm recording this at

9:00 p.m. on a Sunday. So that says a

lot about how hard these guys are

shipping. It feels a lot more like a

small startup than it feels like this

giant evil mega corporation. I know

that's not the vibe a lot of you guys

have. I understand. But I I insist these

guys have been awesome to work with and

they're totally okay with the fact that

I'm sitting here half roasting them as I

go through all of this. Almost forgot to

mention big part of why they're probably

going the open source angle here is that

weird agreement they have with

Microsoft. The one where you know they

get all of the IP until AGI is reached.

I've always found that to be a weird

agreement and in particular here it

hurts similar to how it hurt in the

Windsorf acquisition where if they don't

open source this, Microsoft still gets

access to all of this and can do

whatever they want with it. But by open

sourcing it, everyone else gets access

to it too. So could hypothetically be a

workaround with that deal. Can't say for

sure. This is pure speculation. Just a

possibility I think is worth considering

as all. Let's see how this does in the

cloud. If I click this, will it bring me

to this task? It will. It will. We'll

see how that handles things. In

particular, it got really confused with

the setup for convex. In particular, the

like environment variable management

stuff because you don't have to manage

environment variables with convex. Just

run the dev command and it will tell you

to sign in and you're good. I have no

idea how it's going to handle that in

the cloud. We will see momentarily.

Apparently something opened up the chat

GBT app when I did these things. That's

kind of silly and annoying. Anyways

more on the tokenization stuff.

Apparently, they were comparing medium

between five and five codecs. You might

think that's suspicious. I don't because

I personally don't use high almost ever.

So five medium on the 10th percentile is

93.7% fewer tokens. But on the 90th

percentile it is over double the number

of tokens. I really like this. The fact

that is so flexible based on the

different types of tasks we do. Very

good sign. It's been trained

specifically for conducting code reviews

and finding critical flaws when

reviewing. It navigates your codebase

reasons through dependencies, and it

runs your code and tests in order to

validate correctness. They did talk a

lot about the code review side of things

in the call that I did with them. They

were really excited about the fact that

it isn't just taking your code and

looking at the diffs. It's actually

running the code in a container in the

cloud to test it and find bugs.

Potentially really, really powerful. I'm

still using code rabbit if I'm being

honest with you guys, but this is

something that I would actually consider

using. It's a good pitch. Seems cool.

Have not evaluated it at all yet. They

test it on actual open source repos. For

each commit, experienced software

engineers evaluated review comments for

correctness and importance. We find the

comments by GBD5 codecs are less likely

to be incorrect or unimportant

reserving more user attention for

critical issues. Good. I will say that

through most of the AI code review tools

I've used, they like to spit out

nonsense and things that aren't that

valuable. I am thankful that both Code

Rabbit and now hopefully GPT5 Codeex

will make that better. And also the

tools that aren't as reliable can use

GP5 codecs and fingers crossed they'll

also have better reviews, fewer comments

per PR, more high impact comments, fewer

incorrect comments, about a third as

many incorrect comments. Very good sign.

It also is much better at mobile sites.

Very fun. Can look at images or

screenshots you provide as input

visually inspect progress, and display

screenshots of its work to you. That's

really, really cool. I have not played

with that just yet, but the fact that

the new Codex web interface is capable

of giving you screenshots of the work as

it's going. Very good sign. I like this

a lot.

They rebuilt the Codex CLI to be more

agentic. Should have always been. You

can now attach and share images. I do

not like sharing images in my terminal.

I don't know why people like this, but

you do you. Do it right in the CLI.

Super cool. Now has to-do lists. I've

noticed it using the to-do list a lot

more. Search being something you have to

do via commands and like arguments when

you launch is still annoying. I'm sure

they'll change that in the near future.

Oh, fun quick tangent on the pricing

side. When they set me up for using

this, they used my company account

which doesn't have a subscription. Still

use T3 chat, by the way. But as such, my

$200 month subscription wasn't working.

So, I went and signed up for the $20

tier on my company account. Right as I

started prompting, they went and fixed

it and put it on the right account. But

I was going pretty hard using the $20

tier and wasn't able to hit any limits.

I'm sure you will with heavy enough

usage over long amounts of time. But it

does seem like the codeex plans on the

20 and $200 tiers of OpenAI's Chatbt

plans are actually quite generous, which

hurts because this is yet another reason

to not use T3 Chat. That said, if you

want a better chat interface, use code

COEX to get your first month for $1 on

T3 Chat and every other month will be

eight bucks. Anyways, let's see how that

cloud interface is doing. Oh, nothing.

Yeah, this has been my experience with

the cloud interface. It's just kind of

half broken. Oh

looks like it made changes. FAI Gemini

Flash Gemini flashedit. I don't think

those are the names of those models.

Yeah, it's actually FAI/Gemini

25 flashedit.

So, it didn't search. It didn't check

the web. It just hallucinated the names

of those models.

I still think the cloud side is a bit

bunk if I'm going to be honest with you

guys. I haven't had a good experience

with any of the cloud like background

agent things just yet. But the amount of

problems that this seems to be running

into just doing basic checks for things

rip grapping, looking for files and

names of things. I told it to add the

models. It means they're not there.

It should just be going to the web and

finding them

not rip gpping through node modules.

That's a choice. Yeah, this is where my

skepticism comes in. I'm recording this

part after I finished filming earlier

because I want to mention a specific

thing. I have had the starting codeex

piece on my phone for the past hour and

a half. Even though according to this

it's done and is ready to go open a PR.

Their live notification system is

entirely broken. It has been since it

started. It seems really cool. It

doesn't work at all. And if I can find

an easy way to turn it off, I'm

absolutely going to because at this

point in time, it does not function.

That's, kind, of my, problem, with, the

ecosystem. It just feels like the pieces

are getting there, but the puzzle isn't

yet. And all the parts just kind of

break once you start using them more

together. And that cohesion is so

important to do something like this. And

it's just it's not there yet. I find

that the GBD5 models are still the best

experience I've had doing agentic code

but I still find that the codeex tool

set in particular the web interface and

the like extension in my editor are

within the more clunky options. I

personally still take things like open

code, kilo code, and all these other

agentic tools with GBT over the codeex

ecosystem. Even though the CLI is

improving meaningfully, I'm not seeing

those improvements on the web version

and I'm definitely not seeing them in

the editor VS Code like extension. This

is the problem when you use one name for

everything though. Now if people hear

the name Codeex and they go to try it

whichever version they're trying, be it

the editor version, the web version, the

CLI version, or just the model directly

is going to be how they judge it. And if

I'm using the CLI and the model, and

you're using the web interface and the

extension, we're going to have really

different experiences and that's going

to result in us having a different vibe.

I've already been through this with

OpenAI, as you guys probably remember

with the GBT launch where they don't

label things correctly. They make it too

hard to know what you are and aren't

getting and then they screw up the auto

router and now everyone thinks I'm

insane. Thankfully, we've all now seen

the light and know that Gvt5 is really

good at code. But Codeex version 12 is

not going to help a whole lot. So that's

my feedback. I'm sorry that you had to

learn this way. OpenAI, I'm not sending

this video to them ahead of time to

approve and hopefully I won't get in

trouble for that. You need to fix your

web interface. You need to fix the

extension. you need to name the model

something else or name these other

surfaces something else because people

are going to be confused and frustrated.

And while I appreciate the goal of

unifying everything because it should

hopefully longterm reduce confusion

right now it's just adding more and the

fact that I've had as janken experience

with the web version and the background

agents they have. It's enough of a

reason to rethink how these are branded

going forward. That all said, a new

model that is better at code that will

reason more when it should and reason

less when it shouldn't. All sounds

really good. And for my brief playing

I've had a pretty good experience

overall. I'm curious what you guys

think, though. Have you had a chance to

play with the new Codeex model yet? What

do you think? How has it been? Let me

know in the comments. And until next

time, peace nerds.

OpenAI dropped a new model built for developers

Theo - t3․gg

62 days ago

24:21

GPT Models & ChatGPT

Rank #8

Description

OpenAI just dropped a new model for agentic coding: GPT-5-Codex. Yes, they actually named another thing Codex 🙃 Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase Use CODEX for 1 month of T3 Chat for just $1: https://soydev.link/chat (only valid for new customers) Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O Ph4se0n3 for the awesome edit 🙏

Watch on YouTube

Video Details

Category

GPT Models & ChatGPT

Featured Date

November 13, 2025

Quality Rank

#8

AI Recommended