Loading video player...
Remember back when 04 Mini dropped and I
made the bold claim that OpenAI suddenly
seemed to care about developers and was
starting to wage war against all the
other model companies? Well, it's gone a
heck of a lot further than I ever would
have expected. Today, OpenAI dropped yet
another new product for us as devs. And
it's not just a product. It's a model.
It's a specific model for us to use for
the work that we do every day. And since
OpenAI cares a lot about developers
they took the time to reach out to me
and a handful of other developers to
give it a quick shot a few days ahead of
launch so we could see what it's good
at, bad at, and more. And man, have I
had a lot of fun playing with it. Can't
wait to show you guys all the things
that it does well, but more importantly
all the things it does wrong. But first
we need to know what this is. As I said
before, it is a new model, and obviously
that means it needs a new name, which is
why I'm super excited to reveal Codeex
guys. Really? Again
we're on 10 codeexes now. We have the
C-pilot model from 2023. We have the
CLI. We have the web interface. We have
the extension. We now have God. Guys
you can come up with names. It's not
that hard. Just ask GPT5. Look at that.
You got some free names. I'm using your
service. Just name it anything else next
time, please. That said, the numbers are
looking really, really good. On their
code refactoring tests, it's getting way
better numbers. And on SWE, it's
performing meaningfully better, too. Not
quite as big a gap, but the numbers are
really good. But that's not where this
gets interesting. The model behaves
fundamentally differently. And while
it's not available on API yet, it should
be soon. The thing that's really cool
here is how deeply tied this model is to
the tools that we use it with
specifically the Codeex CLI and the
Codex web interface. Oh, and also the
Codex extension. Thanks for making this
so easy for me, OpenAI. I cannot wait to
show you guys all the cool things and
all the broken things about this new
model. But first, since OpenAI did not
pay me, we do have bills to cover. A
quick word from today's sponsor and then
we'll dive right in. There's one
technology that is more inevitable than
anything else in all of software. And
no, it's not AI. Let's be real, it's
JavaScript. There is no escaping it. No
matter how good our HTML tooling gets
you will always need the ability to run
some real JavaScript on real web pages.
Especially if you're building AI agents
that are browsing the web for you to get
information. And that's why today's
sponsor is so helpful. Browserbase is
the best way to set up a browser in the
cloud. If you need an agent to access a
website or you just need to go get a
screenshot of some inventory somewhere.
If you need to control a browser with
code, your options are suffer or
browserbased. And you should probably
not pick suffering anymore. Tons of
other companies have made the move
already like perplexity and versel. Yes
really. You would expect a company like
Verscell to have all of these things
handled. And to an extent they do, but
when they wanted to introduce the
ability for tools like Vzero to go hunt
across the web to find specific things
their existing tooling just wasn't there
for it. So they made the move to
browserbase and they have been very very
happy. In particular, the tools that
existed were not reliable enough. The
CDN challenges were blocking them from
accessing various things. The quality of
the data was absolute garbage and the
limited parallelization that they would
have problems with was insane because
each of these instances needs a real
processor on a real computer doing real
browsing. You're curious how simple it
is. Here you go. They already set up
playwright on this browser window. So
this is all happening in the browser. We
have window.playwright.chromium.connect
connect over CDP to the connection
string which is something you just get
from their dashboard and copy paste. And
now we have an actual context for the
browser. We have a page that we can do
things to. Page go to URL. Now you're
browsing the web AI already knows how to
use Puppeteer, but does your infra have
a good reliable way for it to do that?
It probably doesn't because you're not
using browserbase yet. Thankfully
that's an easy thing to fix. Check them
out today at zoyv.link/browserbase.
So, as I mentioned before, these two
benchmarks showed some pretty cool
numbers, but that's not what I'm most
excited about. If you know what I've
been complaining about with GBD5, other
than the fact that it was broken in half
the surface areas that we used to
interface with it, the thing I've been
most annoyed about by far is how slow
and token hungry the model tends to be
for developer tasks. And that seems to
be one of the square like pointed
focuses of this change. So, when I think
about the tasks that I give to different
LLMs, I think a lot about how hard is
this task to solve? Like how many tokens
does it need? How complex is the
problem? Does it need internet access?
Does it need all these different things?
And what I found is that for different
tasks, there's quite a bit of variety.
If I'm asking a model a question and we
have a spectrum where on the left here
it's the simplest and on the right here
it is the most complex. So the most
tokens, I'll just say most tokens. I'll
change simplest to least tokens. If I
ask a model something like count to 10
most models are going to not use too
many tokens for a task like this. But if
you ask it for something more complex
like to write code in 15 different
languages or to count the number of Rs
in the word strawberry, it's going to
take a good bit more tokens. You'll end
up somewhere over here instead. But if
we think about the range of how many
output tokens are necessary for these
tasks, something like this is going to
be frankly 10 tokens because it's
counting to 10 and each number will
probably be one token. But something
more complex like counting letters or
actually though like writing code in a
chat interface can get as high as
something like 100,000 tokens. And while
this range seems pretty big, the range
gets a hell of a lot bigger for code
tasks where some code tasks might only
need like a 100 tokens and some other
code tasks much bigger ones might need a
million tokens. And while this might
seem like an exaggeration, I really wish
it was. Just my early playing around was
able to get up to 628k tokens used and
I've managed to break a million many a
time doing oneoff playing around code
tasks in the past since I currently
can't really code much. Uh yeah, I've
been doing a bit more vibe coding lately
and this has been very fun for me to
play with because of these new
characteristics. One of the things I've
been most frustrated with with most of
the AI coding tools is that they are
slow. I am a very fast typer. I go like
160 to 170 word per minute when I have
both of my hands functioning properly. I
have not had that since my surgery and I
miss it dearly. I can't even press the
space bar with my left hand right now. I
can barely press command option in
control. I can't even copy paste my way
through stuff right now. It's been
rough. So I decided to give this a spin
as a true vibe coder would trying my
best to avoid reading code and I did
really good for a bit. We'll go over the
project in a second, but I do want to
show the thing that brought me here
which is the range of how many tokens
are being used. When I was using other
models like GPT5 in standard high
configuration or if I was using models
like Claude or Gemini 25 Pro, I found
that the minimum number of tokens for a
task was still pretty high. And even if
the models were fast, they still felt
slow because they were generating so
many tokens to complete basic work. I
have a bunch of videos where I talk
about this. In particular, the video
about the pricing changes in cursor went
really in-depth on this and all the
things that made it tough. But that's
what makes these changes fun because
trying specifically to handle the small
tasks with small amounts of tokens and
the big tasks with big amounts of
tokens. On OpenAI employee traffic, we
see that for the bottom 10% of user
turns sorted by model generated tokens.
So they're sorting these tasks that were
given by employees to the model by how
many tokens each task used. GBT5 codeex
uses 93.7% fewer tokens than GBT5 did.
So on these simple tasks, it uses almost
a 20th as many tokens. That's an insane
drop. But for the top 10%, it can
actually use significantly more.
Spending twice as long reasoning
editing, and testing code, as well as
iterating in general. That's really cool
to see. The gap in between these numbers
is significant, and the results speak
for themselves. So I'm going to run the
same prompt twice. once with codeex
using standard GPT5 and once with codeex
not using GPT5. Of course, we're doing
the classic image studio. So, I will
send this here and separately I'm going
to spin this up with the new GPT5 codeex
version. The thing I'm particularly
curious about is how many tokens does it
use for this task. This task should be
in quotes relatively simple because it's
just styling the page to look good and
make a mock application. I always was
kind of concerned at how many tokens
would be used when I did this with GPT5.
But right now, even though I started
GPT5 one earlier, it's used fewer tokens
than the GPT5 codeex version.
Interesting to think this might be one
of the complex tasks. While those are
running, I'll show you guys the much
deeper testing I've personally been
doing. So, when I first tried building
this project, I got a decent looking UI
out of it. It's fine at that. It was
different from usual. I can go back and
show it in a bit. It's not that
important. But then I asked it to
actually implement the service because I
tried that before and had varying luck.
This time I told it to use convex and
foul and it got decently far. It did run
into some problems though. It tried too
hard to use next.js and more importantly
it import everything from convex/s
schema, which, was, weird., I, don't, know if
this is a thing that convex used to do
but it's definitely not a thing they do
now. It has to be convex/server. So I
had to go make this change myself. After
I made that change, it built and could
deploy on Convex, but code wouldn't
actually run because of errors with how
it was configuring between the client in
server actions as well as within the web
interface as well as within Convex and
trying to build a complex relationship
between those that wasn't necessary. So
I told it up front, you did this wrong
try again. But this time, I paid
attention to what it was doing. I did
give it search access, which I didn't
realize until recently you have to do
with a command line argument. you have
to say d- search for it to have the
ability to search the web when you use
codeex the CLI. So when I did that it
was able to search and what I found is
that it kind of sucks really hard at
search. Let me show you guys some of the
queries that it made. Here we go. Search
for foul client import foul from
fileclient subscribe example.
This was because it had an error with
how it was importing and configuring
foul initially. It just did it entirely
wrong even though you don't need to
because I already had the environment
variable set up properly. So that was
quite annoying. Here it searched for
Convex XJS setup guide 2025 official
documentation. That was a good search
but that only happened because I said
this is not the correct way to use
Convex. Try again. Follow the official
Nex.js setup guide. I'm more and more
seeing the value in templates. I even
went full vibe code here and just pasted
in errors and told it to try and fix
them. And it didn't. Okay, here's what I
was looking for. This pile of just
absolute junk searches.
FAI/Fluxp Pro/V1.1 ultra API example
file.subscribe prompt aspect ratio
guidance scale. I did not ask for any of
this I do not know why it was
going this hard here. Also, convex react
use query context provider example 2025.
It sucks at search. I am at how bad it
is at search. It's kind of annoying.
Cool. These both finished. This version
used 23.6K tokens. And this was the GPT5
standard. I didn't even have it on high.
I just had it on standard. And then this
was GP5 codeex which was 27.8. I just
realized I should have done a test on
high. I'll test that in a minute. Let me
just bun rundev for these and see how
they look. And here we go. This is the
version that GPT5 Medium made. And
here's the version that GPT5 Codeex
made. Normally I wouldn't read too much
into the UI differences here, but a lot
of the things that are different have
been consistent through various runs.
Now it does definitely behave
differently with UI. It still looks
good, but I've noticed more of these
types of errors where like things are
clipping into each other where you have
these weird layers in the UI. Just bugs
that I didn't see as much of when I was
using standard GPT5. Not sure what that
is. Hopefully, it'll be fixed, but that
does kind of break my heart a little bit
because, one, of the, things, I, loved, about
GPT5 was how good it was at UI. It'd be
kind of annoying if we have to switch
back to standard GPT5 for UI and then go
to GPT5 codecs for other things. to be
determined. Here's the version I made
when I was actually building a working
demo using all of this. And it it it
took a few renditions to get the UI in a
state where I didn't hate it. I'll show
you what it looked like when I first
started. Quick, here we are. This is
what it looked like when I first spun it
up. Way too big of an area up top here.
This looks okay. Too much text. I don't
know what it's doing there. Where things
fall apart is near the bottom here. This
is a mess. I don't know what happened. I
really don't.
This is not the GPT5 I know and love. As
I said, when I told it to design in a
different direction, it was able to
handle it fine. Even the worst things I
generated from codecs look better than
the best things I could generate from
cloud when it comes to from scratch UI
stuff. All of that said, I still
recommend you just go grab a screenshot
of something that looks like what you
want and use that as the starting point.
This is more meant to demonstrate the
native behavior for UI that these models
have. As I was mentioning before though
was not super impressed initially when I
told it to just kind of go off and work
on things. I know that none of these
tools can just invent engineering for
you. Like you need to guide them.
They're kind of like a co-orker. In
fact, that's how this was pitched to me
by OpenAI is they really wanted this new
model to feel like a good co-orker that
might not know everything about the
codebase yet or exactly how to work, but
could be instructed to go do a thing and
work alongside you with that thing.
definitely felt that a lot more than
what the other models I've played with
but it still can commit really hard to
things that are incorrect. I am also
just really unimpressed with the search
in Codeex. Not because Codex is search
itself is bad, but because Codex the
model is bad at doing search with the
CLI. Also, there's so many little UX
things that are still screwed up. Like
when I was working on setting this up
earlier, the way agent internet access
works was interesting. I am thankful
they put this call out here. Enabling
internet access exposes your environment
to security risks. Yeah, real thing. But
the fact that I have to manually turn on
internet access, switch this to all and
there isn't a process where it requests
things when it needs it instead of doing
it this way. Not great. And even then
like the whole process for creating an
environment, it's a bit much. It's kind
of slow and tedious. But now that I've
done this, I should be able to go in
here and change where this goes. Oh, it
doesn't autoupdate here. I have to
refresh. That's not even enough. I have
to reload the window. And hopefully now
when I go to the codeex tab in my
editor, I will see the environment that
I just made. Interesting that this is
under Oh, no. I see what they're doing
here. This is awful UI. ping.gg/T3 chat.
Hover over. And now it gives me other
options here. It's not clear that this
is to select a different environment.
This looks like they are within
ping.gg/T3 chat, which they're not. Look
here. Use local changes. No, I want to
get back on main. It also used npm
which I'm annoyed with. It should be
able to figure that out that I don't use
that, but it didn't even ask. It just
kind of went with the thing that I would
consider wrong. Also, this use local
changes thing breaks so much of the UI.
Make sure you switch over to main. Even
though my main branch is the same as
main there for local changes, there's
just a lot of little UX things they have
to figure out. They're admittedly
annoying. It means I just use the CLI
most of the time or the web interface. I
don't use the extension much for these
reasons. So I'm going to tell it to add
the Gemini image models. Add the Gemini
image models for editing and generating
images through foul. And now I've spun
up a cloud instance that is taking
advantage of the same new model, same
CLI. They're trying really hard to
standardize the like codeex system and
interface across the different
platforms. And they're also supposedly
planning on putting out an SDK which
could be really cool. would mean that
anyone could spin up their own codeex
like tool in the cloud. It doesn't
really seem like they want to win this
by making something no one else can use.
It seems quite like the opposite. They
want their models and their protocols to
power how we do agentic coding, which is
why they're open sourcing pretty much
everything around it. It is kind of
crazy if you think about it the amount
of money and time being spent on the
codec and how that is powering so many
other things. They just give it out for
free MIT licensed on their GitHub.
Actually, I might be wrong about the
license. I think they were more generous
than that. I'm correct. It was Apache
2.0. That's kind of nuts for them to do
something like that. And they're
literally merging things as we speak 2
minutes ago. And I'm recording this at
9:00 p.m. on a Sunday. So that says a
lot about how hard these guys are
shipping. It feels a lot more like a
small startup than it feels like this
giant evil mega corporation. I know
that's not the vibe a lot of you guys
have. I understand. But I I insist these
guys have been awesome to work with and
they're totally okay with the fact that
I'm sitting here half roasting them as I
go through all of this. Almost forgot to
mention big part of why they're probably
going the open source angle here is that
weird agreement they have with
Microsoft. The one where you know they
get all of the IP until AGI is reached.
I've always found that to be a weird
agreement and in particular here it
hurts similar to how it hurt in the
Windsorf acquisition where if they don't
open source this, Microsoft still gets
access to all of this and can do
whatever they want with it. But by open
sourcing it, everyone else gets access
to it too. So could hypothetically be a
workaround with that deal. Can't say for
sure. This is pure speculation. Just a
possibility I think is worth considering
as all. Let's see how this does in the
cloud. If I click this, will it bring me
to this task? It will. It will. We'll
see how that handles things. In
particular, it got really confused with
the setup for convex. In particular, the
like environment variable management
stuff because you don't have to manage
environment variables with convex. Just
run the dev command and it will tell you
to sign in and you're good. I have no
idea how it's going to handle that in
the cloud. We will see momentarily.
Apparently something opened up the chat
GBT app when I did these things. That's
kind of silly and annoying. Anyways
more on the tokenization stuff.
Apparently, they were comparing medium
between five and five codecs. You might
think that's suspicious. I don't because
I personally don't use high almost ever.
So five medium on the 10th percentile is
93.7% fewer tokens. But on the 90th
percentile it is over double the number
of tokens. I really like this. The fact
that is so flexible based on the
different types of tasks we do. Very
good sign. It's been trained
specifically for conducting code reviews
and finding critical flaws when
reviewing. It navigates your codebase
reasons through dependencies, and it
runs your code and tests in order to
validate correctness. They did talk a
lot about the code review side of things
in the call that I did with them. They
were really excited about the fact that
it isn't just taking your code and
looking at the diffs. It's actually
running the code in a container in the
cloud to test it and find bugs.
Potentially really, really powerful. I'm
still using code rabbit if I'm being
honest with you guys, but this is
something that I would actually consider
using. It's a good pitch. Seems cool.
Have not evaluated it at all yet. They
test it on actual open source repos. For
each commit, experienced software
engineers evaluated review comments for
correctness and importance. We find the
comments by GBD5 codecs are less likely
to be incorrect or unimportant
reserving more user attention for
critical issues. Good. I will say that
through most of the AI code review tools
I've used, they like to spit out
nonsense and things that aren't that
valuable. I am thankful that both Code
Rabbit and now hopefully GPT5 Codeex
will make that better. And also the
tools that aren't as reliable can use
GP5 codecs and fingers crossed they'll
also have better reviews, fewer comments
per PR, more high impact comments, fewer
incorrect comments, about a third as
many incorrect comments. Very good sign.
It also is much better at mobile sites.
Very fun. Can look at images or
screenshots you provide as input
visually inspect progress, and display
screenshots of its work to you. That's
really, really cool. I have not played
with that just yet, but the fact that
the new Codex web interface is capable
of giving you screenshots of the work as
it's going. Very good sign. I like this
a lot.
They rebuilt the Codex CLI to be more
agentic. Should have always been. You
can now attach and share images. I do
not like sharing images in my terminal.
I don't know why people like this, but
you do you. Do it right in the CLI.
Super cool. Now has to-do lists. I've
noticed it using the to-do list a lot
more. Search being something you have to
do via commands and like arguments when
you launch is still annoying. I'm sure
they'll change that in the near future.
Oh, fun quick tangent on the pricing
side. When they set me up for using
this, they used my company account
which doesn't have a subscription. Still
use T3 chat, by the way. But as such, my
$200 month subscription wasn't working.
So, I went and signed up for the $20
tier on my company account. Right as I
started prompting, they went and fixed
it and put it on the right account. But
I was going pretty hard using the $20
tier and wasn't able to hit any limits.
I'm sure you will with heavy enough
usage over long amounts of time. But it
does seem like the codeex plans on the
20 and $200 tiers of OpenAI's Chatbt
plans are actually quite generous, which
hurts because this is yet another reason
to not use T3 Chat. That said, if you
want a better chat interface, use code
COEX to get your first month for $1 on
T3 Chat and every other month will be
eight bucks. Anyways, let's see how that
cloud interface is doing. Oh, nothing.
Yeah, this has been my experience with
the cloud interface. It's just kind of
half broken. Oh
looks like it made changes. FAI Gemini
Flash Gemini flashedit. I don't think
those are the names of those models.
Yeah, it's actually FAI/Gemini
25 flashedit.
So, it didn't search. It didn't check
the web. It just hallucinated the names
of those models.
I still think the cloud side is a bit
bunk if I'm going to be honest with you
guys. I haven't had a good experience
with any of the cloud like background
agent things just yet. But the amount of
problems that this seems to be running
into just doing basic checks for things
rip grapping, looking for files and
names of things. I told it to add the
models. It means they're not there.
It should just be going to the web and
finding them
not rip gpping through node modules.
That's a choice. Yeah, this is where my
skepticism comes in. I'm recording this
part after I finished filming earlier
because I want to mention a specific
thing. I have had the starting codeex
piece on my phone for the past hour and
a half. Even though according to this
it's done and is ready to go open a PR.
Their live notification system is
entirely broken. It has been since it
started. It seems really cool. It
doesn't work at all. And if I can find
an easy way to turn it off, I'm
absolutely going to because at this
point in time, it does not function.
That's, kind, of my, problem, with, the
ecosystem. It just feels like the pieces
are getting there, but the puzzle isn't
yet. And all the parts just kind of
break once you start using them more
together. And that cohesion is so
important to do something like this. And
it's just it's not there yet. I find
that the GBD5 models are still the best
experience I've had doing agentic code
but I still find that the codeex tool
set in particular the web interface and
the like extension in my editor are
within the more clunky options. I
personally still take things like open
code, kilo code, and all these other
agentic tools with GBT over the codeex
ecosystem. Even though the CLI is
improving meaningfully, I'm not seeing
those improvements on the web version
and I'm definitely not seeing them in
the editor VS Code like extension. This
is the problem when you use one name for
everything though. Now if people hear
the name Codeex and they go to try it
whichever version they're trying, be it
the editor version, the web version, the
CLI version, or just the model directly
is going to be how they judge it. And if
I'm using the CLI and the model, and
you're using the web interface and the
extension, we're going to have really
different experiences and that's going
to result in us having a different vibe.
I've already been through this with
OpenAI, as you guys probably remember
with the GBT launch where they don't
label things correctly. They make it too
hard to know what you are and aren't
getting and then they screw up the auto
router and now everyone thinks I'm
insane. Thankfully, we've all now seen
the light and know that Gvt5 is really
good at code. But Codeex version 12 is
not going to help a whole lot. So that's
my feedback. I'm sorry that you had to
learn this way. OpenAI, I'm not sending
this video to them ahead of time to
approve and hopefully I won't get in
trouble for that. You need to fix your
web interface. You need to fix the
extension. you need to name the model
something else or name these other
surfaces something else because people
are going to be confused and frustrated.
And while I appreciate the goal of
unifying everything because it should
hopefully longterm reduce confusion
right now it's just adding more and the
fact that I've had as janken experience
with the web version and the background
agents they have. It's enough of a
reason to rethink how these are branded
going forward. That all said, a new
model that is better at code that will
reason more when it should and reason
less when it shouldn't. All sounds
really good. And for my brief playing
I've had a pretty good experience
overall. I'm curious what you guys
think, though. Have you had a chance to
play with the new Codeex model yet? What
do you think? How has it been? Let me
know in the comments. And until next
time, peace nerds.
OpenAI just dropped a new model for agentic coding: GPT-5-Codex. Yes, they actually named another thing Codex 🙃 Thank you Browserbase for sponsoring! Check them out at: https://soydev.link/browserbase Use CODEX for 1 month of T3 Chat for just $1: https://soydev.link/chat (only valid for new customers) Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O Ph4se0n3 for the awesome edit 🙏