Loading video player...
So I had this hypothesis and I was
pretty confident about it. Better AI
model writes a better PRD, right? Better
PRD leads to a better build. So if you
use GPT52 Pro to generate your product
requirements instead of 52 instant, you
should get a meaningfully better
application on the other side. That made
sense to me. So I tested it. Same app,
eight different models generating
purities. everything from GPT5 to
instant all the way up to 52 Pro. Then I
fed every PRD into the same build
system, clawed code with Opus 45 in
planning mode. Same builder, same
process every single time. And when I
looked at the results, I could not tell
them apart. That should not have
happened. And the process of figuring
out why led me somewhere I didn't
expect. Okay, here's what I was
building. Think Spotlight on Mac, but
for words. You hit a hotkey, a panel
pops up. You type a word and instantly
you get definitions, etmology, synonyms,
related concepts, everything. If you
misspell something, it catches that and
shows you what you probably meant, even
with little hints to describe the
problem so that maybe you'll learn how
to spell it next time. But here's the
thing, it's not just dictionary words.
Type apple and you might mean the fruit
or maybe the company. the system needed
to understand that context and show you
both interpretations and kind of let you
explore deeper. That was really the
whole point. So I recorded a 15inut uh
kind of voice memo describing what I
wanted all of the features, how it
should feel, the experience I was going
for and that became what I call my
intent document. Then I gave that same
document to eight different models with
identical instructions. basically turn
this into a PRD, which is just a product
requirements document or kind of the the
structure document, the spec that you
might hand to a development team
actually. And then I took every single
one of those PRDs and fed them into the
exact same build system, which as I
mentioned was clawed code with Opus 45
in planning mode. It's the same build
environment, same execution model. The
only variable was which AI wrote the
PRD. This was all intentional to control
just for the PRD quality. My hypothesis
was so obvious I wasn't sure it was even
going to be worth sharing. So GPT52 Pro
would crush it. The PRD from the
smartest model would produce the most
complete product. GPT52 Instant would
probably miss tons of features. There'd
be a clear correlation between the model
intelligence and output quality. So let
me show you what actually happened. So
I'm starting with you'll see the names
up here just in case. GPT5 too fast.
This is that instant model. If you go to
chat GPT and choose instant, this is
what you'll be running. So it comes back
super fast and gives you a a a
reasonable response, but you wouldn't
expect it to be a great PRD. And if we
come in here and search for something
like horse, you'll see what it's trying
to do is it's giving us horse with
common misspelling here. and it will
tell us that this is showing us the
results of horse even though we had
typed something else in. It is not
giving us a definition of what changed.
That's one of the things that we're
really looking for is if you type in a
misspelling, this one does correct for
it as you can see, but also if you type
in a misspelling, it should show you
what you misspelled to help you next
time. All right, the next one is from
sonnet. You can see once again horse
horse is corrected. It just shows us the
results from Horus itself. Excellent.
And how about from GPT5 to thinking. So
this is kind of the thinking model that
you would go to if you went to chat GPT.
You can see here we're starting to see a
little bit of the call out that I had in
the actual original request that I
wanted to see what was typed in and what
the difference was. And in fact, this
one goes so far as to kind of give us an
inline diff character inline diff, which
is nice. All right. All the way down to
this is opus 45. If this one gives you
that, it's corrected. It doesn't call
out what was corrected necessarily. And
then something like GPT52 Pro finds
hours, does not find horse. So this, by
the way, this is the Pro model. This one
took about 15 minutes. And the first one
that we showed Oh, excellent. There you
go. The first one we showed took about
15 seconds. So GPT52 instant took less
than 15 seconds. This one took more than
15 minutes. And disregard the errors.
Who knows? I give uh a little bit of an
allowance for if it didn't build
immediately on first delivery. So, I did
work with some of these and maybe I
missed the mistake here for misspelling.
No big deal. All right. I know that
seems like enough. You can kind of see
from instant all the way up to pro, but
I went a little bit further. So, let's
imagine that you just took your request,
that 15-minute conversation that I had
with myself, kind of meandering around
all the different features I wanted, and
I went directly into cloud code and put
it directly into planning mode. what
would happen? This is that version. So,
this one keyed in on a little bit
something different that was in the
system at one point, which was trying to
immediately find uh information with a
local dictionary. So, it can't find hur
r e, which is a misspelling of horse. I
haven't said that yet, but that's really
what we're trying to do here. But, it
can find these others. It's a little
tricky, but it absolutely, of course,
works. And okay, how about even crazier?
You thought that one was crazy? I just
took my request, no PRD, directly to
planning mode. We got what we just saw.
And now I said to heck with a plan if
these things are so so good, let's just
go direct to execution. So I took what I
had that conversation about 15 minutes,
put it directly into execution, no
planning mode with Opus 45 and said,
boom, build it. This is what happened.
Now admittedly, this one I want to call
out. I think this is a change to uh Opus
or to Claude Code that is really very
valuable. Claude code is now kind of
looking around for things that it
doesn't understand and then it goes into
planning mode. So I think what's
actually happening underneath the covers
is it actually does go into planning
mode. Ask questions, do the normal
things that planning would normally do.
So there might not be a real direct
execution if you put something this
complicated in. That's the reason I
think this happens. However, by the way,
this is one of the best looking ones. So
most important, I'm looking at these
nine builds and I can't tell you clearly
which one came from pro and which one
came from instant. That was the real
hypothesis. There are differences. Some
missed things that other hit and some
missed more things than others
altogether. But they basically have the
same features. They handle definitions
roughly the same ways. You can see some
of them didn't quite work in exactly the
same way. They deal with concepts
similarly, but some didn't really deal
with concepts well at all. But really,
the UI variations are kind of just
noise, probabilistic differences that
you'd get by running the same purity
twice in general. We all know about that
problem. So, I decided to build an
evaluation that I could run on every
model to take a look and let each one of
those systems build out their own kind
of uh definition of how well they think
they met the PRD that was put into them.
And so, we end up with these kinds of
scores here. Everything, by the way,
scores in basically the 80s. And you can
see like GPT52 fast, this is the instant
model, scores basically the highest on
this chart. Um, and then Opus 45 right
behind it. And then GPT5 thinkings and
then direct execution scored very high.
And if you scroll down um into the 77
area, uh you get kind of the the pro
version and the direct planning one was
really actually weirdly one of the
worst, but really they're very very
close. I wouldn't read too much into
these numbers. This is just an indicator
of are there major differences? If we go
into the way that it kind of um
represents itself, you can see how much
overlap there is for all of the
different features that are being built.
And if we go into something like, let's
say, the pro model and see what it
missed. Multiple information panels, it
missed that. Copying to the clipboard,
it didn't quite get that right. Um,
multiple interpretations, it did not get
multiple interpretations at all. So,
there were some things that it missed
completely. And if we look at like the
fast build,
it got most of these things except that
didn't get derivations. So it didn't do
etmology nearly as well and misspelling.
It did not do the vis the misspelling uh
quite as well. The visualization as we
saw in the beginning. All right. So
great, we have all these numbers. We
have a mishmash of different
applications that are being built.
What does this all tell us? This is now
very confusing and this is where I found
myself. Some of the confusing points
here are like the PRDS themselves are
actually kind of dramatically different
sizes. Some are up to 400 lines and the
smallest one, the instant one, was only
like 50 lines. That's massively
different levels of detail. And yet
these similar results. And so that
became very confusing to me. This
completely broke my mental model. I
mean, I was actually ready to shoot a
video. This this seemed so obvious. I
wasn't sure I was going to share it,
like I said, because I thought this
everybody will know I'm going to shoot
this video and say, "Oh, guess what? to
get the best PRD which will give you the
best product. You use you have to use
the best model that you have access to.
That's what I thought I was going to
have to do. And in reality, obviously
that is absolutely not the case. So what
are we left with here? Okay, so what's
going on here? There's something
happening inside the claw code planning
step that I didn't fully appreciate. So
what's really happening is every PRD
regardless of how smart the model is
that wrote it goes through this powerful
planning filter and that filter smooths
out everything. So that's really the
trick. What do we do about that? Okay,
so the practical implication of this, if
you're using a sophisticated planning
system and clawed code in planning mode
qualifies, the model you use for your
PRD matters way less than you'd think.
This is actually great news for
everyone. If you don't have access to
GPT52
Pro or something like that, you're
perfectly fine. Use a pretty good model
and you should be able to get a pretty
good PRD out of it. And something like
Opus in planning mode will take care of
the rest. If you want to use the free
tier GPT or Gemini, any Frontier model
that's doing some kind of thinking will
probably do exactly what you need. But
here's where I wasn't exactly satisfied
because roughly the same obviously isn't
exactly the same. And I kept wondering,
what if the problem isn't model
intelligence? What if it's something
else entirely? I had a theory. The issue
isn't how smart the model is. The issue
is how much intent survives the
translation. Look, when you describe
what you want to an AI and ask for a
PRD, something's bound to get lost. And
really, it's almost always the why
behind your features. You know, that
feeling you were going for, the stuff
that's kind of hard to specify, but
really easy to lose. So, I created a new
version of my request with the same
intent document, but I added several
paragraphs at the end. Explicit
instructions saying, "Carry the intent
through. Don't just list features.
Explain why each one matters." So,
basically, trying to carry my voice on.
Preserve the nuance. The PRD should feel
like a conversation, not just a
checklist. I gave this enhanced request
to GPT52 thinking. So just kind of the
thinking model and Opus 45. And this
time something different definitely
happened. When I ran the evaluation,
GPT52 thinking scored about the same as
before, mid80s. It tried to carry
intent, but it still formatted
everything like a typical GPT spec.
Clean, structured, but sterile. Opus 45
99%.
Not 89%, not 91%, 99%. Every single
requirement I'd mentioned was in there
and not just listed. It was explained.
The Y was preserved. It read like I'd
written it myself, just organized. This
is a 12point gap from the next closest
model. That's not noise. That's signal.
So, I decided to build it. The 99% kind
of PRD that we were talking about. And
here it is. So, I assumed it would
absolutely be perfect, of course,
because it had everything in it. And it
does have a lot. As you can tell, it is
actually showing me synonyms for things.
It's showing me the results, the
etmology for things. But this one, you'd
be surprised to find actually can't
handle the misspellings like it needs to
in not being able to deal with the
response, which is just kind of
surprising. And if we look over at the
GPT52 version, which is a lower version,
it was a 89% or something like that, you
can see it gets through all of the
different misspellings, changes, and
hints, has a thesaurus, has etmology,
has a lot of the things. But the problem
is it really is still missing quite a
bit. And when we talk about especially
something like this one here, it
definitely doesn't hit all of the marks,
all of the 99% items that are inside of
that PRD. We know for a fact inside of
that PRD, it's got everything it needs.
And it still turned out like this. So,
what was going on here? 18 items were
missing. Not small things. Word of the
day integration missing. Sound effects
gone. Auto dismiss behavior nowhere.
These weren't edge cases. These were
features I explicitly asked for that
made it into the PRD that just
disappeared during planning. The
planning step was dropping 20 to 30% of
my requirements.
Not because they weren't clear, not
because they were unreasonable. They
were just lost. So I ran it again. I
gave Opus the plan it had written, the
original PRD, and said, "Find everything
in the PRD that's not in the plan, and
give me a fallout list. Then update the
plan to include those items." After the
second pass, eight items were missing
instead of 18. After the third, the
remaining items were things that I
hadn't really specified all that well
anyway. They were pretty kind of
ambiguous. So, it was legitimate that
they were kind of missing and wrong. And
so, that's what I asked it to add. So
this is the finding not use a smarter
model for your peer. The finding is your
planning step is probably dropping a
quarter of your requirements and you
don't even know it. Okay. And so here we
are. This is the version of the
application as the uh you know as we've
triple planned it and kind of come out
with a fullblown version and iterated on
it a little bit. So this is what the
application looks like. It has a lot of
the features that we would anticipate at
this point that you can see all of the
different aspects of the original
request inside of it. And I wanted to
kind of very briefly once again describe
to you the pattern that got me to this
versus any of the previous planning
builds. So essentially I make sure that
I carry intent over into the PRD. I
think that's a critical aspect that I
will be talking about more on this
channel very soon. I'm very interested
in intent, but we need to make sure that
intent isn't scrubbed away when we're
building our plans. That PRD, once we
get that PRD into the system, I allow it
to plan against it. And then I say,
"Okay, you think you're done planning?
Compare yourself against the PRD. Make
sure you're covering everything." I do
that once or twice. Then you're going to
have a completed version of a plan. Now
what I believe is that they should be
adding this into the planning mode into
into something like clawed code and
cursor and other tools when they talk
about planning. They should be able to
evaluate against the original request.
That should be a straightforward thing
to do to make sure that the plan that
they've created kind of considers
everything that they were asked to do.
It's kind of surprising that they're not
doing that. And that's where I found
myself. And that's exactly where we are,
which is, oh wow, the planning mode was
actually the problem. It was scraping
off all of these edges and I didn't know
it. What actually matters here is
getting your intent to survive the
journey and then verifying that nothing
got dropped along the way. Your intent
is the actual you in all of this. The
gap between your best idea and what gets
built isn't about using GPT5 Pro versus
52 Instant. It's about the silent losses
at each handoff. The intent that doesn't
make it into the PRD, the requirements
that don't make it into the plan, the
features that don't make it into the
build. So basically, check your work at
each step. That's it. That's the finding
that 3 days of building the same app
like nine times actually taught me.
Look, if this changed how you think
about AI assisted development, let me
know in the comments. I'm curious
whether you've seen similar patterns or
this is new to you. And if you're
building with AI tools regularly,
subscribe. I'm doing more experiments
like this all of the time. I can't seem
to stop.
Thanks for coming along for the ride on
this one and I'll see you in the next
I thought GPT-5 Pro would crush the cheaper models. 15 minutes vs 15 seconds to generate a PRD—there had to be a difference. Then I built the same app 9 times and discovered something I wasn't expecting at all. This video documents an experiment: take one intent document, feed it to 8 different AI models (GPT-5.2 Instant through GPT-5.2 Pro, Claude Opus 4.5, Sonnet, and more), generate a PRD from each, then build every PRD using the exact same system—Claude Code with Opus 4.5 in planning mode. The hypothesis was obvious: smarter model = better PRD = better app. The results broke that assumption completely—and led to a much more useful discovery about where requirements actually get lost in AI-assisted development. If you're building applications with AI tools and wondering whether you should pay for premium models to write your specs, this will save you time and money. Engineers evaluating AI coding workflows, developers curious about Claude Code's planning mode, and anyone who's ever wondered "does the PRD actually matter?" will find something useful here. Whether you're just starting with AI-assisted development or you've shipped dozens of AI-built projects, the finding about intent preservation applies to your workflow. #ClaudeCode #GPT5 #AICoding #PRD #AIAgents 00:00 - Intro 00:54 - The first experiment 02:45 - First build results 06:25 - Initial scores 08:46 - But. now what 09:44 - So, wait, what? 12:37 - Build the perfect PRD 13:56 - Even perfect PRD is not good enough 15:06 - After forcing a better plan 16:48 - Conclusion