I Built the Same App 9 Times—The Results Made No Sense | DailyDevLists

Loading video player...

Full Transcript

3,188 words • EN

So I had this hypothesis and I was

pretty confident about it. Better AI

model writes a better PRD, right? Better

PRD leads to a better build. So if you

use GPT52 Pro to generate your product

requirements instead of 52 instant, you

should get a meaningfully better

application on the other side. That made

sense to me. So I tested it. Same app,

eight different models generating

purities. everything from GPT5 to

instant all the way up to 52 Pro. Then I

fed every PRD into the same build

system, clawed code with Opus 45 in

planning mode. Same builder, same

process every single time. And when I

looked at the results, I could not tell

them apart. That should not have

happened. And the process of figuring

out why led me somewhere I didn't

expect. Okay, here's what I was

building. Think Spotlight on Mac, but

for words. You hit a hotkey, a panel

pops up. You type a word and instantly

you get definitions, etmology, synonyms,

related concepts, everything. If you

misspell something, it catches that and

shows you what you probably meant, even

with little hints to describe the

problem so that maybe you'll learn how

to spell it next time. But here's the

thing, it's not just dictionary words.

Type apple and you might mean the fruit

or maybe the company. the system needed

to understand that context and show you

both interpretations and kind of let you

explore deeper. That was really the

whole point. So I recorded a 15inut uh

kind of voice memo describing what I

wanted all of the features, how it

should feel, the experience I was going

for and that became what I call my

intent document. Then I gave that same

document to eight different models with

identical instructions. basically turn

this into a PRD, which is just a product

requirements document or kind of the the

structure document, the spec that you

might hand to a development team

actually. And then I took every single

one of those PRDs and fed them into the

exact same build system, which as I

mentioned was clawed code with Opus 45

in planning mode. It's the same build

environment, same execution model. The

only variable was which AI wrote the

PRD. This was all intentional to control

just for the PRD quality. My hypothesis

was so obvious I wasn't sure it was even

going to be worth sharing. So GPT52 Pro

would crush it. The PRD from the

smartest model would produce the most

complete product. GPT52 Instant would

probably miss tons of features. There'd

be a clear correlation between the model

intelligence and output quality. So let

me show you what actually happened. So

I'm starting with you'll see the names

up here just in case. GPT5 too fast.

This is that instant model. If you go to

chat GPT and choose instant, this is

what you'll be running. So it comes back

super fast and gives you a a a

reasonable response, but you wouldn't

expect it to be a great PRD. And if we

come in here and search for something

like horse, you'll see what it's trying

to do is it's giving us horse with

common misspelling here. and it will

tell us that this is showing us the

results of horse even though we had

typed something else in. It is not

giving us a definition of what changed.

That's one of the things that we're

really looking for is if you type in a

misspelling, this one does correct for

it as you can see, but also if you type

in a misspelling, it should show you

what you misspelled to help you next

time. All right, the next one is from

sonnet. You can see once again horse

horse is corrected. It just shows us the

results from Horus itself. Excellent.

And how about from GPT5 to thinking. So

this is kind of the thinking model that

you would go to if you went to chat GPT.

You can see here we're starting to see a

little bit of the call out that I had in

the actual original request that I

wanted to see what was typed in and what

the difference was. And in fact, this

one goes so far as to kind of give us an

inline diff character inline diff, which

is nice. All right. All the way down to

this is opus 45. If this one gives you

that, it's corrected. It doesn't call

out what was corrected necessarily. And

then something like GPT52 Pro finds

hours, does not find horse. So this, by

the way, this is the Pro model. This one

took about 15 minutes. And the first one

that we showed Oh, excellent. There you

go. The first one we showed took about

15 seconds. So GPT52 instant took less

than 15 seconds. This one took more than

15 minutes. And disregard the errors.

Who knows? I give uh a little bit of an

allowance for if it didn't build

immediately on first delivery. So, I did

work with some of these and maybe I

missed the mistake here for misspelling.

No big deal. All right. I know that

seems like enough. You can kind of see

from instant all the way up to pro, but

I went a little bit further. So, let's

imagine that you just took your request,

that 15-minute conversation that I had

with myself, kind of meandering around

all the different features I wanted, and

I went directly into cloud code and put

it directly into planning mode. what

would happen? This is that version. So,

this one keyed in on a little bit

something different that was in the

system at one point, which was trying to

immediately find uh information with a

local dictionary. So, it can't find hur

r e, which is a misspelling of horse. I

haven't said that yet, but that's really

what we're trying to do here. But, it

can find these others. It's a little

tricky, but it absolutely, of course,

works. And okay, how about even crazier?

You thought that one was crazy? I just

took my request, no PRD, directly to

planning mode. We got what we just saw.

And now I said to heck with a plan if

these things are so so good, let's just

go direct to execution. So I took what I

had that conversation about 15 minutes,

put it directly into execution, no

planning mode with Opus 45 and said,

boom, build it. This is what happened.

Now admittedly, this one I want to call

out. I think this is a change to uh Opus

or to Claude Code that is really very

valuable. Claude code is now kind of

looking around for things that it

doesn't understand and then it goes into

planning mode. So I think what's

actually happening underneath the covers

is it actually does go into planning

mode. Ask questions, do the normal

things that planning would normally do.

So there might not be a real direct

execution if you put something this

complicated in. That's the reason I

think this happens. However, by the way,

this is one of the best looking ones. So

most important, I'm looking at these

nine builds and I can't tell you clearly

which one came from pro and which one

came from instant. That was the real

hypothesis. There are differences. Some

missed things that other hit and some

missed more things than others

altogether. But they basically have the

same features. They handle definitions

roughly the same ways. You can see some

of them didn't quite work in exactly the

same way. They deal with concepts

similarly, but some didn't really deal

with concepts well at all. But really,

the UI variations are kind of just

noise, probabilistic differences that

you'd get by running the same purity

twice in general. We all know about that

problem. So, I decided to build an

evaluation that I could run on every

model to take a look and let each one of

those systems build out their own kind

of uh definition of how well they think

they met the PRD that was put into them.

And so, we end up with these kinds of

scores here. Everything, by the way,

scores in basically the 80s. And you can

see like GPT52 fast, this is the instant

model, scores basically the highest on

this chart. Um, and then Opus 45 right

behind it. And then GPT5 thinkings and

then direct execution scored very high.

And if you scroll down um into the 77

area, uh you get kind of the the pro

version and the direct planning one was

really actually weirdly one of the

worst, but really they're very very

close. I wouldn't read too much into

these numbers. This is just an indicator

of are there major differences? If we go

into the way that it kind of um

represents itself, you can see how much

overlap there is for all of the

different features that are being built.

And if we go into something like, let's

say, the pro model and see what it

missed. Multiple information panels, it

missed that. Copying to the clipboard,

it didn't quite get that right. Um,

multiple interpretations, it did not get

multiple interpretations at all. So,

there were some things that it missed

completely. And if we look at like the

fast build,

it got most of these things except that

didn't get derivations. So it didn't do

etmology nearly as well and misspelling.

It did not do the vis the misspelling uh

quite as well. The visualization as we

saw in the beginning. All right. So

great, we have all these numbers. We

have a mishmash of different

applications that are being built.

What does this all tell us? This is now

very confusing and this is where I found

myself. Some of the confusing points

here are like the PRDS themselves are

actually kind of dramatically different

sizes. Some are up to 400 lines and the

smallest one, the instant one, was only

like 50 lines. That's massively

different levels of detail. And yet

these similar results. And so that

became very confusing to me. This

completely broke my mental model. I

mean, I was actually ready to shoot a

video. This this seemed so obvious. I

wasn't sure I was going to share it,

like I said, because I thought this

everybody will know I'm going to shoot

this video and say, "Oh, guess what? to

get the best PRD which will give you the

best product. You use you have to use

the best model that you have access to.

That's what I thought I was going to

have to do. And in reality, obviously

that is absolutely not the case. So what

are we left with here? Okay, so what's

going on here? There's something

happening inside the claw code planning

step that I didn't fully appreciate. So

what's really happening is every PRD

regardless of how smart the model is

that wrote it goes through this powerful

planning filter and that filter smooths

out everything. So that's really the

trick. What do we do about that? Okay,

so the practical implication of this, if

you're using a sophisticated planning

system and clawed code in planning mode

qualifies, the model you use for your

PRD matters way less than you'd think.

This is actually great news for

everyone. If you don't have access to

GPT52

Pro or something like that, you're

perfectly fine. Use a pretty good model

and you should be able to get a pretty

good PRD out of it. And something like

Opus in planning mode will take care of

the rest. If you want to use the free

tier GPT or Gemini, any Frontier model

that's doing some kind of thinking will

probably do exactly what you need. But

here's where I wasn't exactly satisfied

because roughly the same obviously isn't

exactly the same. And I kept wondering,

what if the problem isn't model

intelligence? What if it's something

else entirely? I had a theory. The issue

isn't how smart the model is. The issue

is how much intent survives the

translation. Look, when you describe

what you want to an AI and ask for a

PRD, something's bound to get lost. And

really, it's almost always the why

behind your features. You know, that

feeling you were going for, the stuff

that's kind of hard to specify, but

really easy to lose. So, I created a new

version of my request with the same

intent document, but I added several

paragraphs at the end. Explicit

instructions saying, "Carry the intent

through. Don't just list features.

Explain why each one matters." So,

basically, trying to carry my voice on.

Preserve the nuance. The PRD should feel

like a conversation, not just a

checklist. I gave this enhanced request

to GPT52 thinking. So just kind of the

thinking model and Opus 45. And this

time something different definitely

happened. When I ran the evaluation,

GPT52 thinking scored about the same as

before, mid80s. It tried to carry

intent, but it still formatted

everything like a typical GPT spec.

Clean, structured, but sterile. Opus 45

99%.

Not 89%, not 91%, 99%. Every single

requirement I'd mentioned was in there

and not just listed. It was explained.

The Y was preserved. It read like I'd

written it myself, just organized. This

is a 12point gap from the next closest

model. That's not noise. That's signal.

So, I decided to build it. The 99% kind

of PRD that we were talking about. And

here it is. So, I assumed it would

absolutely be perfect, of course,

because it had everything in it. And it

does have a lot. As you can tell, it is

actually showing me synonyms for things.

It's showing me the results, the

etmology for things. But this one, you'd

be surprised to find actually can't

handle the misspellings like it needs to

in not being able to deal with the

response, which is just kind of

surprising. And if we look over at the

GPT52 version, which is a lower version,

it was a 89% or something like that, you

can see it gets through all of the

different misspellings, changes, and

hints, has a thesaurus, has etmology,

has a lot of the things. But the problem

is it really is still missing quite a

bit. And when we talk about especially

something like this one here, it

definitely doesn't hit all of the marks,

all of the 99% items that are inside of

that PRD. We know for a fact inside of

that PRD, it's got everything it needs.

And it still turned out like this. So,

what was going on here? 18 items were

missing. Not small things. Word of the

day integration missing. Sound effects

gone. Auto dismiss behavior nowhere.

These weren't edge cases. These were

features I explicitly asked for that

made it into the PRD that just

disappeared during planning. The

planning step was dropping 20 to 30% of

my requirements.

Not because they weren't clear, not

because they were unreasonable. They

were just lost. So I ran it again. I

gave Opus the plan it had written, the

original PRD, and said, "Find everything

in the PRD that's not in the plan, and

give me a fallout list. Then update the

plan to include those items." After the

second pass, eight items were missing

instead of 18. After the third, the

remaining items were things that I

hadn't really specified all that well

anyway. They were pretty kind of

ambiguous. So, it was legitimate that

they were kind of missing and wrong. And

so, that's what I asked it to add. So

this is the finding not use a smarter

model for your peer. The finding is your

planning step is probably dropping a

quarter of your requirements and you

don't even know it. Okay. And so here we

are. This is the version of the

application as the uh you know as we've

triple planned it and kind of come out

with a fullblown version and iterated on

it a little bit. So this is what the

application looks like. It has a lot of

the features that we would anticipate at

this point that you can see all of the

different aspects of the original

request inside of it. And I wanted to

kind of very briefly once again describe

to you the pattern that got me to this

versus any of the previous planning

builds. So essentially I make sure that

I carry intent over into the PRD. I

think that's a critical aspect that I

will be talking about more on this

channel very soon. I'm very interested

in intent, but we need to make sure that

intent isn't scrubbed away when we're

building our plans. That PRD, once we

get that PRD into the system, I allow it

to plan against it. And then I say,

"Okay, you think you're done planning?

Compare yourself against the PRD. Make

sure you're covering everything." I do

that once or twice. Then you're going to

have a completed version of a plan. Now

what I believe is that they should be

adding this into the planning mode into

into something like clawed code and

cursor and other tools when they talk

about planning. They should be able to

evaluate against the original request.

That should be a straightforward thing

to do to make sure that the plan that

they've created kind of considers

everything that they were asked to do.

It's kind of surprising that they're not

doing that. And that's where I found

myself. And that's exactly where we are,

which is, oh wow, the planning mode was

actually the problem. It was scraping

off all of these edges and I didn't know

it. What actually matters here is

getting your intent to survive the

journey and then verifying that nothing

got dropped along the way. Your intent

is the actual you in all of this. The

gap between your best idea and what gets

built isn't about using GPT5 Pro versus

52 Instant. It's about the silent losses

at each handoff. The intent that doesn't

make it into the PRD, the requirements

that don't make it into the plan, the

features that don't make it into the

build. So basically, check your work at

each step. That's it. That's the finding

that 3 days of building the same app

like nine times actually taught me.

Look, if this changed how you think

about AI assisted development, let me

know in the comments. I'm curious

whether you've seen similar patterns or

this is new to you. And if you're

building with AI tools regularly,

subscribe. I'm doing more experiments

like this all of the time. I can't seem

to stop.

Thanks for coming along for the ride on

this one and I'll see you in the next

I Built the Same App 9 Times—The Results Made No Sense

Matt Maher

46 days ago

17:49

Claude & Anthropic Ecosystem

Rank #2

Description

I thought GPT-5 Pro would crush the cheaper models. 15 minutes vs 15 seconds to generate a PRD—there had to be a difference. Then I built the same app 9 times and discovered something I wasn't expecting at all. This video documents an experiment: take one intent document, feed it to 8 different AI models (GPT-5.2 Instant through GPT-5.2 Pro, Claude Opus 4.5, Sonnet, and more), generate a PRD from each, then build every PRD using the exact same system—Claude Code with Opus 4.5 in planning mode. The hypothesis was obvious: smarter model = better PRD = better app. The results broke that assumption completely—and led to a much more useful discovery about where requirements actually get lost in AI-assisted development. If you're building applications with AI tools and wondering whether you should pay for premium models to write your specs, this will save you time and money. Engineers evaluating AI coding workflows, developers curious about Claude Code's planning mode, and anyone who's ever wondered "does the PRD actually matter?" will find something useful here. Whether you're just starting with AI-assisted development or you've shipped dozens of AI-built projects, the finding about intent preservation applies to your workflow. #ClaudeCode #GPT5 #AICoding #PRD #AIAgents 00:00 - Intro 00:54 - The first experiment 02:45 - First build results 06:25 - Initial scores 08:46 - But. now what 09:44 - So, wait, what? 12:37 - Build the perfect PRD 13:56 - Even perfect PRD is not good enough 15:06 - After forcing a better plan 16:48 - Conclusion

Video Details

Category

Claude & Anthropic Ecosystem

Featured Date

January 14, 2026

Quality Rank

#2

AI Recommended