Loading video player...
There's a constant debate among devs at
the moment is how good actually are
coding agents and the debate usually has
two sides. It has the side that says no
coding agents suck. I hate AI coding.
And the other side of the debate says
no, you're just using them wrong. This
is a skill issue. Now, I can see both
sides of this debate, but if there is a
skill issue that I see most often with
devs, it is not thinking enough about
the context window. The context window
is the main constraint that most AI
coding agents face these days. And
honestly, most devs don't even really
know what it is or how it impacts how
you use these coding agents. If that's
you, you have come to the right place.
We are going to explain everything you
need to know as a user of coding agents
about what the context window is and how
it impacts coding agent performance. So,
let's get started right away by talking
about what actually makes up the context
window. The context window is the entire
set of input and output tokens that the
LLM sees. The input tokens are the
things that you pass to the LLM. You
might pass it a system prompt which is
and some instructions to say to the LM
what it should do and maybe a user
message to initiate the conversation and
then once you've sent that the LLM
starts streaming back some assistant
messages which are the output tokens and
the input plus the output tokens make up
the entire context window. As the
conversation grows longer, let's say
you're chatting with Claude or chat GBT,
the more input and output tokens are
going to be in that context window. So
we usually talk about the context window
kind of growing or the number of tokens
that you're spending inside that context
window growing and eventually it's going
to grow so so long that you will hit a
limit. Each model has a hardcoded limit
which is set by the model provider. And
let's say here you pass too many input
tokens. You've got a system message, a
user message and 100 more messages.
Let's say well you will get an error
from the LLM provider saying you have
hit the limit of the context window. You
might even hit this with a single super
long message. Let's say you're uploading
some documents or you're asking the LLM
to transcribe a video or an enormous
image. And this limit of the context
window is usually described in tokens.
If you don't know what a token is, then
you can check out my YouTube video on
tokens, which I'll link here. Now, you
can actually hit the limit while
generating tokens, too. For instance,
you can just be chatting with the system
like this and it will maybe tell you an
extremely long output which overruns its
context window and it will just stop
because the context window has been hit.
You can take a look at models.dev to
check out the different context window
limits and lots of other information
about different models. For instance,
Claude Haiku 4.5. If we zoom over here,
we can see a context window limit of
200,000 tokens. Down here, we can see a
limit of 2 million tokens. Let's take a
look at that. Boom, boom, boom, boom,
boom. And we have Gemini 2.5 Pro. Gemini
kind of as a selling point has really
large context windows. But as we'll see
in a second, uh bigger is not always
better. We'll see too that there are
some models here like Quen Math Plus
which only have about 4,000 tokens
inside here. So smaller models and older
models too often have much smaller
context window limits. So why do these
models impose context window limits at
all? Why not just allow infinite amount
of text to be passed through the model?
Well, some of it is to do with the
constraints of the architecture of the
model. LLM processing is expensive and
so adding more text and more context
window means you're using more memory
per process. But also, the larger the
context window, the more performance
degrades. In other words, the more
information that you give a model, the
worse it's going to perform. This is
true across tiny models all the way up
to very, very large models. And the
reason for that is that all models
suffer from a problem of retrieving
information from their own context. This
is the classic needle in a haystack
problem. If you have one piece of
information in a huge big bloated
context and you're trying to get the LM
to refine that and take that do
something with it, then it's going to
really struggle. This is especially true
for information that's in the middle of
a conversation. For instance, I've put
this very unscientific graph here where
we have the impact on the output here
and the position in conversation. What
happens is that for really long chats
here where we have each individual
message represented by this little
circle, the information in the middle of
the chat is going to be less prioritized
by the LLM. So the stuff at the start
and the stuff at the end is deemed most
important by the attention mechanism
that the LLM uses. This is not really
intended behavior. It's just an emergent
property of how these systems are
designed. So this is really really
important when you're doing AI coding.
The stuff at the start of the
conversation is going to have most
impact and the stuff at the end is going
to have the most impact. But all the big
bloated stuff in the middle is not
necessarily going to impact the result
that strongly. It still has an impact of
course, but much less than the stuff at
the start and the end. And this mimics
human behavior, too, if you've ever
heard of primacy bias and recency bias.
You will probably remember the start of
this video and the end of the video
better than the guff in the middle.
However, the shorter the context window,
the fewer lostinthe- middle problems
you're going to come across. Models just
do better with less, more focused
information, just like humans do. This
means that regularly clearing your
coding agent chats will refresh the
agents memory and clear its context
window making for much better
performance when you actually go to use
it. Let's actually dive into a coding
agent that I use, Claude Code. I've run
a command called context here, and we
can see the context usage that we've
used up. We've used 95k tokens out of
200k limit. This is on set 4.5 which has
a 200k context window limit. Nearly 8%
of it is just on the system prompt here
and 40% is on these messages. So 77k
tokens that is the content of the
conversation that I've run through so
far. Now if I had some work to do that
was related to the chat thread that I've
just done, which I think is just sort of
uh reworking some documentation, then
105k tokens of free space, that feels
pretty good to me. But I would
definitely start getting scared once I
only had about let's say 50k tokens left
at which point I would run clear which
clears the conversation history and
frees up the context window. You do have
an alternative in claude code here which
is to compact the conversation. If I run
this it clears the conversation history
and creates a summary of what happened.
In other words, it takes all of these
messages and just creates a smaller
message out of them. In theory, that
will pull us further away from the
context window limit and we'll get fewer
lost in the middle problems. However,
this does take some time and of course
you're using an LLM to generate a
summary. So, you are spending tokens
here. It's already taken about a minute
doing this and it's finally done. And we
can press control0 to see the full
summary. We can see it's created a
pretty lengthy summary of the
conversation we just had without any of
like the files that it's pulled in or
anything like that, but it preserves
some of the intention, some of the
vibes, and it's like a sort of mini
rules file just for this conversation.
If we run context again, we can see that
we now have 90% of free space and the
messages instead of like 70k tokens,
whatever they were before, is now only
4K. So compacting is useful when you
want to preserve the vibes of a
conversation. But clear is should be
your default when you just want to clear
it. Go back to a blank slate and keep
going from there. Whenever you're
working with a coding agent, you really
do need full transparency, full
understanding of what's happening in
your context window at any time. I want
to give you a word of warning here too
about MCP servers. MCP servers are super
attractive because they allow you to
plug and play with different pre-made
tool sets out there in the ecosystem,
but they can bloat your context
incredibly rapidly. You might have a
conversation here where like a third of
it is system prompt. You know, a big
chunk of it is MCP tools just from a
couple of MCP servers and then just a
few extra stuff is messages. So, I tend
to be extremely extremely cautious about
adding MCP servers to my setup because I
know how important having a lean context
window is. I also don't tend to write
very very large cursor rules or claw
rules because again, I'm just so scared
of these lostinthe- problems. And as a
result, I really enjoy working with AI
coding agents and I get really decent
performance out of them, I think. And
hopefully, if you take on this paranoia
that I've developed, you'll get great
results, too. So, that's what a context
window is. It is the input and output
tokens that make up the entire thing
that the LLM can see at any one time.
Every LLM comes with a context window
limit, a hard-coded limit set by the mod
provider, which is basically how many
tokens they think the LLM can reasonably
handle. All LLMs are prone to lost in
the middle problems where stuff in the
middle of the context window ends up
being dep prioritized. And so when
you're assessing an LLM, you shouldn't
just look at how big the context window
is. You should look at how well it
retrieves information from its context
window. For instance, in April, Meta
announced Llama for Scout, which if I
hide myself is just down here, and it
has a 10 million context window limit,
but it turned out after people actually
played with it, it suffered from really
bad lost in-the-middle problems. And
even though you could feed it that
information, it wouldn't really do
anything with it. So I hope you have a
better understanding of the limits of
these models and how you can work around
those limits and understand them better
to get better results. If you want to go
deeper into LLM then I have just put out
a AISDK crash course. This is a crash
course for Versel's AIDK which I think
is the perfect way to get started with
LMS if your primary language is
TypeScript. But just a couple more days
you can get this for 99 bucks. So head
to aihero.dev if you want to learn more.
Thanks so much for following along. I
love talking about this stuff and I
really think this is valuable
information. If there's anything LLM
based that you want me to cover,
especially talking about it in the
context of Typescript, let me know in
the comments. So thanks for watching and
I will see you very
A deep dive into the context window - the most important constraint when using AI coding agents. Learn what makes up a context window (input and output tokens), why models have limits, and the critical "lost-in-the-middle" problem that causes models to deprioritize information buried in long conversations. Discover practical strategies for managing context effectively in Claude Code, including when to clear vs. compact conversations, why bigger context windows aren't always better, and how MCP servers can bloat your context. Understanding context windows is the key skill that separates developers who get great results from coding agents versus those who struggle. Includes real examples and best practices for maintaining lean, focused contexts that maximize AI coding performance. Become an AI Hero with my AI SDK v5 Crash Course: https://www.aihero.dev/workshops/ai-sdk-v5-crash-course Sign up to my mailing list: https://aihero.dev/newsletter Join the Discord: https://aihero.dev/discord