Most devs don’t understand how context windows work | DailyDevLists

Loading video player...

Full Transcript

2,022 words • EN

There's a constant debate among devs at

the moment is how good actually are

coding agents and the debate usually has

two sides. It has the side that says no

coding agents suck. I hate AI coding.

And the other side of the debate says

no, you're just using them wrong. This

is a skill issue. Now, I can see both

sides of this debate, but if there is a

skill issue that I see most often with

devs, it is not thinking enough about

the context window. The context window

is the main constraint that most AI

coding agents face these days. And

honestly, most devs don't even really

know what it is or how it impacts how

you use these coding agents. If that's

you, you have come to the right place.

We are going to explain everything you

need to know as a user of coding agents

about what the context window is and how

it impacts coding agent performance. So,

let's get started right away by talking

about what actually makes up the context

window. The context window is the entire

set of input and output tokens that the

LLM sees. The input tokens are the

things that you pass to the LLM. You

might pass it a system prompt which is

and some instructions to say to the LM

what it should do and maybe a user

message to initiate the conversation and

then once you've sent that the LLM

starts streaming back some assistant

messages which are the output tokens and

the input plus the output tokens make up

the entire context window. As the

conversation grows longer, let's say

you're chatting with Claude or chat GBT,

the more input and output tokens are

going to be in that context window. So

we usually talk about the context window

kind of growing or the number of tokens

that you're spending inside that context

window growing and eventually it's going

to grow so so long that you will hit a

limit. Each model has a hardcoded limit

which is set by the model provider. And

let's say here you pass too many input

tokens. You've got a system message, a

user message and 100 more messages.

Let's say well you will get an error

from the LLM provider saying you have

hit the limit of the context window. You

might even hit this with a single super

long message. Let's say you're uploading

some documents or you're asking the LLM

to transcribe a video or an enormous

image. And this limit of the context

window is usually described in tokens.

If you don't know what a token is, then

you can check out my YouTube video on

tokens, which I'll link here. Now, you

can actually hit the limit while

generating tokens, too. For instance,

you can just be chatting with the system

like this and it will maybe tell you an

extremely long output which overruns its

context window and it will just stop

because the context window has been hit.

You can take a look at models.dev to

check out the different context window

limits and lots of other information

about different models. For instance,

Claude Haiku 4.5. If we zoom over here,

we can see a context window limit of

200,000 tokens. Down here, we can see a

limit of 2 million tokens. Let's take a

look at that. Boom, boom, boom, boom,

boom. And we have Gemini 2.5 Pro. Gemini

kind of as a selling point has really

large context windows. But as we'll see

in a second, uh bigger is not always

better. We'll see too that there are

some models here like Quen Math Plus

which only have about 4,000 tokens

inside here. So smaller models and older

models too often have much smaller

context window limits. So why do these

models impose context window limits at

all? Why not just allow infinite amount

of text to be passed through the model?

Well, some of it is to do with the

constraints of the architecture of the

model. LLM processing is expensive and

so adding more text and more context

window means you're using more memory

per process. But also, the larger the

context window, the more performance

degrades. In other words, the more

information that you give a model, the

worse it's going to perform. This is

true across tiny models all the way up

to very, very large models. And the

reason for that is that all models

suffer from a problem of retrieving

information from their own context. This

is the classic needle in a haystack

problem. If you have one piece of

information in a huge big bloated

context and you're trying to get the LM

to refine that and take that do

something with it, then it's going to

really struggle. This is especially true

for information that's in the middle of

a conversation. For instance, I've put

this very unscientific graph here where

we have the impact on the output here

and the position in conversation. What

happens is that for really long chats

here where we have each individual

message represented by this little

circle, the information in the middle of

the chat is going to be less prioritized

by the LLM. So the stuff at the start

and the stuff at the end is deemed most

important by the attention mechanism

that the LLM uses. This is not really

intended behavior. It's just an emergent

property of how these systems are

designed. So this is really really

important when you're doing AI coding.

The stuff at the start of the

conversation is going to have most

impact and the stuff at the end is going

to have the most impact. But all the big

bloated stuff in the middle is not

necessarily going to impact the result

that strongly. It still has an impact of

course, but much less than the stuff at

the start and the end. And this mimics

human behavior, too, if you've ever

heard of primacy bias and recency bias.

You will probably remember the start of

this video and the end of the video

better than the guff in the middle.

However, the shorter the context window,

the fewer lostinthe- middle problems

you're going to come across. Models just

do better with less, more focused

information, just like humans do. This

means that regularly clearing your

coding agent chats will refresh the

agents memory and clear its context

window making for much better

performance when you actually go to use

it. Let's actually dive into a coding

agent that I use, Claude Code. I've run

a command called context here, and we

can see the context usage that we've

used up. We've used 95k tokens out of

200k limit. This is on set 4.5 which has

a 200k context window limit. Nearly 8%

of it is just on the system prompt here

and 40% is on these messages. So 77k

tokens that is the content of the

conversation that I've run through so

far. Now if I had some work to do that

was related to the chat thread that I've

just done, which I think is just sort of

uh reworking some documentation, then

105k tokens of free space, that feels

pretty good to me. But I would

definitely start getting scared once I

only had about let's say 50k tokens left

at which point I would run clear which

clears the conversation history and

frees up the context window. You do have

an alternative in claude code here which

is to compact the conversation. If I run

this it clears the conversation history

and creates a summary of what happened.

In other words, it takes all of these

messages and just creates a smaller

message out of them. In theory, that

will pull us further away from the

context window limit and we'll get fewer

lost in the middle problems. However,

this does take some time and of course

you're using an LLM to generate a

summary. So, you are spending tokens

here. It's already taken about a minute

doing this and it's finally done. And we

can press control0 to see the full

summary. We can see it's created a

pretty lengthy summary of the

conversation we just had without any of

like the files that it's pulled in or

anything like that, but it preserves

some of the intention, some of the

vibes, and it's like a sort of mini

rules file just for this conversation.

If we run context again, we can see that

we now have 90% of free space and the

messages instead of like 70k tokens,

whatever they were before, is now only

4K. So compacting is useful when you

want to preserve the vibes of a

conversation. But clear is should be

your default when you just want to clear

it. Go back to a blank slate and keep

going from there. Whenever you're

working with a coding agent, you really

do need full transparency, full

understanding of what's happening in

your context window at any time. I want

to give you a word of warning here too

about MCP servers. MCP servers are super

attractive because they allow you to

plug and play with different pre-made

tool sets out there in the ecosystem,

but they can bloat your context

incredibly rapidly. You might have a

conversation here where like a third of

it is system prompt. You know, a big

chunk of it is MCP tools just from a

couple of MCP servers and then just a

few extra stuff is messages. So, I tend

to be extremely extremely cautious about

adding MCP servers to my setup because I

know how important having a lean context

window is. I also don't tend to write

very very large cursor rules or claw

rules because again, I'm just so scared

of these lostinthe- problems. And as a

result, I really enjoy working with AI

coding agents and I get really decent

performance out of them, I think. And

hopefully, if you take on this paranoia

that I've developed, you'll get great

results, too. So, that's what a context

window is. It is the input and output

tokens that make up the entire thing

that the LLM can see at any one time.

Every LLM comes with a context window

limit, a hard-coded limit set by the mod

provider, which is basically how many

tokens they think the LLM can reasonably

handle. All LLMs are prone to lost in

the middle problems where stuff in the

middle of the context window ends up

being dep prioritized. And so when

you're assessing an LLM, you shouldn't

just look at how big the context window

is. You should look at how well it

retrieves information from its context

window. For instance, in April, Meta

announced Llama for Scout, which if I

hide myself is just down here, and it

has a 10 million context window limit,

but it turned out after people actually

played with it, it suffered from really

bad lost in-the-middle problems. And

even though you could feed it that

information, it wouldn't really do

anything with it. So I hope you have a

better understanding of the limits of

these models and how you can work around

those limits and understand them better

to get better results. If you want to go

deeper into LLM then I have just put out

a AISDK crash course. This is a crash

course for Versel's AIDK which I think

is the perfect way to get started with

LMS if your primary language is

TypeScript. But just a couple more days

you can get this for 99 bucks. So head

to aihero.dev if you want to learn more.

Thanks so much for following along. I

love talking about this stuff and I

really think this is valuable

information. If there's anything LLM

based that you want me to cover,

especially talking about it in the

context of Typescript, let me know in

the comments. So thanks for watching and

I will see you very

Most devs don’t understand how context windows work

Matt Pocock

131 days ago

9:33

AI Evaluation & Monitoring

Rank #1

Description

A deep dive into the context window - the most important constraint when using AI coding agents. Learn what makes up a context window (input and output tokens), why models have limits, and the critical "lost-in-the-middle" problem that causes models to deprioritize information buried in long conversations. Discover practical strategies for managing context effectively in Claude Code, including when to clear vs. compact conversations, why bigger context windows aren't always better, and how MCP servers can bloat your context. Understanding context windows is the key skill that separates developers who get great results from coding agents versus those who struggle. Includes real examples and best practices for maintaining lean, focused contexts that maximize AI coding performance. Become an AI Hero with my AI SDK v5 Crash Course: https://www.aihero.dev/workshops/ai-sdk-v5-crash-course Sign up to my mailing list: https://aihero.dev/newsletter Join the Discord: https://aihero.dev/discord

Video Details

Category

AI Evaluation & Monitoring

Featured Date

December 2, 2025

Quality Rank

#1

AI Recommended