Loading video player...
Hi there, this is Christian from Lchain.
If you build with coding agents like
cursor, you probably recognize this. The
first few turns with the agents are
great. But then as you keep continuing
talking to the agent in the same thread,
the quality slides, the decision get
more fuzzy and the overall code quality
drops and then cursor drops this system
line context summarized. That's the
moment you know you've crossed the
context boundary line. So why is
summarization such a big deal for
context engineering? Every agent you
build lives inside a fixed memory
window, 100,000, 200,000 or 1 million
tokens, whatever the model supports. And
that window is both your superpower and
your bottleneck at the same time. As
conversations grow, every turn you add
competes for space within that context.
Earlier reasoning, earlier tool outputs,
or earlier code snippets. So without a
good strategy, two bad things happen.
One, the model forgets about important
steps or repeats work, the classical
context drift. And two, you start paying
for tokens that never even influence the
next prediction. The summarization
middleware lets you take control about
exactly that trade-off. You shrink the
history, but you do it intelligently.
Let's check it out. So before we talk
about how summarization helps, it's
worth understanding why context
management is so tricky. When agents run
for a while, they start to suffer from
what some people call context failures.
For instance, you get something called
context poisoning, which is when a small
mistake slips into the context and keeps
being reused by the LLM. Then there's
something called context destruction,
where the model gets overwhelmed and
loses focus on what's important. Next is
context confusion, where too many
unimportant details lead to poor
answers. And finally, context clash
where new information conflict with
information that's already within the
context. There's a really great article
by Drew who talks about how long context
can fail the agent. So, how can we make
sure this doesn't happen to our agent?
There are few wellestablished tactics we
can use to keep our agent context clean
and efficient. First, there's rack
retrieval augmented generation, which
only pulls in the information that's
actually relevant for the agent to work.
Next is something called tool loadout,
which is that you don't throw every tool
definition into your agent and just load
the ones that are needed for your
current task. Next is context
quarantine, which means that you try to
isolate the work into smaller threads so
that conversations doesn't pollute other
conversations. And then there's pruning,
which simply deletes noise and
irrelevant messages and tool outputs or
outdated tool outputs.
And last, there's offloading, which
helps you to store data outside of the
context and load it back in when you
need it. And lastly, the technique that
we want to focus on in this video is
thumbriization, which is instead of
deleting context, you simply just
compress old history into a compact
recap. to keep important context around
while freeing up space. Now, in our
NextJS sandbox application, we have one
agent scenario that focuses on the
summarization middleware. And our
summarization agent is a coding agent
that helps us to refactor our project.
The code is very simple. We have an X.js
post endpoint that takes a request
payload and pretty much just pass it in
to the summarization agent. The payload
contains the message, the API key, and
the thread ID.
The summization agent is fairly simple.
It has a mocked file system, and it has
two tools to read and list files. Then
we start basically midway conversation.
I want to make sure that we trigger the
context window limit uh at a certain
point. So I created some initial
messages to help fill up the context
window from the beginning. Then we
define two models. One is our agent
model that we use for our refactoring
work of the agent and then we have
defined one model for the creating the
summary uh within our summarization
middleware
in our create agent. We then plug
everything together. We define our model
for our agent work. We define the two
tools that we're going to use for our
refactoring. And then we define our
summarization middleware. That
summarization middleware takes a model
as well. This model can be now a cheaper
model. In this case, we're using uh the
claude haiku 4.5. Um and then we define
points where to trigger the h the
summarization middleware and what type
of information we keep. So you can see
here we can define multiple trigger
conditions. For instance, we can say we
want to trigger the summization after
the context window has filled up by 80%.
The summization middleware here looks
into the model profile and knows how
much token size every model has and
provides. For this demo purpose, we
trigger the summization middleware after
about 2,000 tokens. You can also say I
want to trigger it after a certain
amount of messages as well.
Now the keep property allows you to
define what type of information to keep.
Here we say we want to keep the last
thousand about thousand tokens. And then
you can define a custom summary prefix.
In our case, it's a previous
conversation summary.
Lastly, we plug everything together and
we trigger the agent initially with our
initial messages and then with all the
uh consecutive triggers just contain our
messages themselves and then we return
the agent stream and display the
information in our front end. So let's
try it out. Uh we have one example
prompt here that says let's continue
with the refactoring. Can you help me
create a date utils ts file? You can see
here that our context window immediately
fills up to 1,400 tokens. And then our
agent starts helping us with coding. You
see that the agent asked us now, would
you like me to suggest any additional
improvements? Sure, go ahead. Suggest
things.
The agent now adds suggests more
improvements
to our application. And now we see that
we filled up the context window to 2487
tokens. Now that means the next
interaction with the agent will trigger
the summarization before we send the
message to the agent. So let's say we
want to agent to help me format the
code. Now I want you to pay attention to
two things. We now going to see the
summarization going to happening by the
summarization middleware. But you will
also see that our context window will
shrink down to thousand tokens. Let's
see how that looks like.
So once the token kicks in, we see that
the summarization middleware is now
active. Summarizing our context and our
context window now goes down to,100
tokens again. And now we can continue
calling tools and reading from files and
filling up our context window again. And
if you look at the summarization, we see
that
the agent has summarized our intent, our
project structure, the current code
files, and some goals and issues that
have been identified. So, we've been
able to basically compress our previous
history down into one single message and
freed up a lot of space in our context
window. So, to wrap things up, context
management isn't just about fitting
information into a token window. It's
about engineering what your agent
remembers and how it reasons over time.
With the longchain summization
middleware, you can automatically
compress long histories once your
context fills up, keeping your agent
sharp, efficient, and affordable. You
control when it triggers, how much
context to keep, and what to preserve,
all within just a few lines of
configuration.
If you want to try this out, clone the
example repository in the description
below and watch the summary bubble
appear in your own chat application.
That's for this episode. See you in the
next one.
Long-running agents eventually hit context overload — leading to context poisoning, distraction, confusion, and degraded performance. In this video, Christian from LangChain breaks down how Summarization Middleware helps you automatically manage and compress conversation history to keep your agents sharp, efficient, and reliable. You’ll learn: • Why long contexts silently fail over time • Six strategies for fixing context overload (RAG, pruning, offloading, and more) • How summarization fits into the ReAct agent loop • How to configure triggers, keep conditions, and custom prompts • A full live demo in Next.js, showing summaries appear as chat bubbles in real time What Summarization Middleware gives you: • Automatic summarization when token limits are approached • Flexible triggers based on tokens, fractions, or message counts • Control over how much recent context is preserved • A separate, cheaper model for summarization to reduce cost Perfect for: • Coding agents • Customer support assistants • Multi-step workflows • Any long-running conversational agent 📚 Docs: https://docs.langchain.com/oss/javascript/langchain/middleware/built-in#summarization 🧑💻 Example Code: https://github.com/christian-bromann/langchat