Managing Agent Context with LangChain: Summarization Middleware Explained | DailyDevLists

Loading video player...

Full Transcript

1,319 words • EN

Hi there, this is Christian from Lchain.

If you build with coding agents like

cursor, you probably recognize this. The

first few turns with the agents are

great. But then as you keep continuing

talking to the agent in the same thread,

the quality slides, the decision get

more fuzzy and the overall code quality

drops and then cursor drops this system

line context summarized. That's the

moment you know you've crossed the

context boundary line. So why is

summarization such a big deal for

context engineering? Every agent you

build lives inside a fixed memory

window, 100,000, 200,000 or 1 million

tokens, whatever the model supports. And

that window is both your superpower and

your bottleneck at the same time. As

conversations grow, every turn you add

competes for space within that context.

Earlier reasoning, earlier tool outputs,

or earlier code snippets. So without a

good strategy, two bad things happen.

One, the model forgets about important

steps or repeats work, the classical

context drift. And two, you start paying

for tokens that never even influence the

next prediction. The summarization

middleware lets you take control about

exactly that trade-off. You shrink the

history, but you do it intelligently.

Let's check it out. So before we talk

about how summarization helps, it's

worth understanding why context

management is so tricky. When agents run

for a while, they start to suffer from

what some people call context failures.

For instance, you get something called

context poisoning, which is when a small

mistake slips into the context and keeps

being reused by the LLM. Then there's

something called context destruction,

where the model gets overwhelmed and

loses focus on what's important. Next is

context confusion, where too many

unimportant details lead to poor

answers. And finally, context clash

where new information conflict with

information that's already within the

context. There's a really great article

by Drew who talks about how long context

can fail the agent. So, how can we make

sure this doesn't happen to our agent?

There are few wellestablished tactics we

can use to keep our agent context clean

and efficient. First, there's rack

retrieval augmented generation, which

only pulls in the information that's

actually relevant for the agent to work.

Next is something called tool loadout,

which is that you don't throw every tool

definition into your agent and just load

the ones that are needed for your

current task. Next is context

quarantine, which means that you try to

isolate the work into smaller threads so

that conversations doesn't pollute other

conversations. And then there's pruning,

which simply deletes noise and

irrelevant messages and tool outputs or

outdated tool outputs.

And last, there's offloading, which

helps you to store data outside of the

context and load it back in when you

need it. And lastly, the technique that

we want to focus on in this video is

thumbriization, which is instead of

deleting context, you simply just

compress old history into a compact

recap. to keep important context around

while freeing up space. Now, in our

NextJS sandbox application, we have one

agent scenario that focuses on the

summarization middleware. And our

summarization agent is a coding agent

that helps us to refactor our project.

The code is very simple. We have an X.js

post endpoint that takes a request

payload and pretty much just pass it in

to the summarization agent. The payload

contains the message, the API key, and

the thread ID.

The summization agent is fairly simple.

It has a mocked file system, and it has

two tools to read and list files. Then

we start basically midway conversation.

I want to make sure that we trigger the

context window limit uh at a certain

point. So I created some initial

messages to help fill up the context

window from the beginning. Then we

define two models. One is our agent

model that we use for our refactoring

work of the agent and then we have

defined one model for the creating the

summary uh within our summarization

middleware

in our create agent. We then plug

everything together. We define our model

for our agent work. We define the two

tools that we're going to use for our

refactoring. And then we define our

summarization middleware. That

summarization middleware takes a model

as well. This model can be now a cheaper

model. In this case, we're using uh the

claude haiku 4.5. Um and then we define

points where to trigger the h the

summarization middleware and what type

of information we keep. So you can see

here we can define multiple trigger

conditions. For instance, we can say we

want to trigger the summization after

the context window has filled up by 80%.

The summization middleware here looks

into the model profile and knows how

much token size every model has and

provides. For this demo purpose, we

trigger the summization middleware after

about 2,000 tokens. You can also say I

want to trigger it after a certain

amount of messages as well.

Now the keep property allows you to

define what type of information to keep.

Here we say we want to keep the last

thousand about thousand tokens. And then

you can define a custom summary prefix.

In our case, it's a previous

conversation summary.

Lastly, we plug everything together and

we trigger the agent initially with our

initial messages and then with all the

uh consecutive triggers just contain our

messages themselves and then we return

the agent stream and display the

information in our front end. So let's

try it out. Uh we have one example

prompt here that says let's continue

with the refactoring. Can you help me

create a date utils ts file? You can see

here that our context window immediately

fills up to 1,400 tokens. And then our

agent starts helping us with coding. You

see that the agent asked us now, would

you like me to suggest any additional

improvements? Sure, go ahead. Suggest

things.

The agent now adds suggests more

improvements

to our application. And now we see that

we filled up the context window to 2487

tokens. Now that means the next

interaction with the agent will trigger

the summarization before we send the

message to the agent. So let's say we

want to agent to help me format the

code. Now I want you to pay attention to

two things. We now going to see the

summarization going to happening by the

summarization middleware. But you will

also see that our context window will

shrink down to thousand tokens. Let's

see how that looks like.

So once the token kicks in, we see that

the summarization middleware is now

active. Summarizing our context and our

context window now goes down to,100

tokens again. And now we can continue

calling tools and reading from files and

filling up our context window again. And

if you look at the summarization, we see

that

the agent has summarized our intent, our

project structure, the current code

files, and some goals and issues that

have been identified. So, we've been

able to basically compress our previous

history down into one single message and

freed up a lot of space in our context

window. So, to wrap things up, context

management isn't just about fitting

information into a token window. It's

about engineering what your agent

remembers and how it reasons over time.

With the longchain summization

middleware, you can automatically

compress long histories once your

context fills up, keeping your agent

sharp, efficient, and affordable. You

control when it triggers, how much

context to keep, and what to preserve,

all within just a few lines of

configuration.

If you want to try this out, clone the

example repository in the description

below and watch the summary bubble

appear in your own chat application.

That's for this episode. See you in the

next one.

Managing Agent Context with LangChain: Summarization Middleware Explained

LangChain

97 days ago

8:30

AI Framework Development

Rank #1

Description

Long-running agents eventually hit context overload — leading to context poisoning, distraction, confusion, and degraded performance. In this video, Christian from LangChain breaks down how Summarization Middleware helps you automatically manage and compress conversation history to keep your agents sharp, efficient, and reliable. You’ll learn: • Why long contexts silently fail over time • Six strategies for fixing context overload (RAG, pruning, offloading, and more) • How summarization fits into the ReAct agent loop • How to configure triggers, keep conditions, and custom prompts • A full live demo in Next.js, showing summaries appear as chat bubbles in real time What Summarization Middleware gives you: • Automatic summarization when token limits are approached • Flexible triggers based on tokens, fractions, or message counts • Control over how much recent context is preserved • A separate, cheaper model for summarization to reduce cost Perfect for: • Coding agents • Customer support assistants • Multi-step workflows • Any long-running conversational agent 📚 Docs: https://docs.langchain.com/oss/javascript/langchain/middleware/built-in#summarization 🧑‍💻 Example Code: https://github.com/christian-bromann/langchat

Video Details

Category

AI Framework Development

Featured Date

December 4, 2025

Quality Rank

#1

AI Recommended