Loading video player...
All right. Well, thank you all for
coming. We'll go ahead and kick off the
webinar now and I'm sure people will
continue to stream in. Um, I'm Lance,
one of the founding engineers at Lang
Chain. And I'm joined by Peak from
Manis.
Um, Pete, do you want to introduce
yourself quickly?
Yeah. Hey guys, I'm the co-founder and
chief scientist of Manis. So basically I
designed the agent framework and a lot
of things in Manis and I'm super excited
to be here today. Thanks Lance for
having me.
Yeah, we're really excited to do this
because Manis is first Manis is a really
cool product. I've been using it for a
long time but also they put out a really
nice blog post on context engineering a
few months ago that influenced me a lot.
So I want to give a quick overview of
context engineering as I see it. um and
I'll reference their piece and then
Pete's actually going to give a
presentation talking about some new
ideas not covered in the piece. So if
you've already read it to cover some
things that are new which hopefully be
quite interesting for you but I'll kind
of set the stage and I'll hand it over
to Pete and then we'll do some Q&A.
So you might have heard this term
quantation engineering and it kind of
emerged earlier this year. If you look
through time with Google search trends,
prompt engineering was kind of initiated
following chat GPT.
So that's showing December 2022. And
when we got this new thing, a chat
model, there became a great deal of
interest in how do we prompt these
things? Prompt engineering kind of
emerged as a discipline for working with
chat models and prompting them.
Now concept engineering emerged this
year around May. we saw it really rising
in um Google trends and it corresponds a
bit with this idea of the year of agents
and so why is that one of the things
that people have observed if you've been
building agents is that context grows
and it grows in a very particular way
when you build an agent what I mean is
we have an LLM bound to some number of
tools that LM can call tools
autonomously in a loop the challenge is
for every tool called you you get a tool
observation back and that's appended to
this chat list. These messages grow over
time and so you can kind of get this
unbounded explosion messages as agents
run.
As an example, Manis talked about their
piece that typical tasks require around
50 tool calls. Anthropics mentioned
similarly that production agents can
engage in conversations spanning
hundreds of turns. So the challenge is
that agents because they are
increasingly longunning and autonomous
they utilize tools freely you can
accumulate a large amount of context
through this accumulation of tool calls
and Chrome will put out a really nice
report talking about context ro the
observation simply is that performance
drops as context grows. So this paradox
this challenging situation agents
utilize lots of context because of tool
calling but we know that performance
drops as context grows.
So this is a challenge that many of us
have faced
and it kind of spearheaded this or I
think seeded this term of context
engineering. Arpathy of course kind of
coined it on Twitter earlier this year
and you can think about context
engineering is the delicate art and
science of filling the context window
with just the right information needed
for the next step. So trying to combat
this context explosion that happens when
you build agents and they call tools
freely. All those tool messages
accumulate in your messages queue. How
do we kind of call such that the right
information is presented to the agent to
make the correct next decision at all
points in time.
So to address this, there's a few common
themes I want to highlight that we've
seen across a number of different pieces
of work, including Manis, which I'll
mention here.
Idea one is context offloading.
So we've seen this trend over and over.
The central idea is you don't need all
context to live in this messages history
of your agent. You can take information
and offload it, send it somewhere else,
so it's outside the context window, but
it can be retrieved, which we'll talk
about later.
So, one of the most popular ideas here
is just using a file system.
take the output of a tool message as an
example, dump it to the file system,
send back to your agent just some
minimal piece of information necessary
so it can reference the full context if
it needs to, but that full payload, for
example, web search result that's very
tokenheavy isn't spammed into your
context window for perpetuity.
So you've seen this across a number of
different projects. Manis uses this. We
have a project called deep agents that
utilizes the file system. Open deep
research utilizes actually agent state
has a similar role to external file
system. Cloud code of course uses this
very extensively. Uh longunning agents
utilize it very extensively.
So this idea of offloading context to a
file system is very common and popular
across many different examples of
production agents that we're seeing
today.
The second idea is reducing context. So
offloading is very simply taking some
piece of information like a tool message
that's tokenheavy
and not sending it all back to your
messages list dumping it to a file
system where it can be retrieved only as
needed. That's offloading. Reducing the
context is similar but instead you're
just summarizing or compressing
information.
Summarizing tool call outputs is one
intuitive way to do this. So we do this
with open deep research as an example.
pruning tool calls or tool messages. One
thing that's very interesting is Claude
4.5 has actually added this to um if you
look at the some of their most recent
releases, they now support this out of
the box. So this idea of pruning old
tool calls with tool outputs or tool
messages is something that cloud is now
kind of built into their their SDK.
Summarize your compacting full message
history. You see this with cloud code in
its compaction feature. Once you hit a
certain percentage of your overall
context window,
cognition also talks about idea of
summarizing approving at agentto agent
handoffs. So this idea of reducing
context is a very popular theme we see
across a lot of different examples from
cloud code to our open deep research
cognition. Cloud45 has incorporated this
as well.
Retrieving context. Now this is one of
the classic debates today that you might
see raging on X or Twitter. The right
approach for retrieving context. Lee
Robinson from cursor just had a very
nice talk and I'll make sure these
slides are all shared so you can see
these links. He had a very nice talk at
openi demo day talking about cursor for
example uses indexing and semantic
search as well as more kind of simple
file-based search tools like glob and gp
cloud cod force only uses the file
system and simple search tools notably
glob and grip. So there's different ways
to retrieve context on demand for your
agent
indexing and something like semantic
search file system and simple file
search tools both can be highly
effective. There's pros and cons we
could talk about in the Q&A but of
course context retrieval is central for
building effective agents.
Context isolation is the other major
theme we've seen quite a bit of in
particular splitting context across
multi- aents. So what's the point here?
Each sub aent has its own context window
and sub aents allow for separation of
concerns. Manis wide agents talks about
this. Our deep agents work uses this.
Open deep research uses it. Claude sub
agents are utilized uh in their research
um in sub agents are utilized in
claude's uh research um cla's
multi-agent researcher and also cla
ghost support sub aent. So these are all
uh so sub aents are a very common way to
perform context isolation we've seen
across many different projects.
Now one thing I thought was very
interesting is caching context and manis
talks about this quite a bit. I'll let
Pete speak to this a bit later but I
think it's a very interesting trick as
well.
So I'll just show a brief example that
we've seen across open deep research.
This is a very popular repo that we
have. Um it's basically an open- source
deep research implementation. and it
performs on par with some of the best
implementations out there. You can check
our repo. Um, and you we have results
from deep research bench showing that
we're top 10. It has three phases.
Scoping of the research, the research
phase itself using a multi- aent uh
basically architecture and then a final
oneshot writing phase.
We use offloading. So we basically
create a brief to scope our research
plan. We offload that. So we don't just
save that in the context window because
that context window is going to get
peppered with other things. We offload
it. So it's saved independently. It can
be accessed in our case from the line
graph state but it could also be from
file system. It's the same idea. So you
create a research plan, you offload it.
It's always accessible. You go do a
bunch of work. You can pull that back in
on demand so you can put it kind of at
the end of your message list so it's
accessible and readily available to your
agent to perform, for example, the
writing phase.
We use offloading, as you can see, to
help steer the research and writing
phases.
We use reduction to summarize
observation from tokenheavy surf tool
calls. That's done inside research
itself.
And we use context isolation across sub
agents within research itself.
And this is kind of a summary of a bunch
of different uh of these various ideas
across a bunch of different projects.
And actually uh peak is going to speak
to manis in particular and some of the
lessons they've learned. This just kind
of sets up the sets up the stage. Um,
and this just kind of summarizes what I
talked about these different themes of
offloading, reducing context, retrieving
context, isolating, caching, and a
number of popular projects and kind of
where they're used. Um,
and a few different links. I will share
these slides to the notes. And I do want
to let Peak uh go ahead and present now
because I want to make sure we have
plenty of time for him in for questions.
But this just sets the stage. And Pete,
I'll let you take it from here. I'll
stop sharing.
Okay. Can you see my slides?
Yeah. Okay. Perfect. Okay. Thank you,
Lance. I'm super excited to be here
today to share some fresh lessons on
context engineering that we learned from
building Manis. um you know uh here I
say fresh lessons because I realized
that the the last blog post that you
mentioned I wrote about about context
engineering was back in July and yeah
it's the year of the agent so July is
basically the last entry and of course
before this session I went back and read
it again and luckily I think like most
of what I wrote in that blog still hold
up today but I just don't want to like
waste everybody's time by like just
repeating what's already inside that
blog. So today I think instead I want to
dig into some areas that I either didn't
go deep enough on before or didn't touch
at all. So actually we'll be focusing on
the discourage column in Lens's earlier
slides because uh you know personally I
think exploring those non-consensus
ideas often leads to the biggest
inspirations.
Yeah. So here's the topic for today's
talk. First we'll cover a bit about the
bigger question of why we need context
engineering and then we'll have more on
context reduction more on context
isolation and finally some like some new
stuffs about context offloading which we
are testing internally here at manus
yeah so everything I I'm sharing today
is in production in manus it's battle
tested but I don't know how how long it
will last because you know things are
changing super fast
okay let's start with the first big
question is why do we even need context
engineering especially you know uh when
fine-tuning or post- training models has
become like much more accessible today
yeah for example um folks at uh the
thinking machine team thinking machine
team they just released the tinker API
which I like a lot design but for me the
question like why context engineering
actually came through several painful
stages of realization
um before starting manness I've already
spent over 10 years in natural language
processing or NLP uh which is basically
what we call building language models
but before chat GPT and U Mis is
actually my second or third company and
my previous startup we trained our own
language model from scratch to do open
domain information extraction and
building knowledge graph and semantic
search engines on top of them and it was
painful. Our product's innovation speed
was completely capped by the model's
iteration speed. you know, even back
then the the the models were much
smaller comparing to to to today, but
still a single uh training plus
evaluation cycle could take maybe like
one or two weeks and the worst part is
that at that time we hadn't reached PMF
yet and we're spending all that time
like improving benchmark that might not
even matter for the product. So I think
um instead of building specialized
models too early, uh startups really
should lean on general models and
context engineering for as long as
possible. Well, of course, I guess now
that's um some kind of common wisdom.
But as your product matures and open
source base model gets stronger, I know
it's very tempting to think, hey, maybe
I should like just pick a strong base
model, fine-tune it with my data, and
make it really good at my use case. You
know, we've tried that too. And guess
what? It's another trap. You know, to
make AR work really well, you usually
fix an action space, design a reward
around your current product behavior,
and generate tons of like on policy
rollouts and feedback. But, you know,
this is also dangerous because we're
still in the early days of AI and
agents. Everything can shift under a
feet overnight. For us, the classic
example was the launch of MCP. Actually,
it completely changed the design of
Manis from a compact static action space
to something like it's infinitely
extensible. And if you have ever trained
your own model, you know that this kind
of open domain problem is super hard to
optimize. Well, of course, you could
like pour massive effort into post
training that ensures generalization,
but then aren't you basically trying to
become an LM company yourself? Because
you're you're you're basically
rebuilding the same layer that they have
already built. And that's a duplication
of effort. So maybe after all that
buildup, here's my point. Be firm about
where you draw the line. Right now,
context engineering is the clearest and
most practical boundary between
application and model. So trust your
choice.
All right, enough philosophy and let's
talk about some real tech. Uh first
topic, contact reduction. Here I want to
like clarify two different kinds of
compaction operations because we think
like like constant reduction is
fascinating but it's also a new concept.
There's a lot of way to do this and here
in manage we uh divide them into
compaction and summarization. For
compaction in manage every tool call and
tool result we actually has two
different formats a full format and a
compact one. The compact version strips
out any information that can be like
reconstructed from the file system or
external state. For for example here,
let's say you have a a tool that writes
to a file and it probably has two fields
a path and a content field and but once
the tool returns you can ensure that the
file already exists in the environment.
So in the compact format we can safely
drop the super long content field and
just keep the path. And if your agent
start is smart enough well like whenever
it needs to read that file again it can
simply retrieve it via the path. So no
information is truly lost it's just like
externalized. We think this kind of like
reversibility is crucial because agents
do like chain predictions based on
previous actions and observations and
you never know like which past action
will suddenly become super important
like 10 steps later. You cannot predict
it. So this is a a reversible reduction
by using compaction. Of course like
compaction only take you so far.
Eventually like your context will will
still grow and will hit the ceiling. And
that's when we combine compaction with
the more like traditional summarization
but we do it very carefully. For example
here before summarizing we might offload
key parts of the context into files. And
sometimes like we even more do more
aggressively we can dump the entire
pre-summary context as a text file or
simply a log file into the file system
so that we can like always recover it
later. And like Lance like just
mentioned some people just use like glob
and bre you know glob also works for log
files. So if the model is smart enough
it even knows how to retrieve those like
summarized uh those presummarized
context. Yeah. So I think the difference
here is that compaction is reversible
but summarization isn't both reduce
context lengths but they behave very
differently and to make though these uh
to make both uh methods coexist we have
to track some like context length
thresholds at the top like you'll have
your models hard context limit say like
1 million tokens pretty common today but
you know in reality most models start
degrading much earlier typically maybe
around 200k and you'll begin to see what
we call a context rot like repetitions
slower inferences degraded quality. So
by doing a lot of evaluation it's very
important for you to identify that pre-
rot threshold. uh it's typically 128K to
to 200K and use it as the trigger for
context reduction
and whenever like your context size
approaches it you have to trigger
context reduction but starting from
compaction not summarization and
compaction doesn't mean like compressing
the entire history you know we might
compactate like the oldest 50% of tool
calls while keeping the newer ones in
full detail so the model still has fresh
viewshot examples to of like how to use
tools properly. Otherwise like in the in
the worst case the model will will
imitate the behavior and output those
compact format with with missing fields
and that's totally wrong. And after
compaction we have to check how much
free context that we actually gain from
this like like compaction operation.
Sometimes like in this graph after
multiple rounds of compaction the gain
is tiny because like even it's compact
it still uses context and that's when we
go for summarization. But also keep in
mind that when summarizing we always use
the full version of the data not the
compact one and we still like keep the
last few tool calls and tool results in
full detail not summary because it can
allow the model to know where it left
off and we'll continue like like more
smoothly otherwise you'll see like after
summarization sometimes the model will
change its style change its tone and we
find out like keeping a few few like
tool call tool result examples really
help.
Okay, now we've uh now we've covered
reduction and let's talk about
isolation. I really agree with um
Cognition's blog where they warn against
using multi-agent setups because like
when you have multiple agents, syncing
information between them becomes a
nightmare. But you know, this isn't a
new problem. Multiprocess or
multi-thread coordination has been a
classic challenge in the early days of
computer programming. And I think we
could borrow some wisdoms here.
I don't know how many like Golan coders
here are here today but you know in the
go programming language community
there's a famous quote from this gopher
do not communicate by sharing memory
instead share memory by communicating of
course this isn't directly about agent
and it's sometimes even wrong for for
for agents but I think the important
thing is it highlights two distinct
patterns here which is by communicating
or by sharing memory like if we
translate the term memory here into
context We can see that parallel pretty
clear by communicating is like the
easier one to understand because it is
the classic sub aent setup here. For
example, the main agent writes a prompt
and it the prompt is sent to a sub aent
and the sub aent's entire context only
consists of that instruction. We think
if a task has a like short clear
instruction and only the final output
matters say like searching a codebase
for a specific snippet then just use the
communication pattern and keep it simple
because you know the main agent doesn't
care how the sub aent find the code. it
only needs the result and this is what
like cloud code does typically using its
like task tool to delegate like a
separated clear uh task to some sub
aents but for more complex scenarios in
contrast by sharing memory means that
the sub aent can see the entire previous
context it means like all the tool use
tool use history tool usage history but
it the sub aent has its own system
prompt and its own action space for
example like um imagine a deep research
scenario, the final report depends on a
lot of intermediate searches and notes
and in that case you should consider
using the share memory pattern or in our
language by sharing context because like
even you've you can yeah you can save
all that notes and and searches into
file and making the the sub agent to
read everything again but you're just
wasting latency and context and if you
count the amount of token maybe you
you're using even more token to to do
this. Uh so we think like for those uh
scenario that requires a full history
just use a share memory pattern but be
aware that like sharing context is kind
of expensive because you know each sub
aent has a larger input to prefill which
is like you'll spend more on like input
tokens and since the system prompt and
the access space differs you cannot re
reuse the KV cache so you have to pay
the full price
and finally let's talk a little bit
about like context offloading Um, when
people say offload, they usually mean
like moving parts of the working context
into external files. But as your system
grows, especially if you decide to
integrate MCP, one day you realize that
the tools themselves can also take up a
lot of context and having too many tools
in context leads to confusion. We call
it context confusion and the model might
call like the wrong ones or even like
non-existing ones. So we have to find a
way to to also offload the tools. A
common approach right now is like doing
dynamic rag on tool descriptions. Uh for
example like loading tools on demand
based on the current task or the current
status. But that also causes two issues.
First of all like since tool definitions
sit at the front of the context. Yeah,
your KV resets every time. And most
importantly the model's past calls to
remove tools are still in the context.
So it might fot the model into like
calling invalid tools or invalid or
using invalid parameters. So to address
this we're experimenting with a new
layered action space in Manace. Well
essentially we can let manage to choose
from three different levels of
abstractions. Number one function
calling, number two sandbox utilities
and number three packages and API. We go
deeper into into these three layers of
of of layer action space. Let's start
from level one function calling. And
this is a classic. Everyone knows it. It
is schema safe thanks to constraint
decoding. But we all know the downsides.
For example, we mentioned like breaking
the cache and maybe too many tool calls
will cause some confusion. Uh too too
many tools may cause confusion. So in
manas right now we only use a fixed
number of atomic functions. For example
like reading and writing files,
executing shell commands, searching
files in internet and maybe some like
browser browser operations. We think
these atomic functions have super clear
boundaries and they can work together to
compose like much more complex
workflows. Then we offload everything
else to the next layer which is the
sandbox utilities.
As you know each Nana session runs
inside a full virtual machine sandbox.
It's running on running on our own
customized Linux system and that means
manis can use the shell commands to run
pre-install utility that we develop for
manus. For example, we have some format
converters. We have like speech
recognition utilities and even a very
special we and we we call it and then
it's MCP CLI which is which is how we
call MCP. We do not inject MCP tools to
the function colony space. Instead, we
do everything inside that sandbox
through in the command line interface.
And utilities are great because like you
can add new capabilities without
touching the models models function uh
function calling space and you know it's
just some like commands pre-installed in
your computer and if you're familiar
with Linux you always know how to find
those new commands and you can even run
like like d-help to to to to to to
figure out how to use a new tool and
another good thing is for larger outputs
they can just write to files or return
the result in pages and you can use all
those Linux tools like grab cat less
more like to to to to to process that
results on the fly. So the trade-off
here like it's it's super good for large
outputs but it's also not that good for
low latency back and forth interactions
with the front end because like you
always have to like uh like to visualize
uh the interactions of your agent and
show it to the user. So this is pretty
tricky here but we think like it it
already offloads a lot of things. And
then we have another layer, the final
layer, we call it packages and APIs. You
know, here manis can write um Python
scripts to call pre-authorized API or
custom packages. For example, like manis
might use a 3D designing library for
modeling or call a financial API to
fetch market data. And here actually
like we've purchased all these API on
behalf of a user and pay the money for
them. It's included in the subscription.
So we basically we have a lot of like
API keys pre-installed in manask and
mans can can access these APIs using the
keys. I think these are perfect for task
that requires lots of computation in
memory but do not need to push all that
data into the model context. For example
like uh imagine if you're analyzing a
stock's entire year of price data. You
don't feed the model all the numbers.
Instead, you should let the script to
compute it and only put the summary back
into the context. And you know, since
code and APIs are super composable, you
can actually chain a lot of things in
one step. For example, like in in a
typical API, you can do like uh get city
names, get city ID, get weather all in
one Python script. There's also a paper
like from one of my friend called code
act. A lot of people were like
discussing about it. I think it it's
like the same idea because like code is
composable and it can like like like do
a lot of things in one step but also
it's like it's not schema safe. It's
very very hard to do like a strange
decoding on codec. So we think you
should find the right uh scenario for
these features. For us we use all as we
mentioned everything that's like that
can handle inside a like like compiler
or interpreter runtime. We do that using
code otherwise we use like sandbox
utilities or function calls. And the
good thing is uh if you have these three
layers from models point all three
levels still go through the standard
function calls. So the interface stays
simple cache friendly and orthogonal
across functions because you know uh we
mentioned sandbox utilities you're still
accessing these tools using the shell
tool. accessing these tools using the
shell function and also like if you're
using APIs in thirdparty applications
you're just using the file function to
write or read file and then execute it
execute it using the shell function so
you think it does not add like like add
overhead to the model it's still all the
things that models are trained and
they're already familiar with
so let's zoom out and connect the five
dimensions offload reduce retrieve
isolate and cache you can find find out
that they are not independent. We can
see that offload and retrieve enables
more efficient reduction and stable
retrieve makes isolation safe. But
isolation Oh yeah, isolation also slows
down contacts and reduces the frequency
of reduction. However, more isolation
and reduction also affects cache
efficiency and the quality of output. So
at the end of the day, I think context
engineering is the science in art that
requires a perfect balance between
multiple potentially conflicting
objectives. It's really hard.
Um, all right. Before we wrap up, I want
to leave you with maybe one final
thought, and it's kind of the opposite
of everything I just said, which is
please avoid context over engineering.
Like looking back at the at the like
past six or seven months since Manis
launch, actually the biggest leap we've
ever seen didn't came from like adding
more fancy context management layers or
clever retrieval hacks. They all came
from simplifying or from like removing
unnecessary tricks and trusting the
model a little more. Every time we
simplify the architecture, the system
got faster, more stable, and smarter
because we think context engineering
should uh the goal of context
engineering is to make the model's job
simpler but not harder. So if you like
take one thing from today, I think it
should be build less and understand
more. Well, thank you so much everyone
and thanks again to Lance and the main
chain team for having me. Can't wait to
see what you guys all build next. Now
back to Lance.
Yeah, amazing. Thank you for that. Um,
so we have a nice set of questions here.
Maybe we can just start hitting them and
we can kind of reference back to the
slides if needed. And Peak, uh, are your
slides available to everyone? Um,
oh yeah. Yeah, I can share the PDF
version afterwards.
Yes, sounds good.
Um,
yeah. Well, why don't I start looking
through some of the questions and maybe
we can start with the more recent ones
first. Um,
so how does the OM uh call the various
shell tools? How does it know which
tools exist and how to invoke them?
Maybe you can explain a little bit about
kind of the multi the multi-ter kind of
sandboxing setup that you use with
Manis.
Yeah. Uh I think like imagine you're the
person that using a new computer. For
example, if you know Linux, you can
imagine like all the tools are located
in SLUSRBIN.
So we actually we do two things. First
of all, we have a hint in the system
prompt telling manage that hey there's a
lot of pre-installed command line
utilities located in some specific
folder. And also like for the most like
frequently used ones, we already like
injected in the system prompt, but it's
super compact. We do not like tell the
the agent how to use the tools. We only
list them and we can tell the agent that
you can use the the the d-help uh flag
safely because all the utilities are
developed by our team and they have the
same format.
Got it. How about um I know you talked a
lot about using file system. What's your
take on using indexing?
Um and do you utilize like do you spin
up vector stores on the fly if the
context you're working with gets
sufficiently large? How do you approach
that?
Yeah, I think like uh there's no no
right and wrong in this space like
you've mentioned. Uh but at manis we do
not use index databases because right
now you know every sandbox in mana
session is a new one and user want to
like interact with things fast. So
actually we don't have the time to like
build the index on the fly. So we're
more like cloud code. We rely on like
like drip and and and glob. But I think
like if you like consider to build some
something like more long-term memory or
like if you want to integrate some like
like enterprise knowledge base, you
still have to rely on that like um like
external vector index because like it's
only about the the amount of information
that you can access but for like manage
like it operates in a sandbox and for
coding agent you operate in the
codebase. So it it depends on the scale.
Yeah. So that's that's a good follow-up
then. So let's say I'm a user. I have my
manus account. I interact with manus
across many sessions. Do you have the
notion of memory? So claude has cloud MD
files. They persist across all the
different sessions of cloud code. How
about you guys? How do you handle kind
of long-term memory?
Yeah. Uh actually in manis we have a
concept called knowledge which is kind
of like like explicit memory. For
example, like every time you can tell
man, hey, remember like uh every time I
ask for something, deliver is in maybe
in Excel and it's not automatically
inserted into some memory. It will pop
up a a dialogue and say here's what I
learned from our previous conversation
and would you like accept it or reject
it? So this is the explicit one. It
requires user confirmation. Uh but also
like we are discovering new ways to do
it more automatically. For example, like
um uh a pretty interesting uh thing in
agents is that like compared to chat
bots, user often like correct correct
the agent more oftenly. For example,
like a common uh mistake that manners
make is when doing like data
visualization, you know, if you're using
Chinese, Japanese or Korean a lot of
time there will be some font issues and
there will be errors in those render
render visualizations. So the user will
often say like hey you should like use
use like not and CJK font and for these
kind of things the user will will a
different user will will have the same
correction and we need to maybe they'll
find out a way to like to leverage these
kind of a collective feedback and use it
that's kind of like we call it
self-improving agent with online
learning but in a parameter free way.
Yeah.
How about a a different question that
that was raised here and also I think
about quite a bit. You mentioned towards
the end of your talk that um you you
gained a lot from removing things and a
lot of that is probably because of the
fact that also the models are getting
better. So model capabilities in
increasing and so you can kind of remove
scaffolding over time. How do you think
about this because this is one of the
biggest challenges that I've faced is
like over time the model gets better and
I can remove things like certain parts
of my scaffolding. So you're building on
top of this the the foundation that's
like the water's rising and like do you
revisit your architecture every some
number of months with new releases and
just delete as the models get better and
how do you how do you approach that
problem?
Yeah, this is a super good good question
here because you know actually we have
already um refactored Manis for five
times and we've launched Manis in March
and now it's October already five times.
So we think like you cannot stop because
like models are not only improving but
they are changing models behavior are
changing over time like um one way is
you can you can work closely with those
like model providers but we also have
another like internal theory for how we
evaluate or how we design our agent
architecture. I cover a little bit on
Twitter before it's basically like we
all we do not care about a the the a
static the performance of a static uh
benchmark. Instead we like we fix the AR
agent architecture and we switch between
models. If if like your architecture can
gain a lot from switching from a weaker
model to a stronger model then somehow
your your architecture is more futurep
proof because like the the the the
weaker model tomorrow is might be as
good as a stronger model today. Yeah. So
we think like switching like between uh
weaker and strong models can give you
some like early signals of what will
happen next year and give you some time
to prepare your architecture. Yeah. So
for manite um we often like do these
kind of reveal like every every one or
two month and we often like um do some
like um yeah do some like like research
internally using like open source models
and maybe like early access to prep
proprietary models to like prepare the
the the next release like even before
the launch of the next model. Yeah.
Yeah. It's a good observation. You can
actually do testing of your architecture
by toggling different models that exist
today. Yeah. Yeah, that makes a lot of
sense.
What about um best practices or
considerations for um format for storing
data? So like markdown files, plain
text log
uh anything you prefer in particular. I
think obviously it's
Yeah. How do you think about that
kind of file formats for
Yeah.
Yeah. I think like like it's the not
about like plain text or markdown but we
always like prioritize line based um
formats because like it allows like the
models to use like grap or like read
from read from a range of range of lines
and also like markdown can sometime
cause some troubles you know um models
are trained train trained to use
markdown really well and sometimes it
will maybe for for for some model I
don't I don't want to say that name but
but they often like output too many
bullet points if you use markdown too
too often. Yeah. So, actually we we want
to use more plain text.
Yeah, makes sense. How about on the
topic of um compaction versus
summarization?
Let's hit on summarization. This is an
interesting one that I've been asked a
lot before. Uh how do you prompt to
produce good summaries? So, for example,
summarization, like you said, it's
irreversible. So, if you don't prompt it
properly, you can actually lose
information.
The best answer I came up with is just
tuning your prompt for high recall. But
how do you approach this? So
summarization, how do you think about
prompting for summarization?
Yeah, actually like we tried a lot of a
lot like optimizing the prompt for
summarization. But it turns out a simple
approach works really well is that you
do not use a free form like prompt to
let the AI generate everything. Instead,
you could define a kind of a schema.
It's just a form. There's a lot of
fields and let the AI to fill them. for
example, like here are the files that I
that I've modified and here's the goal
of the user. Here's what I left off. And
if you use this kind of like a more
structured schema at least like like the
output is kind of stable and you can
iterate on this. So just do not use like
free form summarizations.
Got it. Yeah, that's a great
observation. So you structured outputs
rather than free form summarization to
enforce certain things are are always
summarized.
Yeah, that makes a lot of sense.
How about with context? How about with
compaction then? And actually I want to
make sure I understood that. So with
compaction, let's say it's a like a
search tool. You have the raw search
tool output and would it be that would
be your raw message and then the
compaction would just be like uh a file
name or something. Is that right?
Yeah, it is. It's not only about like
the tool call. It's also like applied to
the to the result of the tool like you
know uh we interestingly we find out
that almost every every action in man is
just kind of like reversible if you can
offload it to a to the file system or an
external state and for most of these
tasks you already have a unique
identifier for it for example for file
operations of course you have the file
path for like browser operations you
have the URL and even for search search
um actions you have the query so it's
it's naturally it's already there.
Yeah. Okay. This is a that's a great one
and just want to hit that again because
it I've had this problem a lot. So, for
example, I'm an agent that uses search.
I perform a it returns a tokenheavy tool
call. I don't want to return that whole
tool message to um the agent. I've done
things like some kind of summarization
or compaction and send the summary back.
But how do you approach that? Because
you might want all that information to
be accessible for the agent for his next
decision. But you don't want that huge
context block to live inside your
message history.
So how do you approach that? You could
send the whole message back
uh but then remove it later. That's what
claude does now. You could do a
summarization first and send the summary
over. Um you could do you could send
everything and then do compaction so
that later on you don't have the whole
context in your message history. You
only have like a link to the file. How
do you think about that specifically if
you see what I'm saying?
Yeah, I know actually it depends on the
scenario for for example like for like
complex search I mean for complex search
I mean it's not just one query for
example like you have multiple queries
and you want to like like gather some
important things and drop everything
else. Uh in this case I think we should
use sub agents or internally we call it
agent as tool. So for the from the
models p perspective it's still a kind
of function maybe called advanced
search. It's a function called event
search. But what it triggers is actually
another sub aent. But that sub aent is
more like a workflow or agentic workflow
that has a fixed output schema and that
is the result that returns to the agent.
But for like other kinds of more simpler
search for example just like searching
Google like we just use the full detail
format and like append it into the
context and rely on the compactions
thing. But also like we always like
instruct the model to like write down
like the intermediate insights or key
findings into files in case that like
the compaction happens earlier than than
the model expected. And if you like do
this really well actually you don't lose
a lot of information um by compaction
because sometimes like those like old
tool calls are irrelevant after time.
Yeah, that makes sense. Um and I like
the idea of agent as tool. We do that
quite a bit and that does make that that
is that is highly effective. But that
brings up another interesting point
about and and you referenced this a
little bit agent agent communication.
How do you address that? So Walden Yen
from from Cognition had a very nice blog
post talking about this is like a major
problem that they have with Devon. Uh so
like kind of communication between
agents. How do you think about that
problem and yeah ensuring sufficient
information is transferred but not
overloading like you said the prefill of
the sub agent with too much context. So
how do you think about that?
Yeah. uh you know like at Menace we've
like launched a feature called wide
research a month ago like it's basically
like we call yeah internally we call it
agentic map reduce because we we got
inspired from the design of map reduce
and it's kind of special for manus
because uh you know there's a full
virtual machine behind the session so
one way we pass like information or pass
context from the main agent to sub agent
is by sharing the same sandbox so the
file system is there and you can only
pass like the like different path here
and I think like like sending
information to sub agent is not that
hard. The the more more complex thing is
about how to like like have the the
correct output from different agents.
And what we did here is like we have a
trick for every every time if the main
agent want to spawn up a new sub aent or
or maybe 10 sub agent, you have to
design you have to let the main agent to
to define the output schema. And in the
in the sub aent perspective, you have a
special tool called submit result. And
we use constraint decoding to ensure
that what the the sub agent submits back
to to the main agent is the schema that
is defined by the main agent. Yeah. So
you can imagine that this kind of map
produce operation. It will generate a
kind of like spreadsheet and the
spreadsheet is constrained by the
schema.
That's an interesting theme that seems
to come up a lot with how you design
manis. You use schemas and structured
outputs both for summarization and for
this agent agent communication. So it's
kind of like use schemas as contracts.
Um yeah between agent sub agent or
between like
a tool and your agent to ensure that
sufficient information is passed in a
structured way in a complete way. Uh
like when you're doing summarization you
use a schema as well.
Okay fantastic. This is very very very
helpful. Um I'm poking around some other
interesting questions here. Uh any
thoughts on models like uh I think you
guys are use anthropic but do you work
with open models? Um do you do
fine-tuning? you talked a lot about kind
of working with KV cache so for that
maybe using open models how do you think
about like model choice
yeah actually right now we don't use any
like open source model right now because
I think it's not about quality it's
interestingly it's about cost you know
uh we often think that open source model
can lower the cost but if you're at the
scale of manis and and if you're
building a real agent which the input is
way longer than the output then KV cache
is super important and distributed KV
cache is very hard to implement if you
use like open source solutions and if
you use like those like um frontier pro
uh LM providers they have more solid
infrastructure for like distributed cash
uh globally. So sometimes like if you do
the math uh at least for manus we find
out that using like like these flagship
models can sometimes can they can be
even more cheaper than like using open
source models and right now we're not
only using anthropic force like
enthropics model is the best choice for
like agentic task but we're also like
seeing like the progress in Gemini and
in open new model I think right now like
these like frontier labs are not
converging in directions for example
like if you're doing coding of course
you should use uh cloud and if you uh
want to do like more multimodal
multimodality things you should use
Gemini and open model is super good at
like like complex math and reasoning. So
I think for application companies like
us one of our advantage is that we do
not have to build on top of only one
model you can do some like task level
routing or maybe even subtask or step
level routing if you can like like
calculate like if you can can pull in
that kind of KV hash validation. So I
think it's advantage for us and we do a
lot of evaluations internally to know
which models to use for which subtask.
Yeah. Yeah, that makes a lot of sense. I
want to clarify one little thing. So
with KV cache, so what specific features
from the or Yeah. What from the
providers are you using for cache
management? So okay, I know like
anthropic has input caching as an
example. Yeah,
that that's what you mean. Okay, got it.
Yeah,
cool. Okay, perfect. Um,
cool. I'm just looking through some of
the other questions. Uh,
yeah, tool selection is a good one. Um,
right. So, you were talking about this.
You don't use like uh indexing of tool
descriptions and fetching tools on the
fly based on semantic similarity. How do
you handle that? like what's what's the
threshold for too many tools? Yeah, tool
choice is a classic. How do you think
about that?
Yeah. Uh first of all, it depends on the
model. Different model has different
capacity for like tools. But I think a
rule of thumb is
try not to like um include more than
like
30 tools. It's just a random number in
my mind. But actually, I think like if
you're building a we call it a general
AI agent like Manis, you want to make
sure those like native functions are
super atomic. So actually there are not
that much like atomic function that we
need to put inside the action space. So
like for manus we right now we only have
like like 10 or 20 like atomic function
and everything else is in the sandbox.
Yeah. So we don't have to like um to
pull things like dynamically.
Yeah good point actually. Let's explain
that a little bit more. So so you have
let's say 10 tools that can be called
directly
um by the agent. But then I guess it's
like you said the agent can also choose
to for example write a script and then
execute a script. So that expands its
action space hugely without giving it
like you don't have an independent tool
for each possible script. Of course
that's insane. So so our very general
tool to like write a script and then run
it does a lot. Is that what you mean?
Yeah. Yeah. Exactly. Because you know uh
why we are super confident to call manis
a general agent because it runs on a
computer and computer are turning
complete. The computer is the best
invention of human like theoretically
like an agent can do anything that an
maybe a junior intern can do using a
computer. So with the shell tool and the
and the text editor, we think it's
already complete. So you can offload a
lot of things, right, to sandbox.
Yeah. Okay, that makes a lot of sense,
right? Um and then how does manage so is
are all so okay, maybe I'll back up. You
mentioned code with code agents. My
understanding is the model will actually
always produce a script and that'll then
be run inside a code sandbox for so
every tool call is effectively like a
script is generated and run. It sounds
like you do some hybrid where sometimes
M can just call tools directly but other
times it can actually choose to do
something in the sandbox. Is that right?
So it's kind of a hybrid approach.
Okay.
Yeah. I think this is this is super
important because like actually we try
to use entirely to use uh codec for
manners but the problem is if you're
using code you cannot leverage like
constraint decoding and things can go
wrong.
Yeah but you know uh kodak has some like
special use cases as I mentioned earlier
in slides for example like processing a
a large amount of data you don't have to
like port everything in the tool resol
is that you put it inside like maybe the
runtime memory of Python and you only
get the result back to to the model. So
we think you should do it like in a
hybrid way.
Got it. Allow for tool calling and
you've some number of tools maybe 10 or
something that just called directly
some number of tools that actually run
in the sandbox itself.
Perfect. That makes a ton of sense. Very
interesting.
Um
and then maybe how do you keep a
reference of all the previously gen I
guess you have so you basically will
generate a bunch of files. Oh actually
sorry maybe I'll talk about something
else. How about planning?
Tell me about planning and and I know
Manis has this to-do tool or it
generates a to-do list and start of
tasks. Yeah, tell me about that.
Yeah, I think this is very interesting
because like at the beginning man uses
that to-do.md paradigm like it's kind of
I I don't want to you use the word
stupid but actually it wastes a lot of
turn you know um like back in maybe
March or April like if you like check
the log of some menace task maybe like
onethird of the action is about like
updating the the to-do list it wastes a
lot of like like tokens. Yeah. So right
now we using a more like structuralized
planning for example like uh if you use
manis there's a planner at the bottom of
like the system internally it's also
kind of a tool called it's we
implemented using the agent as tool
paradigm so that like there's a separate
agent that that is managing the plan so
actually right now the latest version of
manage we are no longer using that
to-do.md thing of course like todo.md
still works and it can generate like
good results but if you want to say save
tokens you can find another way.
Got it. Yeah. So you have like a planner
agent and it's more like for a subtask
it'll be more like agent as tool call
type things.
Yeah. Got it. And you know it it's very
important to have a separate agent to
that has a different perspective so it
can do some like external reviews and
you can use different models for for
planning for example like oh yeah
sometime like rock can generate some
very interesting insights.
Yeah. Well that's a great one actually.
So think about like multi- aent then and
so like how do you think about that? So
you might have like a planning agent
with its own context window, makes a
plan, produces like some kind of plan
object, maybe it's a file or maybe it
just calls sub agents directly. How do
you think about that? Like and how many
different sub aents do you typically
recommend using?
Yeah, I think this is also like depends
on your design, but here at Manis
actually man is not kind of like the
typical multi- aent system. For example,
like we've seen a lot of like different
agent that divides by role. For example,
like you have a designer agent or design
or like programming agent, manager
agent. We don't do that because we think
like uh why we have this is because this
is how like human company works and this
is due to the limitation of like human
context. So in manis menace is a multi-
aent system but we do not divide by
role. We only have very few agent for
example we have a huge like general
executor agent and a planner agent and a
knowledge management agent and maybe
like some some yeah data API
registration agent. Yeah. So we are very
very cautious about adding more sub
agents because of the reason that we've
mentioned before communication is very
hard and we implement more kinds of like
sub agents as agent as tools as we
mentioned before.
Yeah, that's a
yeah that's a great point. I see this
mistake a lot or I don't know if it's a
mistake but you see anthropomorph
anthropomorphizing agents a lot like
it's my designer agent and I think it's
kind of a forced analogy to think about
like a human org chart in your sub
agents. So got it. So for you it's like
a planner and knowledge manager. A
knowledge manager might do what? Like um
like what will be the task of knowledge
manager?
Like
yeah it's it's even like more simple as
we mentioned like we have a knowledge
system in manage. What the knowledge
agent does is that it reviews like the
conversation between the user and the
agent and and figure out like what
should be like saved in in the long-term
memory. So it's that simple.
Got it. Yeah. Okay. Got It's like it's
like a memory manager planner and then
you have sub agents that could just take
on like a general executor sub agent
that could just call all the tools or
actions in the sandbox.
That makes sense. Keep it simple. I like
that a lot, right? That makes a lot that
makes a lot of sense.
Um
yeah, let me see if there's any there's
a bunch of questions here. Um but we we
did hit a lot. So that's actually
um
how about guardrailing? Someone asked a
question about kind of safety and
guardrailing. How do you think about
this? I guess that's the nice thing
about a sandbox, but tell me a little
bit about that. How you think about it?
Yeah, I think um this is very a very
sensitive question because like you know
if you have a sandbox that's connected
to the internet everything is dangerous.
Yeah. So we have put a lot of effort
like in guard railing like at least we
do not let the information to get out of
the sandbox. For example, like if you
like got prompt injected, uh we have
some like uh checks on like outgoing
traffic. For example, like we'll ensure
that no like token things will go out of
the sandbox. And if the the user wants
to like print something out of the
sandbox, we have those kind of like like
like um what we call it uh removing yeah
removing things and to to ensure that no
information go out of the sandbox. But
you know um for another kind of thing is
that
we have a browser inside of Manis and
the browser is very complicated. For
example, like if you log into some like
um your websites, you can choose to let
manage to persist your login state and
this turns out to be like like very
tricky because like sometime the content
of the web page can also be like
malicious. Maybe they they're doing like
like prompt injection and this I think
is somehow like out of scope for
application company. So we're moving uh
we're working very closely with those
computer use model provider for example
like anthropy and Google. Yeah, they're
adding a lot of guardrails here. So
right now in manage every time you do
some like sensitive operations whether
or inside the um the browser or in the
sandbox manage well will require a
manual confirmation and you must accept
it or otherwise you have to take over it
to finish it yourself. So I think like
it's pretty hard for us to like design a
a like kind of a very like welldesigned
solution but it's a progressive
approach. So right now we're letting the
user to take over more frequently but
like if the guard rail itself in the
model gets better we can do less. Yeah.
Yeah. How about the topic of evals? This
has been discussed a lot quite a bit
online if you probably seen you know
cloud code. They talked a lot about just
doing less formal evals at least for
code because code evals are more or less
saturated lots of internal dog fooding.
How do you think about evals? Are they
useful? What eval are actually useful?
What's your approach?
Yeah.
Yes.
Yeah. You know at the beginning uh at
the launch of Nanis we're using like
public academic benchmarks like Gaia but
then like after after launching to the
public we find out that it's super
misaligned you know models are that that
gets like high scores on Gaia the user
don't like it. So right now we use like
three we have three different kinds of
evaluations first of all most
importantly is that for every like
completed session in manage we'll
request the user to like give a feedback
to give one to five stars. Yeah, this is
the gold standard like we always care
about like the average user rating. This
is number one. And number two, we're
still using some like like internal
automated tests with like verifiable
results. For example, like we have like
created our own data set with like clear
answers, but also like uh we yeah we we
still use a lot of like public academic
benchmarks but we also uh created some
um some data sets that's more focused on
execution because like most benchmark
out there are more about like readon
tasks. So we designed some like like um
like executing tasks or transactional
task because we have the sandbox we can
like frequently reset the test
environment. So these are the automated
parts and most importantly number number
three
we have a lot of interns you know you
have to use a lot of real human interns
to do like like uh evaluations on things
like website generation or data
visualization because like it's very
hard to design a good reward model that
knows whether the output is visually
appealing like it it's about the taste.
Yeah. So we still rely on on a lot of a
lot a lot.
Perfect. Yeah. Let me ask you I know
you're we're coming up on time, but I do
want to ask you about this emerging
trend of of reinforcement learning with
verifiable rewards versus just building
tool calling agents. So like cloud code
extremely good and they have the benefit
because
they built the harness and they can
perform RL on their harness and it can
get really really good with the tools
they provide in the harness.
Do you guys do RL um or how do you think
about that? Because of course in that
case you would have you using open
models.
I've been playing with this quite a bit
lately. How do you think about that?
Just like using tool calling out of the
box with model providers versus doing RL
yourself inside your environment with
your with your with your harness.
Um yeah how do you think about that?
Yeah I mentioned like before starting
Madness I was kind of model training
guy. I've been doing like free training
post training RL for a lot of years but
I have to say that right now
if you like if you have like in um like
sufficient resource you can try but
actually like we as I mentioned earlier
MCP is a big changer here because like
if you want to support MCP you're not
using a fixed action space and if it's
not a fixed action space it's very very
hard to design a good like reward and
you cannot generate a lot of like the
the rollouts and feedbacks will be
unbalanced so if you want to build a
model using like that supports MCP, you
are literally building a foundation
model by yourself. So I think like every
everyone in the in the community like
model companies, they're doing the same
thing. They're doing the same thing for
you. So right now, I don't think we
should spend that much time on doing RL
right now. But like as I mentioned
earlier, we are just discovering like
like exploring new ways to do like maybe
call it like personalization or some
sort of online learning but using like
parameter freeway for example like
collective feedbacks.
Yeah. One little one along those lines
is is it the case that for example
Anthropics done reinforcement learning
at verified rewards on some set of tools
using cloud code. Have you found that
you can kind of mock your your your
harness to use similar tool names to
kind of unlock the same capability if
that makes sense?
Like um for example
like I believe they've just you know
they've obviously performed you know
they it utilized Glob uses GP uses some
other set of tools for manipulating the
file system. Can you effectively
reproduce that same functionality by
having the exact same tools with the
same tool name, same descriptions in
your harness or kind of how do you think
about that like unlocking
um unlocking the Yeah. Right. You see
what I'm saying? Yeah.
Yeah. Uh I know the clear answer here,
but for us, we actually try not to use
the same name because like it it will
like if you design your own function,
you maybe have like different
requirements for that function and the
parameters the input arguments might be
different. So you don't want to like
confuse the model like if the model is
trained on a lot of like post training
data that has some like internal tools,
you don't want to to to let the models
to be confused.
Okay. Okay. Got it. Got it. Perfect.
Um well, I think we're actually at time
and I want to respect your time uh
because I know it's early. You're in
Singapore. It's it's very early
for you. So um well this was really
good. Thank you. Um we'll definitely
make sure this recording is available.
We'll make sure slides are available.
Um, any parting things you want to
mention, things you want to call out,
calls to action. Um,
yeah, people should go use Manis, but
the floor is yours.
Yeah. I just want to say everybody try
this. We have a free tier.
Yeah. Yeah. Absolutely. Hey, thanks a
lot, Pete. I' love to do this again
sometime.
Yeah. Thanks for having me.
Yep. Okay. Bye. Bye.
Join us for a deep dive into context engineering – the critical practice that determines how well your AI agents perform in production. Lance Martin from LangChain and Manus co-founder Yichao "Peak" Ji share battle-tested strategies for managing context windows, optimizing performance, and building agents that scale. Peak was recently named one of MIT's Innovators Under 35 for his work on AI agents. Here, we cover Manus's context engineering approach. Strategies include: (1) **Context reduction** via dual-form tool results (full/compact) with policy-based compaction and schema-driven summarization; (2) **Context offloading** through layered action spaces (function calling → sandbox utils → packages/APIs) with filesystem-based state management and shell utilities instead of vectorstore indexing; (3) **Context isolation** using minimal sub-agents (planner, knowledge manager, executor) with agent-as-tool paradigm and constrained decoding for schema-based inter-agent communication. 📊 Access the Presentations: Lance Martin's slides (LangChain): https://docs.google.com/presentation/d/16aaXLu40GugY-kOpqDU4e-S0hD1FmHcNyF0rRRnb1OU/edit?slide=id.p#slide=id.p Yichao "Peak" Ji's slides (Manus): https://drive.google.com/file/d/1QGJ-BrdiTGslS71sYH4OJoidsry3Ps9g/view?usp=sharing Ready to start building reliable agents? Sign up for LangSmith, our agent observability & evals platform: https://www.langchain.com/langsmith/?utm_medium=social&utm_source=youtube&utm_campaign=q4-2025_meetup-manus_co Chapters 0:01:00 Introduction to context engineering 0:12:00 Why context engineering in Manus 0:15:00 Context reduction in Manus 0:19:20 Context isolation in Manus 0:22:17 Context offloading in Manus 0:29:00 Avoid context over-engineering 0:31:00 Q&A: Explain sandbox utils in Manus 0:31:55 Q&A: Indexing (vectorstore) vs just using files 0:32:50 Q&A: Memory in Manus 0:34:30 Q&A: Manus and The Bitter Lesson 0:36:44 Q&A: Data format 0:37:45 Q&A: Summarization tips 0:40:00 Q&A: Sub-agents as tools 0:43:57 Q&A: Model choice 0:46:20 Q&A: Tool selection 0:49:48 Q&A: Planning 0:53:35 Q&A: Guardrails 0:55:39 Q&A: Evals 0:57:15 Q&A: Using RL