Loading video player...
Welcome, everyone, as we get started. It's been a while since we've done one of
these webinars, so we're excited to do them again. My name is Harrison. I'm the
co -founder, CEO of LinkedIn, joined by Nick, who's been here
for a number of years and currently on the Applied AI team. And we're excited
to talk a bunch about stuff around evalling and debugging
these new types of long running or deep agents, as we call them. And
so maybe some agenda setting and then some context. So we
have about an hour here. I'll kick it off by doing about 10 minutes of
some thoughts on kind of like evaluating these deeper types of agents and
introducing some of the new things that we actually launched yesterday, which I think are
really cool. And then I'll hand it off to Nick after that. And he has
about 20 minutes of a practical more kind of like hands on showing how we
use some of these tools to actually test, debug,
trace these deep agents. And then after
that, we'll open it up for questions in the chat. So I think that so
there is a QA box in the
panel below. So leave your questions there and we will get to them
at the end. Okay. So I think we can probably get started.
We'll be talking about debugging and observing deep agents. We use deep agents in the
title. By deep agents, we really mean agents that run for an extended period of
time and do more autonomous tasks. And so this is a, the term deep agent
is a term we came up with to describe things like cloud code and deep
research and Maness, these general purpose agents that would operate for extended periods of time.
And, and, and when we saw that they had a number of kind of like
common characteristics, like they would all use a file system and they would all use
sub agents and things like that. And one of the, I think it's also just
reasonable to call these things agents. Like these are what we think agents are. I
think the issue is, you know, we we've been using the term agents for a
number of years and agents haven't really worked. And then I would say about six
months ago, they started to work and, and, and, and by work, I mean things
again, like cloud code, deep research, Maness. And so the, I think it's, so we
use deep agents as a way to kind of like differentiate these newer types of
agents that we see taking off. And so as these things have started to become
more and more common, we, we've started to think about how the tools that we
build to work with them have evolved. And so now I'm going to start sharing
some slides that walk through a lot of our thoughts here. Um,
so awesome. Um, so what, what, what is different about
evaluating first? So we'll talk about evaluating, and then we'll talk about kind of like
observing and debugging. So what's it different about evaluating deep agents compared to like simpler
LLM apps, whether it be kind of like a single LLM call or a chain.
So when you evaluate a, a simple LLM apps, the typical thing that we see
is you build a dataset, and this is a collection of inputs and outputs. Uh,
you define an evaluator. This could be correctness. This could be LLM as a judge.
It could be something. And then you basically run it in a loop. You, you
take your agent, you run it over the dataset, you run, uh, you get your,
your outputs, and then you run the evaluator over the outputs and you can just
run it in a loop. Uh, deep agents, uh, are, are different from these types
of applications in, in a few ways. So, so one, they oftentimes have state. So
you're not just judging the output of the LLM. You're, judging the changes that it's
made to the state, whether it's, uh, if you think about coding agents, cause that's
the most kind of like typical example, uh, those would be files in the code
base. How, how are those changing? Um, the, the outputs oftentimes complex. Um, so even
if it does respond with natural language and now really long natural language sponsors, or
if it makes changes to code, it can make a ton of changes to code
and in a bunch of different files. So the, so the output's generally way more
complex than just like a single response from an LLM and each run needs bespoke
logic. And so what I mean by this is that when you're testing changes to
code, for example, like each test case might be, you know, a different task you
want to do and how you evaluate that task is very different depending on tasks.
Like in, in, in terminal bench, um, which is a data set for testing coding
agents, they, they run a bunch of tests that are specific to each data point.
So you now have kind of like bespoke logic. Um, and so the patterns that
we see for evaluating these types of deep agents are the following. We see this
bespoke testing logic for each data point. Each test case has its own success criteria.
This is different than, you know, if you're just calculating accuracy of a classifier that's
using the same criteria for all data points. This is different. Um, we often, there,
there's multiple different ways to actually run the agents themselves. Um, you might want to,
run them just a single step for individual decisions. You might want to do like
a full turn and, uh, start testing kind of like the end state of what
happens. Um, but a lot of these agents are also conversational. And so you might
want to test the back and forth. And so how, how does that happen? And
so there's multiple different ways to invoke these agents. Um, and then third, this is
related to state, like environment matters. So you need kind of like clean, reproducible test
environments. Um, the, the main thing that we've been pushing here recently to help
evaluate these agents is a PI test ingest integration. Uh, so this automatically logs inputs
and outputs and full traces for each, uh, session or for each test. And so
this is great. So you can track results over time. You can link failed tests
to the agent execution. So if a test fails, you know exactly what happens. Uh,
it stores these as experiments in Langsmith so you can track regressions. And, and, and,
and the reason that this PI test just integration is nice is because when you
write a PI test case, you just write code, you write code to set up
the agent, you write code to call the agent, you write code to test the
agent's output. And as we mentioned, there's multiple ways to run the agent. So you
might want different ways to run the agent for different test cases. And then there's
multiple logic, uh, or the, or the bespoke logic for, for each test case. And
so you might want bespoke code for that as well. So PI test ingest, PI
test for Python, just for JavaScript provide really, uh, nice ways, um, in, in a
pretty familiar software engineering paradigm to, to write kind of like evals for these things.
So moving on to debugging deep agents, um, what, what's different about debugging deep agents
versus, uh, some of the simpler LLM apps, simpler LLM apps probably have a shorter
prompt. So maybe, you know, a paragraph or something like that max, and they have
shorter trajectory. So maybe it's just a single call to an LLM. Maybe it's like
a, a chain. So it does like one retrieval call and then one answer, but
it's generally simpler. When you talk about deep agents, they've got longer prompts. You look
at cloud code, it's, it's prompts about 2000 lines long. They've got longer trajectories. They
can call, you know, 10, 50, a hundred tool calls in a row. Um, and,
and they're oftentimes conversational. They oftentimes have a back and forth, whether it's to ask,
uh, whether it's to ask clarifying questions or, or, um, the human correcting the agent
or anything like that. And so this causes issues. Now you can, you can't quickly
tell whether a trajectory was efficient or whether it worked. Like you can't look at
like a hundred tool calls and know whether it was being the most efficient or
not. And same for prompts. Like if it messes up, you don't immediately know which
part of the prompt to change. If the prompt is just three sentences, you can
see that all in one go and immediately have an idea for which, you know,
sentence to change. If it's a massive prompt, you don't know that. And so we've
evolved, evolved Langsmith in two ways to help debugging these, agents. Um, we launched both
of these yesterday. They're both new. We'd love feedback on them. And Nick will walk
through them in more detail. One is Polly, which is an in -app assistant for,
for agent engineering. And then, uh, the other is Langsmith Fetch, which is a CLI.
So walking through them, Polly is an AI assistant for agent engineering in Langsmith. It's
chat based, so you can interact with it. Um, there's three places in the app
that you can use it. We wanted to, only add it into places where we
thought it could be useful. We tried to be pretty intentional about where we put
it. One is in the trace view. So you can ask questions like, hey, did
this call the right number of tool calls? Did it look like it was being
inefficient in any tool calls? The next is in the thread view. Um, and so
here you can say things like, was the user disappointed in the conversation? Were they
asking, you know, did they have to repeat the question multiple times? And then the
third is in the playground. So here you can say things like, hey, the, the
output is actually X. I want it to be Y. Go change the, go change
the, um, the prompt and, in a way where, where, where it makes it Y.
And so we'll go and figure out what parts of the prompts to change. Langsmith
Fetch is the other thing. So, so Polly lives in the app. Langsmith Fetch brings
Langsmith to you. Specifically, it brings it in, in the form of a CLI. Um,
you can, uh, you can, you can pip install Langsmith Fetch. And then there's two
core workflows. One is like an instant pull, grab the latest trace in a project.
So if you're, if you're developing an agent locally and you're just running it and
it just messed up, you can, you can pull down the latest trace, um, and
work with it immediately. And then, and the second is bulk export. So grab a
bunch of traces and dump them locally. So maybe you've run the agent a bunch
of times. Maybe you want to look at production data. Um, maybe you want to
look at, uh, some tests you're running, basically pull down the latest 10, 100, 20,
however many traces, and then, and then put them in your file system. And that
lets the agent that lets the coding agent that presumably, you know, everyone's using coding
agents these days. That's what the coding agent work, go over them, inspect them, see
what's going on and suggest changes to the agent, which also lives in your code
base. So it's bringing these things into your code base. So it can start to
edit code based on that. Um, that's the high level, uh, things that I wanted
to share. Again, like we think, we think, uh, evaluating and debugging agents are different
for these longer type of deep, deep agents where I've introduced a number of things,
pie test, poly, and then, uh, laying Smith fetch to help with this, um, to
make it all more concrete. I'm now going to hand it off to Nick. Who's
going to show you how we actually use these things. Cool. Uh, let's get into
it. I'm going to go ahead and share my screen here. Um, if you guys
can see the code, uh, today, I'm just going to walk through a pretty simple
deep agent. There's actually a quick start for this agent, uh, in our deep agents,
quick starts repository. If you're curious to follow up later, this agent is just a
personal assistant, uh, but we're basically going to walk through the full implementation of this
personal assistant, the prompts that we give the different tools that has access to, uh,
and one specific sub agent for meeting scheduling. Once we have a good idea of
how our assistant works, we'll run it over an example, and then we'll pivot over
into Langsmith, take a look at the trace, uh, and then use poly use Langsmith
fetch, uh, and use another, uh, pretty interesting tool to try and make our agent
better. Cool. So for those of you who, uh, know deep agents or have some
familiarity, it's a pretty simple agent harness. So a deep agent is just a react
agent, but it comes opinionated with a set of tools, uh, that we've seen to
be pretty helpful for making agents performant, uh, in a lot of different situations. And
one of the most important parts for a developer to get right for a deep
agent is the system prompt. Uh, like Harrison mentioned system prompts for a single chat
application might be, be quite simple. Uh, but we see system prompts for deep agents
being quite long and quite detailed. Uh, you can give heuristics on how the agent
is supposed to work. You can give concrete examples of what it's supposed to do
in different scenarios. Uh, and all of this is pretty helpful to Bacon. And so
just taking a quick look at the system prompt for our assistant here. Uh, I'm
giving it a bit of a persona. I'm telling it, it's a top notch assistant
for me. Uh, I'm giving it a little bit of background about myself. I live
in New York, pretty busy. Uh, and I'm also giving it some pretty concrete instructions
on how to handle incoming emails. So that's how our personal assistant it's going to
work. It's going to trigger when I get a new email and it's going to
determine what to do with it. And so you can see here that I've given
it some triage instructions. There are a class of emails that I don't think are
typically worth responding to and that I don't need to look at. And then there
are emails that I think are definitely worth responding to where I want this agent
to go ahead and write a response for me. This agent also has the ability
to do other things as well. So in addition to writing emails, starting new threads,
uh, it can call a sub agent that's focused on scheduling meetings. So the sub
agent has access to my Google calendar and can schedule meetings for me, or if
he doesn't think I should look at an email, it can go ahead and mark
it as read. The second prompt here, uh, is just for that
sub agent. And so we'll see this all come together in a second, but this
sub agent has a different sort of, uh, purpose than the main overall agent. We
really want it to just focus on scheduling meetings and this is intentional, right? We
don't want to give the main agent too much to worry about at once. And
so whenever someone wants to schedule a meeting, this main agent will kick it over
to this meeting scheduler sub agent. And the sub agent has two specific tools to
look at my calendar and then to schedule meetings on my calendar. I also gave
it some guidelines to follow about, uh, different times to schedule meetings and how long
to schedule them for. And so line 79 to line 93
here, this is pretty much the bulk of, of creating the actual deep agent. And
intentionally it's pretty short. We have create deep agent, which is a factory function from
our deep agents package. We have a few Gmail tools, which I've implemented in a
separate file, but these just allow us to talk to the Gmail API and actually
take actions. What we do here is we create the deep agent. We give it
that overarching system prompts with instructions for myself and also on how and when to
call its tools. We give it the one particular sub agent for scheduling meetings. So
this sub agent is really good at scheduling meetings and interacting with my calendar. And
the agent also has a few other tools around writing emails, kicking off new email
threads and marking emails as read. And so just to zoom
out here, I think the point that I really want to hammer home is that
deep agents are really quick to set up in this sort of fundamental basic form.
But this is really just the start of the battle, right? Like we have the
question now becomes how do we make sure that the agent actually works the way
we want it to? How do we know that it's calling the right tools and
the right scenarios? And how can we also improve it over time against specific inputs?
And so like most developers would do, I think the natural next thing is just
to run our deep agent over an example and see if it does what we
want it to do. And so here, I'm just going to kick off our deep
agent against a sample email thread. This is from my good friend Oliver Queen.
And Oliver Queen wants to talk to us about deep agents, he's super interested, wants
to chat about it and wants to schedule some time to chat at 8am next
Monday. So we're just going to go ahead and run this agent.
And what I've done here is basically just packaged this email into a single message
for the agent to take in. So it's just a simple message that says a
new email came in, use your tools, handle it to the best of your ability.
And what I've done here is I've tracked the response. So this is an agent
running under the hood, we're going to get a full response back about how it
handled it. And I'm going to print this out just for comparison, because I think
a default for a lot of developers, myself included is just print debugging. What I
didn't mention earlier is I have set in my environment, the Langsmith API key, and
I've also turned on tracing to Langsmith. And so if I swap screens over here
to Langsmith, I can see this trace is ongoing here. So this trace that's currently
running has to do with the question that I just asked, which was an email
thread came in, handle it to your best ability, and we can see the content
of the email here. And like Harrison mentioned, right, like this trace for the steep
agent is pretty complex, it's pretty long, it would take me a decent amount of
time to click through each of these individual parts, and parse through manually what's going
on. And so that's why I'll open poly over here on the side to see
if it can do the job for me. Poly has a few default prompts. So
I'm just going to go ahead and ask it to summarize what the agent did
in the trace. At the same time, we can click through and try to get
an idea of what went on. So the model we're using here is Claude Sonnet.
We called this task tool, this task tool kicked off our sub agent
to go ahead and look at my availability to see if I could schedule the
meeting for Oliver. And even just in the time that is like me to look
through this one step. Polly was able to quickly pull all of the information from
these traces and give me a quick summary of what the agent actually did. So
we can see that the agent checked calendar availability, delegated to the sub agent, like
we just saw, it then drafted and sent an email response. And then it marked
the email as read for me, so I don't actually have to handle it myself.
It also gave me some nice metrics like the agent successfully completed in 46 seconds.
And so this is pretty cool. In the UI, instead of having to click through
each of these different pieces myself, I can get a nice summary from Polly. Polly
is also just a generic chat, right? So if I had follow up questions like
what, you know, specific days did you check or things like that, I could also
ask Polly and get some answers back there. And so
Harrison brought up this point earlier, but what we've done so far is constrained to
this, this UI experience, right? We're using Langsmith to get a really good sense of
exactly what our agent did. And we can use Polly to give us a nice
summary or ask specific questions about specific parts. But a lot of the development that
actually happens for our agent, right, happens in our code. You can see here for
comparison, the print statement is pretty, pretty hairy, and I probably wouldn't want to debug
this myself. And so that begs the question, how can I get
the information that I see in Langsmith into my terminal so that it can be
used by coding agents like Claude code or the deep agent CLI so that I
can improve my agent directly while working with a coding assistant. So
I'm going to go ahead and use Langsmith fetch here. It's just a package that
we can install. So if we try to install Langsmith fetch, we can see that
I already installed it. And now Langsmith fetch, which works to pull the latest trace
from a particular project. So what I'm going to do is I'm actually going to
come back into Langsmith. And I'm going to take a look at the tracing project
that I'm currently logging to, which is for this personal assistant, I'm going to go
ahead and copy this ID here. And I'm going to set this as the config
that I use for Langsmith fetch. So I'm going to do UV run Langsmith fetch
config set project UID.
And I'm just going to set this project as the one that I want to
get traces from. Now, if I run UV run Langsmith fetch
traces, it's going to go ahead, call the Langsmith API under the hood and get
the most recent trace for me and pretty print it in this nice format. And
like I mentioned, right, like we can see this already in a much nicer view
by just going into Langsmith. But this is really helpful for a coding agent to
digest. So I'm actually going to go ahead here and kick up our coding
agent, the deep agent CLI. For those of you who haven't worked with the deep
agent CLI before, it's super similar to cloud code. It's a coding agent, it has
access to all of my files, and it can suggest and make updates to those
files. And so I've given the deep agent CLI agent a few instructions
already. But I can just ask it here, can you go ahead and
fetch the most recent trace from Langsmith and summarize it
for me. And so just like other coding agents, there are different modes that you
can run the deep agent CLI in. Right now I have it in approval mode
where I want to approve every action. But I could also run this in YOLO
mode and, and let it cook for me. So I just approved the action here.
And so it's running Langsmith fetch traces in the background. And it's just given me
the summary, it's told me the different tools that the agent used. It's told me
the overall workflow and what it just handled. Now let's see
if we can actually improve the agent's behavior in this case. So I may be
like some of you don't really love waking up too early. So I'm going to
tell the agent that hey, I actually, I prefer to sleep in. I
don't want to take any meetings before 9am.
Can you update my agent to adhere to this? And can you also
write a test to make sure my agent follows these
instructions. So let's go ahead and see what it does.
We can see that it's taking a few actions here. So it's reading into my
files, it just read agent .py and also test assistant up high, where I've defined
my tests. And now, cool. So it's suggesting an update to my prompt.
And it's basically just writing a new line here that says very important. Nick wants
to sleep in Nick doesn't want to wake up. Nick. Cool. So that has
now been edited in my actual system prompt for my agent. Now it is working
some more. And in a moment, we'll see what it comes up with.
Cool, just to expand this a little bit, we can see now that it's updating
and adding an actual test. This is a pi test test. And
looking carefully, I'll go ahead and accept this. But we'll take a closer look at
exactly what code is in there. If I open up my test assistant dot pi
and scroll down to this new code, I can see that it's a test specifically
designed to test declining early morning meetings. I can see that the
example looks super similar to the one that I just inputted. And I have a
specific success criteria for this test case. So the response should politely
decline any early meetings, in this case, the 8am meeting time, and respect my
preference. Then there's a few different ways I'm actually going to test the agent. And
I'll go ahead and kick off the test here. So it can run while we're
walking through this. I run the agent here, I run it to its end state.
So I take the email, I format it into that same input message, and then
I get my response. From that response, I'm going to extract the tool calls. And
there's a few different things I'm going to assert here. One, I want to make
sure that we actually wrote an email response to the user. That's very important. If
we didn't do that, there's no point in using an LLM as a judge and
we fundamentally made a mistake. So this is a very deterministic check that we can
take based on the end state of the agent. Then I'm going to do something
softer. I'm going to use this evaluate tool call method that I wrote. And what
this does under the hood is it passes in the success criteria, and it actually
uses an LLM as a judge to evaluate the result. So we use an LLM
as a judge with structured output to see if we get to see if the
final output from the agent fulfills the result criteria. And this is really powerful, right?
Because for my different test cases, I can write different specific success criteria and have
the LLM as a judge judge these different criteria for each test case. It's also
neat, right? Because I can assert different tool calls with potentially different tool arguments for
my different test cases too. Awesome. So we can see the agent says, great, the
test passed. Now let's fetch the trace to see exactly how the agent handled it.
We don't need to do that. But if we go over to Langsmith, we can
see that we ran this test case, and it got logged here in my tracing
project as well with this input. And in this case, we can see that
it wrote an email response. And actually, it says,
I'd love to chat about it. However, I don't take meetings before 9am. So I've
updated the behavior of my agent, really using the AI tools to help me out.
So to kind of recap what we just went through here, right, we started with
a pretty simple, agent, just a personal assistant agent.
We wrote it with deep agents, in just a few lines of code. And then
we tested it out on a pretty simple example. And its behavior was was pretty
reasonable. But there was a specific customization that I wanted to add on top of
it. To do that, I used deep agent CLI locally, the coding agent,
I gave this coding agent access to Langsmith fetch, so that it can find the
latest trace and understand the information from it. And then I asked for a specific
change. And it made that change, wrote a test for that change, ran that test
and log that back to Langsmith. And in the case that this hadn't passed, right,
the agent would then be able to continue iterating. And it could fetch the more
most recent trace to try and figure out what went wrong, according to the test,
and how it can then update the prompts, or edit the tools to make it
better. Cool. That was it for the practical
example. So I can stop sharing my screen, and we can maybe take a few
questions.
I can help read some of them out. All right, let's see if we can
do this. I clicked answer live for this one. Do you see anything pop up
when I click answer live? I don't. Okay, I'll click answer live and then read
well out. Can poly eventually be used in the terminal so I don't have to
tab back and forth and within the browser debug and cursor much smoother flow coding
agents? I think this is where Langsmith fetch comes into play. So we don't currently
have plans to expose poly itself in via the API, but we have Langsmith fetch
for that. How will one evaluate the multi agent without a
curated QA pair? Nick, do you maybe want to talk about kind of like evaluation
strategies in general and how you might think about evaluating without a
ground truth? Yeah, I mean, I think there are a lot of different types of
ground truths, right? So it kind of depends on the type of agent you're building.
If it's a more like open ended agent, so like, let's say research or coding,
there might be specific criteria that you want to assert your final answer again, and
that might not just be a ground truth. So for instance, for a coding agent,
right, you might want to actually run the file that the agent came up with
to see if it works. For a research agent, you might want to make sure
that sources are cited properly. Or some other success criteria, like one or two specific
facts were mentioned. So I think it really depends, you don't need a full ground
truth answer. In a lot of cases, sometimes you just want to make sure that
one particular thing happened for your agent. Cool.
Let's skip around to okay, this one from Cameron, are those deep agent trace spans
automatic? If you set two environment variables, they start tracing to Langsmith
automatically? Yeah, so you don't have to do anything else to your code. And yeah,
we do have docs on tracing. If you if you look at Langsmith tracing, and
then Lang chain integrations, you can see it there.
Does poly run with its own API key? Or do we need to provide one
in order for it to work? So this runs with your own API key, we
did that for a reason so that you don't have to worry about any data
sensitive things coming coming to our models.
So yeah, you provide your own API key, and then it runs.
Do you suggest using deep agent CLI for Lang chain related things rather than cloud
code? I don't know if you have a take on this, Nick, I'll give my
quick take, which is basically, I mean, Deep agent CLI is right now
there's nothing about it specifically for Lang chain related things. But we are thinking about
how to make it easy to customize things like deep agent CLI so that you
can have like a CLI that is really good for Lang chain things. And so
I actually view deep agent CLI as more of like a scaffolding for other maybe
like more specialized coding agents in the future. Nick, I don't know if you have
anything else to add. Yeah, I think that's spot on. I mean, one thing that
I've struggled with a lot with cloud code is that it gets import paths wrong
a lot of the time, especially for more or like less adopted libraries. And so
I think one thing that we could do that's really interesting is is have specific
profiles that you can pull with deep agent CLI and get a coding agent that's
really good at one thing.
Are there any specific features that are only supported for Python versus TypeScript? I
mean, off top of head, Langsmith fetch is right now only a Python client, but
it works for traces whether they're coming from Python or TypeScript. Deep agents is both
Python and TypeScript, but we do have some differences between the two. Nick, I think
you've worked more on the JS side of things. Do you know the differences between
deep agents and Python and TypeScript? I think it should be pretty up to date
in terms of the core functionality. So if you see anything, please open issue on
the repo and it'll get addressed. Cool. I'm thinking I can just
abandon cloud code and use deep agents to develop an agent instead. I presume I
need to set up with API keys. Yeah. So and one of the cool things
that we do, I think we support right now open AI, Anthropic and Gemini. So
you can also use it with other models as well. How does Langsmith differ from
other agent observability and eval tools, i .e. Langfuse? I think
there's a few things that we've seen. I think generally we've pushed the boundary a
lot in what you can do with agents. So I think we're generally, we're very
scalable because agents have a lot of data that comes in. And so we have
a whole team kind of like dedicated to working on scalability. And then I think
also there's a difference between tracing like single LLM call apps and like these really
complex agents. And I think these complex agents didn't really exist until like six months
ago or nine months ago. And so I think everyone, including us prior to that,
was basically really focused on simpler LLM apps. And now only now, I think, are
we starting to tackle these more deep agent style things. And I think, we're doing
a lot of work there that, that other folks, uh, aren't, um,
uh, can we use our own models to run poly slash deep agent CLI or
do we have to use one of the frontier models? Um, so for I'll answer
for poly and then maybe you can answer for deep agent CLI, Nick, but for
poly, um, one of the things we are adding that should be in prod today
or tomorrow is the ability to switch between different models. Right now, we only support
open AI and Anthropic, but it's an open AI like compliant endpoint. So it can
be any other endpoint that's running whatever model, as long as it's open AI compliant.
So we've seen a lot of the open source models be hosted on platforms that
expose them as open AI endpoints. And so you could plug those in. Yeah, right
now for deep agent CLI, I believe, uh, we run with the frontier model providers.
So I think you can use, uh, open AI models and Anthropic models or, or
Gemini models. Um, but config, configurability is something that we're looking to add, uh, in
the CLI in general. Do you plan to support skills middleware as part of core
deep agents live itself and not part of deep agents CLI only? Uh,
yes. Not sure what timeframe. Um, but yes, I think skills by Anthropic are very
interesting and we want to support them. Um, is Langsmith open source or are there,
or is there any cost associated with it? Can I use it for my local
development? Uh, Langsmith is not open source. It is our commercial offering. Um, there is
a free tier, so you can absolutely sign up. And if you're just using it
for debugging, you should be able to use just the free tier. Um, uh, pass
a certain number of traces, uh, that's where there is some cost associated with it.
And there is an enterprise option to deploy it on prem, uh, but that's, that's
generally only an enterprise option. Um, how about difference with
Arise to enterprise tools? I think similar to the answer to Langfuse, I think, uh,
it's easy to monitor a single trace or a single trace of a simple app,
but I think it's harder to monitor lots of, of traces, whether that's millions or
billions coming in and do it respond, uh, do it snappily and also do it
for these more kind of like longer and more complex deep agents. Um,
there's a, there's an interesting question here about the example, uh, in the case of
a deep agent, let's say the agent did a couple of tool calls that are
useless or has wrong parameters, but eventually passed. How will the deep agent based evals
catch that, uh, basically the length of the trajectory. Uh, so this is really interesting,
right? So sometimes, right. You, as the developer of the app have a strong opinion
of given some sort of inputs, what exactly should the agent do next? And so
this goes back to what Harrison showed about running an agent for a single step.
You can just run it for that one chat turn. You can see what tool
calls it makes, and you can assert that it essentially calls the right tool, uh,
with the correct arguments. There are other cases like in this email example, where I
kind of don't know how many dates the agent is going to check, right? I
mostly care that at the end of the day, it does schedule a meeting for
me and it doesn't really matter to me like what particular day, as long as
I'm free. And so that's a different thing, right? Then I have to run the
agent end to end. And what I can assert against the trajectory is just that
at some point it called the schedule meeting tool and, uh, I don't actually care
about the arguments. So I just called it. I just care that it called the
tool. Cool. Um, awesome. Uh, maybe we can,
maybe we can take turns choosing, there's too many questions. So maybe we can take
turns choosing questions that we think are interesting in answering them because otherwise I'm just
gonna be reading down the list. One that I, uh, I like here, are you
planning to add poly at the project level to summarize the most common problems the
agent has or check multiple traces and, and, and def. So we're thinking a lot
about where to add poly. Um, we want to be really intentional with where we
add it. We don't, you know, we don't want to add AI just for the
sake of adding AI. And so the three places we add it like threads, traces,
and playground, um, even there, we started with like limited functionality. So like, you know,
you can't, um, you know, uh, add an evaluator in the playground because we really
wanted to focus on like what we thought was like the most critical stuff there.
Um, some of the other places we are, and, and the other thing I'd say
is that we are thinking about agents and other parts of the app that may
actually be like, uh, uh, longer running more background things. So for example, we have
an insights agent. So in the insights functionality we have is an agent. It is
not part of poly. Why isn't it part of poly? Because it's just a pretty
different UX. Like insights will run for like 5, 10, 20 minutes in the background.
It analyzes, like it analyzes like all the traces. That's kind of similar to what
you're asking here to summarize the most common problems the agent has. This is, I'd
use the insights agent for this. And this just isn't really something that you'd probably
want to like have in a chat. It's more of like we have a dedicated
UI for it. We list the jobs that have run previously. Um, and, and so
that's kind of how, uh, so, so the answer would be like, we are thinking
of putting in other places the app. We don't just want to add it everywhere
willy nilly. We will also have these other types of agents that make more sense
for what we consider like longer running types of things. Another example of this would
actually be optimization. So we're thinking a bunch about how do you hill climb on
examples and should that be part of poly in the playground or should that be
like part of these like async background jobs? And we're actually, we don't actually have
a super strong opinion on that yet. We're thinking through that live, but like we're
planning to add a lot more AI and then Langsmith fetch, we will add more
things in as well to pull down experiments. Um, and, and maybe even pull down
annotation to use and things like that. Cool. Uh, I see a few questions here
that are pretty quick. So does Langsmith support human annotation? Yes. Uh, Langsmith has an
annotation cues feature, uh, that's specifically dedicated to that. Uh, is deep agent CLI free
to use? Yes, it's free to use. Uh, you just need to supply your own
API key, uh, but then, uh, you can use it for free and install it
for free. Uh, another interesting question here, what sort of memory is served with the
deep agent CLI, uh, or is there a way to customize the memory when prompting?
Uh, so yes, the deep agent, uh, that runs as a part of deep agent
CLI pulls part of its system prompt from a file called agent .md, uh, which
it writes locally on your machine. And so what you can do is you can
open that file yourself and you can add instructions, which is what I did earlier
today, uh, by giving it some instructions on how to use Langsmith fetch. Uh, what
you can also do, right. Is you can prompt the agent to remember something. So
maybe I tell it, Hey, I have soccer every Tuesday at six. Um,
this is something that the, the deep agent can then write to its own memory
files because it's, uh, it has access to the file system. And so instead of
you actually cracking open that file and writing it yourself, you can talk to the
agent and the agent has the ability to write that as well. Um,
one here, which is maybe more open in it as well. Can you use tests
to check things like token or context use? Um, which maybe I'll even generalize, like
what types of things can you test in, in, in tests? Um, I, I've seen
someone, I, I, I mean, you can test a bunch of stuff. So like, uh,
you can test latency, you can test token use, um, context use. I actually
haven't seen, I mean, I think we've seen people test functions that like create context.
I think we've seen people test like hallucination and groundedness in context. Um, I, I
don't know if we've seen people test that like, oh, I run an agent. Does
it ever pass like 20 % context fullness or something like that? Um,
uh, anything else that you've seen people test Nick? Uh, sorry, I missed, I missed
that part of the question. I was looking through the, uh, the question list. Can
you use tests to what, what do people, what, what do people test? So like,
I've seen people test like latency, um, uh, uh, obviously like accuracy,
groundedness. Um, uh, you can test for things like, you know, prompt injection, like security
related things, uh, any other, any other types of things that you've seen people test
for? Yeah. I mean, I think content is a big one, right? So if you
have like a, an open ended chat application, you can have some guard rails, uh,
that run as some sort of online evaluation to filter out questions that you deem,
uh, don't fit content. That's a, that's another one I've seen. Cool. Uh, maybe one
other quick one that I'll answer and then I'll let Nick go. Uh, if we
use, uh, if we use fetch, uh, to trace, do we still need Langsmith subscription?
Yeah. So, so Langsmith fetch fetches data from Langsmith. So the idea is basically, yeah,
you still, you still send, uh, all the traces to Langsmith. And so we'll need
a subscription for that. Although as mentioned, we do have a free account and then
you just use fetch to bring it to your coding agent. Um, a really good
question I see from Manuel and Bruno. Uh, what are your thoughts on the side
of building the evals for complex queries carried out by a deep agent where there
are multi -step tasks with many points of failure, any advice here on good practices
on building the test sets? Um, for me, this is, this goes back to running
agents for a single step at a time. Uh, so you can always model the
inputs to an agent as some sort of a list of messages that are going
in, right? So maybe instead of letting a rate, an agent run for multiple steps
at once, and then evaluating the end result, you just run the agent on the
first input. It produces the first output, and you can have a set of assertions
on that first output that have to pass before you give it the next input.
And so you really break this multi -step problem down to a bunch of single
steps, uh, and you can have assertions in between these. And so that way also,
if the agent does go wrong and if it goes off the rails, you're not
wasting a bunch of tokens or money, uh, with the agent doing things that are
totally wrong, right? You catch it at the first point of failure, and then you
can work on optimizing the agent until it passes that. And then you can see
if it passes the second, the third, and so on. Um,
one other question that's pretty easy answer. Is there a way to host a deep
agent so that my team can access it and ask questions through a shared interface?
Uh, yes, there's two ways actually. So like one, we have Langsmith deployments. Langsmith deployments
is where you can deploy, uh, Lang graph, Lang chain, or deep agents, um, or,
or any agents actually. And, uh, it's agents that you build in code. And then
we have, uh, we, what we call Lang graph studio, which is where you can
interact with these things, um, after you've deployed them. We also have a no code
agent builder in Langsmith that's actually built on top of deep agents. Um, so we
didn't do anything in no code for a while, uh, but we just recently did.
And what we did there was we, uh, basically took deep agents and we put
it so that you could build it in a UI because deep agents are just
a prompt plus tools. And so we put this in, in the UI and now
you can build these things in a no code way. And then you can also
interact with them and share them. And we have workspace agents, so you can share
them with other people in your workspace. And so this is still in beta. And
so we'd love any feedback here. Um, but yes, that's, that's how you can kind
of like share and access these, these deep agents. Um, one more open -end question
that, that I think is pretty interesting. What feature within Langsmith have you seen resonate
the most with enterprises? Uh, I can speak more for myself, what feature resonates the
most with myself. Um, but I think it's tracing, right? I think tracing is still
super undervalued. I think evals is a buzzword that everyone is throwing around, but I
think at the end of the day, right, if something doesn't pass your evals, you
need to understand why it didn't pass your evals. And so you need to go
look at the trace or you need to have poly go look at the trace,
or you need to have deep agent CLI with Langsmith Fetch, go look at the
trace. Uh, and you have to understand very granularly where your agent went wrong. Um,
so I think like when I, at least when I think evals, uh, I don't
think like large benchmarks, I don't think like a hundred tests, right. That, that are
running and you pass 50 of them. I think a small blur set, uh, where
you want to have a hundred percent accuracy or passing and making sure your agent
does, uh, accomplish its, its core competencies. Uh, and before Harrison
jumps in, I see another one from Sriharsha, any ideas to make a Swift or
Android based Langchain framework? Uh, on deep agent so that we can run the agents
on device with an on -device LLM. Um, I don't know about on -device LLM,
but I talked to Harrison about this few weeks ago, but I think mobile is
a super interesting UX. Um, I've kicked off cloud code from my phone and had
it do stuff for me while I was off hiking or whatnot. Um, so I
think this is something to explore and is, is super interesting to me. Yeah. And,
and maybe, um, I can add one thing to Nick's answer around what have enterprises
found most resonate most. What one surprising thing to me is the amount of people
using, uh, Langsmith for kind of like, uh, for compliance actually. Um, so we, we
didn't build Langsmith to be like a compliance tool, but a lot of the functionality
that it provides in terms of like showing everything that happens inside or testing and
benchmarking, like that's exactly what compliance teams are looking for. They want to know exactly
what happens inside the app so that they can ensure that nothing, bad is happening.
And they want to know how it performs on different things. And so we've heard
multiple times that, that Langsmith is a really useful tool for ensuring that agents are
compliant with, with any regulations or things there. Um, and, and then maybe one quick
one, um, I can, uh, uh, I'll answer are any plans for Langchain Academy
courses related to this to show off best practices in Langchain's opinion. Um, we probably,
it's still so, we're doing a bunch of work on like evaluating deep agents. And
so, uh, no, I, no plans until now. Um, but it's a good idea, but
we do have a Langchain Academy course on Langsmith in general. Um, so we have
Langsmith essentials, which is, uh, which is like a one hour course. And then, uh,
or we have Langsmith quick start, which is a one hour, uh, course in the
Langsmith essentials, which is a like three or four hour course. And I think Nick
actually taught that, that the longer one, um, back in the day. Um, and so,
yeah, for, for Langsmith kind of like, like, there's so much in Langsmith that we
actually didn't show during this, this webinar. Um, and so for a full deep dive,
I definitely encourage you to check out the Langsmith, uh, courses on the Langchain Academy.
Um, I'll, I'll, I'll drop a link to the Academy in the chat. If people
didn't know about this, we have a bunch of education, no resources on, uh, Langchain,
Langgraph, Langsmith. Um, and we spend a ton of time there. Um,
two, two kind of related questions. Um, is there an online evaluator option? Uh, and
if so, how do you deal with streaming and, and latency? Uh, and there was
another question here that I, think I lost. Uh, oh yeah. And in multi -agents
and orchestration, is there an eval test to prune agents, removing them when not used
or not performing? I think both of these actually really lend themselves to, to online
evals. Um, for those who, who are less familiar with the term, so online evals
run when your app is in production already, uh, people, real users are using your
app, uh, and those traces are showing up, uh, in your tracing project. Every time
a trace shows up, you can run an online eval to sort of figure out
what was the intent of the user? Did a certain thing occur? Did it occur
under a certain time, et cetera? Um, yeah, I think online evals is super interesting,
especially for deep agents, right? You can track, um, how long they're taking to respond,
right? That's super important for UX. Uh, you can track how many tokens are being
used. You can make sure your research agents or your, more open -ended agents aren't
churning forever before converging. Um, so yeah, I think there's a lot to be done
with, with online evals and, and links by supports that quite well.
Um, any additional features for fetch poly being developed? Um, the two that
are top of mind, one, we want to let fetch pull down, um, results from
experiments. So right now, it can pull down traces and groups of traces. We want
to let it pull down, um, groups from, experiments. Um, and then for
poly, the big thing that we're iterating on is actually having it do some of
the prompt optimization. So right now it can edit the prompt, but we want to
let it edit the prompt and then rerun the, the, the whole playground to evaluate
over the dataset and then get the results and then see how we see if
they changed and then change the prompt again and then, and then iterate on that.
And so that's problem. Those two are the two things on the immediate term roadmap
there. Um, I see a question as well. Uh, can, can
fetch be used for monitoring deep agents runs, uh, let's say with a custom tool.
So that is one avenue, right? You could give the deep agent a custom tool
just to fetch traces from Langsmith. The reason that fetch was designed the way that
it is, is because deep agent CLI has access to a shell, right? So it
can just run, uh, the Langsmith fetch command instead of, of going through a tool.
Uh, if for some reason you wanted to design an agent that wasn't running in
a shell environment and it didn't have access to that, um, how Langsmith fetch works
under the hood is it uses Langsmith's, uh, APIs to actually go fetch runs and
traces. Uh, and so that is something that you could add as, as tool functionality
as well.
Um, can fetch be used for non deep agents like a little simpler lane graph?
Um, uh, right now, a lot of this functionality is optimized
for agents, which have like a messages list basically. Um, because we've seen that that's,
you know, that's a very canonical way to represent the trajectory of these agents. Um,
if it's a more custom agent, then if it's a more custom line graph thing
without messages, then even though it might be like shorter or run faster, it's actually
more complex than some these simpler things. So like fetch and poly are still, they're
still like, like the core, um, like deep agents is actually a pretty simple agent
under the hood. It basically just runs in a loop and calls tools and accumulates
a list of messages. And I think that's like a, a dirty little, I don't
know if it's a dirty little secret, but this is like the, the, the most
like impressive agents actually use like some of the simplest kind of like cognitive architectures
out there. Um, and, and so a lot of these tools are really optimized for
when you have this like tool looping architecture with is, which is simple and is
easy to deal with. Um, and at least right now, that's the focus. I do
think poly works on anything. So poly we've given more latitude with, um, but Langsmith
fetch, we've, we've, kept a little bit more focused. Um,
I see a few questions about multi -agent setups. Uh, so in the past we've
had more packages, uh, with more specified handoffs, right? Where you have one agent specifically
handing off work to another agent and then waiting for that to return. Um, now
this is really modeled as tool calls. And I think this is inspired a lot
by how we saw, uh, Claude code and deep research from Claude, that, great blog
post a while ago, uh, and Manus, how those tools work. Um, the agent retains
the right to spawn sub agents and hand off work to them, but the work
is returned in the form of a tool message result. Uh, and one of the
best benefits of this is context isolation. So you could take a task that the
main agent could accomplish on its own, but it might be a fairly isolated or
siloed task. Uh, for example, like researching a particular topic and researching that all in
the main context window is really going to bloat that context window so that when
you have your final response, you're going to have a bunch of, uh, ugly web
search results sprinkled throughout. Uh, and so by handing that off to a sub agent,
right. It's functionally a sub agent with the exact same capabilities, but the context is
siloed. So the sub agent can do all of the work and it can return
one final nice report to the main agent, uh, without returning all of those additional
tokens in context. Um, one question around
prompt optimization, uh, reprompt optimization, dspy integration slash co
-functionality or JEPA, um, mini rant here. JEPA is a really fancy term for
a really simple concept of just letting the LLM suggest improvements and then, and then
using that prompt and testing how well that prompt does. And so, yes, we're doing,
uh, variations of that. Um, I think the ideas look relatively simple and so we
probably won't be integrating with, dspy just for that. Um, but it's, yeah, it's essentially
JEPA under the hood. Um, I see
one comment here, uh, deep agents, uh, how, like how do deep agents run commands
and how do they, they fire things off? So one chief benefit of deep agents
is that they can run things in parallel. And so this thing can include tools,
but this can also include sub agents. Um, so maybe the canonical example that I
always bring up is if you've got a complex research task and you're asking, uh,
you know, who's the goat, is it LeBron or Jordan, uh, instead of doing these
sequentially, right, the deep agent can launch two sub agents in parallel, uh, one of
which will research LeBron, one of which will research Jordan. Uh, and then it waits
for both of those results to come back before coming up with its, its final
answer. Um, but yeah, parallelization is, is really powerful for agents. Um, and, uh, it
can speed up, uh, the user experience a lot.
Um, there's a, there's a lot of questions around deep agents. And so maybe rather
than try to answer all of them because we are short on time and, uh,
we should do a deep agents, uh, webinar in the future, but, um, high level
for deep agents. Uh, we're very, very bullish on deep agents. Um, if you want
to give, if you want agents that are more autonomous or you're working on workflows
where you need more autonomous agents, then you should probably switch over to deep agents
from whatever you're doing now. Um, uh, there are definitely cases where you still need
workflows. Um, but, I would actually argue that most of the things where you might
think you need a workflow, if you're thinking about it in the context of an
agent, it's actually probably better to use something more, more autonomous, like, like deep agents.
Um, we're investing in a lot in both Python and TypeScript. Um, we think it
provides way more value than just abstractions. So like a big value add of link
chain and other agent frameworks out there are the abstractions. Um, deep agents has more
of like a batteries included approach. So it provides like a planning tool and sub
agents and things like that. Um, so there's a bunch of questions about deep agents.
Um, I'd encourage people to check out the Python and TypeScript repose and leave issues
or questions there. We probably won't get to them all here. I think we're probably
good on time unless there's any final things that you wanted to answer or talk
about Nick. Uh, no, I think it's good for me. Thanks. Thanks everyone for coming.
Um, really appreciate it. Cool. Awesome. Yeah. Thank you all for joining. Um, there will
be a recording of this that we'll post later and we'll do more of these.
So hopefully see you in the future. Goodbye.
Explore the unique challenges of observing and evaluating Deep Agents in production. Deep Agents represent a shift in how AI systems operate – unlike simple chatbots or basic RAG applications, these agents run for extended periods, execute multiple sub-tasks, and make complex decisions autonomously. In this session, we'll dive into practical approaches for gaining visibility into Deep Agent behavior and measuring their effectiveness using LangSmith. Learn more about Deep Agents here: https://blog.langchain.com/deep-agents/ Try LangSmith: https://smith.langchain.com/