Observing & Evaluating Deep Agents Webinar with LangChain | DailyDevLists

Loading video player...

Full Transcript

10,605 words • EN

Welcome, everyone, as we get started. It's been a while since we've done one of

these webinars, so we're excited to do them again. My name is Harrison. I'm the

co -founder, CEO of LinkedIn, joined by Nick, who's been here

for a number of years and currently on the Applied AI team. And we're excited

to talk a bunch about stuff around evalling and debugging

these new types of long running or deep agents, as we call them. And

so maybe some agenda setting and then some context. So we

have about an hour here. I'll kick it off by doing about 10 minutes of

some thoughts on kind of like evaluating these deeper types of agents and

introducing some of the new things that we actually launched yesterday, which I think are

really cool. And then I'll hand it off to Nick after that. And he has

about 20 minutes of a practical more kind of like hands on showing how we

use some of these tools to actually test, debug,

trace these deep agents. And then after

that, we'll open it up for questions in the chat. So I think that so

there is a QA box in the

panel below. So leave your questions there and we will get to them

at the end. Okay. So I think we can probably get started.

We'll be talking about debugging and observing deep agents. We use deep agents in the

title. By deep agents, we really mean agents that run for an extended period of

time and do more autonomous tasks. And so this is a, the term deep agent

is a term we came up with to describe things like cloud code and deep

research and Maness, these general purpose agents that would operate for extended periods of time.

And, and, and when we saw that they had a number of kind of like

common characteristics, like they would all use a file system and they would all use

sub agents and things like that. And one of the, I think it's also just

reasonable to call these things agents. Like these are what we think agents are. I

think the issue is, you know, we we've been using the term agents for a

number of years and agents haven't really worked. And then I would say about six

months ago, they started to work and, and, and, and by work, I mean things

again, like cloud code, deep research, Maness. And so the, I think it's, so we

use deep agents as a way to kind of like differentiate these newer types of

agents that we see taking off. And so as these things have started to become

more and more common, we, we've started to think about how the tools that we

build to work with them have evolved. And so now I'm going to start sharing

some slides that walk through a lot of our thoughts here. Um,

so awesome. Um, so what, what, what is different about

evaluating first? So we'll talk about evaluating, and then we'll talk about kind of like

observing and debugging. So what's it different about evaluating deep agents compared to like simpler

LLM apps, whether it be kind of like a single LLM call or a chain.

So when you evaluate a, a simple LLM apps, the typical thing that we see

is you build a dataset, and this is a collection of inputs and outputs. Uh,

you define an evaluator. This could be correctness. This could be LLM as a judge.

It could be something. And then you basically run it in a loop. You, you

take your agent, you run it over the dataset, you run, uh, you get your,

your outputs, and then you run the evaluator over the outputs and you can just

run it in a loop. Uh, deep agents, uh, are, are different from these types

of applications in, in a few ways. So, so one, they oftentimes have state. So

you're not just judging the output of the LLM. You're, judging the changes that it's

made to the state, whether it's, uh, if you think about coding agents, cause that's

the most kind of like typical example, uh, those would be files in the code

base. How, how are those changing? Um, the, the outputs oftentimes complex. Um, so even

if it does respond with natural language and now really long natural language sponsors, or

if it makes changes to code, it can make a ton of changes to code

and in a bunch of different files. So the, so the output's generally way more

complex than just like a single response from an LLM and each run needs bespoke

logic. And so what I mean by this is that when you're testing changes to

code, for example, like each test case might be, you know, a different task you

want to do and how you evaluate that task is very different depending on tasks.

Like in, in, in terminal bench, um, which is a data set for testing coding

agents, they, they run a bunch of tests that are specific to each data point.

So you now have kind of like bespoke logic. Um, and so the patterns that

we see for evaluating these types of deep agents are the following. We see this

bespoke testing logic for each data point. Each test case has its own success criteria.

This is different than, you know, if you're just calculating accuracy of a classifier that's

using the same criteria for all data points. This is different. Um, we often, there,

there's multiple different ways to actually run the agents themselves. Um, you might want to,

run them just a single step for individual decisions. You might want to do like

a full turn and, uh, start testing kind of like the end state of what

happens. Um, but a lot of these agents are also conversational. And so you might

want to test the back and forth. And so how, how does that happen? And

so there's multiple different ways to invoke these agents. Um, and then third, this is

related to state, like environment matters. So you need kind of like clean, reproducible test

environments. Um, the, the main thing that we've been pushing here recently to help

evaluate these agents is a PI test ingest integration. Uh, so this automatically logs inputs

and outputs and full traces for each, uh, session or for each test. And so

this is great. So you can track results over time. You can link failed tests

to the agent execution. So if a test fails, you know exactly what happens. Uh,

it stores these as experiments in Langsmith so you can track regressions. And, and, and,

and the reason that this PI test just integration is nice is because when you

write a PI test case, you just write code, you write code to set up

the agent, you write code to call the agent, you write code to test the

agent's output. And as we mentioned, there's multiple ways to run the agent. So you

might want different ways to run the agent for different test cases. And then there's

multiple logic, uh, or the, or the bespoke logic for, for each test case. And

so you might want bespoke code for that as well. So PI test ingest, PI

test for Python, just for JavaScript provide really, uh, nice ways, um, in, in a

pretty familiar software engineering paradigm to, to write kind of like evals for these things.

So moving on to debugging deep agents, um, what, what's different about debugging deep agents

versus, uh, some of the simpler LLM apps, simpler LLM apps probably have a shorter

prompt. So maybe, you know, a paragraph or something like that max, and they have

shorter trajectory. So maybe it's just a single call to an LLM. Maybe it's like

a, a chain. So it does like one retrieval call and then one answer, but

it's generally simpler. When you talk about deep agents, they've got longer prompts. You look

at cloud code, it's, it's prompts about 2000 lines long. They've got longer trajectories. They

can call, you know, 10, 50, a hundred tool calls in a row. Um, and,

and they're oftentimes conversational. They oftentimes have a back and forth, whether it's to ask,

uh, whether it's to ask clarifying questions or, or, um, the human correcting the agent

or anything like that. And so this causes issues. Now you can, you can't quickly

tell whether a trajectory was efficient or whether it worked. Like you can't look at

like a hundred tool calls and know whether it was being the most efficient or

not. And same for prompts. Like if it messes up, you don't immediately know which

part of the prompt to change. If the prompt is just three sentences, you can

see that all in one go and immediately have an idea for which, you know,

sentence to change. If it's a massive prompt, you don't know that. And so we've

evolved, evolved Langsmith in two ways to help debugging these, agents. Um, we launched both

of these yesterday. They're both new. We'd love feedback on them. And Nick will walk

through them in more detail. One is Polly, which is an in -app assistant for,

for agent engineering. And then, uh, the other is Langsmith Fetch, which is a CLI.

So walking through them, Polly is an AI assistant for agent engineering in Langsmith. It's

chat based, so you can interact with it. Um, there's three places in the app

that you can use it. We wanted to, only add it into places where we

thought it could be useful. We tried to be pretty intentional about where we put

it. One is in the trace view. So you can ask questions like, hey, did

this call the right number of tool calls? Did it look like it was being

inefficient in any tool calls? The next is in the thread view. Um, and so

here you can say things like, was the user disappointed in the conversation? Were they

asking, you know, did they have to repeat the question multiple times? And then the

third is in the playground. So here you can say things like, hey, the, the

output is actually X. I want it to be Y. Go change the, go change

the, um, the prompt and, in a way where, where, where it makes it Y.

And so we'll go and figure out what parts of the prompts to change. Langsmith

Fetch is the other thing. So, so Polly lives in the app. Langsmith Fetch brings

Langsmith to you. Specifically, it brings it in, in the form of a CLI. Um,

you can, uh, you can, you can pip install Langsmith Fetch. And then there's two

core workflows. One is like an instant pull, grab the latest trace in a project.

So if you're, if you're developing an agent locally and you're just running it and

it just messed up, you can, you can pull down the latest trace, um, and

work with it immediately. And then, and the second is bulk export. So grab a

bunch of traces and dump them locally. So maybe you've run the agent a bunch

of times. Maybe you want to look at production data. Um, maybe you want to

look at, uh, some tests you're running, basically pull down the latest 10, 100, 20,

however many traces, and then, and then put them in your file system. And that

lets the agent that lets the coding agent that presumably, you know, everyone's using coding

agents these days. That's what the coding agent work, go over them, inspect them, see

what's going on and suggest changes to the agent, which also lives in your code

base. So it's bringing these things into your code base. So it can start to

edit code based on that. Um, that's the high level, uh, things that I wanted

to share. Again, like we think, we think, uh, evaluating and debugging agents are different

for these longer type of deep, deep agents where I've introduced a number of things,

pie test, poly, and then, uh, laying Smith fetch to help with this, um, to

make it all more concrete. I'm now going to hand it off to Nick. Who's

going to show you how we actually use these things. Cool. Uh, let's get into

it. I'm going to go ahead and share my screen here. Um, if you guys

can see the code, uh, today, I'm just going to walk through a pretty simple

deep agent. There's actually a quick start for this agent, uh, in our deep agents,

quick starts repository. If you're curious to follow up later, this agent is just a

personal assistant, uh, but we're basically going to walk through the full implementation of this

personal assistant, the prompts that we give the different tools that has access to, uh,

and one specific sub agent for meeting scheduling. Once we have a good idea of

how our assistant works, we'll run it over an example, and then we'll pivot over

into Langsmith, take a look at the trace, uh, and then use poly use Langsmith

fetch, uh, and use another, uh, pretty interesting tool to try and make our agent

better. Cool. So for those of you who, uh, know deep agents or have some

familiarity, it's a pretty simple agent harness. So a deep agent is just a react

agent, but it comes opinionated with a set of tools, uh, that we've seen to

be pretty helpful for making agents performant, uh, in a lot of different situations. And

one of the most important parts for a developer to get right for a deep

agent is the system prompt. Uh, like Harrison mentioned system prompts for a single chat

application might be, be quite simple. Uh, but we see system prompts for deep agents

being quite long and quite detailed. Uh, you can give heuristics on how the agent

is supposed to work. You can give concrete examples of what it's supposed to do

in different scenarios. Uh, and all of this is pretty helpful to Bacon. And so

just taking a quick look at the system prompt for our assistant here. Uh, I'm

giving it a bit of a persona. I'm telling it, it's a top notch assistant

for me. Uh, I'm giving it a little bit of background about myself. I live

in New York, pretty busy. Uh, and I'm also giving it some pretty concrete instructions

on how to handle incoming emails. So that's how our personal assistant it's going to

work. It's going to trigger when I get a new email and it's going to

determine what to do with it. And so you can see here that I've given

it some triage instructions. There are a class of emails that I don't think are

typically worth responding to and that I don't need to look at. And then there

are emails that I think are definitely worth responding to where I want this agent

to go ahead and write a response for me. This agent also has the ability

to do other things as well. So in addition to writing emails, starting new threads,

uh, it can call a sub agent that's focused on scheduling meetings. So the sub

agent has access to my Google calendar and can schedule meetings for me, or if

he doesn't think I should look at an email, it can go ahead and mark

it as read. The second prompt here, uh, is just for that

sub agent. And so we'll see this all come together in a second, but this

sub agent has a different sort of, uh, purpose than the main overall agent. We

really want it to just focus on scheduling meetings and this is intentional, right? We

don't want to give the main agent too much to worry about at once. And

so whenever someone wants to schedule a meeting, this main agent will kick it over

to this meeting scheduler sub agent. And the sub agent has two specific tools to

look at my calendar and then to schedule meetings on my calendar. I also gave

it some guidelines to follow about, uh, different times to schedule meetings and how long

to schedule them for. And so line 79 to line 93

here, this is pretty much the bulk of, of creating the actual deep agent. And

intentionally it's pretty short. We have create deep agent, which is a factory function from

our deep agents package. We have a few Gmail tools, which I've implemented in a

separate file, but these just allow us to talk to the Gmail API and actually

take actions. What we do here is we create the deep agent. We give it

that overarching system prompts with instructions for myself and also on how and when to

call its tools. We give it the one particular sub agent for scheduling meetings. So

this sub agent is really good at scheduling meetings and interacting with my calendar. And

the agent also has a few other tools around writing emails, kicking off new email

threads and marking emails as read. And so just to zoom

out here, I think the point that I really want to hammer home is that

deep agents are really quick to set up in this sort of fundamental basic form.

But this is really just the start of the battle, right? Like we have the

question now becomes how do we make sure that the agent actually works the way

we want it to? How do we know that it's calling the right tools and

the right scenarios? And how can we also improve it over time against specific inputs?

And so like most developers would do, I think the natural next thing is just

to run our deep agent over an example and see if it does what we

want it to do. And so here, I'm just going to kick off our deep

agent against a sample email thread. This is from my good friend Oliver Queen.

And Oliver Queen wants to talk to us about deep agents, he's super interested, wants

to chat about it and wants to schedule some time to chat at 8am next

Monday. So we're just going to go ahead and run this agent.

And what I've done here is basically just packaged this email into a single message

for the agent to take in. So it's just a simple message that says a

new email came in, use your tools, handle it to the best of your ability.

And what I've done here is I've tracked the response. So this is an agent

running under the hood, we're going to get a full response back about how it

handled it. And I'm going to print this out just for comparison, because I think

a default for a lot of developers, myself included is just print debugging. What I

didn't mention earlier is I have set in my environment, the Langsmith API key, and

I've also turned on tracing to Langsmith. And so if I swap screens over here

to Langsmith, I can see this trace is ongoing here. So this trace that's currently

running has to do with the question that I just asked, which was an email

thread came in, handle it to your best ability, and we can see the content

of the email here. And like Harrison mentioned, right, like this trace for the steep

agent is pretty complex, it's pretty long, it would take me a decent amount of

time to click through each of these individual parts, and parse through manually what's going

on. And so that's why I'll open poly over here on the side to see

if it can do the job for me. Poly has a few default prompts. So

I'm just going to go ahead and ask it to summarize what the agent did

in the trace. At the same time, we can click through and try to get

an idea of what went on. So the model we're using here is Claude Sonnet.

We called this task tool, this task tool kicked off our sub agent

to go ahead and look at my availability to see if I could schedule the

meeting for Oliver. And even just in the time that is like me to look

through this one step. Polly was able to quickly pull all of the information from

these traces and give me a quick summary of what the agent actually did. So

we can see that the agent checked calendar availability, delegated to the sub agent, like

we just saw, it then drafted and sent an email response. And then it marked

the email as read for me, so I don't actually have to handle it myself.

It also gave me some nice metrics like the agent successfully completed in 46 seconds.

And so this is pretty cool. In the UI, instead of having to click through

each of these different pieces myself, I can get a nice summary from Polly. Polly

is also just a generic chat, right? So if I had follow up questions like

what, you know, specific days did you check or things like that, I could also

ask Polly and get some answers back there. And so

Harrison brought up this point earlier, but what we've done so far is constrained to

this, this UI experience, right? We're using Langsmith to get a really good sense of

exactly what our agent did. And we can use Polly to give us a nice

summary or ask specific questions about specific parts. But a lot of the development that

actually happens for our agent, right, happens in our code. You can see here for

comparison, the print statement is pretty, pretty hairy, and I probably wouldn't want to debug

this myself. And so that begs the question, how can I get

the information that I see in Langsmith into my terminal so that it can be

used by coding agents like Claude code or the deep agent CLI so that I

can improve my agent directly while working with a coding assistant. So

I'm going to go ahead and use Langsmith fetch here. It's just a package that

we can install. So if we try to install Langsmith fetch, we can see that

I already installed it. And now Langsmith fetch, which works to pull the latest trace

from a particular project. So what I'm going to do is I'm actually going to

come back into Langsmith. And I'm going to take a look at the tracing project

that I'm currently logging to, which is for this personal assistant, I'm going to go

ahead and copy this ID here. And I'm going to set this as the config

that I use for Langsmith fetch. So I'm going to do UV run Langsmith fetch

config set project UID.

And I'm just going to set this project as the one that I want to

get traces from. Now, if I run UV run Langsmith fetch

traces, it's going to go ahead, call the Langsmith API under the hood and get

the most recent trace for me and pretty print it in this nice format. And

like I mentioned, right, like we can see this already in a much nicer view

by just going into Langsmith. But this is really helpful for a coding agent to

digest. So I'm actually going to go ahead here and kick up our coding

agent, the deep agent CLI. For those of you who haven't worked with the deep

agent CLI before, it's super similar to cloud code. It's a coding agent, it has

access to all of my files, and it can suggest and make updates to those

files. And so I've given the deep agent CLI agent a few instructions

already. But I can just ask it here, can you go ahead and

fetch the most recent trace from Langsmith and summarize it

for me. And so just like other coding agents, there are different modes that you

can run the deep agent CLI in. Right now I have it in approval mode

where I want to approve every action. But I could also run this in YOLO

mode and, and let it cook for me. So I just approved the action here.

And so it's running Langsmith fetch traces in the background. And it's just given me

the summary, it's told me the different tools that the agent used. It's told me

the overall workflow and what it just handled. Now let's see

if we can actually improve the agent's behavior in this case. So I may be

like some of you don't really love waking up too early. So I'm going to

tell the agent that hey, I actually, I prefer to sleep in. I

don't want to take any meetings before 9am.

Can you update my agent to adhere to this? And can you also

write a test to make sure my agent follows these

instructions. So let's go ahead and see what it does.

We can see that it's taking a few actions here. So it's reading into my

files, it just read agent .py and also test assistant up high, where I've defined

my tests. And now, cool. So it's suggesting an update to my prompt.

And it's basically just writing a new line here that says very important. Nick wants

to sleep in Nick doesn't want to wake up. Nick. Cool. So that has

now been edited in my actual system prompt for my agent. Now it is working

some more. And in a moment, we'll see what it comes up with.

Cool, just to expand this a little bit, we can see now that it's updating

and adding an actual test. This is a pi test test. And

looking carefully, I'll go ahead and accept this. But we'll take a closer look at

exactly what code is in there. If I open up my test assistant dot pi

and scroll down to this new code, I can see that it's a test specifically

designed to test declining early morning meetings. I can see that the

example looks super similar to the one that I just inputted. And I have a

specific success criteria for this test case. So the response should politely

decline any early meetings, in this case, the 8am meeting time, and respect my

preference. Then there's a few different ways I'm actually going to test the agent. And

I'll go ahead and kick off the test here. So it can run while we're

walking through this. I run the agent here, I run it to its end state.

So I take the email, I format it into that same input message, and then

I get my response. From that response, I'm going to extract the tool calls. And

there's a few different things I'm going to assert here. One, I want to make

sure that we actually wrote an email response to the user. That's very important. If

we didn't do that, there's no point in using an LLM as a judge and

we fundamentally made a mistake. So this is a very deterministic check that we can

take based on the end state of the agent. Then I'm going to do something

softer. I'm going to use this evaluate tool call method that I wrote. And what

this does under the hood is it passes in the success criteria, and it actually

uses an LLM as a judge to evaluate the result. So we use an LLM

as a judge with structured output to see if we get to see if the

final output from the agent fulfills the result criteria. And this is really powerful, right?

Because for my different test cases, I can write different specific success criteria and have

the LLM as a judge judge these different criteria for each test case. It's also

neat, right? Because I can assert different tool calls with potentially different tool arguments for

my different test cases too. Awesome. So we can see the agent says, great, the

test passed. Now let's fetch the trace to see exactly how the agent handled it.

We don't need to do that. But if we go over to Langsmith, we can

see that we ran this test case, and it got logged here in my tracing

project as well with this input. And in this case, we can see that

it wrote an email response. And actually, it says,

I'd love to chat about it. However, I don't take meetings before 9am. So I've

updated the behavior of my agent, really using the AI tools to help me out.

So to kind of recap what we just went through here, right, we started with

a pretty simple, agent, just a personal assistant agent.

We wrote it with deep agents, in just a few lines of code. And then

we tested it out on a pretty simple example. And its behavior was was pretty

reasonable. But there was a specific customization that I wanted to add on top of

it. To do that, I used deep agent CLI locally, the coding agent,

I gave this coding agent access to Langsmith fetch, so that it can find the

latest trace and understand the information from it. And then I asked for a specific

change. And it made that change, wrote a test for that change, ran that test

and log that back to Langsmith. And in the case that this hadn't passed, right,

the agent would then be able to continue iterating. And it could fetch the more

most recent trace to try and figure out what went wrong, according to the test,

and how it can then update the prompts, or edit the tools to make it

better. Cool. That was it for the practical

example. So I can stop sharing my screen, and we can maybe take a few

questions.

I can help read some of them out. All right, let's see if we can

do this. I clicked answer live for this one. Do you see anything pop up

when I click answer live? I don't. Okay, I'll click answer live and then read

well out. Can poly eventually be used in the terminal so I don't have to

tab back and forth and within the browser debug and cursor much smoother flow coding

agents? I think this is where Langsmith fetch comes into play. So we don't currently

have plans to expose poly itself in via the API, but we have Langsmith fetch

for that. How will one evaluate the multi agent without a

curated QA pair? Nick, do you maybe want to talk about kind of like evaluation

strategies in general and how you might think about evaluating without a

ground truth? Yeah, I mean, I think there are a lot of different types of

ground truths, right? So it kind of depends on the type of agent you're building.

If it's a more like open ended agent, so like, let's say research or coding,

there might be specific criteria that you want to assert your final answer again, and

that might not just be a ground truth. So for instance, for a coding agent,

right, you might want to actually run the file that the agent came up with

to see if it works. For a research agent, you might want to make sure

that sources are cited properly. Or some other success criteria, like one or two specific

facts were mentioned. So I think it really depends, you don't need a full ground

truth answer. In a lot of cases, sometimes you just want to make sure that

one particular thing happened for your agent. Cool.

Let's skip around to okay, this one from Cameron, are those deep agent trace spans

automatic? If you set two environment variables, they start tracing to Langsmith

automatically? Yeah, so you don't have to do anything else to your code. And yeah,

we do have docs on tracing. If you if you look at Langsmith tracing, and

then Lang chain integrations, you can see it there.

Does poly run with its own API key? Or do we need to provide one

in order for it to work? So this runs with your own API key, we

did that for a reason so that you don't have to worry about any data

sensitive things coming coming to our models.

So yeah, you provide your own API key, and then it runs.

Do you suggest using deep agent CLI for Lang chain related things rather than cloud

code? I don't know if you have a take on this, Nick, I'll give my

quick take, which is basically, I mean, Deep agent CLI is right now

there's nothing about it specifically for Lang chain related things. But we are thinking about

how to make it easy to customize things like deep agent CLI so that you

can have like a CLI that is really good for Lang chain things. And so

I actually view deep agent CLI as more of like a scaffolding for other maybe

like more specialized coding agents in the future. Nick, I don't know if you have

anything else to add. Yeah, I think that's spot on. I mean, one thing that

I've struggled with a lot with cloud code is that it gets import paths wrong

a lot of the time, especially for more or like less adopted libraries. And so

I think one thing that we could do that's really interesting is is have specific

profiles that you can pull with deep agent CLI and get a coding agent that's

really good at one thing.

Are there any specific features that are only supported for Python versus TypeScript? I

mean, off top of head, Langsmith fetch is right now only a Python client, but

it works for traces whether they're coming from Python or TypeScript. Deep agents is both

Python and TypeScript, but we do have some differences between the two. Nick, I think

you've worked more on the JS side of things. Do you know the differences between

deep agents and Python and TypeScript? I think it should be pretty up to date

in terms of the core functionality. So if you see anything, please open issue on

the repo and it'll get addressed. Cool. I'm thinking I can just

abandon cloud code and use deep agents to develop an agent instead. I presume I

need to set up with API keys. Yeah. So and one of the cool things

that we do, I think we support right now open AI, Anthropic and Gemini. So

you can also use it with other models as well. How does Langsmith differ from

other agent observability and eval tools, i .e. Langfuse? I think

there's a few things that we've seen. I think generally we've pushed the boundary a

lot in what you can do with agents. So I think we're generally, we're very

scalable because agents have a lot of data that comes in. And so we have

a whole team kind of like dedicated to working on scalability. And then I think

also there's a difference between tracing like single LLM call apps and like these really

complex agents. And I think these complex agents didn't really exist until like six months

ago or nine months ago. And so I think everyone, including us prior to that,

was basically really focused on simpler LLM apps. And now only now, I think, are

we starting to tackle these more deep agent style things. And I think, we're doing

a lot of work there that, that other folks, uh, aren't, um,

uh, can we use our own models to run poly slash deep agent CLI or

do we have to use one of the frontier models? Um, so for I'll answer

for poly and then maybe you can answer for deep agent CLI, Nick, but for

poly, um, one of the things we are adding that should be in prod today

or tomorrow is the ability to switch between different models. Right now, we only support

open AI and Anthropic, but it's an open AI like compliant endpoint. So it can

be any other endpoint that's running whatever model, as long as it's open AI compliant.

So we've seen a lot of the open source models be hosted on platforms that

expose them as open AI endpoints. And so you could plug those in. Yeah, right

now for deep agent CLI, I believe, uh, we run with the frontier model providers.

So I think you can use, uh, open AI models and Anthropic models or, or

Gemini models. Um, but config, configurability is something that we're looking to add, uh, in

the CLI in general. Do you plan to support skills middleware as part of core

deep agents live itself and not part of deep agents CLI only? Uh,

yes. Not sure what timeframe. Um, but yes, I think skills by Anthropic are very

interesting and we want to support them. Um, is Langsmith open source or are there,

or is there any cost associated with it? Can I use it for my local

development? Uh, Langsmith is not open source. It is our commercial offering. Um, there is

a free tier, so you can absolutely sign up. And if you're just using it

for debugging, you should be able to use just the free tier. Um, uh, pass

a certain number of traces, uh, that's where there is some cost associated with it.

And there is an enterprise option to deploy it on prem, uh, but that's, that's

generally only an enterprise option. Um, how about difference with

Arise to enterprise tools? I think similar to the answer to Langfuse, I think, uh,

it's easy to monitor a single trace or a single trace of a simple app,

but I think it's harder to monitor lots of, of traces, whether that's millions or

billions coming in and do it respond, uh, do it snappily and also do it

for these more kind of like longer and more complex deep agents. Um,

there's a, there's an interesting question here about the example, uh, in the case of

a deep agent, let's say the agent did a couple of tool calls that are

useless or has wrong parameters, but eventually passed. How will the deep agent based evals

catch that, uh, basically the length of the trajectory. Uh, so this is really interesting,

right? So sometimes, right. You, as the developer of the app have a strong opinion

of given some sort of inputs, what exactly should the agent do next? And so

this goes back to what Harrison showed about running an agent for a single step.

You can just run it for that one chat turn. You can see what tool

calls it makes, and you can assert that it essentially calls the right tool, uh,

with the correct arguments. There are other cases like in this email example, where I

kind of don't know how many dates the agent is going to check, right? I

mostly care that at the end of the day, it does schedule a meeting for

me and it doesn't really matter to me like what particular day, as long as

I'm free. And so that's a different thing, right? Then I have to run the

agent end to end. And what I can assert against the trajectory is just that

at some point it called the schedule meeting tool and, uh, I don't actually care

about the arguments. So I just called it. I just care that it called the

tool. Cool. Um, awesome. Uh, maybe we can,

maybe we can take turns choosing, there's too many questions. So maybe we can take

turns choosing questions that we think are interesting in answering them because otherwise I'm just

gonna be reading down the list. One that I, uh, I like here, are you

planning to add poly at the project level to summarize the most common problems the

agent has or check multiple traces and, and, and def. So we're thinking a lot

about where to add poly. Um, we want to be really intentional with where we

add it. We don't, you know, we don't want to add AI just for the

sake of adding AI. And so the three places we add it like threads, traces,

and playground, um, even there, we started with like limited functionality. So like, you know,

you can't, um, you know, uh, add an evaluator in the playground because we really

wanted to focus on like what we thought was like the most critical stuff there.

Um, some of the other places we are, and, and the other thing I'd say

is that we are thinking about agents and other parts of the app that may

actually be like, uh, uh, longer running more background things. So for example, we have

an insights agent. So in the insights functionality we have is an agent. It is

not part of poly. Why isn't it part of poly? Because it's just a pretty

different UX. Like insights will run for like 5, 10, 20 minutes in the background.

It analyzes, like it analyzes like all the traces. That's kind of similar to what

you're asking here to summarize the most common problems the agent has. This is, I'd

use the insights agent for this. And this just isn't really something that you'd probably

want to like have in a chat. It's more of like we have a dedicated

UI for it. We list the jobs that have run previously. Um, and, and so

that's kind of how, uh, so, so the answer would be like, we are thinking

of putting in other places the app. We don't just want to add it everywhere

willy nilly. We will also have these other types of agents that make more sense

for what we consider like longer running types of things. Another example of this would

actually be optimization. So we're thinking a bunch about how do you hill climb on

examples and should that be part of poly in the playground or should that be

like part of these like async background jobs? And we're actually, we don't actually have

a super strong opinion on that yet. We're thinking through that live, but like we're

planning to add a lot more AI and then Langsmith fetch, we will add more

things in as well to pull down experiments. Um, and, and maybe even pull down

annotation to use and things like that. Cool. Uh, I see a few questions here

that are pretty quick. So does Langsmith support human annotation? Yes. Uh, Langsmith has an

annotation cues feature, uh, that's specifically dedicated to that. Uh, is deep agent CLI free

to use? Yes, it's free to use. Uh, you just need to supply your own

API key, uh, but then, uh, you can use it for free and install it

for free. Uh, another interesting question here, what sort of memory is served with the

deep agent CLI, uh, or is there a way to customize the memory when prompting?

Uh, so yes, the deep agent, uh, that runs as a part of deep agent

CLI pulls part of its system prompt from a file called agent .md, uh, which

it writes locally on your machine. And so what you can do is you can

open that file yourself and you can add instructions, which is what I did earlier

today, uh, by giving it some instructions on how to use Langsmith fetch. Uh, what

you can also do, right. Is you can prompt the agent to remember something. So

maybe I tell it, Hey, I have soccer every Tuesday at six. Um,

this is something that the, the deep agent can then write to its own memory

files because it's, uh, it has access to the file system. And so instead of

you actually cracking open that file and writing it yourself, you can talk to the

agent and the agent has the ability to write that as well. Um,

one here, which is maybe more open in it as well. Can you use tests

to check things like token or context use? Um, which maybe I'll even generalize, like

what types of things can you test in, in, in tests? Um, I, I've seen

someone, I, I, I mean, you can test a bunch of stuff. So like, uh,

you can test latency, you can test token use, um, context use. I actually

haven't seen, I mean, I think we've seen people test functions that like create context.

I think we've seen people test like hallucination and groundedness in context. Um, I, I

don't know if we've seen people test that like, oh, I run an agent. Does

it ever pass like 20 % context fullness or something like that? Um,

uh, anything else that you've seen people test Nick? Uh, sorry, I missed, I missed

that part of the question. I was looking through the, uh, the question list. Can

you use tests to what, what do people, what, what do people test? So like,

I've seen people test like latency, um, uh, uh, obviously like accuracy,

groundedness. Um, uh, you can test for things like, you know, prompt injection, like security

related things, uh, any other, any other types of things that you've seen people test

for? Yeah. I mean, I think content is a big one, right? So if you

have like a, an open ended chat application, you can have some guard rails, uh,

that run as some sort of online evaluation to filter out questions that you deem,

uh, don't fit content. That's a, that's another one I've seen. Cool. Uh, maybe one

other quick one that I'll answer and then I'll let Nick go. Uh, if we

use, uh, if we use fetch, uh, to trace, do we still need Langsmith subscription?

Yeah. So, so Langsmith fetch fetches data from Langsmith. So the idea is basically, yeah,

you still, you still send, uh, all the traces to Langsmith. And so we'll need

a subscription for that. Although as mentioned, we do have a free account and then

you just use fetch to bring it to your coding agent. Um, a really good

question I see from Manuel and Bruno. Uh, what are your thoughts on the side

of building the evals for complex queries carried out by a deep agent where there

are multi -step tasks with many points of failure, any advice here on good practices

on building the test sets? Um, for me, this is, this goes back to running

agents for a single step at a time. Uh, so you can always model the

inputs to an agent as some sort of a list of messages that are going

in, right? So maybe instead of letting a rate, an agent run for multiple steps

at once, and then evaluating the end result, you just run the agent on the

first input. It produces the first output, and you can have a set of assertions

on that first output that have to pass before you give it the next input.

And so you really break this multi -step problem down to a bunch of single

steps, uh, and you can have assertions in between these. And so that way also,

if the agent does go wrong and if it goes off the rails, you're not

wasting a bunch of tokens or money, uh, with the agent doing things that are

totally wrong, right? You catch it at the first point of failure, and then you

can work on optimizing the agent until it passes that. And then you can see

if it passes the second, the third, and so on. Um,

one other question that's pretty easy answer. Is there a way to host a deep

agent so that my team can access it and ask questions through a shared interface?

Uh, yes, there's two ways actually. So like one, we have Langsmith deployments. Langsmith deployments

is where you can deploy, uh, Lang graph, Lang chain, or deep agents, um, or,

or any agents actually. And, uh, it's agents that you build in code. And then

we have, uh, we, what we call Lang graph studio, which is where you can

interact with these things, um, after you've deployed them. We also have a no code

agent builder in Langsmith that's actually built on top of deep agents. Um, so we

didn't do anything in no code for a while, uh, but we just recently did.

And what we did there was we, uh, basically took deep agents and we put

it so that you could build it in a UI because deep agents are just

a prompt plus tools. And so we put this in, in the UI and now

you can build these things in a no code way. And then you can also

interact with them and share them. And we have workspace agents, so you can share

them with other people in your workspace. And so this is still in beta. And

so we'd love any feedback here. Um, but yes, that's, that's how you can kind

of like share and access these, these deep agents. Um, one more open -end question

that, that I think is pretty interesting. What feature within Langsmith have you seen resonate

the most with enterprises? Uh, I can speak more for myself, what feature resonates the

most with myself. Um, but I think it's tracing, right? I think tracing is still

super undervalued. I think evals is a buzzword that everyone is throwing around, but I

think at the end of the day, right, if something doesn't pass your evals, you

need to understand why it didn't pass your evals. And so you need to go

look at the trace or you need to have poly go look at the trace,

or you need to have deep agent CLI with Langsmith Fetch, go look at the

trace. Uh, and you have to understand very granularly where your agent went wrong. Um,

so I think like when I, at least when I think evals, uh, I don't

think like large benchmarks, I don't think like a hundred tests, right. That, that are

running and you pass 50 of them. I think a small blur set, uh, where

you want to have a hundred percent accuracy or passing and making sure your agent

does, uh, accomplish its, its core competencies. Uh, and before Harrison

jumps in, I see another one from Sriharsha, any ideas to make a Swift or

Android based Langchain framework? Uh, on deep agent so that we can run the agents

on device with an on -device LLM. Um, I don't know about on -device LLM,

but I talked to Harrison about this few weeks ago, but I think mobile is

a super interesting UX. Um, I've kicked off cloud code from my phone and had

it do stuff for me while I was off hiking or whatnot. Um, so I

think this is something to explore and is, is super interesting to me. Yeah. And,

and maybe, um, I can add one thing to Nick's answer around what have enterprises

found most resonate most. What one surprising thing to me is the amount of people

using, uh, Langsmith for kind of like, uh, for compliance actually. Um, so we, we

didn't build Langsmith to be like a compliance tool, but a lot of the functionality

that it provides in terms of like showing everything that happens inside or testing and

benchmarking, like that's exactly what compliance teams are looking for. They want to know exactly

what happens inside the app so that they can ensure that nothing, bad is happening.

And they want to know how it performs on different things. And so we've heard

multiple times that, that Langsmith is a really useful tool for ensuring that agents are

compliant with, with any regulations or things there. Um, and, and then maybe one quick

one, um, I can, uh, uh, I'll answer are any plans for Langchain Academy

courses related to this to show off best practices in Langchain's opinion. Um, we probably,

it's still so, we're doing a bunch of work on like evaluating deep agents. And

so, uh, no, I, no plans until now. Um, but it's a good idea, but

we do have a Langchain Academy course on Langsmith in general. Um, so we have

Langsmith essentials, which is, uh, which is like a one hour course. And then, uh,

or we have Langsmith quick start, which is a one hour, uh, course in the

Langsmith essentials, which is a like three or four hour course. And I think Nick

actually taught that, that the longer one, um, back in the day. Um, and so,

yeah, for, for Langsmith kind of like, like, there's so much in Langsmith that we

actually didn't show during this, this webinar. Um, and so for a full deep dive,

I definitely encourage you to check out the Langsmith, uh, courses on the Langchain Academy.

Um, I'll, I'll, I'll drop a link to the Academy in the chat. If people

didn't know about this, we have a bunch of education, no resources on, uh, Langchain,

Langgraph, Langsmith. Um, and we spend a ton of time there. Um,

two, two kind of related questions. Um, is there an online evaluator option? Uh, and

if so, how do you deal with streaming and, and latency? Uh, and there was

another question here that I, think I lost. Uh, oh yeah. And in multi -agents

and orchestration, is there an eval test to prune agents, removing them when not used

or not performing? I think both of these actually really lend themselves to, to online

evals. Um, for those who, who are less familiar with the term, so online evals

run when your app is in production already, uh, people, real users are using your

app, uh, and those traces are showing up, uh, in your tracing project. Every time

a trace shows up, you can run an online eval to sort of figure out

what was the intent of the user? Did a certain thing occur? Did it occur

under a certain time, et cetera? Um, yeah, I think online evals is super interesting,

especially for deep agents, right? You can track, um, how long they're taking to respond,

right? That's super important for UX. Uh, you can track how many tokens are being

used. You can make sure your research agents or your, more open -ended agents aren't

churning forever before converging. Um, so yeah, I think there's a lot to be done

with, with online evals and, and links by supports that quite well.

Um, any additional features for fetch poly being developed? Um, the two that

are top of mind, one, we want to let fetch pull down, um, results from

experiments. So right now, it can pull down traces and groups of traces. We want

to let it pull down, um, groups from, experiments. Um, and then for

poly, the big thing that we're iterating on is actually having it do some of

the prompt optimization. So right now it can edit the prompt, but we want to

let it edit the prompt and then rerun the, the, the whole playground to evaluate

over the dataset and then get the results and then see how we see if

they changed and then change the prompt again and then, and then iterate on that.

And so that's problem. Those two are the two things on the immediate term roadmap

there. Um, I see a question as well. Uh, can, can

fetch be used for monitoring deep agents runs, uh, let's say with a custom tool.

So that is one avenue, right? You could give the deep agent a custom tool

just to fetch traces from Langsmith. The reason that fetch was designed the way that

it is, is because deep agent CLI has access to a shell, right? So it

can just run, uh, the Langsmith fetch command instead of, of going through a tool.

Uh, if for some reason you wanted to design an agent that wasn't running in

a shell environment and it didn't have access to that, um, how Langsmith fetch works

under the hood is it uses Langsmith's, uh, APIs to actually go fetch runs and

traces. Uh, and so that is something that you could add as, as tool functionality

as well.

Um, can fetch be used for non deep agents like a little simpler lane graph?

Um, uh, right now, a lot of this functionality is optimized

for agents, which have like a messages list basically. Um, because we've seen that that's,

you know, that's a very canonical way to represent the trajectory of these agents. Um,

if it's a more custom agent, then if it's a more custom line graph thing

without messages, then even though it might be like shorter or run faster, it's actually

more complex than some these simpler things. So like fetch and poly are still, they're

still like, like the core, um, like deep agents is actually a pretty simple agent

under the hood. It basically just runs in a loop and calls tools and accumulates

a list of messages. And I think that's like a, a dirty little, I don't

know if it's a dirty little secret, but this is like the, the, the most

like impressive agents actually use like some of the simplest kind of like cognitive architectures

out there. Um, and, and so a lot of these tools are really optimized for

when you have this like tool looping architecture with is, which is simple and is

easy to deal with. Um, and at least right now, that's the focus. I do

think poly works on anything. So poly we've given more latitude with, um, but Langsmith

fetch, we've, we've, kept a little bit more focused. Um,

I see a few questions about multi -agent setups. Uh, so in the past we've

had more packages, uh, with more specified handoffs, right? Where you have one agent specifically

handing off work to another agent and then waiting for that to return. Um, now

this is really modeled as tool calls. And I think this is inspired a lot

by how we saw, uh, Claude code and deep research from Claude, that, great blog

post a while ago, uh, and Manus, how those tools work. Um, the agent retains

the right to spawn sub agents and hand off work to them, but the work

is returned in the form of a tool message result. Uh, and one of the

best benefits of this is context isolation. So you could take a task that the

main agent could accomplish on its own, but it might be a fairly isolated or

siloed task. Uh, for example, like researching a particular topic and researching that all in

the main context window is really going to bloat that context window so that when

you have your final response, you're going to have a bunch of, uh, ugly web

search results sprinkled throughout. Uh, and so by handing that off to a sub agent,

right. It's functionally a sub agent with the exact same capabilities, but the context is

siloed. So the sub agent can do all of the work and it can return

one final nice report to the main agent, uh, without returning all of those additional

tokens in context. Um, one question around

prompt optimization, uh, reprompt optimization, dspy integration slash co

-functionality or JEPA, um, mini rant here. JEPA is a really fancy term for

a really simple concept of just letting the LLM suggest improvements and then, and then

using that prompt and testing how well that prompt does. And so, yes, we're doing,

uh, variations of that. Um, I think the ideas look relatively simple and so we

probably won't be integrating with, dspy just for that. Um, but it's, yeah, it's essentially

JEPA under the hood. Um, I see

one comment here, uh, deep agents, uh, how, like how do deep agents run commands

and how do they, they fire things off? So one chief benefit of deep agents

is that they can run things in parallel. And so this thing can include tools,

but this can also include sub agents. Um, so maybe the canonical example that I

always bring up is if you've got a complex research task and you're asking, uh,

you know, who's the goat, is it LeBron or Jordan, uh, instead of doing these

sequentially, right, the deep agent can launch two sub agents in parallel, uh, one of

which will research LeBron, one of which will research Jordan. Uh, and then it waits

for both of those results to come back before coming up with its, its final

answer. Um, but yeah, parallelization is, is really powerful for agents. Um, and, uh, it

can speed up, uh, the user experience a lot.

Um, there's a, there's a lot of questions around deep agents. And so maybe rather

than try to answer all of them because we are short on time and, uh,

we should do a deep agents, uh, webinar in the future, but, um, high level

for deep agents. Uh, we're very, very bullish on deep agents. Um, if you want

to give, if you want agents that are more autonomous or you're working on workflows

where you need more autonomous agents, then you should probably switch over to deep agents

from whatever you're doing now. Um, uh, there are definitely cases where you still need

workflows. Um, but, I would actually argue that most of the things where you might

think you need a workflow, if you're thinking about it in the context of an

agent, it's actually probably better to use something more, more autonomous, like, like deep agents.

Um, we're investing in a lot in both Python and TypeScript. Um, we think it

provides way more value than just abstractions. So like a big value add of link

chain and other agent frameworks out there are the abstractions. Um, deep agents has more

of like a batteries included approach. So it provides like a planning tool and sub

agents and things like that. Um, so there's a bunch of questions about deep agents.

Um, I'd encourage people to check out the Python and TypeScript repose and leave issues

or questions there. We probably won't get to them all here. I think we're probably

good on time unless there's any final things that you wanted to answer or talk

about Nick. Uh, no, I think it's good for me. Thanks. Thanks everyone for coming.

Um, really appreciate it. Cool. Awesome. Yeah. Thank you all for joining. Um, there will

be a recording of this that we'll post later and we'll do more of these.

So hopefully see you in the future. Goodbye.

Observing & Evaluating Deep Agents Webinar with LangChain

LangChain

79 days ago

54:47

AI Framework Development

Rank #1

Description

Explore the unique challenges of observing and evaluating Deep Agents in production. Deep Agents represent a shift in how AI systems operate – unlike simple chatbots or basic RAG applications, these agents run for extended periods, execute multiple sub-tasks, and make complex decisions autonomously. In this session, we'll dive into practical approaches for gaining visibility into Deep Agent behavior and measuring their effectiveness using LangSmith. Learn more about Deep Agents here: https://blog.langchain.com/deep-agents/ Try LangSmith: https://smith.langchain.com/

Watch on YouTube

Video Details

Category

AI Framework Development

Featured Date

December 13, 2025

Quality Rank

#1

AI Recommended