Loading video player...
Good afternoon everyone.
If there's one message I would like you
to take away from this talk is that eval
can become a source of joy.
Yeah. If you consolidate your agents
tools into an MCP server. Okay. I'll
explain more.
Uh I'm Scott Yak from Data Do. What does
Data do? Right. So, data dog is an
observability platform. So, you send us
metrics, traces, logs, and this nice
telemetry data. We provide the tools to
help you figure out whether your service
is running as expected, whether there's
an outage and how do you fix it. So, we
have a nice website. You can look at the
metrics, check on your service, nice,
there's a nice dashboard to get an
overall sense of how your website is
doing, some traffic information. We have
some really fancy tools to visualize how
your service is running. So this is all
very nice but it's a bit intimidating
for newcomers and we don't just want you
to know uh to see what's going on with
your service. We want you to be able to
act with your service. So the dog builds
a bunch of agents and we have learned a
couple of lessons along the way. What we
learned is that in the process from
going from demo quality to production
ready agents, you when you want to
improve the scores so that you can
actually accomplish your tasks, you have
to deal with a whole bunch of failure
modes. Things like hallucination,
output formatting, to call failures and
so on and so forth, right? And the
answer that everyone is going to tell
you is that you need evals, right? And I
agree with that. You definitely need
evals,
but they're also a real pain, right? So
will I like to help with that, right?
Just evals are still going to be
painful, but at least we can take away
some part of it. Tool call failures and
that's where MCP servers can help.
Now what are MCP servers? So without MCB
servers, what you would have to do is
that for every single agent that you
build, you have to uh you typically have
to build the tools to connect to the
back end. And what usually happens is
that because each agent is doing
different things, they need different
tools. They call the tools differently
and they expect different things from
the tools. The tools fail, the tool
calls fail in different ways. And that
means that each agent team have their
own tool uh tool called failure evals
and that leads to a lot of duplication
effort.
What MCP servers allow you to do is that
you can consolidate all your tools into
one server. And so these tools can serve
a whole bunch of different agent teams.
Uh [snorts] but you could always have
done that, right? But what's so special
about MCP servers? Well, the nice thing
is that um MCP servers can also be
remote. So not only are they just a
building block towards or stepping stone
towards better agents, they also talk
directly to customers. So the your ids
like cursor can use your tools directly.
Third party agents like cloud code can
use your tools directly and to build an
agentic experience directly from the
customer. So what this means is that
your tools are no longer just a stepping
stone. It's self a product. And when you
have a product, you want to come up with
a good product experience.
And what we want to get out of this uh
product is that for our users when they
connect to our MCP server, the agents
become data. Power users.
So to give you a quick sense of what our
MCP what it's like to connect to MCP
server. So here's the tab, the agent tab
from cursor, right? So you might ask a
question like do you have any are there
any HTTP errors in service MCP blah blah
blah okay then cursor is going to talk
to LM give you some u some feedback and
then it makes a tool call to our search
locks tool it chooses search locks then
it gives the search locks tool gives
back the results and then it summarizes
what it found yes I found some HTTP
errors great right it is acting like a
power user as a user you didn't really
need to know data do search lock syntax
it's Good. Um, so let's see what
actually is going on in there. So we
have a better sense of what our MCP
server actually needs to do.
So here what just happened. You have the
user, you have the agent here which is a
cloud code agent panel. Uh you have the
MCP server, our data do uh MCP server
and that's our back end that serves uh
the search locks tool and in the back.
So the agent has a really important job
which is to manage the context window.
It starts with the system prompt and it
makes a call to MCP server to list tools
so that it can fill its context window
with the tool descriptions and that way
it knows what tools are available and
how to call them. Then it receives a
request from the user and it adds that
to the s to the context window. It then
makes a it then calls LLM to decide uh
to see what to do next. it may decide to
call a tool and because of the
information from tool descriptions and
then it makes a call to the MCP server.
The MCP server takes that request
decides what to do with it. Maybe call a
particular back end, maybe cause that
other back end. The back end returns the
results and the MCP server does some
other uh business logic, maybe it does
some filtering, maybe it does some
pageionation, maybe it postprocesses
some JSON into CSV, whatnot, right? Or
error handling. and before it passes
back the response to the agent.
So now the MCP server uh so MCP server
gives back the response to the agent.
The agent loads it back to the context
window. Now it's part of the context
window when it decides what to do next.
It may have it might decide that it
needs to do something uh it needs to go
back and ask more questions. So it keeps
looping or looping until it's complete
and then when [snorts] it's complete it
gives you back a result. Yes, I found
something and returns it back to the
user. So that is a whole that's the flow
of the agentic system or most basic
agent loop.
So I often get this question what
exactly is there to do in MCB server?
Don't you just wrap around the API and
you prompt the tools and the prompt the
tool descriptions all that stuff. So yes
the tool descriptions is a is an
important part right it tells you it
tells the agent how to call tools and
what tools are available. But besides
that uh it's also important to note that
the tool call response also gets fed
into the agent context window and that
affects what it will call what we will
do next. And so in the MCP server we
control the MCP we control the MCP
server and we because we are in the same
company we also we also can talk to the
people who work on the back end to
influence what comes back out of the to
call response. So that means that we
also want to optimize the tool
implementations.
So and this is a massive search base,
right? It's not just the text in the
tool description. It is the code base of
your entire back end as well, including
your MCB server. And how do you optimize
that? How do you know what to
prioritize? For that, you need EDOS,
right? And the thing is that um it's a
bit so eels for MCB servers are
different because there's a lot we don't
control. When you're agent you control
almost everything. When you are an MCB
server you control almost nothing. And
that feeds into our eval philosophy.
We don't optimize for any particular
agent. We are agent agnostic and we only
check the final result. So we don't care
about which tool it calls. Okay. So I'll
go into more detail about why we choose
this as our philosophy for eatals.
So why agent agnostic right and the
reason for that is really because of
helplessness. We don't really know
enough about the agent which agent needs
to use us. So we cannot actually
optimize for any agent. Um a different
way to look at it is that we want MCB
server to be ergonomic enough so that
any agent can use it well not just the
smart ones. The nice side effect of that
is that you actually can, you know,
prioritize simple agents. Um, and what
that means is that our evals, we don't
need to use the expensive ones. We use
the cheaper ones. We use the faster
ones. And that means that our evals are
fast and cheap to run. That's nice. Nice
bonus.
So why tool agnostic though? So what
often happens when people make a tool
called evals for agents is that they'll
when they when agent reaches a
particular step they will say okay make
sure you call this tool and make sure
you call this tool this way and I think
there's a better approach than that and
the reason is that agents are creative
like there are often a few ways to solve
the same problem for example if you want
to check which service went out of
memory there are probably three
different three tools in data dot that
can give you that answer. But if you
were to force a tool to say, "Okay, you
have to call search locks then and the
agent goes ahead and try something
else." Then you're going to the EDOS are
going to fail uh like needlessly and you
have too much stuff to add into your
EDOS. It becomes brutal.
And the other thing is that agents are
surprisingly resourceful. So one thing
that I've accidentally uh noticed when I
was using cursor with our MCP server is
that sometimes if you once I misspelled
a service name it actually uh zoomed
out. So it added some wild card tags. It
removes certain filters. So it
effectively is zooming out and then it
sees okay what service names are in your
are in your logs in the first place and
it finds one that is actually similar
enough to the service name that you had
and then it zooms into that and says
okay this is we didn't find the exact
thing you want but this is probably what
you want like if you are in some you
know very tight agentic loop with very
um rigid eels you're probably going to
mark that as a failure but if you are a
if you're a human trying to do something
messing things up once in a That's
actually a nice experience. So we don't
want to erase that by overly rigid eels.
And the other thing is that because we
are the team that develops the MCP
server, we want to keep upgrading our
tools. If we if we make it too hard to
change our tools, then when we want to
do big changes, big improvements like
consolidating five tools to three tools,
like switching to a different query
syntax or adding one tool that just
makes all tools better. If you have
overly rigid tool um evals, that is
going to prevent you from making those
changes in a way that you can see the
eval scores before and after, right?
Because once you change those tool once
you change your tool surface you have to
rewrite your evals and you don't get an
apples to apples comparison anymore
right your tool specific evals become
too brittle it becomes change detectors
and you don't want that you want to keep
improving your MCP server so that it
just when you connect to it the whole
experience is just better you don't have
the user shouldn't have to think about
the individual tools
and the effect of that is that eval runs
are fast right we from the time you
change your code and you run the eval
run and you see it in the dashboard
takes only two minutes. We run it
locally. It's nice when you see it on
dashboard it looks like this. uh we have
it instrumented with data. LLM
observability and so you can see at a
glance right which uh which two calls
failed which two calls succeeded and
then you can zoom into that one of those
scenarios and say okay like show me the
spans of of the of the tool calls the LM
calls and all that that happened and
then you this each of these things have
a trace ID which you can then take a
take the URL copy it put into your group
your group Slack and see hey this thing
failed why did it fail can you help me
take a
It's a very nice debugging experience.
So it makes running evals actually fun.
Like we change something, you rerun the
evals, two minutes later you see stuff,
you can share with your team. You can
you try something new that makes
something that didn't work before work.
You share you can now uh take this trace
ID, share with your team.
Our evals now become integrated into the
dev cycle of our MCP server. And now
devs want to write EDS, right? If they
are trying to do something with an MCB
server that is challenging its existing
capabilities and the evals don't
actually capture that new cap new
possibility they want to write an eval
so that they can see hey before this
failed now this passed I did something
great
now how does this help agent builders so
the nice side effect for other people is
that tool call failures are now handled
by a central team so the agent builders
they don't have to handle tool call
failures with their big end to and eval
runs right so to call failures are
handled by this lightweight process
that's done by a small central team this
means less work for them
and another good side effect is that
let's say you have a few agent teams and
one of the agent teams give you some
feedback that hey this tool doesn't
really work very well maybe your you
didn't describe it properly then so the
feedback from your agent team can lead
to MCP server improvements from your
developers and that leads to
improvements to all your agents because
Now they are using the same MCP server
with the improved tooling.
That's very uh motivating for people who
are building MCB servers.
But so it was nice to run the MCB server
but how do you make the EVAL scenarios
to run in the first place? Right?
Running EVAL scenarios was never really
the hard part. The hard part is making
the eval scenarios. Right? I just told
you how to do the easy stuff but not how
to do the hard stuff. Right?
So let's go through step by step how do
you actually make a single eval scenario
for MCB server.
So first we think okay if you are a
data.p power user you should know how to
search for logs you should understand
log search syntax right. So you look at
the documentation page you see okay we
need to understand how to put tags
service tags basic things. Okay, so show
me recent logs in a service name full.
Simple question, reasonable request,
right? Can you make that an EDL
scenario? Um, so the the unfortunate
thing here is that if you call this
thing, um, your actual answer is a whole
blob of JSON, right? And if you want to
put that in if you put that into your
evals, it's going to be all variables.
You don't want that. Actually, you can
just ask it for counts. Uh, because
you're trying to test whether it
understands the lock search syntax,
right? If it knows how to make a count
and it gets back a specific number,
chances are it actually knows how to
call that. It actually understands that
syntax. Just changing from counts to
list is pretty trivial.
So can we make this the case? Uh not yet
because the word recent is ambiguous. So
we need to make this more specific,
right? Recent means before, it could
mean five minutes before, right? It's
not like a different LM call might
decide that it means different things
and you get a different answer. So this
needs to be made unambiguous, right? Is
this okay now? Uh still not yet because
last hour depends on what time you're
calling it. So now you need to fix the
time. There's this little thing called
time travel uh which you can pass in a
time stamp into your server and the
server pretends what time it is now. And
now finally this is actually
unambiguous. you can as a human or you
can help you you can get a script to
help you generate this get this answer
uh put this in the eval scenario and
then no matter when you run it you're
going to get back the same answer as
long as the two calls are correct right
and it also doesn't really matter if the
if your MCB server uh if your LLM makes
a different call to a different tool or
you change your tool surface and you got
and you just get back the same answer
it's fine it's still going to pass
so how many evals else do we need to
make? So let's go back to our our
documentation, right? So we have wild
cards, we have two wild cards, we have
full text, we have free tax, we have all
this special u we have all this special
syntax, we have all these edge cases
that we need to deal with, right? So we
need maybe hundreds of eval scenarios
just to cover such logs. And then we
have other products too. Not only do we
have logs, we also have metrics, we have
traces right?
And then not only do you want to make a
query on one thing, you want to make
queries across different products. And
how do you do that? How many evals do
you need now? Maybe thousands, right?
And that's going to be several days of
non-stop ground work just to create eval
scenarios.
And what's worse is that if you work for
a company that has data retention limits
for your logs, then every time the data
disappears, you have to repeat this
whole process all over again. Because
once the data disappears, your new data
comes in, then last hour and your now
time needs to change and the answers are
all going to change. So you need to do
this all over again. So is there a
better way?
The good news is that you can actually
generate those emails. You don't have to
do this by hand. Uh it's a bit
unintuitive that this is actually
possible. So I'll walk you through that.
So if you start with a question, a
natural language question and you try to
get the answer, this is not really easy,
right? If you are trying to if you're
just using a simple agent that tries to
make a tool call, this is not easy.
That's why we're trying to make agents
in the first place. But if you start
from a CQ, you kind of know what the
answer should look like and then you
turn that from the CQ to the answer.
That's easy. you just pass that query
exactly into your API call and then you
can make a script that just gets that
answer
and then to go from a CQ query to the
question uh it's not so straightforward
but what you can do is that you can use
a parser parse this thing convert it
into a and then from a you traverse the
the tree for each node you map it to a
text template and that will lead you to
a natural language string that preserves
the semantic meaning of that CQ query
So this is this is does not this does
not involve an LLM. This is a
straightforward um just you can write
Python to do this thing for you.
How do you get a seat query in the first
place? Right? We just made the problem
uh we just pass the problem to another
place. So here is where the LLM or the
coding agent comes in. The coding agent
with the help of a product documentation
that Yeah.
>> Hello anyone? Okay.
All right. So, with the pro with the
help of your um your coding agent and
your product documentation and some
elbow grease because your coding agent
is going to look at your is going to
look at the placeholder service names.
What's going on? It's going to look at
the placeholder service names and just
fill in say API, but it doesn't exist,
right? You work for this company. You
know what services actually exist. So
you have to fill in say from service API
to service full right. So you still have
to do that but you can generate a whole
bunch of seed queries at once
and with those sequeries
repeat that for each process for for
each question to for each CQ query to
get a question answer pair and then zip
them up to get labeled eval scenarios.
So these labelled eval scenarios you can
put them all into a Python list and you
can all run them locally.
So we just went from one docs page to
200 something eval scenarios. That's
pretty nice.
But we now have a new problem.
Too many evals. Too many eval scenarios,
right? If you just so the basic the
first thing to do that you want to you
probably would do is that you print them
to your console log because we're
running it locally. But then when you
have 200 eval scenarios and you know
before your change you got the you got
0.7 as a score as average score after
the change you got 0.71 as your eval
score. So, was this a no-up or was it
because something improved and something
got worse? At a glance, it's not easy to
see. But what we can do then is uh going
back to the previous example, you can
use your LLM to also help you add some
tags because when you use that process,
you were referring to the product
documentation and a product
documentation probably has some sections
headers that tell you what capability it
refers to. So you can add the capability
tag. Then you can pass these tags along
to your instrumentation to part into
your stance that you pass to LM
observability or your favorite
observability tool. Then you can group
them by capability. So at a glance you
can see what capabilities you're doing
well at and what capabilities you're not
doing so well at. So here we can see
that count to C has the lowest uh
average metric score. And so if we
should probably do something about that.
So as a as a human you look at this
thing and you think okay I probably need
to prompt the tune the prompts for my uh
for my lock search syntax so that I can
so that I know how to deal with the C.
Maybe I didn't think about that when I
was writing the tool. Or
you can ask cloud code to look at that
and help you optimize it.
Because by
instrumenting this thing and putting it
into LM observability, LM observability
being a data do product is also can be
exposed through our MCP server. So now
you can ask a coding agent like cloud
code to point to it and to ask some
information. So let's start with just
running the evaluation, right? We got
the score of like 0.04, a low score. We
want to improve that.
So then we can ask the coding agent to
analyze the edction result with the help
of the MCP tool for like LM observably
tool in MCP server.
So it analyze this thing and now that
this analysis is in the context window
of the coding agent you can point the
coding agent at the piece of code that
you want to optimize. So in the next uh
slide I you're looking at the code at a
diff like the red part is is looking at
the tool description for our MCP server
uh for search data.logs box and the red
part is the one that is is what is going
to delete and the green part is what is
going to improve. It's going to add you
can see that it is trying to um it's
trying to deal with the with this IP
addresses now
so you can optimize the code and then
rerun restart restart the server and
then rerun the evaluation and get a
higher score. So you can repeat this
process again and that's the
selfoptimization loop. So to recap what
I just showed you. So we generated 200
something label eval scenarios. You can
run evals locally in two minutes. You
can visualize results on the dashboard.
You can analyze the failure patterns on
a dashboard or you can do it with the
help of a coding agent. And you can
point the code at the place where you
want to optimize
whether we the business logic be the
prompt the tool description or somewhere
deeper in the back end to optimize the
code based on what we just put into the
context window.
So as a recap,
there's a huge search space that we can
optimize by being as MCP server
developers.
And with all this automatically
generated evals, they can become a
source of joy not just for uh MCP server
developers, but also for all the agent
builders that built on top of our MCP
server. But the first thing you have to
do is to consolidate your agents tools
into an MCP server.
Thank you.
[applause]
>> Thank you, Scott. Uh, we have time for
two questions.
Uh, very nice presentation. Thank you
very much. Uh I'm wondering if you are
experimenting also in fixing the issues
that data do finds in a similar manner
using MCP server then feeding that to
cloud code and then fixing the issue
automatically.
>> Wait, can you can you repeat what are
you feeding into the thing?
>> Uh basically how you are fixing the
evils.
>> You can also use MCP server for data dog
to see the issues that data do found and
then automatically fix by using MCP
server and cloud code. Are you
experimenting with this or
>> we I definitely plan to uh yeah this is
a very some of this stuff is b coded uh
it happened in a pretty short amount of
time
>> okay yeah we are looking forward we are
users and it's just amazing tool thanks
>> thank you
>> thank you
>> we've got time for one more if anyone
has a question
>> yeah okay
>> you mentioned um at a point where you
need the back end to return different
type of response so that agent can
handle them better.
>> Okay.
>> Can you be more specific?
>> Ah yes. So the thing is that the our
tool surface is not the same as our API
surface, right? When you have an API,
people expect it to be stable, right?
People build code on it. They don't
change it very much. Every time you
change API, you break anyone who hasn't
like updated their their binaries. But
with MCP, you can change your tool
description uh every time because
they're going to have to call the list
services every time they restart their
they initialize uh and restart their
agent. So you can actually you it's more
forgiving. You can change your tool
service more often. So what we can do is
that we can we can actually have a
different API. We can have a different
um interface um compared to our API. So
for example, what we can do is that we
can do some spelling correction. We can
do some checking before it reaches the
API call. We can perhaps if we think
that this is a clustering query instead
of a list query, we can actually just
call the the clustering API instead of
calling the list API.
Agents often rely on external tools to help them accomplish their tasks, and external MCP servers are convenient for getting those tools with minimal code. Since the agent-builder does not control the MCP server's interface, the responsibility now falls on the MCP server developers to help the agents that connect to it make progress on their tasks. In this talk by Scott Yak, Senior Software Engineer at Datadog, shared some lessons learned from building and iterating on Datadog's public MCP server, which is currently used by both internal and external agents. Attendees learned how to implement their MCP server in a way that improves the effectiveness of the agents that connect to it. ----------------- Join us at AI Dev 26 x San Francisco! Tickets: https://ai-dev.deeplearning.ai/