AI Dev 25 x NYC | Scott Yak: Building MCP Servers That Make Agents More Effective | DailyDevLists

Loading video player...

Full Transcript

4,894 words • EN

Good afternoon everyone.

If there's one message I would like you

to take away from this talk is that eval

can become a source of joy.

Yeah. If you consolidate your agents

tools into an MCP server. Okay. I'll

explain more.

Uh I'm Scott Yak from Data Do. What does

Data do? Right. So, data dog is an

observability platform. So, you send us

metrics, traces, logs, and this nice

telemetry data. We provide the tools to

help you figure out whether your service

is running as expected, whether there's

an outage and how do you fix it. So, we

have a nice website. You can look at the

metrics, check on your service, nice,

there's a nice dashboard to get an

overall sense of how your website is

doing, some traffic information. We have

some really fancy tools to visualize how

your service is running. So this is all

very nice but it's a bit intimidating

for newcomers and we don't just want you

to know uh to see what's going on with

your service. We want you to be able to

act with your service. So the dog builds

a bunch of agents and we have learned a

couple of lessons along the way. What we

learned is that in the process from

going from demo quality to production

ready agents, you when you want to

improve the scores so that you can

actually accomplish your tasks, you have

to deal with a whole bunch of failure

modes. Things like hallucination,

output formatting, to call failures and

so on and so forth, right? And the

answer that everyone is going to tell

you is that you need evals, right? And I

agree with that. You definitely need

evals,

but they're also a real pain, right? So

will I like to help with that, right?

Just evals are still going to be

painful, but at least we can take away

some part of it. Tool call failures and

that's where MCP servers can help.

Now what are MCP servers? So without MCB

servers, what you would have to do is

that for every single agent that you

build, you have to uh you typically have

to build the tools to connect to the

back end. And what usually happens is

that because each agent is doing

different things, they need different

tools. They call the tools differently

and they expect different things from

the tools. The tools fail, the tool

calls fail in different ways. And that

means that each agent team have their

own tool uh tool called failure evals

and that leads to a lot of duplication

effort.

What MCP servers allow you to do is that

you can consolidate all your tools into

one server. And so these tools can serve

a whole bunch of different agent teams.

Uh [snorts] but you could always have

done that, right? But what's so special

about MCP servers? Well, the nice thing

is that um MCP servers can also be

remote. So not only are they just a

building block towards or stepping stone

towards better agents, they also talk

directly to customers. So the your ids

like cursor can use your tools directly.

Third party agents like cloud code can

use your tools directly and to build an

agentic experience directly from the

customer. So what this means is that

your tools are no longer just a stepping

stone. It's self a product. And when you

have a product, you want to come up with

a good product experience.

And what we want to get out of this uh

product is that for our users when they

connect to our MCP server, the agents

become data. Power users.

So to give you a quick sense of what our

MCP what it's like to connect to MCP

server. So here's the tab, the agent tab

from cursor, right? So you might ask a

question like do you have any are there

any HTTP errors in service MCP blah blah

blah okay then cursor is going to talk

to LM give you some u some feedback and

then it makes a tool call to our search

locks tool it chooses search locks then

it gives the search locks tool gives

back the results and then it summarizes

what it found yes I found some HTTP

errors great right it is acting like a

power user as a user you didn't really

need to know data do search lock syntax

it's Good. Um, so let's see what

actually is going on in there. So we

have a better sense of what our MCP

server actually needs to do.

So here what just happened. You have the

user, you have the agent here which is a

cloud code agent panel. Uh you have the

MCP server, our data do uh MCP server

and that's our back end that serves uh

the search locks tool and in the back.

So the agent has a really important job

which is to manage the context window.

It starts with the system prompt and it

makes a call to MCP server to list tools

so that it can fill its context window

with the tool descriptions and that way

it knows what tools are available and

how to call them. Then it receives a

request from the user and it adds that

to the s to the context window. It then

makes a it then calls LLM to decide uh

to see what to do next. it may decide to

call a tool and because of the

information from tool descriptions and

then it makes a call to the MCP server.

The MCP server takes that request

decides what to do with it. Maybe call a

particular back end, maybe cause that

other back end. The back end returns the

results and the MCP server does some

other uh business logic, maybe it does

some filtering, maybe it does some

pageionation, maybe it postprocesses

some JSON into CSV, whatnot, right? Or

error handling. and before it passes

back the response to the agent.

So now the MCP server uh so MCP server

gives back the response to the agent.

The agent loads it back to the context

window. Now it's part of the context

window when it decides what to do next.

It may have it might decide that it

needs to do something uh it needs to go

back and ask more questions. So it keeps

looping or looping until it's complete

and then when [snorts] it's complete it

gives you back a result. Yes, I found

something and returns it back to the

user. So that is a whole that's the flow

of the agentic system or most basic

agent loop.

So I often get this question what

exactly is there to do in MCB server?

Don't you just wrap around the API and

you prompt the tools and the prompt the

tool descriptions all that stuff. So yes

the tool descriptions is a is an

important part right it tells you it

tells the agent how to call tools and

what tools are available. But besides

that uh it's also important to note that

the tool call response also gets fed

into the agent context window and that

affects what it will call what we will

do next. And so in the MCP server we

control the MCP we control the MCP

server and we because we are in the same

company we also we also can talk to the

people who work on the back end to

influence what comes back out of the to

call response. So that means that we

also want to optimize the tool

implementations.

So and this is a massive search base,

right? It's not just the text in the

tool description. It is the code base of

your entire back end as well, including

your MCB server. And how do you optimize

that? How do you know what to

prioritize? For that, you need EDOS,

right? And the thing is that um it's a

bit so eels for MCB servers are

different because there's a lot we don't

control. When you're agent you control

almost everything. When you are an MCB

server you control almost nothing. And

that feeds into our eval philosophy.

We don't optimize for any particular

agent. We are agent agnostic and we only

check the final result. So we don't care

about which tool it calls. Okay. So I'll

go into more detail about why we choose

this as our philosophy for eatals.

So why agent agnostic right and the

reason for that is really because of

helplessness. We don't really know

enough about the agent which agent needs

to use us. So we cannot actually

optimize for any agent. Um a different

way to look at it is that we want MCB

server to be ergonomic enough so that

any agent can use it well not just the

smart ones. The nice side effect of that

is that you actually can, you know,

prioritize simple agents. Um, and what

that means is that our evals, we don't

need to use the expensive ones. We use

the cheaper ones. We use the faster

ones. And that means that our evals are

fast and cheap to run. That's nice. Nice

bonus.

So why tool agnostic though? So what

often happens when people make a tool

called evals for agents is that they'll

when they when agent reaches a

particular step they will say okay make

sure you call this tool and make sure

you call this tool this way and I think

there's a better approach than that and

the reason is that agents are creative

like there are often a few ways to solve

the same problem for example if you want

to check which service went out of

memory there are probably three

different three tools in data dot that

can give you that answer. But if you

were to force a tool to say, "Okay, you

have to call search locks then and the

agent goes ahead and try something

else." Then you're going to the EDOS are

going to fail uh like needlessly and you

have too much stuff to add into your

EDOS. It becomes brutal.

And the other thing is that agents are

surprisingly resourceful. So one thing

that I've accidentally uh noticed when I

was using cursor with our MCP server is

that sometimes if you once I misspelled

a service name it actually uh zoomed

out. So it added some wild card tags. It

removes certain filters. So it

effectively is zooming out and then it

sees okay what service names are in your

are in your logs in the first place and

it finds one that is actually similar

enough to the service name that you had

and then it zooms into that and says

okay this is we didn't find the exact

thing you want but this is probably what

you want like if you are in some you

know very tight agentic loop with very

um rigid eels you're probably going to

mark that as a failure but if you are a

if you're a human trying to do something

messing things up once in a That's

actually a nice experience. So we don't

want to erase that by overly rigid eels.

And the other thing is that because we

are the team that develops the MCP

server, we want to keep upgrading our

tools. If we if we make it too hard to

change our tools, then when we want to

do big changes, big improvements like

consolidating five tools to three tools,

like switching to a different query

syntax or adding one tool that just

makes all tools better. If you have

overly rigid tool um evals, that is

going to prevent you from making those

changes in a way that you can see the

eval scores before and after, right?

Because once you change those tool once

you change your tool surface you have to

rewrite your evals and you don't get an

apples to apples comparison anymore

right your tool specific evals become

too brittle it becomes change detectors

and you don't want that you want to keep

improving your MCP server so that it

just when you connect to it the whole

experience is just better you don't have

the user shouldn't have to think about

the individual tools

and the effect of that is that eval runs

are fast right we from the time you

change your code and you run the eval

run and you see it in the dashboard

takes only two minutes. We run it

locally. It's nice when you see it on

dashboard it looks like this. uh we have

it instrumented with data. LLM

observability and so you can see at a

glance right which uh which two calls

failed which two calls succeeded and

then you can zoom into that one of those

scenarios and say okay like show me the

spans of of the of the tool calls the LM

calls and all that that happened and

then you this each of these things have

a trace ID which you can then take a

take the URL copy it put into your group

your group Slack and see hey this thing

failed why did it fail can you help me

take a

It's a very nice debugging experience.

So it makes running evals actually fun.

Like we change something, you rerun the

evals, two minutes later you see stuff,

you can share with your team. You can

you try something new that makes

something that didn't work before work.

You share you can now uh take this trace

ID, share with your team.

Our evals now become integrated into the

dev cycle of our MCP server. And now

devs want to write EDS, right? If they

are trying to do something with an MCB

server that is challenging its existing

capabilities and the evals don't

actually capture that new cap new

possibility they want to write an eval

so that they can see hey before this

failed now this passed I did something

great

now how does this help agent builders so

the nice side effect for other people is

that tool call failures are now handled

by a central team so the agent builders

they don't have to handle tool call

failures with their big end to and eval

runs right so to call failures are

handled by this lightweight process

that's done by a small central team this

means less work for them

and another good side effect is that

let's say you have a few agent teams and

one of the agent teams give you some

feedback that hey this tool doesn't

really work very well maybe your you

didn't describe it properly then so the

feedback from your agent team can lead

to MCP server improvements from your

developers and that leads to

improvements to all your agents because

Now they are using the same MCP server

with the improved tooling.

That's very uh motivating for people who

are building MCB servers.

But so it was nice to run the MCB server

but how do you make the EVAL scenarios

to run in the first place? Right?

Running EVAL scenarios was never really

the hard part. The hard part is making

the eval scenarios. Right? I just told

you how to do the easy stuff but not how

to do the hard stuff. Right?

So let's go through step by step how do

you actually make a single eval scenario

for MCB server.

So first we think okay if you are a

data.p power user you should know how to

search for logs you should understand

log search syntax right. So you look at

the documentation page you see okay we

need to understand how to put tags

service tags basic things. Okay, so show

me recent logs in a service name full.

Simple question, reasonable request,

right? Can you make that an EDL

scenario? Um, so the the unfortunate

thing here is that if you call this

thing, um, your actual answer is a whole

blob of JSON, right? And if you want to

put that in if you put that into your

evals, it's going to be all variables.

You don't want that. Actually, you can

just ask it for counts. Uh, because

you're trying to test whether it

understands the lock search syntax,

right? If it knows how to make a count

and it gets back a specific number,

chances are it actually knows how to

call that. It actually understands that

syntax. Just changing from counts to

list is pretty trivial.

So can we make this the case? Uh not yet

because the word recent is ambiguous. So

we need to make this more specific,

right? Recent means before, it could

mean five minutes before, right? It's

not like a different LM call might

decide that it means different things

and you get a different answer. So this

needs to be made unambiguous, right? Is

this okay now? Uh still not yet because

last hour depends on what time you're

calling it. So now you need to fix the

time. There's this little thing called

time travel uh which you can pass in a

time stamp into your server and the

server pretends what time it is now. And

now finally this is actually

unambiguous. you can as a human or you

can help you you can get a script to

help you generate this get this answer

uh put this in the eval scenario and

then no matter when you run it you're

going to get back the same answer as

long as the two calls are correct right

and it also doesn't really matter if the

if your MCB server uh if your LLM makes

a different call to a different tool or

you change your tool surface and you got

and you just get back the same answer

it's fine it's still going to pass

so how many evals else do we need to

make? So let's go back to our our

documentation, right? So we have wild

cards, we have two wild cards, we have

full text, we have free tax, we have all

this special u we have all this special

syntax, we have all these edge cases

that we need to deal with, right? So we

need maybe hundreds of eval scenarios

just to cover such logs. And then we

have other products too. Not only do we

have logs, we also have metrics, we have

traces right?

And then not only do you want to make a

query on one thing, you want to make

queries across different products. And

how do you do that? How many evals do

you need now? Maybe thousands, right?

And that's going to be several days of

non-stop ground work just to create eval

scenarios.

And what's worse is that if you work for

a company that has data retention limits

for your logs, then every time the data

disappears, you have to repeat this

whole process all over again. Because

once the data disappears, your new data

comes in, then last hour and your now

time needs to change and the answers are

all going to change. So you need to do

this all over again. So is there a

better way?

The good news is that you can actually

generate those emails. You don't have to

do this by hand. Uh it's a bit

unintuitive that this is actually

possible. So I'll walk you through that.

So if you start with a question, a

natural language question and you try to

get the answer, this is not really easy,

right? If you are trying to if you're

just using a simple agent that tries to

make a tool call, this is not easy.

That's why we're trying to make agents

in the first place. But if you start

from a CQ, you kind of know what the

answer should look like and then you

turn that from the CQ to the answer.

That's easy. you just pass that query

exactly into your API call and then you

can make a script that just gets that

answer

and then to go from a CQ query to the

question uh it's not so straightforward

but what you can do is that you can use

a parser parse this thing convert it

into a and then from a you traverse the

the tree for each node you map it to a

text template and that will lead you to

a natural language string that preserves

the semantic meaning of that CQ query

So this is this is does not this does

not involve an LLM. This is a

straightforward um just you can write

Python to do this thing for you.

How do you get a seat query in the first

place? Right? We just made the problem

uh we just pass the problem to another

place. So here is where the LLM or the

coding agent comes in. The coding agent

with the help of a product documentation

that Yeah.

>> Hello anyone? Okay.

All right. So, with the pro with the

help of your um your coding agent and

your product documentation and some

elbow grease because your coding agent

is going to look at your is going to

look at the placeholder service names.

What's going on? It's going to look at

the placeholder service names and just

fill in say API, but it doesn't exist,

right? You work for this company. You

know what services actually exist. So

you have to fill in say from service API

to service full right. So you still have

to do that but you can generate a whole

bunch of seed queries at once

and with those sequeries

repeat that for each process for for

each question to for each CQ query to

get a question answer pair and then zip

them up to get labeled eval scenarios.

So these labelled eval scenarios you can

put them all into a Python list and you

can all run them locally.

So we just went from one docs page to

200 something eval scenarios. That's

pretty nice.

But we now have a new problem.

Too many evals. Too many eval scenarios,

right? If you just so the basic the

first thing to do that you want to you

probably would do is that you print them

to your console log because we're

running it locally. But then when you

have 200 eval scenarios and you know

before your change you got the you got

0.7 as a score as average score after

the change you got 0.71 as your eval

score. So, was this a no-up or was it

because something improved and something

got worse? At a glance, it's not easy to

see. But what we can do then is uh going

back to the previous example, you can

use your LLM to also help you add some

tags because when you use that process,

you were referring to the product

documentation and a product

documentation probably has some sections

headers that tell you what capability it

refers to. So you can add the capability

tag. Then you can pass these tags along

to your instrumentation to part into

your stance that you pass to LM

observability or your favorite

observability tool. Then you can group

them by capability. So at a glance you

can see what capabilities you're doing

well at and what capabilities you're not

doing so well at. So here we can see

that count to C has the lowest uh

average metric score. And so if we

should probably do something about that.

So as a as a human you look at this

thing and you think okay I probably need

to prompt the tune the prompts for my uh

for my lock search syntax so that I can

so that I know how to deal with the C.

Maybe I didn't think about that when I

was writing the tool. Or

you can ask cloud code to look at that

and help you optimize it.

Because by

instrumenting this thing and putting it

into LM observability, LM observability

being a data do product is also can be

exposed through our MCP server. So now

you can ask a coding agent like cloud

code to point to it and to ask some

information. So let's start with just

running the evaluation, right? We got

the score of like 0.04, a low score. We

want to improve that.

So then we can ask the coding agent to

analyze the edction result with the help

of the MCP tool for like LM observably

tool in MCP server.

So it analyze this thing and now that

this analysis is in the context window

of the coding agent you can point the

coding agent at the piece of code that

you want to optimize. So in the next uh

slide I you're looking at the code at a

diff like the red part is is looking at

the tool description for our MCP server

uh for search data.logs box and the red

part is the one that is is what is going

to delete and the green part is what is

going to improve. It's going to add you

can see that it is trying to um it's

trying to deal with the with this IP

addresses now

so you can optimize the code and then

rerun restart restart the server and

then rerun the evaluation and get a

higher score. So you can repeat this

process again and that's the

selfoptimization loop. So to recap what

I just showed you. So we generated 200

something label eval scenarios. You can

run evals locally in two minutes. You

can visualize results on the dashboard.

You can analyze the failure patterns on

a dashboard or you can do it with the

help of a coding agent. And you can

point the code at the place where you

want to optimize

whether we the business logic be the

prompt the tool description or somewhere

deeper in the back end to optimize the

code based on what we just put into the

context window.

So as a recap,

there's a huge search space that we can

optimize by being as MCP server

developers.

And with all this automatically

generated evals, they can become a

source of joy not just for uh MCP server

developers, but also for all the agent

builders that built on top of our MCP

server. But the first thing you have to

do is to consolidate your agents tools

into an MCP server.

Thank you.

[applause]

>> Thank you, Scott. Uh, we have time for

two questions.

Uh, very nice presentation. Thank you

very much. Uh I'm wondering if you are

experimenting also in fixing the issues

that data do finds in a similar manner

using MCP server then feeding that to

cloud code and then fixing the issue

automatically.

>> Wait, can you can you repeat what are

you feeding into the thing?

>> Uh basically how you are fixing the

evils.

>> You can also use MCP server for data dog

to see the issues that data do found and

then automatically fix by using MCP

server and cloud code. Are you

experimenting with this or

>> we I definitely plan to uh yeah this is

a very some of this stuff is b coded uh

it happened in a pretty short amount of

time

>> okay yeah we are looking forward we are

users and it's just amazing tool thanks

>> thank you

>> thank you

>> we've got time for one more if anyone

has a question

>> yeah okay

>> you mentioned um at a point where you

need the back end to return different

type of response so that agent can

handle them better.

>> Okay.

>> Can you be more specific?

>> Ah yes. So the thing is that the our

tool surface is not the same as our API

surface, right? When you have an API,

people expect it to be stable, right?

People build code on it. They don't

change it very much. Every time you

change API, you break anyone who hasn't

like updated their their binaries. But

with MCP, you can change your tool

description uh every time because

they're going to have to call the list

services every time they restart their

they initialize uh and restart their

agent. So you can actually you it's more

forgiving. You can change your tool

service more often. So what we can do is

that we can we can actually have a

different API. We can have a different

um interface um compared to our API. So

for example, what we can do is that we

can do some spelling correction. We can

do some checking before it reaches the

API call. We can perhaps if we think

that this is a clustering query instead

of a list query, we can actually just

call the the clustering API instead of

calling the list API.

AI Dev 25 x NYC | Scott Yak: Building MCP Servers That Make Agents More Effective

DeepLearningAI

86 days ago

27:17

Model Context Protocol (MCP)

Rank #1

Description

Agents often rely on external tools to help them accomplish their tasks, and external MCP servers are convenient for getting those tools with minimal code. Since the agent-builder does not control the MCP server's interface, the responsibility now falls on the MCP server developers to help the agents that connect to it make progress on their tasks. In this talk by Scott Yak, Senior Software Engineer at Datadog, shared some lessons learned from building and iterating on Datadog's public MCP server, which is currently used by both internal and external agents. Attendees learned how to implement their MCP server in a way that improves the effectiveness of the agents that connect to it. ----------------- Join us at AI Dev 26 x San Francisco! Tickets: https://ai-dev.deeplearning.ai/

Watch on YouTube

Video Details

Category

Model Context Protocol (MCP)

Featured Date

December 7, 2025

Quality Rank

#1

AI Recommended