Loading video player...
This week on the agent factory.
>> So LM evaluation is like a school exam.
You're testing knowledge with static
Q&A. But agent evaluation is more like a
job performance review. And
>> that's it. This is uh how you can cover
the full loop with dedicated web. As we
saw, it's very fast.
>> So this end to end evaluation is one of
the most important things in multi- aent
system.
Hi everyone, welcome to the agent
factory. This is a podcast where we talk
about agents and how to put them into
production. I'm Ian.
>> Hi, I'm Annie. It's so great to be here
and today's topic is one of the most
complex but also most important one that
is agent evaluation.
>> That's right. Uh we had a lot of
question about this one. So today we are
trying to answer some of them.
>> Exactly. So we will cover everything
about what agent evaluation really means
and what you should be measuring, how to
do it using ADK and whereas AI and we'll
even explain advanced topic like
evaluation in multi- aent system. So if
you ever wonder, oh how do I know if my
agent is actually working well? This is
the episode for you.
>> Exactly. And as always, if you find this
useful, make sure to subscribe to the
Google Cloud Tech channel and activate
notification so you can hear about the
new episode as soon as get released. But
now, let's get started.
>> Sure.
>> So, uh, Annie, let's start with a very
simple question like what is agent
evaluation?
>> All right, that's a really good
question. You can think of an agent as a
really complex system. When you're
evaluating it, isn't just about checking
whether the final answer looks right.
You know with traditional software tests
are pretty straightforward, right? You
check if A equals to B. But with LM
agent it is very different. You have to
look at the whole system.
>> Exactly. So we need to look at things
like you know can the actual can the
agent actually complete the task? Does
it make good decision along the way or
how well does it use tools memory and
even the reasoning?
>> Yeah. So it not just say hey did it give
the right answer or not. It's about
system level behavior things like
autonomy, multi-step reasoning to usage
and how the agent handles unpredictable
situations.
>> Okay. So, how is this different from
traditional testing or even from you
know standard LM evaluation?
>> That is a really great question. You
know the traditional testing is
deterministic. It's a same input, same
output for almost every single time,
right? It's very perfect for unit tests
like fail to check pass or failure. But
agent don't work that way. You know
agents are viable. You can give the same
prompt twice but it might end up with
two complete different outcomes. Right?
>> So you can't just write a pi test that
says you know agents that's assert agent
response equal to expected and this is
because you will ended up with a very
flaky test uh suite. Instead you want
instead of focusing on single outputs we
have to look at the behavior over time.
This explain the difference between uh
you know traditional software testing
and agent evaluation. But how is that
different from LLM benchmarks like MMLU?
>> Yeah. So LM evaluation and agent
evaluation are very different. So LM
evaluation is like a school exam. You're
testing knowledge with static Q&A. But
agent evaluation is more like a job
performance review. You know, we care
about whether agent can use tools
correctly or recover from errors or stay
consistent across multiple turns. So
even if you have a really great model,
your agent can still perform badly
because the agent might not call API
properly. Right?
>> I see. So the big take here is that
traditional testing is uh traditional
testing is for deterministic logic. LM
evaluation is for general model
capabilities but agent evaluation is for
system level task effectiveness
>> you know so when we talk about
evaluating an agent the short answer is
you need to measure everything you
really need a full stack approach and
let's work through what that really
means
>> yeah let's begin with the final outcome
so here you need metrics that uh tells
you if the agent uh actually achieve its
goal
>> uh that's where we want to look at the
task success rate for example. But
that's not the whole story. Like we also
we also care about the quality of the
output. What uh for example was it
coherent, accurate and safe? Did it
avoid hallucination and bias responses?
Uh so it's not just you know did it
finish the job, it's did it finish it
well.
>> Exactly. And what's next is agent's
chain of thought. It's planning and
reasoning. So we'll ask did it break the
task into logical steps? Was the
reasoning actually consistent or did
they just get lucky and land the right
answer by chance? Because if the
reasoning isn't solid, it won't hold up
across different rounds.
>> That's right. And we also want to look
at the tool utilization, right? So if
the agent pick the right tool or did it
pass the correct parameters?
>> So and it's not just about the right
tool and parameters, did it do it
efficiently? So or for example it waste
time and money making redundant API
calls. I haven't seen many agents stuck
in this loop uh you know going through
the same API calls and driving up uh
costs.
>> Yeah that's right. So that's why we need
to track it. And finally we have this
memory and context retention. You know
can the agent recall the right
information when it actually needs it or
when new information conflict with what
it already knows? Can agent resolve that
conflict correctly?
So again as we said at the beginning we
really need to evaluate everything and
uh try to find the right measures for
all these aspect of the agents outcome
reasoning tools and memory
>> right
>> but uh so now we know what to cover but
the problem is how to how to measure the
the agent. So how do actually we
evaluate an agent in practice?
>> Yeah that's a great question. So be to
evaluate agent we need to understand two
big concepts and the first concept is
offline evaluation and second concept is
online evaluation. So offline is what
you do before production. You test
against static golden data set to catch
regression and online is what happens
after deployment. So you're monitoring
live user data looking for drift or even
running AP test.
Got it. So this is the classic
distinction between a pre-production
versus post-production evaluation and
the within both we have some popular
methods that user that users or
developers use in order to evaluate the
agents. The first one is the ground
truth checks. So they are fast, cheap
and reliable. You try to answer
questions like is this a valid JSON that
has been generated or does the format
match with the schema. So the uh
limitation of this check is that the
fact that they don't capture nuances of
the agent outputs or there are some
aspects like you know coherence
factuality of the agents that are very
hard to measure.
>> That makes so much sense and that's why
we have this LLM as a judge. So this is
when you use a strong model to score
subjective qualities like how coherent
the plan is. This the great thing about
this is it scale really well but how
they evaluate really depends on how they
will train.
>> Yeah, exactly. And you know in order to
compensate these two methods uh we have
a third and last method that is pretty
popular and as you can imagine is the
human in the loop right.
>> So that's when domain experts review the
output of the agents. This is one of the
most specialized method in some sense
but it's also slower and the most
expensive.
>> I can imagine that. But do you just pick
one of those or does all do all of them
work better together?
>> No, I mean all of them as we just saw
they have pros and cons. So the best
strategy here is combining them. In
particular, what you want to create is
what is called a calibration loop. Okay.
So you start with a human to create a
small high quality golden data set and
then you fine-tune your LM as a judge
till its core they line up with the
human expectations.
>> Oh, so you get the best of both. you get
the best of both accuracy of human and
the scale of LM. That's a key. All
right. So now I get the concept but how
do we actually put this into practice?
>> Yeah. So let's uh let's try to run uh an
agent evaluation here. Okay. So actually
what we can do is that we can use agent
development kit and uh in agent
development kit you have a ADK web that
will help us.
>> Yeah. Right. ADK web is so handy for
this offline inner loop development. It
built for fast interactive testing
exactly the kind of offline evaluation
we just tal about and let's walk through
this five-step loop all together.
>> Yeah, perfect. So let's use a very
simple agent uh product research agent
as an example.
>> Right, you can see this agent has two
tools. The first tool called get product
details for customerf facing info and
the second tool is look up product
information for internal SKUs. But the
issue with this agent is the instruction
wasn't clear enough.
>> Yeah. So that's why like uh let's go
through the uh the first step. Okay. So
in order to first of all we want to test
the agent and define you know the golden
uh the golden path.
>> So in the ADK web UI I can type hey tell
me about the headphones for example and
uh the agent comes uh comes back with an
SKU but that's an internal data and uh
we don't want it. M
>> so in the eval tab what I can do is that
I can create a case and correct the
expected response to uh the customer
description of the product that I
expect.
>> So this is how you create a golden data
set with uh ADK web.
>> Right. So once you have this data set
let's go to next step which is evaluate
the agent. As a developer I select the
test case and click run evaluation
button. And then you can see from this
demo that it fails right away. So that's
why we go to next step which is about
finding the root cause. And you can see
I jump into this trace tab. And here is
a magic happens. It shows the agent
step-by-step reasoning process. And from
here I realize oh okay the agent choose
the wrong tool. Look up product
information.
>> Yeah. Once you have this kind of
information then you can fix the agent.
So in this very simple case the problem
as we said it was the instruction was
too ambiguous. Mhm.
>> So I can open my agent code and write
the instruction to be more clear. For
example, I can write something like for
customerf facing description use the get
production uh the gra product uh uh
details and for uh internal uh data like
SKU use the lookup product information.
>> Okay. So now we fix it and then we need
to go to next step which is validating
the agent. So AD server hall reloads. I
run the same test again. Okay. tell me
the headphones and this time it gave me
the correct customer description. So by
rerunning evaluation the test pass.
>> Yeah. So and that's it. This is uh how
you can cover the full loop with ADK
web. As we saw it's very fast and uh uh
gives you this interactive debugging and
testing.
>> Yeah, that is a solid workflow. But just
to note, ADK doesn't stop there. It also
support running unit test and
integration test. But here's a catch. is
still manual and interactive right it
only cover a limited set of metrics
that's okay for development but it
doesn't scale so this bring us to the
next step evaluating with what has AI
>> exactly so anyone you need to test at
scale or evaluating at scale and you
want richer metrics like using LM as a
judge as we were saying you need a
production grade platform and that's
where Vertexi comes in so with Vertex AI
you can take your eval stats and run uh
I run them through the genai evaluation
services. So it is designed to handle
this complex qualitative evaluation for
your agents at scale and uh it produce
an evaluation outcomes that you can also
use to build dashboard for your agent in
production.
>> Yeah. So basically ADK is for a fast
inner loop during development and what
has a is a production scale out loop.
>> Yeah, you got it. So and of course as
you can imagine Vert.x A is just one
option. There are so many platform out
there that you can use to evaluate your
agents.
>> Yeah. So we work through how to evaluate
agents with ADK and what has AI. But
here's the problem. You know in both
case you need data set. But honestly
that's not always available or even data
set is available it can be very
expensive or very hard to create. That
is what we call code start problem.
>> Exactly. And the way we handle this with
something called a synthetic data
generation. So basically we use an LLM
to create the data set for us. We build
a data generation. We build a data
generation pipeline uh together with
baits and biases sometimes ago. Uh and
we can explain a generic process here.
>> Yeah, sure. So you can think of this as
four-step recipe and the first step is
you ask LM to generate realistic user
task and next step is you have it act
like a expert agent and produce the
perfect step-by-step solution. And third
is if you want you can bring a vicer
model to try the same task which gives
you a bunch of imperfect attempts. So
now you have this both attempts and
finally you use an LM as a judge to
compare those imperfect attempts against
the perfect solution and score them
automatically.
>> Yeah, that's nice. So we got a way to
build our evaluation data. But uh I know
developer would may ask like now we have
this uh data but how actually use it to
design tests uh at scale.
>> Yeah that is a really great question. So
that's where this three tier testing
strategy comes in.
>> Yeah tell more about this.
>> Sure. So the first tier is unit test.
Just like the example we had before the
production research agent you're testing
one small piece in isolation and the
tier two is integration test. It's like
taking the whole car for a test drive.
You're not just checking the single
component. You're checking if the whole
multi-step journey works the way it
should. And tier three is end to end
human review and multi-agent testing.
This is a final sanity check. It's also
where you bring in multiple agents and
feed results back into your human in the
loop calibration loop.
>> That's a really interesting framework
and I really would like to see it in
practice. Maybe we can have a future
episode on this. Sure.
>> But for now, like we pretty much covered
the A toZ of evaluating a single agent,
which leaves one big question. What
happens when you got multiple agent
working together?
>> Yeah. So here's the thing. When we move
into multi- aent system for really
complex task, our whole approach for
evaluation has to evolve.
>> Yeah. The issue is that judging agency
initiations doesn't tell much about the
system overall performance. That's
especially true if you start using you
know this uh new u framework to build
multi- aent system like the agent to
agent protocol which is all about agent
discovery and talk to each other more
easily.
>> Yeah. Right. Exactly. So in a multi-
aent setup you don't just care about
reasoning or two use. You care about
whether the whole system gets a job
done.
>> Sure. So to explain why this is relevant
let's uh let's have a quick example. So
let's imagine that we have an agent A uh
which is for customer support and an
agent B that is for refund and
replacement process.
>> Yeah.
>> So a customer comes to agent A and say I
bought as Mark widget last week and it
doesn't turn on. So I uh I really would
like to have a refund or a replacement.
>> Right. So in that case what happen next
is agent A kicks things off. It generate
greeting tool and check customer info
and look up purchase history and confirm
the order. But here's a key. it can
actually process a refund and instead
its job is to hand everything over to
agent B including all the customer
details and product information.
>> Yeah. And then the agent B takes it from
there and the agent B review the case
uses specialized tool for refounding or
replacement to finish the request. Mhm.
>> But here is the problem like if you look
at the aa in isolation depending on the
metrics that you use for example a task
core completion score would be zero and
uh it didn't solve the problem but in
the reality exceed 100% because its role
was to hand off to the agent B. Yeah,
that's right. And if agent B if agent B
successfully process the refund, that is
great. But what if agent A has passed
the wrong information and the system as
a whole will still fail, right? So
that's why single agent matrix can be
totally misleading.
>> Yeah. So what really matters here is
that can the agent end off smoothly,
share contacts and keep latency and cost
reasonable across the whole uh journey.
>> Exactly. So this end to end evaluation
is one of the most important things in
multi- aent system which means
evaluation itself become part of the
design and we may even need agents that
can emit structural data specifically
for evaluation purpose.
>> Yeah, I like the idea and I kind of
seeing you know multi-agent evaluation
like a network analysis where you know
um you look at interaction and not just
the outcomes. I really would like to
spend some time looking more into that.
>> Sure. Totally. So it not just about
hands-off, it's also about
collaboration, communication efficiency,
conflict resolution and this is just one
of the open questions in this whole
agent evaluation world.
>> Yeah. Beside that actually there are few
uh many others like for example there is
this cost uh scalability cost trade-off.
Mhm.
>> So as we were saying human evaluation
are still a valid option but at the same
time is low and expensive. So you can
use LM as a judge which is faster and
scalable but you know in order to align
them to the human expectation you need
to tune them. So there we go the
trade-off with cost and uh you know
performance.
>> Also I heard about benchmark integrity
you know that also being a problem. If
test questions leak into model training
data, the score doesn't mean so much in
this case. And once models start acing
benchmark, we will have to build even
harder ones.
>> Yeah, exactly. And another one which is
the last one I just want to mention is
the this subjective attributes
evaluation. So things like you know
creativity, productivity, even humor.
How do you manage how do you measure
those with your agent? Like imagine an
agent and produce image. How an agent is
good in producing them. These are still
you know um this kind of question they
still not have a clear answer yet.
>> Yeah those are really tough challenge
but I think they're also very exciting
challenge and I really can't wait to see
new evaluation framework to address
those challenge but for now I think that
is a perfect place to wrap things up.
>> Yeah I think so we really cover a lot
today on agent evaluation. We kick we
kick things off with the basics like
what agent evaluation actually is how
it's different from traditional testing
as well as LMA evaluation and we broke
down this full stack approach you know
looking at the outcome reasoning tools
and memory
>> yeah and after that we also walked
through the concept of ground truth LM
as a judge and human in the loop and all
the way down to what's next with multi
agent hands off and the challenge of
benchmarking
>> yeah that's a lot of content about
evaluations. I hope now you have a
better idea of how to evaluate your
agent and also we will share some of the
links of the resources that we just
discussed in the show notes.
>> Yeah, that's right. Thank you so much
for hanging out with us on this agent
factory. And once more, if you enjoy
today's content and want to support us,
making sure to subscribe to Google Cloud
Tech channel, activate notification,
like, comment, and share with everyone.
>> Yeah. And until next time, I'm Ivan.
>> I'm Annie. Powering down.
[Music]
Learn how to effectively evaluate your AI agent and ensure it performs reliably in production. This episode of The Agent Factory is your definitive guide on Agent Evaluation, showing you how to go from local testing with the Agent Development Kit (ADK) to large scale, enterprise grade evaluation using Vertex AI. We break down how to implement a full-stack agent evaluation strategy, including how to use ADK for fast debugging and golden dataset creation, and how Vertex AI's GenAI Evaluation service scales your testing with the LLM as a judge approach. Don't launch an agent you can't trust—watch to learn how to measure outcome, reasoning, tool use, and memory. Want to build production ready agents? Don't miss an episode! In this episode you'll learn: 1️⃣ How to evaluate the agent's system level behavior, not just its output. 2️⃣ The 5 step inner loop workflow for testing agents with ADK (Agent Development Kit). 3️⃣ How to use Vertex AI for production scale, qualitative agent evaluation. 4️⃣ The unique challenges of testing and evaluating multi-agent systems (A2A). 5️⃣ Techniques for generating synthetic data to solve the evaluation cold start problem. About The Agent Factory: "The Agent Factory" is a video first technical podcast for developers, by developers, focused on building production ready AI agents. We explore how to design, build, deploy, and manage agents that bring real value. 🔗 Resources & links mentioned: ➖ Google's Agent Development Kit (ADK) evaluation guide → https://goo.gle/3KshHIu ➖ Google's Agent Development Kit (ADK) → https://goo.gle/3Kq6Lex ➖ Vertex AI GenAI Evaluation Service → https://goo.gle/3ICTMpe ➖ How to evaluate generated answers from RAG at scale on Vertex AI → https://goo.gle/4o1oh7p ➖ How to evaluate LLMs with custom criteria using Vertex AI AutoSxS → https://goo.gle/46GfMYg, https://goo.gle/3IOMjDt Subscribe to The Agent Factory → https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #AgentEvaluation #EvaluateTheAgent #ADK #VertexAI #AIAgents #AI #Payments Speakers: Annie Wang Ivan Nardini Products Mentioned: ADK, Vertex AI, A2A