Agent evaluation with ADK & Vertex AI | The Agent Factory Podcast | DailyDevLists

Loading video player...

Full Transcript

3,710 words • EN

This week on the agent factory.

>> So LM evaluation is like a school exam.

You're testing knowledge with static

Q&A. But agent evaluation is more like a

job performance review. And

>> that's it. This is uh how you can cover

the full loop with dedicated web. As we

saw, it's very fast.

>> So this end to end evaluation is one of

the most important things in multi- aent

system.

Hi everyone, welcome to the agent

factory. This is a podcast where we talk

about agents and how to put them into

production. I'm Ian.

>> Hi, I'm Annie. It's so great to be here

and today's topic is one of the most

complex but also most important one that

is agent evaluation.

>> That's right. Uh we had a lot of

question about this one. So today we are

trying to answer some of them.

>> Exactly. So we will cover everything

about what agent evaluation really means

and what you should be measuring, how to

do it using ADK and whereas AI and we'll

even explain advanced topic like

evaluation in multi- aent system. So if

you ever wonder, oh how do I know if my

agent is actually working well? This is

the episode for you.

>> Exactly. And as always, if you find this

useful, make sure to subscribe to the

Google Cloud Tech channel and activate

notification so you can hear about the

new episode as soon as get released. But

now, let's get started.

>> Sure.

>> So, uh, Annie, let's start with a very

simple question like what is agent

evaluation?

>> All right, that's a really good

question. You can think of an agent as a

really complex system. When you're

evaluating it, isn't just about checking

whether the final answer looks right.

You know with traditional software tests

are pretty straightforward, right? You

check if A equals to B. But with LM

agent it is very different. You have to

look at the whole system.

>> Exactly. So we need to look at things

like you know can the actual can the

agent actually complete the task? Does

it make good decision along the way or

how well does it use tools memory and

even the reasoning?

>> Yeah. So it not just say hey did it give

the right answer or not. It's about

system level behavior things like

autonomy, multi-step reasoning to usage

and how the agent handles unpredictable

situations.

>> Okay. So, how is this different from

traditional testing or even from you

know standard LM evaluation?

>> That is a really great question. You

know the traditional testing is

deterministic. It's a same input, same

output for almost every single time,

right? It's very perfect for unit tests

like fail to check pass or failure. But

agent don't work that way. You know

agents are viable. You can give the same

prompt twice but it might end up with

two complete different outcomes. Right?

>> So you can't just write a pi test that

says you know agents that's assert agent

response equal to expected and this is

because you will ended up with a very

flaky test uh suite. Instead you want

instead of focusing on single outputs we

have to look at the behavior over time.

This explain the difference between uh

you know traditional software testing

and agent evaluation. But how is that

different from LLM benchmarks like MMLU?

>> Yeah. So LM evaluation and agent

evaluation are very different. So LM

evaluation is like a school exam. You're

testing knowledge with static Q&A. But

agent evaluation is more like a job

performance review. You know, we care

about whether agent can use tools

correctly or recover from errors or stay

consistent across multiple turns. So

even if you have a really great model,

your agent can still perform badly

because the agent might not call API

properly. Right?

>> I see. So the big take here is that

traditional testing is uh traditional

testing is for deterministic logic. LM

evaluation is for general model

capabilities but agent evaluation is for

system level task effectiveness

>> you know so when we talk about

evaluating an agent the short answer is

you need to measure everything you

really need a full stack approach and

let's work through what that really

means

>> yeah let's begin with the final outcome

so here you need metrics that uh tells

you if the agent uh actually achieve its

goal

>> uh that's where we want to look at the

task success rate for example. But

that's not the whole story. Like we also

we also care about the quality of the

output. What uh for example was it

coherent, accurate and safe? Did it

avoid hallucination and bias responses?

Uh so it's not just you know did it

finish the job, it's did it finish it

well.

>> Exactly. And what's next is agent's

chain of thought. It's planning and

reasoning. So we'll ask did it break the

task into logical steps? Was the

reasoning actually consistent or did

they just get lucky and land the right

answer by chance? Because if the

reasoning isn't solid, it won't hold up

across different rounds.

>> That's right. And we also want to look

at the tool utilization, right? So if

the agent pick the right tool or did it

pass the correct parameters?

>> So and it's not just about the right

tool and parameters, did it do it

efficiently? So or for example it waste

time and money making redundant API

calls. I haven't seen many agents stuck

in this loop uh you know going through

the same API calls and driving up uh

costs.

>> Yeah that's right. So that's why we need

to track it. And finally we have this

memory and context retention. You know

can the agent recall the right

information when it actually needs it or

when new information conflict with what

it already knows? Can agent resolve that

conflict correctly?

So again as we said at the beginning we

really need to evaluate everything and

uh try to find the right measures for

all these aspect of the agents outcome

reasoning tools and memory

>> right

>> but uh so now we know what to cover but

the problem is how to how to measure the

the agent. So how do actually we

evaluate an agent in practice?

>> Yeah that's a great question. So be to

evaluate agent we need to understand two

big concepts and the first concept is

offline evaluation and second concept is

online evaluation. So offline is what

you do before production. You test

against static golden data set to catch

regression and online is what happens

after deployment. So you're monitoring

live user data looking for drift or even

running AP test.

Got it. So this is the classic

distinction between a pre-production

versus post-production evaluation and

the within both we have some popular

methods that user that users or

developers use in order to evaluate the

agents. The first one is the ground

truth checks. So they are fast, cheap

and reliable. You try to answer

questions like is this a valid JSON that

has been generated or does the format

match with the schema. So the uh

limitation of this check is that the

fact that they don't capture nuances of

the agent outputs or there are some

aspects like you know coherence

factuality of the agents that are very

hard to measure.

>> That makes so much sense and that's why

we have this LLM as a judge. So this is

when you use a strong model to score

subjective qualities like how coherent

the plan is. This the great thing about

this is it scale really well but how

they evaluate really depends on how they

will train.

>> Yeah, exactly. And you know in order to

compensate these two methods uh we have

a third and last method that is pretty

popular and as you can imagine is the

human in the loop right.

>> So that's when domain experts review the

output of the agents. This is one of the

most specialized method in some sense

but it's also slower and the most

expensive.

>> I can imagine that. But do you just pick

one of those or does all do all of them

work better together?

>> No, I mean all of them as we just saw

they have pros and cons. So the best

strategy here is combining them. In

particular, what you want to create is

what is called a calibration loop. Okay.

So you start with a human to create a

small high quality golden data set and

then you fine-tune your LM as a judge

till its core they line up with the

human expectations.

>> Oh, so you get the best of both. you get

the best of both accuracy of human and

the scale of LM. That's a key. All

right. So now I get the concept but how

do we actually put this into practice?

>> Yeah. So let's uh let's try to run uh an

agent evaluation here. Okay. So actually

what we can do is that we can use agent

development kit and uh in agent

development kit you have a ADK web that

will help us.

>> Yeah. Right. ADK web is so handy for

this offline inner loop development. It

built for fast interactive testing

exactly the kind of offline evaluation

we just tal about and let's walk through

this five-step loop all together.

>> Yeah, perfect. So let's use a very

simple agent uh product research agent

as an example.

>> Right, you can see this agent has two

tools. The first tool called get product

details for customerf facing info and

the second tool is look up product

information for internal SKUs. But the

issue with this agent is the instruction

wasn't clear enough.

>> Yeah. So that's why like uh let's go

through the uh the first step. Okay. So

in order to first of all we want to test

the agent and define you know the golden

uh the golden path.

>> So in the ADK web UI I can type hey tell

me about the headphones for example and

uh the agent comes uh comes back with an

SKU but that's an internal data and uh

we don't want it. M

>> so in the eval tab what I can do is that

I can create a case and correct the

expected response to uh the customer

description of the product that I

expect.

>> So this is how you create a golden data

set with uh ADK web.

>> Right. So once you have this data set

let's go to next step which is evaluate

the agent. As a developer I select the

test case and click run evaluation

button. And then you can see from this

demo that it fails right away. So that's

why we go to next step which is about

finding the root cause. And you can see

I jump into this trace tab. And here is

a magic happens. It shows the agent

step-by-step reasoning process. And from

here I realize oh okay the agent choose

the wrong tool. Look up product

information.

>> Yeah. Once you have this kind of

information then you can fix the agent.

So in this very simple case the problem

as we said it was the instruction was

too ambiguous. Mhm.

>> So I can open my agent code and write

the instruction to be more clear. For

example, I can write something like for

customerf facing description use the get

production uh the gra product uh uh

details and for uh internal uh data like

SKU use the lookup product information.

>> Okay. So now we fix it and then we need

to go to next step which is validating

the agent. So AD server hall reloads. I

run the same test again. Okay. tell me

the headphones and this time it gave me

the correct customer description. So by

rerunning evaluation the test pass.

>> Yeah. So and that's it. This is uh how

you can cover the full loop with ADK

web. As we saw it's very fast and uh uh

gives you this interactive debugging and

testing.

>> Yeah, that is a solid workflow. But just

to note, ADK doesn't stop there. It also

support running unit test and

integration test. But here's a catch. is

still manual and interactive right it

only cover a limited set of metrics

that's okay for development but it

doesn't scale so this bring us to the

next step evaluating with what has AI

>> exactly so anyone you need to test at

scale or evaluating at scale and you

want richer metrics like using LM as a

judge as we were saying you need a

production grade platform and that's

where Vertexi comes in so with Vertex AI

you can take your eval stats and run uh

I run them through the genai evaluation

services. So it is designed to handle

this complex qualitative evaluation for

your agents at scale and uh it produce

an evaluation outcomes that you can also

use to build dashboard for your agent in

production.

>> Yeah. So basically ADK is for a fast

inner loop during development and what

has a is a production scale out loop.

>> Yeah, you got it. So and of course as

you can imagine Vert.x A is just one

option. There are so many platform out

there that you can use to evaluate your

agents.

>> Yeah. So we work through how to evaluate

agents with ADK and what has AI. But

here's the problem. You know in both

case you need data set. But honestly

that's not always available or even data

set is available it can be very

expensive or very hard to create. That

is what we call code start problem.

>> Exactly. And the way we handle this with

something called a synthetic data

generation. So basically we use an LLM

to create the data set for us. We build

a data generation. We build a data

generation pipeline uh together with

baits and biases sometimes ago. Uh and

we can explain a generic process here.

>> Yeah, sure. So you can think of this as

four-step recipe and the first step is

you ask LM to generate realistic user

task and next step is you have it act

like a expert agent and produce the

perfect step-by-step solution. And third

is if you want you can bring a vicer

model to try the same task which gives

you a bunch of imperfect attempts. So

now you have this both attempts and

finally you use an LM as a judge to

compare those imperfect attempts against

the perfect solution and score them

automatically.

>> Yeah, that's nice. So we got a way to

build our evaluation data. But uh I know

developer would may ask like now we have

this uh data but how actually use it to

design tests uh at scale.

>> Yeah that is a really great question. So

that's where this three tier testing

strategy comes in.

>> Yeah tell more about this.

>> Sure. So the first tier is unit test.

Just like the example we had before the

production research agent you're testing

one small piece in isolation and the

tier two is integration test. It's like

taking the whole car for a test drive.

You're not just checking the single

component. You're checking if the whole

multi-step journey works the way it

should. And tier three is end to end

human review and multi-agent testing.

This is a final sanity check. It's also

where you bring in multiple agents and

feed results back into your human in the

loop calibration loop.

>> That's a really interesting framework

and I really would like to see it in

practice. Maybe we can have a future

episode on this. Sure.

>> But for now, like we pretty much covered

the A toZ of evaluating a single agent,

which leaves one big question. What

happens when you got multiple agent

working together?

>> Yeah. So here's the thing. When we move

into multi- aent system for really

complex task, our whole approach for

evaluation has to evolve.

>> Yeah. The issue is that judging agency

initiations doesn't tell much about the

system overall performance. That's

especially true if you start using you

know this uh new u framework to build

multi- aent system like the agent to

agent protocol which is all about agent

discovery and talk to each other more

easily.

>> Yeah. Right. Exactly. So in a multi-

aent setup you don't just care about

reasoning or two use. You care about

whether the whole system gets a job

done.

>> Sure. So to explain why this is relevant

let's uh let's have a quick example. So

let's imagine that we have an agent A uh

which is for customer support and an

agent B that is for refund and

replacement process.

>> Yeah.

>> So a customer comes to agent A and say I

bought as Mark widget last week and it

doesn't turn on. So I uh I really would

like to have a refund or a replacement.

>> Right. So in that case what happen next

is agent A kicks things off. It generate

greeting tool and check customer info

and look up purchase history and confirm

the order. But here's a key. it can

actually process a refund and instead

its job is to hand everything over to

agent B including all the customer

details and product information.

>> Yeah. And then the agent B takes it from

there and the agent B review the case

uses specialized tool for refounding or

replacement to finish the request. Mhm.

>> But here is the problem like if you look

at the aa in isolation depending on the

metrics that you use for example a task

core completion score would be zero and

uh it didn't solve the problem but in

the reality exceed 100% because its role

was to hand off to the agent B. Yeah,

that's right. And if agent B if agent B

successfully process the refund, that is

great. But what if agent A has passed

the wrong information and the system as

a whole will still fail, right? So

that's why single agent matrix can be

totally misleading.

>> Yeah. So what really matters here is

that can the agent end off smoothly,

share contacts and keep latency and cost

reasonable across the whole uh journey.

>> Exactly. So this end to end evaluation

is one of the most important things in

multi- aent system which means

evaluation itself become part of the

design and we may even need agents that

can emit structural data specifically

for evaluation purpose.

>> Yeah, I like the idea and I kind of

seeing you know multi-agent evaluation

like a network analysis where you know

um you look at interaction and not just

the outcomes. I really would like to

spend some time looking more into that.

>> Sure. Totally. So it not just about

hands-off, it's also about

collaboration, communication efficiency,

conflict resolution and this is just one

of the open questions in this whole

agent evaluation world.

>> Yeah. Beside that actually there are few

uh many others like for example there is

this cost uh scalability cost trade-off.

Mhm.

>> So as we were saying human evaluation

are still a valid option but at the same

time is low and expensive. So you can

use LM as a judge which is faster and

scalable but you know in order to align

them to the human expectation you need

to tune them. So there we go the

trade-off with cost and uh you know

performance.

>> Also I heard about benchmark integrity

you know that also being a problem. If

test questions leak into model training

data, the score doesn't mean so much in

this case. And once models start acing

benchmark, we will have to build even

harder ones.

>> Yeah, exactly. And another one which is

the last one I just want to mention is

the this subjective attributes

evaluation. So things like you know

creativity, productivity, even humor.

How do you manage how do you measure

those with your agent? Like imagine an

agent and produce image. How an agent is

good in producing them. These are still

you know um this kind of question they

still not have a clear answer yet.

>> Yeah those are really tough challenge

but I think they're also very exciting

challenge and I really can't wait to see

new evaluation framework to address

those challenge but for now I think that

is a perfect place to wrap things up.

>> Yeah I think so we really cover a lot

today on agent evaluation. We kick we

kick things off with the basics like

what agent evaluation actually is how

it's different from traditional testing

as well as LMA evaluation and we broke

down this full stack approach you know

looking at the outcome reasoning tools

and memory

>> yeah and after that we also walked

through the concept of ground truth LM

as a judge and human in the loop and all

the way down to what's next with multi

agent hands off and the challenge of

benchmarking

>> yeah that's a lot of content about

evaluations. I hope now you have a

better idea of how to evaluate your

agent and also we will share some of the

links of the resources that we just

discussed in the show notes.

>> Yeah, that's right. Thank you so much

for hanging out with us on this agent

factory. And once more, if you enjoy

today's content and want to support us,

making sure to subscribe to Google Cloud

Tech channel, activate notification,

like, comment, and share with everyone.

>> Yeah. And until next time, I'm Ivan.

>> I'm Annie. Powering down.

[Music]

Agent evaluation with ADK & Vertex AI | The Agent Factory Podcast

Google Cloud Tech

104 days ago

20:38

AI Evaluation & Monitoring

Rank #3

Description

Learn how to effectively evaluate your AI agent and ensure it performs reliably in production. This episode of The Agent Factory is your definitive guide on Agent Evaluation, showing you how to go from local testing with the Agent Development Kit (ADK) to large scale, enterprise grade evaluation using Vertex AI. We break down how to implement a full-stack agent evaluation strategy, including how to use ADK for fast debugging and golden dataset creation, and how Vertex AI's GenAI Evaluation service scales your testing with the LLM as a judge approach. Don't launch an agent you can't trust—watch to learn how to measure outcome, reasoning, tool use, and memory. Want to build production ready agents? Don't miss an episode! In this episode you'll learn: 1️⃣ How to evaluate the agent's system level behavior, not just its output. 2️⃣ The 5 step inner loop workflow for testing agents with ADK (Agent Development Kit). 3️⃣ How to use Vertex AI for production scale, qualitative agent evaluation. 4️⃣ The unique challenges of testing and evaluating multi-agent systems (A2A). 5️⃣ Techniques for generating synthetic data to solve the evaluation cold start problem. About The Agent Factory: "The Agent Factory" is a video first technical podcast for developers, by developers, focused on building production ready AI agents. We explore how to design, build, deploy, and manage agents that bring real value. 🔗 Resources & links mentioned: ➖ Google's Agent Development Kit (ADK) evaluation guide → https://goo.gle/3KshHIu ➖ Google's Agent Development Kit (ADK) → https://goo.gle/3Kq6Lex ➖ Vertex AI GenAI Evaluation Service → https://goo.gle/3ICTMpe ➖ How to evaluate generated answers from RAG at scale on Vertex AI → https://goo.gle/4o1oh7p ➖ How to evaluate LLMs with custom criteria using Vertex AI AutoSxS → https://goo.gle/46GfMYg, https://goo.gle/3IOMjDt Subscribe to The Agent Factory → https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #AgentEvaluation #EvaluateTheAgent #ADK #VertexAI #AIAgents #AI #Payments Speakers: Annie Wang Ivan Nardini Products Mentioned: ADK, Vertex AI, A2A

Video Details

Category

AI Evaluation & Monitoring

Featured Date

December 10, 2025

Quality Rank

#3

AI Recommended