How to evaluate agents in practice | DailyDevLists

Loading video player...

Full Transcript

1,412 words • EN

[music]

Hello everyone, welcome back. In the

last episode of agent evaluation, we

talk about why normal software testing

is not working for AI agents and what

part of an agent system you should

measure. And in today's video, we will

turn that theory into practice. We will

walk through three tier testing pyramid.

so that you have a framework to tackle

agent evaluation challenge and then we

will use Google agent development kit to

actually run test follow this three tier

testing pyramid and we will also explain

what a trajectory is and how the

evaluation matrix work and by the end of

today's video you will know how to test

your agents step by step by using a

clear testing py pyramid and Use ADK to

build reliable automated checks. All

right, let's get started. So part one,

the three tier testing pyramid. You can

see from this diagram on the screen that

we have the three tiered pyramid. So for

the first tier, tier one encompasses

component level unit tests which is also

the foundation of agent testing and at

this level we test the smallest building

blocks in isolation. [snorts] Often time

it can be tool use like did this agent

pick the right tool from the simple

prompt? Did it produce a valid correctly

structured respon request like correct

JSON field and tests at this tier

usually are fast, cheap and automated.

So it's also perfect for continuous

integration to catch regressions

earlier.

Now let's take a look at this tier two.

So this level is about trajectory level

integration test. It is also about

testing a full multi-step task end to

end. And we will cover did the agent

plan logically? Did you use tools in the

right order and adapt if anything

failed? Did the agent remember earlier

information and reach the goal?

And for the tier three level, it is

about end to end human review. We need

to involve human in the loop to check

this whole experience like the

helpfulness, safety, common sense. So

this tier is usually slow and more

expensive.

But it is also your final quality gate.

So in summary, here is a big picture. We

have tier one component level unit test.

It checks correctness like tools and

parameters.

We have tier two trajectory level hero

integration test. It checks capability

and reasoning. And lastly tier three end

to end human review. This enhance user

experience and trust. So now you know

the three tier testing pyramid. Let's

take a look at example of how to use

agent development kit ADK to write tests

following this three- tier pyramid

framework.

and let's take a look at it. So the

second part of this video is ADK in

action. We will cover what to measure

and how to evaluate with ADK. And let's

use bookfinder agent. As you can see on

the screen in this agent, it's using

Gemini 2.5 Pro this model as brain and

has three different tools. Search local

library tool, find local bookstore tool

and order online tool to make decision

and answer user's question. So if a user

has input query, order this book Harry

Potter for me, you will first check the

local library and then check the

bookstore. If we cannot find this book

in the library or bookstore, it will

then order online for users. And when we

evaluate this agent, we need to evaluate

this whole book finding journey. Which

means we need to set expected trajectory

and compare it with the actual process.

And a trajectory is the entire journey

your agent takes to solve a task. And

this includes first of all the sequence

of two costs with their arguments

and and secondly the intermediate steps

and reasoning.

and also the final response. So for this

example, bookfinder, our desired

trajectory is search local library to

find local bookstore and also order

online.

Other than trajectory, we need to also

compare the final answer. We need to see

whether that recommends an online

retailer or if local options fail. So

now we need to measure the trajectory

and the final answer. But how to set the

criteria? And here we introduce two key

ADK built-in matrix. We will use those

matrix to define the passing criteria.

And the first matrix is to trajectory

average score. And this means how

closely the actual two costs match your

expected sequence.

And here is how it scored. We will

compare each expected cause to what

actual happened. If you set the score to

be one, that means they have to match

100% to pass the test. Similarly, if you

set the score to be zero, they can pass

the test even they're completely

different.

So when we use it, uh we need to set the

threshold between zero and one.

And the second matrix is response match

score. And this means how similar the

final answer to your expected answer.

Under hood, we use ro one to check the

word overlap which combines the

precision and recall into a zero to one

score. So when we use it, we need to

pick a threshold between zero and one

that allows natural language variation.

For example, 0.5 means close enough and

0.8 means you need to be very strict

with the response. Note that agents are

nondeterministic. So we should avoid

demanding one unless you use temperature

approximate to zero. So now we know we

need to test trajectory and final

response. Let's design a test following

the pyramid framework. So for tier one

component tests to test simple prompts

and expected to cost by testing

components like to selection accuracy

parameter and we can check the

correctness.

And next tier 2 trajectory test we can

use ADK eval with multi-step tests and

expected two sequence by using matrix

like two trajectory average score and

response match score. So we can evaluate

capability and reasoning. And lastly

tier three human review because ADK can

give you traces. So we can have human in

the loop to judge the whole quality. Now

let's get into the last part of this

video which is a demo about how to use

ADK to test your agent. So for tier one

component test we will see this example

of two correctness.

The goal is does a single tool call get

selected and filled correctly. So here

is an example code with unit test and

when we're reading this tier one result

we need to check the two selection

accuracy like did it pick the search

local library tool and also for the

parameter did it pass title equal to

harboarder correctly

since tier one maps to the foundation

the correctness of tools and parameter.

So if they fails we need to fix them

before moving on. And for tier two

trajectory test that including

multi-step plan and final answer. The

goal is does the agent follow the

expected journey and produce a good

final response. Here is example code to

test and let's set past criterial.

We can set this two trajectory average

score and response match score to define

how strict are the tests.

As you can see, 0.8 as trajectory

demands a very close match to your plan

steps and 0.5 allows a natural language

variation in the final response.

All right, let's run it and you can see

the result from the screen.

In summary, you've just learned how to

evaluate AI agents in practice. We start

with a tier one test for fast cheap

component checks

and then we move to tier two ADK eval

for full trajectory and final answer

quality.

You can also finish with tier three

human review to guarantee a great user

experience. So with this three tier

approach and ADK built-in tools, you can

go from I think this agent works to I

know this agent is very reliable. And

this allow us to wrap up our two-part

series on agent evaluation.

In the last episode, we explored what

traditional way of testing doesn't work

for AI agent and what to measure in

agents. So in today's episode, we make

it very practical with a clear testing

pyramid and hands-on ADK demos. And from

here, you're ready to start writing your

own evaluation sets and test files with

ADK.

Thank you so much for watching, and I

will see you in future videos. Bye.

>> [music]

[music]

How to evaluate agents in practice

Google Cloud Tech

80 days ago

10:54

Ai Whitelist

AI Whitelist

Rank #1

Description

Evaluating Agents with ADK → https://goo.gle/testagent This video applies the theory of AI agent evaluation from our previous episode, guiding you through practical testing with Google's ADK. Learn about the 3-Tier Testing Pyramid, covering component level unit tests, trajectory level integration tests, and human review. Discover how to use ADK to design and run reliable, automated checks for your agents, ensuring they perform as expected in real world scenarios. Chapters: 0:00 - Introduction to practical agent evaluation 1:05 - The 3-tier testing pyramid explained 1:15 - Tier 1: Component level unit tests 1:55 - Tier 2: Trajectory level integration tests 2:22 - Tier 3: End to end human review 3:14 - Agent Development Kit (ADK) in action 4:48 - [Demo] The 3-tier testing pyramid 9:23 - Summary and next steps Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #GoogleCloud #AIAgents #ADK Speakers: Annie Wang Products Mentioned: Agent Development Kit

Video Details

Category

Feed

AI Whitelist

Featured Date

December 12, 2025

Quality Rank

#1

AI Recommended