Loading video player...
All right. Hi everyone. Welcome to
OpenAI Build Hours. I'm Tasha, product
marketing manager on the platform team.
Really excited to introduce our speakers
for today. So, myself kicking things
off, uh, Summer from our applied AI team
on the startup side, and Henry who runs
product for the platform team.
Awesome. So, as a reminder, our goal
here with build hours is to empower you
builders with the best practices, tools,
and AI expertise to scale your company,
your products, and your vision with
OpenAI's APIs and models. Um, you can
see the schedule down here at the link
below openai.com/buildours.
Awesome. Our agenda for today. So, I
will quickly go over agent kit, which we
launched just a couple weeks ago at
Devday. Um, then hand it off to Samarth
for an agent kit demo. Henry will then
run us through eval which really helps
bring those agents to life and let us
trust them at scale. Um if we have time
we'll go over a couple real world
examples and then definitely leaving
time for Q&A at the end. So feel free to
add your questions as we go through.
Awesome. So let's do a quick snapshot of
what agents was like um building with
for the last really like several months
or even year. Uh it used to be super
complex. Uh orchestration was hard. You
had to write it all in code. Uh if you
wanted to update the version, it would
sometimes introduce breaking changes. Uh
if you wanted to connect tools securely,
you had to write custom code to do so.
Um and then running evals required you
to manually extract data from one system
into a separate eval platform, daisy
chaining all of these separate systems
together to make sure you could actually
trust those agents at scale. Um prompt
optimization was slow uh and manual. And
then on top of all of that, you had to
build UI to bring those agents to life.
And that takes another several weeks or
months to build. So basically it was in
massive need of a huge upgrade which is
what we're doing here. Um so with agent
kit we hope that we've made some
incremental improvements to how you can
build agents. Um now workflows can be
built visually with a visual workflow
builder. It's versioned um so no intro
no no breaking changes are introduced.
Um there's an admin center uh called the
connector registry where you can safely
uh connect data and tools and we have
built-in evals into the platform that
even includes thirdparty model support.
Um as Samarth will show us in a bit,
there's an automated prompt optimization
tool as well uh which makes it really
easy to perfect those prompts uh
automatically rather than trial and
error yourself manually. Um and then
finally we have chatkit which is a
customizable UI.
Cool. So bringing it all together, this
is sort of the agent kit tech stack. At
the bottom we have agent builder uh
where you can choose which models to
deploy the agents with. Connect tools um
write and automate and optimize those
prompts. Add guardrail so that the
agents perform as you would expect them
to even when they get um unexpected
queries. uh deploy that to chatkit which
you can host yourself or with open AAI
and then optimize those agents at scale
in the real world with real world data
from real humans by observing uh and
optimizing how they perform uh through
our eval platform.
Cool. So we're already seeing a bunch of
startups and Fortune 500s and everything
in between using agents to build a
breath of use cases. Some of the more
popular ones that we're seeing are
things like customer support agents to
triage and answer chatbased customer
support tickets, sales assistants
similar to the one that we'll actually
demo today, um internal productivity
tools like the ones that we use at
OpenAI to help teams across the board um
work smarter and faster and reduce
duplicate work. Uh knowledge assistance
and even doing research like document
research or general research. And the
screenshot here on the right is just a
few uh templates that we have in the
agent builder that show some of the
major use cases that we're already
powering.
Okay, so um let's make this all real
with a real world example. Uh a common a
challenge that businesses face is
driving and increasing revenue. Let's
say that your sales team is too busy
outbounding to prospects, building
relationships, meeting with customers.
We want to build a go-to market
assistant to help save sales time and
increase revenue. And with that, I will
kick it over to Samar to show us how to
do it.
>> Great. One of the biggest questions that
we get at OpenAI is how do we use OpenAI
within OpenAI? Um, and hopefully this
kind of rolls the curtain a little back
so you can take a peek at how we
actually build some of our goto market
assistance. Um we'll cover a few
different topics today like uh maybe the
agents that are capable of uh doing data
analysis, lead qualification as well as
outbound email generation. Um so what
I'll do here is move over and share.
Great. So we're actually on our Atlas
browser. Um feel free to download that.
I had a fantastic time using it these
past few weeks and um I think it saved
me hours if not uh you know days worth
of time doing some things sometimes and
uh um I'm a big fan. Uh okay so we'll
get started and when we get into the
agent builder platform the first thing
that we really see um is a start node
and the agent node. Um you can think of
the agent as the atomic particle within
you know the workflow that you go in and
construct and behind it is the agents
SDK which actually powers the entirety
of agent builder. Whenever we build
these agent builder workflows, um it
doesn't have to live within the OpenAI
platform. Uh you can copy this code,
host this on your own, and you might
want to even, you know, take this beyond
traditional chat applications and do
things uh like being able to trigger
these via web hooks. So for this
example, um we have three agents in mind
that we're looking to build out. the
data analysis one where we'll pull from
data bricks a lead qualification one
where we'll scour the internet for
additional details and outbound email
generation um where we want to maybe
qualify an email with things on a
product or a marketing campaign that
we're launching. Sound good?
>> That sounds great. I'm on board.
>> Okay, great. So, we'll get started by
building our first agent here. Since we
have uh three different types of use
cases in mind for what we're actually
trying to build, um what we want to do
is use a very traditional architectural
pattern using a triage agent. So the way
that we think about this is that agents
are really good at doing specialized
tasks. So if we break down this question
to um you know the proper sub agent, we
might be able to get better responses.
So for this first agent, let's call this
a question classifier.
Typing is hard.
a copy over the prompt that we've we've
put in here. I'll just take a quick peek
at what this looks like. And really what
we're doing here is asking the model to
qualify or classify a question as either
a qualification, a data, or an email
type of question. Really, the idea here
is that we can then route this query
depending on what the model selected as
what its output should be. Um, and
rather than having a traditional text
output, what we want to do here is
actually force the model to output in a
schema that we recognize and can use for
the rest of the workflow. So let's say
let's call this variable that the out
the the model will output in category
and select the type as enum. What this
means is the model will only output a
selection uh from the list that we
provide here. So um from my prompt I had
the email agent, the data agent and the
qualification agent.
>> Great.
>> And real quick uh how did you write the
prompt? Did you write that all yourself
or I know the importance of prompt and
steering the agent. How did you come up
with that? I think writing prompts is
one of the most cumbersome things that
we can do. Um I there's a lot of time
spent spinning wheels on what actually
matters when you're capturing that
initial prompt. And I think um one of
the most key ways that I write prompts
myself is use chat GPT and GPT5 to be
able to create my vzero of the prompts.
Um, within agent builder itself, you can
actually go in and, uh, edit the prompt
or create prompts from scratch to be
able to use as, uh, the bare bones for
what you might, you know, spin on in the
future for your agent workflows. Um, for
now, we'll leave it as the one that, uh,
we pasted in here, but we'll in the rest
of this workflow, we'll take a peek at
what using that actually looks like.
Great. Um, so now that we've actually
got got the output, um, agent builder
actually allows us to make this very
stateful. So for example I have a um a
set state icon here. Sorry just
again drag and dropping also can be
difficult. Um so what we want to do here
is take that output value from the
previous stage and assign that to a new
variable such that the rest of this
workflow is able to reference it. Um
we'll call this category again. Um and
assign no default value for now. Um,
using that same value, I can now
conditionally branch to either the data
analysis agent or the rest of my
workflow to handle maybe additional
steps I want to do prior to executing
the email um or the data qualification
use case or the customer qualification
use case. Um, what we'll do here is drag
this agent in
and we'll set that the we'll set the
conditional statement here to say um if
the state category is equal to data.
Let's see. Oh, it looks like I spelled
it wrong.
>> Debugging.
Great.
>> As you can see, there's helpful hints
where we were actually able to see um
what actually went wrong and be able to
really quickly go back and debug that.
So here in this case we want to see if
it's a data you a data agent will route
to that separate agent
and if it's not we'll probably use um
additional logic to go in and scour the
internet for those um you know inbound
leads that we want to qualify or an
email that we want to write. Um let's
stick with the data analysis agent for
now and go over what it's like to
actually go in and connect to external
sources within agent builder and largely
agents SDK. Um what I want to do here is
actually instruct the model on how to
use data bricks and create queries that
it can use um in co in cohort with an
MCP server. So what we've done here is
uh added a tool for the model to be able
to go and access this MCP server and
query data bricks however it chooses
fit. Um if my quer is really hard and
might require you know joins data bricks
and GPT5 would be able to use those
together to then be able to create a
concise query. Um, so since I've built
my own server for now, um, I'll add it
here. And let's call this I'll add my
URL first. Um, I'll call this the datab
bricks MCP server.
Um, and what I'll do here is actually
choose the authentication pattern. You
can also select no authentication. Um,
but for things that are protected
resources or might with live within
authenticated platforms, you might want
to use something like a personal access
token to go do that last mile of
federation. So, in this case, I'll I'll
I'll use a um a personal access token I
created within my data bricks instance
and hit create here. Let's give it a
second to pull up the tools. And we can
see that a fetch tool is actually
submitted here. Um what this allows us
to do is select a subset of the
functions that are actually allowed to
the MCP server um to really allow the
model to not get overwhelmed with the
choices of potential actions that it can
take. So, I'll add that tool there.
Oops. Um and I'll also um I'll go back.
One thing I might have missed here is
actually setting the model. What I want
to do is make this really snappy. And so
what I can do is choose a non-reasoning
model there. But for this one, I really
want the model to iterate on these
queries and react to the way that the
model or the the the results of the
model were actually perceived to um the
agent. And so uh what we'll do here is
do a quick test query to make sure the
piping works. So maybe I'll say um show
me the top 10 accounts.
That should be good enough. Um, and what
we can see is the model actually
stepping through the individual stages
of this workflow. So in the beginning,
you can see that it classified this
question as a data question, saved that
state and then routed. Um, we can see
that when it reached that agent and
decided to use that tool, it actually
asked us for consent to be able to go
and take that action. You can configure
that logic on the front end to be able
to handle how to actually show to the
user, hey, the model actually wants to
go and uh select an action there. Um,
with MCP you're able to do both read and
write actions. And we have a few of
these MCP servers out of the box. Think
like Gmail. Um, we have a ton more uh
out of the box that you're able to
connect to.
>> SharePoint. Totally. Um, and so here we
can see that the the model is actually,
you know, thinking about how to
construct that query. And we can see
that we can see a response here. We
didn't ask for the model to really
format this result for us, but we can
actually really quickly do that with
this agent itself. by just asking the
model to say um I would like the results
to be in natural language and just by
you know spinning on um the generate
button within agent builder itself
you're able to provide these inline
changes depending on the results that
you see in real time
>> super cool
>> cool u so the next thing I want to do is
actually create another agent to do some
of that research that we were mentioning
that might be useful for something like
generating an email or
uh qualifying a lead. So, we'll call
this the information gathering agent.
Looks like it's stuck here. I might have
to give it a quick refresh in a moment.
See,
platform's a bit buggy.
Great. Um,
cool. So, we're at this information
gathering agent and what we want to do
is tell the model uh how to actually go
and search the internet for the leads
that we want. Particularly, we're
looking for a subset of the information
that might be publicly available for a
company. So, think about like the
company legal name, the number of
employees they have, the company
description, maybe their annual revenue,
as well as their geography. Um, and what
we want to do here again is use a
structured output to define what our
output should look like when the model
goes and um searches the internet for
this. This gives us a good mapping and
for the model itself to know what to
look for when it's writing these queries
and we're able to then uh you know
instruct the model in terms of the way
that it should search for the uh across
the internet.
Great. Um what we want to do here is
also change the output format for uh the
schema that we want to enter. Maybe we
want to put the the fields that we
previously just showed into a structured
output format. You can also add
descriptions um in the in the
properties, but for now we're going to
leave those blank.
Great. So now that when the model goes
to this information gathering agent,
it will hit this uh agent, search the
internet, and output in the format that
we're looking for. Cool. Um since we
saved the state of the the query routing
in the beginning, we can go ahead and
reference this again um when we're when
we're going to um route again via email
or to the lead generation u and lead
enhancement agent. So what we'll do here
is set this equal to email and then
otherwise we'll just route it to the
other agent.
>> Awesome. Yeah. And the sub aent
architecture is great because it means
that you get better quality results a
bit faster than you would just using one
general purpose agent
which is helpful for actually having
impact and helping the sales team be
more productive.
>> Um what we'll do here is paste in a
prompt for this email agent. Um, but
really the highlight for for this for
the email agent is that we're looking to
generate emails that are not just from,
you know, information from the query or
from the the internet, but we also want
to upload files that might map to the
way that we're actually thinking about
building uh emails in general for
marketing campaigns. So, what you may
have in this case is something like PDFs
that contain information on what the
campaign is. Maybe you have other PDFs
that contain information of how you
should write emails. Um, all of this is
really useful information for the model
in order to spec out what that email
should actually look like. Um, so what
we'll do here is add a tool to actually
go and search these files. You can
attach vector stores that you may have
already um to the workflow and be able
to use those out of the box. You're also
able to add these via API. Um, but for
now, what we'll do is just drag in a
couple files that we have. Um, we have
one that's a standard operating
procedure for how to write emails. And
we have another document on a a
potential promotion that this sample
company has. Um, and what we've done is
allowed the model to then go in and
search the vector store for this type of
information in order to actually go and
generate that email. um on the lead
enhancement agent. Um instead of writing
one ourselves, let's pretend like we
have a uh like a general segmentation of
u the market that we want to actually
assign various account executives to. So
in this case, what we want to do is
essentially um be able to output a quick
schematic of how we're going to do that
assigning process depending on the
information that was gathered from the
internet. And without writing a prompt,
um, agent builder will be able to output
an entire, um, you know, version of that
prompt as a as a starting point.
>> Super cool.
>> Great. Um, before I move away from agent
builder and and show like this working
end to end, what I wanted to show is
that agent builder doesn't just support
text and structured output formats, we
also support really rich widgets. So,
what this looks like in practice is that
uh, we can instead of outputting text or
JSON, upload a widget. And I'll show you
in a little bit what it looks like to
actually create a widget and use a
widget. But we can actually go in and
upload a widget file itself. Um, so I'll
drag this in here. Or maybe I have to
Great. So we can see a quick preview of
what this widget looks like. Rather than
just outputting in text and maybe, you
know, traditionally like chat GPT what
you see is like a markdown formatted
result. we want to maybe render
something richer such that if you do
host this on your own on your on your
own website, you're able to have that
multimodal um component as well. So what
we'll do here is create this component.
And now if I say draft an email to
should we use open AI open AI about
>> great so we can see that it went to the
information gathering agent um since
we've given access to the web search
tool from the reason did we do that let
me make sure that I did that
>> may skip
>> might may have skipped that step
>> there we go Great.
>> So again, sorry,
>> I was just gonna say I love that you can
test the workflow live here and debug it
like we're doing before going to
production.
>> Totally. And the really nice thing is as
you run questions through this workflow,
we save the traces of exactly how the
model has executed um you know various
queries and then more holistically the
way that the workflow has orchestrated.
So this is really rich information as
you're continuing to iterate on your
workflow. And Henry will touch on this a
ton, but the ability to really peel back
the curtain and see how the model is
thinking about this and then assign
graders I think really allows you to
scale out this process of evaluations as
well.
>> Yeah.
>> So great. Looks like here it's searching
for Lumaf Fleet. Um we'll let this run
for a little bit and see see what
happens at the end. Um,
okay. Looks like it might take a little
bit to do that. We we'll get back to
that one. Um, end to end. So, what we've
built here is essentially an agent that
allows you to do three different things.
The first one is that allows you to go
and query data bricks for being able to
pull in that invi
um you know
information that might live bei behind
some form of a information wall and be
able to pull that within the agent
workflow itself. Um and then
alternatively being able to qualify
write uh emails um and then also qualify
inbound that you might get from
customers. Um, all of this lives within
a workflow that you can then host within
um, chatkit, which we'll cover, or you
can take this out and use it in your own
codebase to handle um, what those chat
workflows actually look like.
>> Super cool. Um, one of the questions I
was wondering was uh, what's the
difference between pulling a tool from
the lefth hand sidebar in like drag and
dropping that in as a node as opposed to
adding that tool into the um, agent node
specifically?
>> Totally great question. So, um, when I
added like the search tool to the
information gathering agent, I've
allowed the model to determine if it
should actually go in and use that tool.
Sometimes I always want the tool to run
prior to an agent actually getting that
information. So, I can add one of these
nodes to be able to ensure that the
model is actually doing this action,
prior to the agent actually receiving
that information.
>> Makes a ton of sense. Yeah. So, agent
kit then I feel like is a good
combination of deterministic and
somewhat if you want non-deterministic
outcomes to be true.
>> Yeah.
>> Cool.
>> Great. Um I want to pivot to so we built
this amazing workflow. Now we want to go
in and deploy it. Um I think one of the
most fantastic things that we released
at our most recent dev day was the
ability to go in and host these
workflows that you've built. So using
using the uh workflow ID that we've gone
in and built, we're able to actually
power these chat interfaces that uh
might require a ton of engineering
otherwise to support things like
reasoning models as well as being able
to support um you know complex agent
architectures and the handoffs that you
might want to show to users. Um what
this looks like in production is that
you're able to match the entirety of
your uh your brand guidelines to the
actual chat interface that you're
building. and we'll take a peek at how
some of our real customers are using
this today. Um, but really the I wanted
to highlight the fact that you can, you
know, entirely customize this to, you
know, the color scheme, the font
families, as well as the starter prompts
that your users might go in and use. Um,
say for example, we have a workflow that
looks at our utility bills, um, where we
might want it to go and connect to an
MCP server, pull up your billing
history, analyze those past bills, uh,
and then be able to, uh, show a really
rich widget to the user. the entirety of
that process and the customization of
what the user sees is entirely uh
configurable through chatkit. So here in
the question, how's my energy usage?
Rather than just showing a traditional
text response, we see a really rich
graph that allows you to visualize the
output.
>> This is super cool. Yeah. And I think
for our use case uh example, just to
drive it home, one of the widgets that
we have available that maybe you'll show
us shortly is uh an email widget. So, if
you wanted the agent to actually draft a
email to Opening Eye, which I think it's
still researching information for
because there's so much public
information out there. Um, and then
sales can just click to have that button
uh to have that email sent to the
customer.
>> Totally. Yeah. Let's take a look at a
few what those widgets could be. So,
we've released a gallery where you can
take a peek at some of the ones that we
think are really cool. You can also
click into these and see what the code
is to actually build these. But what I
think is really cool is being able to
generate gen generate these through
natural language. Like for example, if I
wanted to have an email uh component or
widget that I wanted to mock up that um
contains some specific brand guidelines
or um formatting in a way of that widget
that really appealed to my brand. Um I'm
totally able to do that via natural
language. Um and so using this you can
then export that into agent builder and
then show that UI when uh agent builder
invokes that that widget in chatkit.
>> Amazing.
>> Great. Um before moving it to Henry, I
wanted to show an example of what this
looks like in real life. Um we have a
website here that renders a p a globe
with or a picture of the earth. And what
we want to do is be able to control this
globe that we have uh via natural
language. So where should we go today,
Tasha?
>> Well, I think our next dev day exchange
is in Bangalore. So I'm going to say
India.
>> Let's go to India.
So what we should see here is another
agent builder powered workflow. But we
can see how not only did um a widget
populate on the right side, we actually
were able to control the JavaScript that
was rendered on the on the actual
website itself. So being able to have
this customizability and portability
into the websites and browsers that you
use every day is something that um we
find really fascinating with Chatkit.
>> That was the fastest trip to India I've
ever taken.
>> Totally. Um awesome.
So we covered um the the build side as
well as the deploy side into chatkit. Um
what really is the most important part
and you know the the hardest part of a
lot of building agents is the evaluate
part.
>> Yep. That's how we know that we can
trust the agents uh in real world
scenarios in production at scale with
all of the glorious uh and weird edge
cases that come up. So with that, I'd
love to hand it over to our friend in
the UK uh Henry who can walk us through
an EDOS demo.
Thank you so much Tash and Samath. And
hi everyone. I'm Henry. I'm one of the
product managers who worked on agent
kit. Um and so today I want to talk a
little bit about how once you've built
that agent, once you've got that
workflow, um and you've defined it in
the visual builder, I want to talk about
how you can test it. I want to talk
first about how you can test an
individual node and get confident that
that specific agent or that specific
node is going to perform as you want it
to. Cuz ultimately your agent is only as
good as its weakest link. like you need
every single component to be dialed in
and performing how you want it to. Once
you've got every one of those nodes in a
place that you're comfortable with, you
then want to be able to assess the
endto-end performance. And for that, you
can look at traces, but traces are hard
to interpret. And so now we have a trace
grading experience, too, that allows you
to take those traces and evaluate them
at scale. So, let me pull up my screen
and start talking you through a bit of a
demo and show how we can uh how we can
do this.
So, here you can see an agent that I
built. This is based on a real example
from one of our financial services
customers. This takes an input of a
company name. It assesses is this a
public or a private company and it
completes a series of analyses on that
company before ultimately writing a
report for one of the professional
investors of that company to review. So,
as I mentioned, you have a whole bunch
of agents here and every single one of
these agents needs to perform well and
needs to perform as you want it to. And
so how do you get confident in the
performance that's going to do that? How
do you get visibility and um and kind of
transparency into how it's going to
perform? So when you're defining this
agent and you're looking into one of
these nodes, you can see there's an
evaluate button here in the bottom
right. So we click that evaluate button
that's going to take that agent node
which has a prompt, it has tools, it has
a model assigned and it's going to open
it in a data set. So here you can see
this data sets UI and this allows you to
visually build a simple eval. And so I'm
going to now attach um just a couple of
rows of data into this eval. You can see
a company name and then you can see some
ground truth revenue and income figures
as well. So I've imported that to this
data set and that's going to allow us to
run this eval. So here you can see
everything that was passed through from
the visual builder. You've got the
model, you've got the tool of web
search, you've got the system prompt and
the user message that we had assigned.
And then you can add additionally see
this data that I uploaded. So this is
just three rows, a couple of company
names, and then some ground truth values
for the revenue and income figures that
our web search tool should return for
those um those companies. So what I can
do now, I can run the generation. So
this is obviously the first stage of any
eval is to run generation and then once
you've completed the generation then you
complete the evaluation stage. So while
that generation is running I want to
show how we can attach columns and so
here we can add new columns for let's
say ratings where we can attach a thumbs
up and thumbs down rating and then let's
additionally add columns for free text
feedback.
So this is where I can attach kind of a
free text annotation. uh maybe I'm happy
with something, maybe I want to attach
some kind of longer form feedback on
that data as well. And so what you can
see now is that this output is coming
through. And if I click into this, I can
tab through these generations that have
been completed. So you can see here I
was asked to complete some analysis of
Amazon, of Apple, and then of meta as
well is still running. And I can scroll
through that and I can see the
generation that was completed. So what I
can then do is I can attach these free
text labels or attach these annotations
sorry that I just created. So I can say
this one's good. I can say maybe this
one's bad. I can say maybe this one's
good. And then I can attach feedback. I
can maybe say this is too long for
example.
Now once I've done those annotations I
can also add graders. So, let me add a
grader here. And I'm going to just
create a simple grader that's going to
evaluate a financial analysis. And it's
going to require that this financial
analysis contains upside and downside
arguments that it considers competitors,
that it ends with a buy, sell, or hold
rating. So, I'm going to save that and
I'm going to run it. And this is now
going to run through uh in fact, let me
just let me just change that. Okay,
let's just leave that. So that's now
going to run through and complete those
kind of greater uh ratings. So that's
going to take a little while to run
through because we've got a lot of data
in there. So I'm going to tab over to a
data set that I created earlier where
you can see these graders have now
completed. If I click into these, I can
see the rationale. I can see why the
grader has given the result that it has
done. So here you can see for example uh
this grader has failed because there's
no explicit recommendation and there's
no competitor comparison. So what we
could do at this point and now just
maybe recap even where we are here we've
got those generations that have been
completed. We've got all those
annotations and we've got all these
greater outputs. What do you do at this
point? How do you make your agent
better? So one thing you can do is just
do some manual prompt engineering and
try and find patterns in that data and
then try and rewrite your prompt. That
obviously takes a long time and requires
you to find those patterns and to spend
a bunch of time, you know, trying to
solve them. What we see as a better
solution is automated prompt
optimization. So you can see here
there's this new optimize button. So if
I click that, it's going to open a new
prompt tab in this data set and that's
where we're going to automate the
rewriting of the prompt. And this is how
you save yourself having to do that
manual prompt engineering every time. So
this is where we're taking those
annotations, we're taking those greater
outputs and we're taking all the uh the
prompt itself and we're using that to
suggest a new prompt. And again, this
will take a minute or two to run
through. So I'm going to tap here to one
that I made earlier. And you can see
here the rewritten prompt that completes
a fundamental financial analysis but is
much more thorough and complete than the
initial kind of pretty scrappy and rough
prompt that I had completed.
So that's an overview of how you can
take that single node from that agent
builder and how you can robustly
evaluate that single agent. But we're
not building a single agent here. This
is a multi- aent system and we want to
test every one of the nodes
individually. But ultimately what we
care about is that endto-end
performance. So how do we get confident
in that? How do we test that? So as
Samarth mentioned these agents emit
traces and here you can see some example
traces from when I've previously run
this agent. So clicking through this I
can see every span. I can click into
every span and I can start to identify
uh you know what happened when this
agent ran. Now as I'm clicking through
this I might start to notice problems.
For example, here you can see I there's
a bunch of sources that have been pulled
by the web search tool. For example,
CNBC and Barons. Maybe we don't want
these third party sources to be cited.
Maybe we want only first party
authoritative sources. So, we should say
web search
sources should be first party only.
Let's just run that with GPT5 and Nano
so it's nice and fast. And then as I
click through more of these, I might
find additional problems. Let's say we
identify another pattern that the end
result doesn't contain a buy, sell hold
rating. So we say end result needs to
contain a clear by sell
rating. And again, I'm building up these
requirements that I can then run over
specific traces. And now this set of
requirements you can think of as like a
grader rubric. And this greater rubric
is built up with a series of criteria
that define a good agent. And then once
I've got that set of criteria built up
and I've tested it on a couple of
traces, I can then click this grade all
button at the top here. And this is
going to export the set of traces that
I've scoped this to. So in this example,
just these five traces. And it's going
to take the set of graders that I've
defined on the right. and it's going to
open that in a new evap. And this allows
you to assess a very large number of
traces at scale because clicking through
every one of these traces and trying to
find problems doesn't work that well. It
takes a lot of time. It doesn't scale
well. But instead, you can run these
trace graders over a very large number
of traces. And that will help you
identify just the spans that are
problematic and just the traces that you
want to dive into. So that was an
overview of how we have this kind of
embedded eval experience. is tightly
integrated with the agent builder. Um I
also just wanted to flash a couple of
best practices that we've seen from
working with a large number of customers
now uh on this platform and a couple of
lessons that we've learned. Um first
starting simple you know don't over
complicate things but do start early.
have a handful of inputs and a simple
grader that you define right at the
start of the project instead of leaving
evals right to the last minute as like a
I'm just about to ship this thing I
better do some testing which I know some
people do it's much better to like start
early embed evals and do kind of eval
driven development where you're
rigorously testing your prototypes
finding problems in the prototypes and
then quantitatively measuring your
improvement as you hill climb against
your eval much better way to build a
product and likely to result in in
higher performance.
Secondly, using human data. It's really
hard just coming up with hypothetical
inputs, using LLMs to generate synthetic
inputs. You'll probably get much better
performance if you get real user data,
real inputs from real end users because
that captures a lot of the messiness of
the real world. And then finally, make
sure you invest a bunch of time
annotating generations and aligning your
LLM graders because this is how you make
sure that your subject matter expertise
is really encoded into the system so
that your graders are actually
representing what you want your product
to do.
So that was a high level overview um of
our product. This is all in G. So we'd
love for you to to give it a spin and
please let us know uh any feedback at
all. And with that, I'll pass back over
to Tasha and uh Sam.
>> Awesome. Thanks, Henry. I feel like we
could do a whole hour session on emails.
That was awesome. Um, one quick question
for you actually before uh you step out
is um how large of an email data set do
you recommend? We got this from chat. Is
it uh 100, a thousand, 10? How do you
know what the right uh data set size is
to get the results you want?
>> Yeah. So, the best thing to do is to get
started early. And so even like 10 to 20
examples goes a long way. And having um
having that set of data in there to just
test your application against is is
really helpful. So even just you know 10
20 a couple of dozen uh rows is helpful.
And then as you get closer to production
clearly the more is the more is better.
But it's really, you know, I wouldn't
think of it as a a question of just how
many rows because there's kind of a
quality times quantity uh multiplier
that you have to um have to consider
here. Having, you know, 50 rows of
really high quality inputs that are very
representative of a large set of user
problems and then having graders that
are really aligned with the data that
the behavior you want to see that can
perform phenomenally. Whereas if you use
an LLM to generate a thousand rows of
synthetic inputs, it's not going to be
that helpful. So I'd say the quality is
almost more important than just the
quantity.
>> Yeah,
>> that makes a lot of sense. Yeah.
>> Yeah. And just to add on top of that,
like one of the questions that we get a
ton of is like how do we create a
diverse data set to run evals from,
especially if you haven't put a lot of
these tooling into production already.
Um, when we're building our go to market
assistant, our engineering team that
actually supports those workflows sits
right next to our go to market team to
understand what subject matter experts
are actually asking or curious about.
This allows us to build a good diverse
set of questions that on every iteration
that we continue to optimize. Um, we're
capturing the nuances and the real
queries that people are actually
interacting with.
>> Super cool. Um, awesome. Well, thanks
Henry. So, with that, I'd love to cover
a couple real world examples and then
we'll leave some time for Q&A at the
end. So, um, our first one here is a
short video of a procurement agent that
RAMP built. Um, so they used chatkit to
actually visualize uh this UI to the
person requesting a software. They used
agent builder on the back end to
actually orchestrate the agent flow. Um,
and they used evals to make sure that it
would work uh at scale in production. So
while this isn't live on their platform
yet, um, we hope that it will be in the
near future. And that was a quick run
through of um, what they actually built
and and the prototype. Um, awesome. So,
uh, RAMP with the agent kit stack, uh,
was able to build this prototype 70%
faster, which I think is pretty amazing,
uh, equivalent to like two engineering
sprints instead of two quarters. Um,
Ripling, I actually think you worked on
this project a little bit. Do you want
to maybe share what they built and how
it went?
>> Yeah, totally. We were initially
thinking about like how we can spec this
out through the agents SDK and um one of
the hard challenges was like getting
that alignment between subject matter
experts as well as you know the ability
to build workflows that were logically
sound and so we really sat with them to
understand what was their real go to
market use cases and be able to work
backwards from there. Um chatting with
their team I think it was a pleasure to
use a tool like agent builder and we we
got a ton of really uh good feedback on
next versions that we're looking to roll
out.
>> That's awesome. Um similarly HubSpot who
has been doing a lot of amazing uh work
in the AI space they used uh chatkit to
enhance their uh breeze AI assistant. If
you want to actually advance um awesome
thanks all good. Uh so yeah they saved
weeks of front end time like we
mentioned at the start building agents
from start to finish is super
timeconuming because of each of the
complex steps involved. So if we can
even help with just one of those um
numerous steps, the UI uh aspect in this
case, that's um that's a useful lift. Uh
and then finally, Carlile and Bane,
which were two uh amazing evals
customers of ours. So um they were able
to see a 25% efficiency gain um in their
eval data set, which is fantastic. Um
cool. Okay, so maybe to round it out
before we go over to Q&A, um when we
launched Agent Kit, these are some of
our early um customers who built on the
product. And you'll see that Agent Kit's
currently powering tech stacks at
startups, Fortune 500s, everything in
between. Um these are the different
types of agents. There's a bunch of uh
breadth of use cases here from uh work
assistants to a procurement agent,
policy agents. Um, Albertson's the large
grocery retailer has a merchandising
intelligence agent. Um, Bane code
modernization. So, really cool to see
just the wide range of use cases here.
Awesome. With that, we can go to Q&A
from the chat.
Uh, maybe do you want to go to the next
slide?
>> Cool. Okay. So, how can I add a four
loop blocks? Mark, you want to take that
one?
>> Yeah, good question. So we don't have a
for loop but uh we do have a while loop
that's available within agent builder
you're able to actually be able to um
conditionally continuously run different
agent workflows depending on if a
completion criteria has met. Um
obviously with the agents SDK you can
take it out into a codebase and then
orchestrate that on your own. maybe use
like our interpretation as that of that
as like v0ero. Uh but instead of a for
loop, we do support Y loops. So such
that you're able to actually iterate um
throughout the workflow until that uh
end criteria has been met.
>> Hopefully that helps. Um what else have
we got? How does Agent Kit compare to
the agents SDK?
>> Um I would say that agent kit so far is
uh so well I I'll back up a bit. Agent
kit is a suite of products that we've
tried to opinionate as to the most
useful tools that we at OpenAI find find
uh from our day-to-day as we build
agents. Um agents SDK powers the
entirety of agent kit and most of
everything that you can do within agent
kit you're also able to do within the
agents SDK or it's via uh available via
an API. Um so far uh we're continuing to
roll out a ton of these changes to make
that parody happen a little bit more
closer. Um but we imagine in the future
that agent kit will also contain um you
know some features that allow you to
extend the ability to host these
workflows um on the cloud. And so rather
than using like traditional chatkit
implementations uh you could also
trigger these workflows via an API as
well. Um this allows you to essentially
host the agents SDK on the cloud. Yeah.
>> Very cool. Yeah. And I would say um
agent builder is like the equivalent of
the agents SDK functionally but it's the
canvas visualbased way to actually
orchestrate those agents whereas agents
SDK is like the jump straight in
straight into the code version of it. Um
so yeah very cool.
Uh how do you build out of the box MCP
servers versus building your own?
>> Yeah totally. So we have a few MCP
servers. So we support uh remote MCP
servers which means that the MCP servers
have to be hosted on the cloud or um
hosted on the publicly available
internet to some degree. Uh when we're
building our own MCP servers, a lot of
the considerations that we have around
authentication require us to build our
own MCP servers. That said, a lot of the
providers that you use every day like
think Gmail, your calendar, etc. Those
all have out of the box connections
likely that you're able to just paste in
an API key and get started with all the
tools that we support. Some of these I
think um you know we don't have full
capabilities to do things like write. So
for example, if you want to write an
email via the Gmail API, I don't believe
that is currently supported. So you
might want to spin up your own MCP
server there. Um the thing I really like
about MCP is that it allows for that
authentication and blackbox is what that
flow actually looks like. So whether you
want to bring your own personal access
token or go through something like OOTH
and then pass in that last token that
you get to uh the MCP server, both are
totally great options to be able to
authenticate to secured sources.
>> Cool. We have any more questions?
>> Yes.
>> When do you recommend a classifier agent
with branching logic to different
agents?
>> Yeah, I think this is a great question.
It's one that we get a ton because um
as you add more tooling and instructions
to a model, what we've seen is that the
performance generally deteriorates. Um
imagine a world where you had a 100
tools, right? Allowing the model to
select which one of those 100 tools
becomes increasingly difficult. Um more
realistically, you might not have a 100
tools, but you might have 20. And each
agent or each use case for an agent
might use those tools in entirely
different ways. So one way that I like
to think about agents is that I like to
stratify the logic for what is a core
competency for this agent. What are the
net set of tools that I want this agent
to use and only in that specific type of
way. The moment I start confusing the
model and how to invoke these tools, how
to interpret the instruction with the
context of those tools, I like to branch
off to a different agent. So in the in
the cases that we had um where uh you
know we were looking for three different
uh GTM use cases maybe the email agent
that we're building you know that
outputs a widget is not the best one to
also do lead qualification. So um those
use cases where you're maybe using even
the same tools but uh you want to
structure the outputs a little bit
differently you want the model to
interpret the outputs a little
differently um it's good to branch out
to different agents.
>> Cool.
Alrighty. Uh, can we use agent kit for a
multimodal use case especially for
analyzing images and files?
>> Totally. So, um, this is a great use
case for agent kit. We do support file
inputs within that preview section that
we covered. You're able to even play
around in the playground with uploading
files. Um, I what I find really
interesting is that like we propagate
this behavior to chatkit as well or
chatkit propagates that behavior to
agent builder as well where if you
upload files within chatkit that is also
passed into hosted agent builder
backends.
>> Oh, super cool. Yeah.
>> Okay. So, we are at the end here. We
would love to leave you with a few
resources if you're interested in
exploring more. Um, agent kit docs, a
super helpful place to guess place to
get started. Um, we also released a
cookbook the other week um that walks
you through a very similar use case to
the one that we showed today um in a bit
more detail even uh chat studio if you
want to play around with chatkit and see
how you can customize it. Um, and then
finally, uh, to learn more about
upcoming build hours and past build
hours, the build hour refill on GitHub.
Awesome. Uh, and with that, I think
we're at a close. If you want to, um,
Right. Okay. Upcoming build hours, we
have two, uh, agent RFT. So, building on
what we talked about today. How do you
actually customize models for tool
calling and custom graders and things
like that? Um, that will be November
5th. So, really excited to build on
today's session um, with that next
session. And then on December 3rd, agent
memory patterns. Um, so hope to see you
at both of those. You can uh get more
information about registering at this
link.
>> Awesome. Well, that's it. Thank you so
much for putting this awesome demo
together. It was super fun. Um, yeah.
Thank you all for watching and I hope
you have fun building agents.
Introducing AgentKit—build, deploy, and optimize agentic workflows with a complete set of tools. This Build Hour demos how to design workflows visually and embed agentic UIs faster to create multi-step tool-calling agents. Samarth Madduru (Solutions Engineering), Tasia Potasinski (Product Marketing), and Henry Scott-Green (Product, Platform) cover: • Build with Agent Builder- a visual canvas based orchestration tool • Deploy with ChatKit- an embeddable, customizable chat UI • Optimize with new Evals capabilities- datasets, trace grading, auto-prompt optimization • Real World Examples from startups to Fortune 500 companies like Ramp, Rippling, HubSpot, Carlyle, and Bain • Live Q&A 👉 AgentKit Docs: https://platform.openai.com/docs/guides/agents/agent-builder 👉 AgentKit Cookbook: https://cookbook.openai.com/examples/agentkit/agentkit_walkthrough 👉 ChatKit Studio: https://chatkit.studio/playground 👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours/ 00:00 Introduction 04:50 Agent Builder 21:27 ChatKit 24:53 Evals 35:17 Real World Examples