Loading video player...
Hey everyone, welcome back to another
build hour. I'm Christine. I'm on the
startup marketing team and today I'm
here with Will and Theo.
>> Hey, I'm Will. I'm on the engineering
team building the fine tuning product.
Yeah. And I'm Theo, solutions architect
working with startups and with Will in
particular quite a lot.
>> So today's topic is on agent RFT. Um
really exciting. So, if you have been
tuning in to our past build hours, we
did a series on agents all about how to
build agents, starting with the
responses API and then working our way
up to agent kit. Um, and now talking
about agent RFT. So, all of these past
build hours can be found on our YouTube.
Um, and the purpose of build hours is
really to help you build on our API and
use our tools. Um, with that, I'll give
you a quick snapshot now of what this
next hour will be all about. So, first
we are going to intro you to agent RFT.
Then we'll spend some time quickly with
the the task setup um and then move on
to some live demos. Um we have a really
exciting customer spotlight today with
Cognition. Um so we'll be dialing in and
then share some customer stories and
then end with Q&A. So on the right side
of your screen, you'll have a Q&A box.
Um feel free to toggle over and submit
questions throughout the hour. our team
is in the room and joining in virtually
to help address these questions and then
we'll save a few for the end to answer
live with our team. So with that, I will
pass it off to Will and Theo.
>> Awesome. So let's kick things off by
talking about agents. So you're probably
joining us today because you're building
an agent for your application or your
business and you'd like to improve its
performance. So what makes an agent
different from a regular model is its
ability to interact with the outside
world to complete a task. doesn't have
to go through you all the time or even
talk to you. It just gets things done on
its own. Now, in order to get things
done, this agent has to have access to
tools. So, if you're building a coding
agent, for example, it needs access to a
terminal, a code interpreter, or maybe
even an entire codebase. Or if you're
building a customer service agent, it
might need access to internal software
to look up customer records or billing
systems to issue refunds or even the
ability to escalate to a human being.
So this agent needs a way to interact
with your business context and the
outside world to get things done through
the use of tools. And the way that we
think about agents here is that all
their interactions with the outside
world go back into the context window.
So that means that after looking at what
it sent into and then got out of a tool,
the agent will reason to itself, call
another tool, and then repeat the
process.
>> Yeah, that's super cool. And so how how
does that tie in with our first product
first party products in agents?
>> Yeah. Yeah, totally. So we we do care a
lot about agents here obviously and
we're building some of the best agents
for specific use cases and here's how
openi agents use tools. So for example
codeex has access to a wide range of
tools to complete coding tasks end to
end like running tests reading your code
files or making code changes. So in the
case of codeex codeex might have access
to say a code planning tool or a
terminal tool or even a tool to apply um
git patches. And another example of a
first-party agent that we've released is
deep research. It's now embedded within
our agent and GP5 products. Um, deep
research has access to a browser. Uh,
can also look through your files and can
can also run code. Um, so for for deep
research, you know, so this this set of
tools allows the agent to deliver you
the most up-to-date, most accurate
research articles.
>> Yeah. Yeah, that's super cool. And so
when we work with customers usually
using our models and they're interested
into optimizing those models, they work
a lot on prompt engineering. What would
you recommend to optimize your agents?
>> Yeah, so prompt engineering is honestly
a great way to start. Um, we've seen
many different ways to improve the
performance of agents so far. So, let's
go through them. So, as you said, you
can steer model behavior by optimizing
the the prompt. So, it's almost like
instructing the model to do your task
better. But let's say that you've
optimized your prompt and um you're
still not as satisfied as you could be.
So, you can then um optimize the task
itself. For example, you can simplify
the task. You can add better guardrails
around the task to improve the agents
chances of getting things right. Um or
you can add or subtract tools even. Um
or make the tools better at
accomplishing what the agent intended to
do.
>> Yeah. It's it's interesting when you
look at agents because we've seen
customers be successful at improving the
performance of the agent just by
changing the description of the tools.
Yeah.
>> And even their naming just because it
makes more sense. It's like semantically
um easier to understand for the model.
>> Totally. Totally. Yeah. So there's a lot
that you can do to um improve the
performance of the agent before you move
to fine-tuning. Um but yeah, let's say
you've tried all these approaches, but
you still want better performance. So
that's where fine-tuning comes in.
Fine-tuning is a way to train the agent
end to end on your task to achieve even
better performance.
And uh what we're here to talk about
today is agent reinforcement finetuning
or agent RFT. Agent RF is the way to do
this. Agent RFT changes the weights of
the model according to a learning signal
that you can specify to teach the model
what good behavior and less than good
behavior look like. Uh during training,
the agent will go and explore many
different ways of calling your tools to
learn how it can do better and better as
training progresses. And we wanted to
remind everyone that base RF is already
a functionality in the current
fine-tuning product, but you cannot use
it to fine-tune agents. Agent RF does
allow you to do this. So it allows the
agent to call tools while it's exploring
during the rollout process. So it can
learn from all possible ways of using
your tools. Um you can also specify
arbitrary reward signal through an
endpoint uh which we call to train them
all on so that it gets better and better
in ways that matter to you.
And so to summarize the benefits of
agent RFT, it it first helps you improve
the performance of reasoning models. It
improves the agents ability to use tools
and reach the best final answer. It's
also quite sample efficient which can be
really important in domains where
training data is scarce. We'll talk more
about specific examples when we go
through the customer stories and the
process itself can result in the model
that has lower latency and is better on
agentic tasks.
>> Okay, that's that's that's really cool.
And when you mentioned latency, I think
that's a very key point.
Is it the number of reasoning tokens
that drops and tool call the number of
tool calls or what is it?
>> Yeah. Yeah, totally. So yeah, let's
let's uh kind of dive in a bit into
latency and ML performance and how we
can improve those aspects. So one of the
challenges of making agents work with
your business context is that might be
very different from how we at OpenAI
train our models. So if your tools look
and behave the same way as say Codeex's
tools that's that we've trained on or
Deep Research's tools, then you're in
luck cuz the domain of the tools is
going to be similar between the uh the
uh the base model and the uh um the base
model and your your task. But your
business context um is most likely
specific to you. So that means that um
your agent might not be used to using
your tools in the way that is ideal. It
might call a tool too many times. it
might call five different tools and
calling one tool is better for for what
it was trying to do in a given moment.
So using agent RF you can kind of align
these these domains. Um it's possible to
train them all to use far fewer tool
calls to achieve the same or sometimes
even better performance on even task. Um
that means lower latencies for you and
faster experiences for your end users.
This process happens naturally because
we do apply a light penalty to the
amount of tokens a model uses to reason.
Um
okay. Yeah. But um perhaps you want to
impose a uh you want to impose like a
constraint actually because instead of
this natural process of the model kind
of learning how to use fear tool calls
and use fear tokens um sometimes you
want to make sure the model stays within
a given tool called budget and doesn't
go over that limit. So given how
important tool calls are in affecting
latency this could really reduce the uh
latency of your rollouts. Um, so agent
RP allows you to specify this cutoff
that train the model to stay within this
given budget while preserving or
exceeding the original ML performance.
But ultimately you're probably here in
the first place to improve the ML
performance of your agent. So um
obviously agent RF can help you do this
by first training the model to reason
better across tool outputs and two
training the model to use tools better
in the first place. So all this is
learned organically during the
exploration and rollout process um as it
tries many different ways across the
search space to call your tools and then
think about the outputs from your tools
to arrive at a better answer. So
hopefully it it hill climbs well hill
climbs nicely on your task.
>> Yeah, that's that's awesome and I really
want to try it out. I want any people to
try it out actually and I know you
worked so hard to like make this work.
So
>> we we worked hard.
>> Yeah, but like the whole team and your
whole team was like under the hood, how
does it work? how does it communicate
with the tools and stuff?
>> Yeah. Yeah. So, let's let's dive in. So,
in order to make all this work, uh we've
introduced several major new updates to
the existing RF product. So, first of
all, it's the ability of the model to
call tools during training via calls to
your endpoints. So, calls to your tool
endpoints. And second, it's the ability
for you to specify greater in the form
of an endpoint that we can call to get
your custom reward signal out. So, these
two additions mark the first time we've
allowed our models to interact with the
outside world during the training
process. even even our frontier models
um through your tools as the model is
exploring and doing its rollouts and
through your reward signal when we're
ready to uh update the model.
So to dive even deeper into exactly
what's happening during the training
process. So for each agent roll out, we
assign a unique identifier to all tool
calls and final answers that come out of
that roll out. And um when the agent
calls your tools uh we attach that
unique ID to the tool call so that your
system can recognize different tool
calls as originating from the same
rollout. So this can allow you to keep
track of rollouts as they happen which
could be important for state management
if if you choose. So you can do this in
your own database or in your own
backend. Um so that when we emit the
final answer and then call your grader
um you can then attach all the context
from the agent to the final answer
through that unique identifier and then
you can pass all the stuff into your
greater um and you know you can have
this like very very holistic grading
context.
>> Yeah. Yeah. That's awesome. I think what
I find the most powerful here is really
that all the tool calls and the grading
happens in your environment. So it can
match totally your production
environment and then your model will
just not be surprised when it sees that
specific tool and will know how to call
it
>> and it also gives you like so much
flexibility in the grading. So currently
on our platform we have couple of ways
of grading but here because you receive
every tool call you can store them you
can you can grade them and really shape
the policy that you want for your model.
>> Yeah absolutely totally. So there's a
lot of flexibility here and uh we do
hope that agent RFT helps you teach
agents to achieve frontier performance
on your tasks. So enough talk about
theoretical things. Let's now illustrate
how agent RF works with a real world
example. So we're going to fine-tune an
agent to perform better on fin QA
financial QA which is a benchmark that
gives a model a financial report and
asks it to answer questions about it
that require numerical reasoning. So the
uh original benchmark so this is
actually an academic benchmark that was
published and the original benchmark
prompts uh include the relevant
financial report that the model needs to
answer the question. Um but we've
decided to make things harder because we
like doing things the most difficult way
here at OpenAI. We've modified the
benchmark. We've made it a lot harder by
only giving the model the question
itself without the context, no report.
And uh we require it to use tools like
an agent would to search for the correct
report in this pile of 2,800 financial
reports to answer the question. Uh and
to make the task even more challenging,
we require that the model arrive at its
answer within 10 tool calls.
>> Yeah. Yeah. So that's so much harder
because you have to know where to look
in to. Then once you found where to look
in, you have to reason over it and all
of this in a very constrained project.
>> Totally. Totally. Yeah. Yeah. So here
are the tools that we've given the
access uh we've given the agent access
to. So we have a search tool which is a
semantic search tool. Um we have a list
tool which kind of goes through all the
directories and document paths and um
tells you what's in the file system. And
we have this like funnily named cat tool
which is kind of our engineer brains
just naming things the way that we
understand. But um cat returns a
document given a path. So it's kind of
like opening a document in your
computer. So let's go through an
example. So here's a sample question
from the benchmark. um the agent might
call the search tool uh after seeing
this question about you know answering
some uh question about like intel's
return. So it might ask the search tool
um some query to try to find the
relevant documents and information out
of it. Uh and so the search tool might
return something like this which has a
table and a text form with all the
relevant numbers it needs to answer the
question. And here's the greater for
this task, which is how we generate
reward signal for the agent's final
answers. Um, just to keep things as
simple as possible, since we've already
over complicated things by making the
benchmark harder, we used a model grader
for this task. We could have used an
endpoint grader, um, which is something
that we'll cover soon. Um, but we also
could have used a string grader which
which um actually you know rewards the
model for exact string matches to the
ground truth but is super super brittle
and can penalize the agent for minor
formatting errors like you know writing
out $32 instead of using the dollar
sign. So it's it's super brittle. Um
kind of penalizes the model in ways that
we don't want to. Uh so in our case we
also want to give partial credit to the
agent as well for answers that are
really close to the ground truth like
you know like rounding errors like 0.999
if the ground truth is actually one.
>> All right I'm going to hand it over to
Theo to talk about the demo and training
process itself.
>> Oh thank you. Well that was that was a
great setup. So let me let me dive in
some some code here and just going to
make sure the right screen is being
shared. I'm sorry.
Yeah, that took time.
>> Yep.
All right. So, here you should you
should be seeing the the the code here.
So, the first thing that we're going to
do is we're going to look into the tool
server. So, we're using model to do this
because a very was very fast to just set
up a fast API endpoint and we have a bit
of descriptions on on on how to set that
up. But the main idea here is that we're
going to first set up a base image where
this is going to be a Debian image and
we have fast API, pandas, numpy and open
AI. So those will be the libraries that
we need to to run our code and to run
the tools. So and we're also going to
add the the the corpus the full like
data all the data and the documents so
that the model can actually look into
them. So here we we defined the
different tools and let me just look at
the search tool because as will
mentioned it's a semantic search. So
we're going to get we create some
embeddings we use a open embedding model
and then we'll compute the cosine
similarity. So very similar to rag that
you you are probably all familiar with
and this is how we build um the search
tool which is just defined here. I'm not
going in depth on the other tools
because list and cat are quite
straightforward. And then the way we're
going to provide these tools to the
platform and to the model. It's just
going to be through this list where we
have um the JSONs of the tools where we
have a name which is going to be search
a URL which is the model URL I just set
up and then a set of headers which have
an O token so that only I can access um
those end points.
>> Oh, if you put your name in the URL,
it's great. Yeah,
>> it's it's mine.
>> Yeah. Yeah. Yeah, it's yours. No one
else is
>> No. Yeah. So, that's how we set up the
tools and um then we can have a look at
the greater. So, as Will mentioned,
we're using a model grader because in
the data set there the answers are not
don't always have the same consistency
in the number of decimals or should we
put the dollar sign before or we write
dollars after. So, the idea is to
prevent this brittleleness. We just use
a model greater and as we'll mention we
provide some partial rewards of 0.5 if
the answer is close but not exact but
also this allows us to provide like
answer of one if you say 7% instead of
0.07 as an answer. So this is very
important because we want to make sure
we provide the right signal to the agent
or else it's not going to be able to
learn what was a correct reasoning path
versus what was not a good reasoning
path. All right. And we're using GPT 4.1
and then the uh response format. And you
might be familiar with this from from
previous build hours or RFT engagements.
>> Right. And I also want to remind
everyone that we used a model grader
here. But you know, we also have this
endpoint grader that you can call where
the endpoint grader is basically us
calling your endpoint via the public
internet so that you can define your
custom reward signal. But in this case,
for simplicity sake, we just chose to go
with a model grader.
>> Yeah, totally. Yeah, totally. Thank you,
Will. All right. So now what do we
always do before running our team? You
can imagine that we optimize a prompt
etc. And what we're going to do is we're
going to run a baseline to see how GPT5
performs. And if you remember the
reinforcement fine-tuning uh where build
hour we did with Prashant a couple
months ago, we were very interested in
the variance of uh the model. So given
the specific sample, what is the
variance of scores that it gets for that
sample? And so I'm going to run I'm
going to run those plots. I actually ran
the model free time on each of the
training sets. I mean actually on each
of the sample from training and
validation. I'm showing validation here
because it's just 100 samples and
training is a thousand samples which is
a bit too large and hard to read. But
that's the plot that you might have seen
last time. I'm going to describe it
again. But what how does it look to you?
Will
>> Yeah, I mean well you're going to I'm
going to need some help interpreting
this graph because there's a lot of
stuff going on here. So
>> yeah. All right.
>> Take it away. Yeah. This is
>> for sure. So we ran each sample. So on
the x- axis you have each different
sample. For each sample we ran the model
I mean the agent three times and on the
y-axis we have the score.
>> So if you look at each each point the
red cross is the best score that it got
out of the three runs.
>> So if you look at this sample here it
got zero every time.
>> If you go look at the sample at the top
right it got one every time. And in the
middle sometimes they go zero sometimes
they go one and maybe sometimes they go
five
>> right
>> so that's for the overall the red cross
is the big the best the thick blue bar
which is light blue this is the mean
over the free runs
>> and the thin blue bar is actually the
variance
>> and when I see a plot like this I
>> I don't think it's a great plot for
reinforcement finetuning because many of
the samples do not have variance
>> okay
>> but we still have a We still have a
fraction of them like probably 15% in
the middle that do have variance and is
it is this variance which is going to
enable the model to learn what a good
reasoning path is versus what is not a
good reason.
>> Yeah, totally.
>> And so we expect that all those samples
will actually provide some signal to to
improve the performance of the model.
>> Right.
>> And uh very importantly this is on the
validation set but you can trust me the
distribution on the train set. So it's
kind of similar.
>> Yeah totally. And maybe this is a good
um part to talk about the compute
multiplier because the compute
multiplier kind of controls the amount
of exploration that the model does. And
so maybe like you know over three
repeats basically each data point is
being explored three times. There are
just not enough samples to kind of hike
up those those uh zero scores up into
the blue bar region. But maybe if we set
the compute multiplier higher such that
we make the model explore more, maybe
one of those uh the model has more
chances to to get some like nonzero
reward out of its exploration. So that's
where the uh exploration actually really
matters.
>> Yeah. Yeah. Totally. Totally great. All
right. So now that we've seen this uh we
also share a notebook very simple that
you can run through that will allow you
to run the training on our platform uh
using our our API and
okay now let me actually go and share go
and find
the examples of one of the training runs
we did.
All right.
All right. So now we're on the platform
the OpenAI platform that again might be
you might remember it in some ways and
you can see here that uh this is the job
that we ran this it has a number and
we're going to explore all the
hyperparameters that we ran and then see
the curves for rewards for output tools
and so on. So very high level I run for
free epochs meaning that we go through
each sample three times. The batch size
was set to 16 and as will mentioned
there's a compute multiplier which is
very important number for the amount of
variation that we will observe during
training and here I've set it to one. If
we want more variation and use more
compute and have more chances of
stumbling on good reasoning paths then
you would bump up this compute
multiplier. But you also have to
remember that you are hosting endpoints
and during training we're going to hit
those endpoints. So if you increase this
compute multiplier you're going to have
to increase the robustness of your
endpoints as well.
>> Yeah.
>> Right.
>> Totally.
>> And I'm using reasoning effort medium
and eval samples is the number of times
we evaluate each point from the
validation data set to have robust
curves during training.
All right. So let's have a look at this
reward curve.
Yeah. So you can see at the very
beginning we start at a baseline of
around 0.6 of validation reward. So this
purple curve is the score on the
validation set and that's the full
validation set run twice as per samples.
And then the green curve is actually the
model performance on the specific batch
that you're training on. So here we have
batch size of 16. So this value here
step two 0.461 461 means that like for
all the trajectories that we ran over
those all 16 samples in the batch we
have an average or of 0.461. So it's
less representative and robust than the
validation curve because the validation
curve is on the full validation data
set.
>> Right?
>> And so what we can see here is that very
rapidly in just 10 steps the model
actually improves the performance by 14
percentage point from 0 59 to 0.63.
>> Yeah,
>> that's quite a lot. And so it it
directly probably learned how to use the
tools much better.
>> Yeah.
>> And if you go on a bit longer, you can
see that the work goes down a little bit
and the assumption is that the model is
exploring new solutions to try to push
the performance even more
>> and it manages at the very end to push a
little more.
>> Yeah. And I always love correlating the
reward with the reasoning tokens means
because here you can see that in the big
exploration phase in the middle where
actually reward was going down the model
was starting to think more and more.
>> Right. Yeah.
>> And sometimes it's just not necessary to
think more and more and maybe you just
have to learn how to use your tools
better or different tools. And so this
is what I love about the UI is that we
also show those tool call per rollouts
that show the distribution of how the
number of tool calls evolves
during um during rollout. So you can see
at the very beginning we we use a total
of probably eight or nine tools and then
it drops quite significantly.
>> Yeah.
>> Um to much lower numbers and so you can
see that the performance gain that we
saw after 10 step is also correlated to
a huge drop in tool calls.
>> Yeah. And you can assume that the model
is just learning to use those tools much
more efficiently. Yeah.
>> And I think that's really awesome
because it shows how we're closing the
distribution shift just in those 10
steps. Of course 10 step is a number
that worked here. Maybe in your case it
will be more, maybe it will be less but
um that's very interesting. And then you
can see all this region of interest
where the reward was going down and the
number of reasoning tokens was also
going up is is a region where well the
tool the tool call were definitely
shifting. And then if you go to the very
end, you see that the model starts doing
a lot of lists. Not exactly sure why,
but this allows it to reach higher
rewards and it kind of converges to to a
policy. You can see where list becomes
kind of flat and all those lines become
more or less flat.
>> And um
>> I think that's that's very interesting.
As a business, you might have stopped
after 10 steps because you don't want to
plateau for too long.
>> Mh.
>> But it's always interesting to see what
happens beyond.
>> Sure.
>> All right. Right. So those are the the
high level curves I wanted to explore.
There are many other other curves such
as the number of tokens per tool call
and that will give you a sense um most
probably also about for the speed of the
training run. The more output tokens uh
the longer it will be it will take to
train the model. Um, but let's go back
to sharing the code because
right now we've done a very high level
analysis.
Um,
where is this?
All right. Yeah. Sorry. So, we've done a
very high level analysis of what
happens. But because we have access to
those models, we can just observe the
traces in depth and try to understand
what is happening under the hood. So
I've loaded all the results. I run the
evaluations three times on the
validation set for the baseline model
and step 10 model that we saw the big
increase and the decrease in reason in
the number of tool calls. And what I'm
going to look at is as will mentioned
like the performance initially but also
the latency and also the output tokens.
So let's do some quick plots here.
You'll find the code to do all those
plots. And here you can find a very
simple plot. On the left hand side you
have the average reward over the 3 * 100
samples and the average latency. And
what you can see here is that the well
technically you want to be at the top
left, right? You want higher reward,
lower latency. And that's what we get
from going from baseline to step 10. We
have a five second um reduction. And
that's approximately like 10%. And we
have an 11 percentage point increase
which is which is quite significant.
>> Yeah. Yeah. Wow.
>> And um so latency here sometimes it's a
bit hard to to look just at latency.
What we can look at is also number of
tokens because it will give you some
information on
>> and uh on the on the time it will take
to expose. And you can see the tokens
mean went from like 2500 to 1,500.
>> Wow. Yeah. uh for the
>> huge reduction. Yeah.
>> Yeah. Huge reduction. That's probably
from less reasoning and and less tool
calls out.
>> Right. Right. Right.
>> Um all right. So now that we we saw the
high level, let's look into the tool
calls per per trace. So I also run
something here to compute the means. And
you can see that for the baseline model,
we were around 6.9 tool calls per trace.
And for the finetune model, we're only
at 4.2. So that means smarter model,
faster model uh that has I mean just
closing the distribution shift.
>> Yeah.
>> And if we look more in depth into this,
we can actually plot um equation plot
which is a plot that I really love to
plot after having run RFT to understand
really what's happening in the model
behavior. So let me walk you through
this very simply.
>> You you and the plots man.
Yeah, I think it's a it's a great way to
analyze, you know, absolutely to get an
understanding of the policy change. So
here on the left hand side, you have the
the delta reward. So we take the reward
of step 10 minus that of the baseline
for all of our data points. And on the
x-axis we have the delta and tool calls.
So step 10 minus baseline. And the
equation where you want to be here is
again the top left because you want
higher reward, lower number of tool
calls. Yeah. And you can see that we
have 29 points or in this region which
means that a large fraction of those
data points actually just like a I mean
of those 100 samples are just
faster and higher reward. Then we have
another interesting one which is the
ones where there is no delta in reward
but a decrease in number of tool calls
and that's also quite a big fraction
with like 62 uh I mean more than
probably 50 samples cuz here we count
all of them but that's the idea and then
there are some samples where the model
is starting to lose in reward and lose
and doing less tool call
>> and this kind of highlights that even if
we learned a a policy that's kind of
general we might not be able to capture
all of the data points because this
policy might be a bit too maybe a bit
bit too strict for many of those those
points.
>> Yeah.
>> So that's the trade-off, but we don't
have any point in the bottom right
corner which is the one that we really
don't want to which is more tool calls
and lower reward. So I'm quite happy
with what the model has trained and how
the policy changed.
>> And we can also skim through some of the
traces uh in a little even a little more
depth and like in all of those tool
calls. But since what is interesting to
see is that
not the model has learned to use each of
the tools uh better on the first line
here. This is the number of tool calls
per specific tool. And you can see that
it drops for search for cat and for list
>> so it's really general which is quite
cool. And finally something more about
the model policies. But then this was
would require even more work. But we can
look at is the model being a bit smarter
in the way it uses the tools and the way
it repeats the number of of of
making exactly the same tool call with
different parameters. So here I'm
plotting I mean I was looking into
diagrams. So sometimes does search and
search and then cat and cat
>> and you can see that the number of
repeats drops significantly from 1,000
to 500. So we just divide by two. So the
model is much better at understanding um
I mean I take making the right tool call
the first time so it doesn't have to
repeat the exact same following with
slightly different parameters
>> right
>> and and if we look in depth this is a
very cherrypicked example but you can
see the baseline model doing the search
tool six times in a row before doing
list cast another search and cat whereas
the the function model just follows a
very simple policy of search list and
cat and then it probably just reason on
the output to provide a final answer.
Yeah,
>> totally.
>> Yeah. And I just want to add that on
this benchmark, the documents in the
train set and the documents in the
validation set, um, there's no overlap
in the documents that are required to
answer the question. However, the model
is kind of still operating on the same
pile of documents. So, so it still has
access to basically the entire file
structure, but the questions that are
asked, um, there's there's basically no
overlap in the documents that are
required. So in some ways it kind of
like learns how to use this file system
and like knows what documents are
inside. Um but ultimately you know the
the documents are kept separate.
>> Yeah. So that Yeah.
>> Yeah. That's pretty cool.
>> So that would probably match potentially
your business use case where um you have
this existing corpus that you that you
want to learn how to use better.
>> Yeah. Totally.
>> Yeah.
>> All right. Yeah. Cool. So uh that was
that was a quick demo. uh you will find
the code if you want to to go through it
run it and very high level we also have
some advice on how to get successful
with agent RFT. So the first one is you
need a well specified and constrained
task and this is mainly in the sense
where you need consensus from uh people
who have domain knowledge or aesthetic
uh for for some visual task. Um where
there is one real answer in a way or
people will agree on what the good
answer is. And this is very important
because you want to share some signal to
the model that is consistent and not say
in one example oh answer A is good in
the next one answer A is not good
because then the model will get confused
and will not learn how to reason better.
>> Yeah.
>> The second one is nonzero baseline
performance. So we saw initially in our
variance plot um you need it you need
the model to be sometimes right or else
if it's just never right at after
running like a 100 times on the same
sample it will probably never learn. And
if it's especially if it's like this
across your whole data set,
>> right?
>> And then improve accuracy. IK. So that's
very interesting. If you run multiple
times for each sample and then instead
of looking at the average performance,
you look at the average performance for
the best samp for the best trajectory of
each of the samples. That gives you some
information on again the variance and
how often does the model get it right.
And so during training, we're going to
nudge all the trajectories to go um and
match those best sample those best
trajectories. And technically you can
then bootstrap on this because you will
generalize across other samples and so
you'll bring in some new reasoning
patterns and probably keep on pushing
and you can do that multiple times.
>> And finally is quality over quantity. Um
in this example we use 1,000 training
samples which is something which is
quite a large number. I've done a lot of
engagements with many less samples
probably 150 and we've been quite
successful. And again the idea is really
on the how good the data is and you
don't want any uh mixed signals uh to
the model
>> right yeah
>> all right so that was like on the
performance side on what you do
beforehand now on the infrastructure
side really related to our product um
what you want to do is you want to
mirror production behavior you have the
opportunity to host your tools so just
go for it make it like very similar to
production and like that everything that
you're improving during training will
actually translate to your product then
the second part is investing in
designing your grader. So the grader
will really affect the way uh the model
behaves and so the model policy and so
it's very important to have it be
aligned with your domain knowledge to
make it hard to game and to hack. Um
though this is very hard so as soon as
you get something that is little bit
like hard to game you should like go for
it and try it and then preferably have
some gradient as well mentioned if it's
just binary and just the string check
that will be complex um by nature of
many of the problems which don't have
like a first order logic answer yes or
no right
>> and the want to give the model like
partial credit right
>> exactly yeah yeah
>> and and you want the model to know that
it was going in the right Yeah, exactly.
Like here in this case, we could have
thought of adding some reward for
reading in the right file. It knows that
writing the right file. Maybe just the
reasoning is wrong.
>> Right. Right. It's like teaching a
person kind of.
>> Yeah, that's a little bit how I think
about it. All right. And then yeah,
limit the tool calls output length
because it's just going to make your
training very slow and also going to
confuse the the model. So if you can uh
work on outputting only what is
necessary, that will make it more
efficient across the board. But I think
that's also very reasonable for any
agents and they'll just fine.
>> Totally. And it saves you money, too.
You don't want to shove tons of useless
tokens into the context window. So,
>> yeah.
>> So, neat.
>> All right.
>> Awesome.
>> Cool.
>> Thanks so much for that. That was
incredible.
>> No, thank you.
>> Got analysis and all the plots charts.
>> Thank you, Will and Theo for walking us
through that. Um, really excited about
this next segment. Um, our customer
spotlight. We're going to hear directly
from the research engineer Sam Pretty at
Cognition. So, please welcome Sam
Pretty.
>> Hey everyone. Uh, I'll quickly take over
the screen sharing.
>> Um, is that good?
>> Yeah.
>> Yeah.
>> As long as it's coming from your
computer.
>> Yeah. Yeah, it is.
>> Sweet.
>> Yes. Uh, thank you, Tio, Will, and
Christine. Um so hi everyone I'm SRI I
work as a research engineer at
Cognition. Um at Cognition we build
Devon and Wensurf. So Devon is a
autonomous AI engineer that works
independently on solving tasks in your
codebase. And as part of my work at
Cognition I I work on improving like
models to make parts of Devon smarter.
Um so I'm excited really excited to
share like what we've been working with
uh Will and Theo on the agent RFD
feature.
Um so one of the tasks in Devon like
when you when you give initial query to
Devon the first thing it does it goes
into a planning mode to kind of try to
figure out what it needs to do to like
actually solve this task. Um and from a
UX perspective we don't want the agent
to spend too much time in planning mode
because we want Devon to start working
and like showing edits to the user as
soon as possible. Um and so one of the
motivations of this was can we like
fine-tune GP5 or other frontier models
like that better so that they get to
this um editing stage as quickly as
possible while still maintaining or even
improving accuracy. Um so the way we
designed this task is given the initial
user prompt we want to restrict the set
of tools that's available to Devon. Um,
so in this case it's just read file and
and shell because we don't really need
to make any edits at this stage and let
so and then we let the model explore or
the agent explore it and like figure out
how to like which files to look at and
which files to edit to solve this task.
Um, so in this case we just have the
read file and shell tool. the the
motivation of the shell tool is so that
the the model can run commands like prep
and find to like search the code base
for like certain strings that the user
might have put in the query or just look
for like certain file names or things
like that. Um and so as as the mentioned
earlier so we need obviously the tool
calls and then the data set and the
reward. Um so for the data set we
collected a bunch of uh real world
repositories and collected uh user
queries from those repositories and then
we labeled what are the files like the
user actually edited to solve this uh
this task and ideally we want this sub
agent to return those exact files so
that in the following on the agent can
continue and make the edits to those
files. Um and then for the reward reward
we use the metric called F1 score. Um so
F1 score balances both precision and
recall. Um, this is because we want the
model to obviously like not just like if
we just did precision or just a recall,
the model would either like be very
conservative and only return a few files
or like return too many things to try to
get everything. Um, and obviously we
want to be in the balance so that the
the agent that comes along after does
not have to like is it context is not
polluted with too much data. Um, so
yeah, we can uh get to the the eval
results. Um so we started off with GPT5
being so we started with the GPT5 base
model being um somewhat lower than like
the current frontier model. Um and we we
ran two experiments. So one experiment
was uh GP5 with like a less smaller data
set of around 100 samples. So 100 tasks
across varying repositories. Um and then
a larger larger uh experiment with
around thousand samples. Um so one thing
we tried to maintain was that the set of
repositories would be distinct or
disjoint between the train and the eval
uh case because we wanted to make sure
it wasn't that like the model was just
learning things about the uh about the
data set because ideally like when when
we want to use this in the in the in
real life the model the train model will
have never seen certain repositories
because it will be private repositories.
Um so um as you can see like uh even
with the smaller data set it already
obviously beats the base model by quite
a lot. Um and and the late with the
larger data set, we get a even further
boost. Um and and the plan action score
here is basically the F1 score. So we we
take a look at all the files the model
looked at and also at the end the model
does output u like what it thinks the
right files to edit should be and we
kind of compare that with the label
ground truth. Um so during the
experiment some of the things we noticed
are that uh the model starts learning
how to do a lot of parallel tool calls.
So if you looked at the traces, the
first action that the model does, it
will like kick off like eight different
things um depend like listing the repos,
grapping for things and then following
on it will like once it gets the results
from those tool calls, it will like it
will like independently explore all of
these those things by again running more
parallel tool calls. Um and usually
because running running the tool called
such as read file is quite a bit faster
than like the actual like um the model
inference it it does help a lot that
like these back and forths are like
reduced quite a lot. So um I think in
the eval score for example when we put
this in Devon directly we noticed that
to get to the originally on on the
baseline to get to the the end of the
planning mode it would take around eight
like 8 to 10 back and forths with the
agent or with the model but with the
fine-tuned model we would be done in
like four back and forth. So that like
cuts in cuts time in half by almost
force. Obviously there's a thing where
like sometimes the model could learn
could run a tool call that takes a
longer amount of time. So we we do we do
try to penalize things like if it tries
to do too many tool calls because it
does take a lot of time or things like
that. Um yeah and also during training
we did have to kind of penalize um if
the model took too long because we don't
want the model to like keep exploding
and like never be satisfied. Um so yeah
uh we noticed that like with with this
agent RFD feature we can just push like
already frontier model like GPD5 even
further on like a specialized task when
we when we have a clear reward of what
we want to optimize
um for the for the infrastructure um as
as we mentioned earlier um we we we run
both the tool calls as a remote endpoint
as well as the grader um and so the way
the training works is every step the the
platform like sends us a bunch of
rollout requests So basically like the
model tries given a certain sample it
tries to like uh like does the roll out
and like like there's like around 32
copies or something like that um for
each and so for each each roll out we
spin up a new VM um and and like run the
tool calls in this VM and then the
results are given back to the platform
or the our left grader and then at the
end when we get the final answer the uh
we call a greater endpoint where we
compare the the trajectory. So we look
at the list of all the tool calls that
uh the model made in in that particular
rollout as well as the final answer and
we give it a score based on the label
ground truth. Um so in this case we
decided to go with like isolated VMs
because uh we we didn't in because as
you know remember we we used like a
shell tool so the model could decide to
do some destructive actions. So we
didn't want to like one roll out to
affect the other rollouts in case the
model goes crazy and runs RMRF or
something like that. Um and obviously
like we use VMs because uh we could
reuse the production Devon VM info where
we give every Devon instance a VM but I
think containers work well for this
purpose as well. Um yeah and some of the
some of the interesting things we
noticed was that the RL is quite bursty.
So um at at the beginning of every roll
out they would send us like 500 new
rollout requests. So um you definitely
need to like handle that because that's
like 500 new VMs starting instant at the
same time. Um and then the other other
kind of like foot gun is that um
sometimes like let's say there's
infrastructure error um and the VMs fail
um the it it does like what ends up
happening is the model gets zero reward
because like the tool calls fail and
like the model can't figure out what's
going on. Um and while that's not the
model's fault, that does lead to the
training kind of collapsing or like the
like the model learning in a bad way
because even the model might have done
something good, it got a zero reward. So
um it is good to have a lot of
monitoring on like when there's tool
called failures or like you know there's
abnormal abnormal issues with the model.
Um because sometimes it could be that
the model just has formatting issues and
it's not calling the uh the tool
correctly. Um but uh other times it
could be just our infra issue. But
anyways, uh thank you like to Will and
like Kathy for like helping me debug all
of these issues
>> and all the other people. Yeah,
>> for sure.
>> Yeah. And anyways, it's been it's been
really exciting exciting to try this
agent RF feature and like being able to
tune the performance of GP5 even more.
>> Yeah, thank you.
>> Yeah.
>> Yeah, it's been incredible. Thank you so
much for working with us so closely on
this and really glad to to ship an agent
to you that's um that that's like
beating the bas beating beating the
state of the art it seems. So yeah.
>> Yeah.
>> Um do we have time for questions for
this part?
>> Yeah. Um let's go through some some uh
success stories too that we can share
and then um we do have quite a bit of
questions so I want to make sure we have
enough time for that.
>> Sounds good.
>> Thanks and pretty. We will catch you
later.
>> Thanks so much. Thank you.
>> Bye-bye.
>> All right. So, yeah, we've seen a good
success story with Cognition and SRT and
we just wanted to showcase some others
to show the versatility versatility
um of agent RFT. So, let's let's start
with with one with whom I worked really
closely with Ambience and Ambience
>> closely with all of them.
>> Yeah, that's true. But, uh ambience is a
yeah healthcare uh company and
they're they they're embedded in some
hospitals and one of the tasks that they
look at is ICD10 coding and they have an
agent for this and ICD10 coding is
actually the the work that you have to
do when you want to do billing um after
a session with a patient and you have to
map the topics or the illnesses um the
diagnosis actually of what you've
discussed with some um codes and those
codes are very precise and there are
like 70k of these codes so it's quite a
very quite a hard task And what we're
looking at here is we have a transcript
between the doctor and the patient and
automatically from that transcript we
want to propose the right ICDS and codes
and this is requires a lot of um nuance
understanding of the discussion but also
a lot of medical reasoning and that's
why ambience looked at GP5 and was using
GPT5 and one other aspect sorry is that
it has to be quite fast and so they use
GP5 with low reasoning because of the
way the the doctors use
And so if you look at the plot on the
right hand side, we started with GPT5
low hovering around the 0.5 F1 score and
then we built that agent that actually
has a tool that does a search um for
those IC10 codes and then we actually
RFD that model and you can see the jump
from 0.52 to 0.57. It might look
slightly small but the actual highest
performance you can get is around
0.70.75
because we are looking at a task that is
slightly subjective and doctors agree or
not on what are the actual codes. So
this is a really a significant jump for
them and not only we're seeing this
increase in F1, the fact that we are
fine-tuning as we've seen during the
whole session also allow them to reduce
the latency and so there's a 18% um
average response time and that actually
halfs the number of samples that are
above their latency product uh latency
threshold in the product. So that was a
that was a great use case and it was
great working with with Brandon Patrick
and the team
and uh then regarding another use case
very different no more healthcare here
we're looking at Jensen Spark slides
creation agent so genspark has amazing
agents and one of them is an agent that
builds slides so the user will
communicate with the agent to make
different tool calls and at the very end
those slides um sometimes are not
aesthetically necessarily very pleasing.
There's a bit too much text or they're
too long and therefore uh they use a
reasoning model to try to harmonize all
the output. And this is what we fine-
tuned on. And what was great working
with Flame and team was that they worked
a lot on their model grader and
different type of graders to judge both
the content and the visual aspect of it.
And they were extremely happy with the
output. And I I also find those slides
quite quite pretty on the right hand
side that that we have. And um in terms
of numbers, it provided like 88%
improvement uh on bad cases over the
existing models which is a significant
uh number that we're that we're very
happy with.
>> So yeah, well do you have other use
cases?
>> Yeah, absolutely man. We yeah we should
use GenSpark for the next build slide.
These look great. Yeah. So moving on
just to show you how diverse um the
success stories on agent RF can be we
have we work closely with MacO to build
a GPU kernel building agent. So, Mac is
building agents to write these GPU
kernels, which is really difficult for
LLMs because there's just a a scarcity
of training data out there compared to
other domains. Um, especially true for
new hardware. So, if Nvidia puts out a
new accelerator, there just aren't
enough examples of performant kernels.
But using agent RFT, uh, as few as a 100
PyTorch prompts were enough alone for
GP5 to learn how to write fast kernels
for a new hardware platform, as long as
you have a good grader, which is what
Macro worked really hard on. Um and that
allowed us to not need any code examples
actually to um start writing these
really performant kernels. Uh so the
fine-tuned model actually beats the
state-of-the-art by 72% in writing
correct and performant GPU kernels which
is a huge boost.
>> And um lastly we worked closely with
Rogo. Uh Rogo is building a financial
reasoning agent. It's uh capable of
reading uh financial filings, extracting
investment insights and then supporting
human analysts through this question
answering interface. And uh they wanted
to fine-tune oformin uh to summarize and
present these key findings from earlier
steps in the uh kind of finance
workflow. So rogo is really interesting.
They they used a custom LLM as a judge
grader uh that was accessible via
endpoints. So that we we called their
custom grader that measures the agents
factual accuracy, reasoning,
completeness, financial soundness,
clarity of explor uh clarity of
explanation. Um so you can see how you
can just fit a lot of your own criteria
and your own rubrics into your own
custom grader, which is the power of the
the um part of the power of the agent RF
platform. Uh the results are fantastic
with a 21% increase in core ML
performance with much lower
hallucination rates and missing
citations. Um I also just want to call
out that u rogo did a ton of work in
kind of making their greater uh
unhackable. I think earlier runs showed
that um the model actually started
reward hacking. So what happens with
these um with with the RFT process
sometimes is that if you have an edge
case in your greater um the model is
super super smart and sneaky and will
find ways to uh kind of exploit that
grater. So it's also really important to
make sure that your grader is pretty
watertight. Uh and that's what Rogo did.
They made their greater watertight. They
detected the hack. Um, and as a result,
the true performance that you're trying
to optimize the model on just just
started shooting up.
>> Yeah. Yeah. I remember there was one
robo run where we came back on the
platform that we showed earlier and the
average reward on validation was just
one.
>> Yeah. Little
>> 100%
little too.
>> That's too good to be true. Yeah.
>> All right. Yeah. So, that was for the
customer.
>> So, let's let's wrap up. Let's wrap up.
So um just to summarize uh so let's talk
about when to turn to agent RFT. So um
the general process that we recommend is
first you want to make sure that you
build this really high quality data set
where you're training and eval sets
closely match your production traffic.
So you you kind of want the uh agent to
not be surprised um when you know you
kind of go from like fine-tuning it to
actually exposing it to Showtime. Um,
and second of all, you know, you're
probably on this journey of improving
your agent performance. So, you want to
figure out what the baseline performance
looks like, so you know where to improve
from there. So, uh, you probably want to
run these baseline evals against GP5 or,
you know, whatever models you you like
using. And then from there, you want to
try to optimize the performance without
fine-tuning because that's often one of
the easier ways to get better
performance out. You might want to
adjust other parts of the task like
improving your prompts, improving your
infra, improving your task harness. And
then after you've squeezed out all the
juice out of the task and base model,
that's when you turn to agent RFT to
further optimize and start changing the
weights of the model to be fundamentally
better on your task and your domain in
an end to end fashion.
>> Awesome.
>> Yeah, thank you, Will.
>> Okay, great. Let's move on to Q&A. We
have um a ton of questions. So um wanted
to make sure we had enough time for
them. Um so maybe just let's go to the
next slide and we can tackle them.
>> Cool.
>> Yeah.
>> What kinds of tasks are best suited for
agent RFT?
>> You want to take Okay. Okay. All right.
So I have a take on this. Um so
obviously you know we we've explored and
explained a lot of ways in which you
should kind of structure your data set
or make sure that your data points have
enough variance in them so that during
exploration the model actually knows
what the difference between a good and a
bad data point looks like. So
fundamentally you do want a train set
where hey you still haven't squeezed all
the juice out of it but the the model
given enough exploration can can figure
out what a uh kind of good performance
looks like. So that way it can kind of
hill climb. So that's that's one thing
that I would say. Another thing I would
say is, you know, there's a task itself,
but then there's the way that you're
evaluating and grading it. And if the
way that you're grading it is in a
binary fashion, then it's going to be
really hard for the the agent to hill
climb or kind of gradually get better
and better um and improve on that task.
So you want to make sure that yes, you
have the task itself, but you also have
how you're evaluating it. So those two
things generally lead to quite a bit of
success. Um do you want to talk about
maybe domains or you know other things
that you have?
>> Yeah in terms of domains it was quite
surprising I think we showed it with the
customer stories but it's really widely
applicable. So I see agent RFT as you
presented very very clearly at the
beginning something for any type of
agents and as soon as you have an agent
that uses tools that are out of
distribution of what we trained our
marine model on which will naturally
happen because you have your own tools
then this is really an opportunity for
you to dive in on agent RF and if
there's a lot of reasoning associated
with it that's even better to use GP5 or
a very strong reasoning model.
>> Totally.
>> Yeah.
>> Totally.
Um, all right. What's different about
the RF2 platform now and since it
debuted in May? Well, I'll take this one
even if if if Will is the one who built
it,
>> but obviously and the whole team
>> and a big team.
>> But, um, what I really like about agent
RFT now is well, there there's multiple
things. In May, we were only able to
fine-tune O4 mini with a very specific
set of graders. Now you're able to
fine-tune O4 mini or GPT5 with tools and
with endpoint graders. So the
flexibility is just incomparable and you
can tackle pretty much I mean so many
new tasks. And what is great here is
that with this you can actually create
features in your product that just did
not exist before. They they the model
were not is not good enough and now you
have a path to actually other than
prompting help the model to use those
tools well and to build the product that
you actually want. So for me that that's
that's the most important and then a lot
of work also on the observability of the
platform and those different curves and
all the stability all of this has
improved a lot and that's great.
>> Totally. Yeah. Yeah. And I also just
want to kind of emphasize that with
agent RF we're now in this multi-step RL
paradigm. So with the original RFT
platform it's you give the model the
prompt, the model thinks for a while and
then spits back an answer to you. But
now you can actually do this multi-step
thing. So um the model is now in this
loop with your world and with your
environment and all of that actually
gets trained on end to end with with
your grader. So
>> yeah that's awesome.
>> Yeah.
>> Um Kristen do you have any other
questions from uh
>> yeah some more?
>> Yeah. Why don't we go to if you go to
the next slide I was refreshing during
Okay. So if you're able to refresh um
otherwise I can just read them out loud.
>> Uh let's see.
Um, start slashhow. Oh, so here's
curious why RFT is sampled.
>> Go back. I think we
>> Yeah.
>> Yep.
>> No, that's cool. Yeah. Take it.
>> This one, right? Yep.
>> Oh, okay. Cool. Uh, do we have one after
this or
>> Yeah.
>> Okay. There's more. Okay. Okay.
>> We can run a bit.
>> All right. Sounds good. Sounds good.
Okay. Uh, yeah. Why is RFT sample? Okay.
So, this is sort of a question around
like why maybe RL is sample efficient?
Uh I mean we could talk at length about
that but um fundamentally um first of
all the model is basically generating
its own training data through the
exploration process. So remember how we
talked about this compute multiplier
thing or like how the model is exploring
the search space. Well, it's actually
generating its own training data through
that sampling process. And when you give
it the grade or the reward at the end,
that tells the model how well it's doing
based on the trajectory roll out that it
generated. And that actually ends up um
being the thing that we train the model
on. So we obviously do a bunch of like
reinforcement learning stuff over that
trajectory, you know, like uh and we
take your reward or your your grade and
we kind of apply it over the roll out in
certain ways. Um but ultimately the
model is generating its its own data.
>> Yeah. I also add a note on the fact that
we're working on a frontier model and
fine-tuning a frontier model. So the
prior that you're working on is very
strong already and probably already has
success on your task.
>> So all of this variance only works
because that model just is able to
sometimes get it right.
>> Exactly.
>> And so it's just the the power of the
prior. And so if you think also and you
continue to accelerate if there's a new
model that is stronger and you have RF
on it as well, you can expect also to it
to be sample efficient again because the
prior will keep on increasing.
>> Yeah, the model is generating really
good data for itself basically because
the prior is so good as as Dio said. So
>> yeah. All right, let's see what the next
one is. How does the RL training
objective function differ from general
RFT and tank RFT? Um
>> I can take you can take first. All
right, sure. I'll start. So um yeah, so
there's the difference between the
actual like RL loss function and the
actual reward. So um the reward is what
we allow you to specify. So whether this
be reward functions that are native in
our platform, for example, the string
check greater that we kind of dissed on
earlier or the model grader uh or your
endpoint greater. And so with that
reward, we we do stuff with that reward.
So that would be the RL or reinforcement
learning like loss. Um so that might be
what you're what you're asking about
actually. So that so far doesn't differ
between general RFT and agent antic. So
we don't we don't change the loss part.
Uh we may though we we may try new
things as we're doing our research and
research engineering um to deliver even
like bigger model gains. But yeah, for
now honestly there's no difference. The
main difference is um now you can define
much more flexible reward functions.
>> Yeah. And I think that's super
important. Um when you when you look at
B RFT because you don't have access to
the train of thought, you only have
access to really the final output.
Whereas when you look at agent tech RFT,
you do have a lot of information on the
traces and some pretty even mentioned it
grading those tool calls. Yeah. And so
you can uh have much more control on the
policy that you want to see uh for your
model after training. So that's one one
big difference to me.
>> Totally. We have a little bit left. I
don't know. Should we try to
>> Yeah, let's go through them. Um we can
run over if people are are free to stay.
Um but
>> sounds good.
>> Cool. Use RFT. When there is a new
response of the model, does the new
model learn automatically from previous
um is is this like if there is a
new response of the model? Is this
talking about like domain shift? for
example, like you train the model on
like um a set of data points, but then
you evaluate on
>> Yeah, I'm not not exactly sure. The way
I see it maybe ties in with all this
idea of different trajectories that we
run and technically any new response
generated by the model is a new
trajectory and we actually do leverage
this when we compute uh the objective.
So my answer would be yes.
>> Yes, during training.
>> If that's understanding,
>> during training. Yeah. But but when
you're doing inference then like we're
not using whatever you have during
inference to to go back and make the
model better continuously. Not yet at
least someday.
>> All right.
>> Yeah.
>> Cool.
>> Great question.
>> Yeah,
>> you can read it. Yeah.
>> Uh I guess so. Okay. Are the alpha
endpoints used in the code available to
everybody?
>> Um I guess it depends on which alpha
endpoints we're talking about. So
>> is the alpha endpoints as a tool
integrator?
>> Yeah.
>> Yeah. Okay. Oh, yeah. Okay. So, no, they
they aren't. And um that's where the
agent RFT interest form comes in. So,
this is a functionality that is um that
we're exposing and is something that is
more or less in like I don't want to say
private beta, but is the type of
functionality that we want to work with
you to make sure that you get the most
success possible on. So, um, this is
something that, you know, we'd love for
you to, you know, talk to your, uh,
friendly neighborhood account executive
or account director at OpenAI to, um,
see how we can work together on.
>> Yeah. And that question is such a good
segue right?
>> Yeah, exactly. Um, so we can wrap up
with some resources on the right here.
um feel free to explore these but if
you're interested in learning more about
agent RF and specifically working with
our team here um check out this link
this tiny URL um fill out the interest
form and we will we will be in touch um
and with that I will share one more
slide on the upcoming build hours. So
join us on December 3rd for agent memory
patterns. Um this will be the the last
build hour of our agent series. Um but
many more build hours to come. check out
um our homepage there below and we'll
keep adding them. So, thanks everyone
and we'll see you next time.
>> Thank you.
>> Yeah, thank you. Bye.
Agent RFT enables reasoning models to become even more powerful, tool-using agents by training directly on the workflows they will execute in production. By operating on agent rollouts, reasoning models can call tools, generate intermediate reasoning steps, and receive real-time feedback via customer-provided endpoints. This Build Hour will walk through the preparation, infrastructure, and safety oversight to use Agentic RFT. Theophile Sautory (Applied AI) and William Hang (API Engineering) cover: • Improving agent performance with optimization and fine-tuning options • Key differences between Base RFT and Agentic RFT • New additions and how Agent RFT works • Task setup and live demos training with tools • Customer spotlight on Cognition with Sampriti Panda (Research Engineer) • Success stories featuring Ambience, Genspark, Mako, and Rogo • Live Q&A 👉 Agent RFT Interest Form: https://tinyurl.com/agentRFT 👉 Follow along with the code repo: https://github.com/openai/build-hours 👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours/ 00:00 Introduction 01:34 Intro to Agent RFT 11:12 Task Setup 14:15 Demos: Training with Tools 31:33 Best Practices 35:15 Customer Spotlight: Cognition 44:58 Success Stories 51:16 Summary 52:33 Q&A