Build Hour: Agent RFT | DailyDevLists

Loading video player...

Full Transcript

12,611 words • EN

Hey everyone, welcome back to another

build hour. I'm Christine. I'm on the

startup marketing team and today I'm

here with Will and Theo.

>> Hey, I'm Will. I'm on the engineering

team building the fine tuning product.

Yeah. And I'm Theo, solutions architect

working with startups and with Will in

particular quite a lot.

>> So today's topic is on agent RFT. Um

really exciting. So, if you have been

tuning in to our past build hours, we

did a series on agents all about how to

build agents, starting with the

responses API and then working our way

up to agent kit. Um, and now talking

about agent RFT. So, all of these past

build hours can be found on our YouTube.

Um, and the purpose of build hours is

really to help you build on our API and

use our tools. Um, with that, I'll give

you a quick snapshot now of what this

next hour will be all about. So, first

we are going to intro you to agent RFT.

Then we'll spend some time quickly with

the the task setup um and then move on

to some live demos. Um we have a really

exciting customer spotlight today with

Cognition. Um so we'll be dialing in and

then share some customer stories and

then end with Q&A. So on the right side

of your screen, you'll have a Q&A box.

Um feel free to toggle over and submit

questions throughout the hour. our team

is in the room and joining in virtually

to help address these questions and then

we'll save a few for the end to answer

live with our team. So with that, I will

pass it off to Will and Theo.

>> Awesome. So let's kick things off by

talking about agents. So you're probably

joining us today because you're building

an agent for your application or your

business and you'd like to improve its

performance. So what makes an agent

different from a regular model is its

ability to interact with the outside

world to complete a task. doesn't have

to go through you all the time or even

talk to you. It just gets things done on

its own. Now, in order to get things

done, this agent has to have access to

tools. So, if you're building a coding

agent, for example, it needs access to a

terminal, a code interpreter, or maybe

even an entire codebase. Or if you're

building a customer service agent, it

might need access to internal software

to look up customer records or billing

systems to issue refunds or even the

ability to escalate to a human being.

So this agent needs a way to interact

with your business context and the

outside world to get things done through

the use of tools. And the way that we

think about agents here is that all

their interactions with the outside

world go back into the context window.

So that means that after looking at what

it sent into and then got out of a tool,

the agent will reason to itself, call

another tool, and then repeat the

process.

>> Yeah, that's super cool. And so how how

does that tie in with our first product

first party products in agents?

>> Yeah. Yeah, totally. So we we do care a

lot about agents here obviously and

we're building some of the best agents

for specific use cases and here's how

openi agents use tools. So for example

codeex has access to a wide range of

tools to complete coding tasks end to

end like running tests reading your code

files or making code changes. So in the

case of codeex codeex might have access

to say a code planning tool or a

terminal tool or even a tool to apply um

git patches. And another example of a

first-party agent that we've released is

deep research. It's now embedded within

our agent and GP5 products. Um, deep

research has access to a browser. Uh,

can also look through your files and can

can also run code. Um, so for for deep

research, you know, so this this set of

tools allows the agent to deliver you

the most up-to-date, most accurate

research articles.

>> Yeah. Yeah, that's super cool. And so

when we work with customers usually

using our models and they're interested

into optimizing those models, they work

a lot on prompt engineering. What would

you recommend to optimize your agents?

>> Yeah, so prompt engineering is honestly

a great way to start. Um, we've seen

many different ways to improve the

performance of agents so far. So, let's

go through them. So, as you said, you

can steer model behavior by optimizing

the the prompt. So, it's almost like

instructing the model to do your task

better. But let's say that you've

optimized your prompt and um you're

still not as satisfied as you could be.

So, you can then um optimize the task

itself. For example, you can simplify

the task. You can add better guardrails

around the task to improve the agents

chances of getting things right. Um or

you can add or subtract tools even. Um

or make the tools better at

accomplishing what the agent intended to

do.

>> Yeah. It's it's interesting when you

look at agents because we've seen

customers be successful at improving the

performance of the agent just by

changing the description of the tools.

Yeah.

>> And even their naming just because it

makes more sense. It's like semantically

um easier to understand for the model.

>> Totally. Totally. Yeah. So there's a lot

that you can do to um improve the

performance of the agent before you move

to fine-tuning. Um but yeah, let's say

you've tried all these approaches, but

you still want better performance. So

that's where fine-tuning comes in.

Fine-tuning is a way to train the agent

end to end on your task to achieve even

better performance.

And uh what we're here to talk about

today is agent reinforcement finetuning

or agent RFT. Agent RF is the way to do

this. Agent RFT changes the weights of

the model according to a learning signal

that you can specify to teach the model

what good behavior and less than good

behavior look like. Uh during training,

the agent will go and explore many

different ways of calling your tools to

learn how it can do better and better as

training progresses. And we wanted to

remind everyone that base RF is already

a functionality in the current

fine-tuning product, but you cannot use

it to fine-tune agents. Agent RF does

allow you to do this. So it allows the

agent to call tools while it's exploring

during the rollout process. So it can

learn from all possible ways of using

your tools. Um you can also specify

arbitrary reward signal through an

endpoint uh which we call to train them

all on so that it gets better and better

in ways that matter to you.

And so to summarize the benefits of

agent RFT, it it first helps you improve

the performance of reasoning models. It

improves the agents ability to use tools

and reach the best final answer. It's

also quite sample efficient which can be

really important in domains where

training data is scarce. We'll talk more

about specific examples when we go

through the customer stories and the

process itself can result in the model

that has lower latency and is better on

agentic tasks.

>> Okay, that's that's that's really cool.

And when you mentioned latency, I think

that's a very key point.

Is it the number of reasoning tokens

that drops and tool call the number of

tool calls or what is it?

>> Yeah. Yeah, totally. So yeah, let's

let's uh kind of dive in a bit into

latency and ML performance and how we

can improve those aspects. So one of the

challenges of making agents work with

your business context is that might be

very different from how we at OpenAI

train our models. So if your tools look

and behave the same way as say Codeex's

tools that's that we've trained on or

Deep Research's tools, then you're in

luck cuz the domain of the tools is

going to be similar between the uh the

uh the base model and the uh um the base

model and your your task. But your

business context um is most likely

specific to you. So that means that um

your agent might not be used to using

your tools in the way that is ideal. It

might call a tool too many times. it

might call five different tools and

calling one tool is better for for what

it was trying to do in a given moment.

So using agent RF you can kind of align

these these domains. Um it's possible to

train them all to use far fewer tool

calls to achieve the same or sometimes

even better performance on even task. Um

that means lower latencies for you and

faster experiences for your end users.

This process happens naturally because

we do apply a light penalty to the

amount of tokens a model uses to reason.

Um

okay. Yeah. But um perhaps you want to

impose a uh you want to impose like a

constraint actually because instead of

this natural process of the model kind

of learning how to use fear tool calls

and use fear tokens um sometimes you

want to make sure the model stays within

a given tool called budget and doesn't

go over that limit. So given how

important tool calls are in affecting

latency this could really reduce the uh

latency of your rollouts. Um, so agent

RP allows you to specify this cutoff

that train the model to stay within this

given budget while preserving or

exceeding the original ML performance.

But ultimately you're probably here in

the first place to improve the ML

performance of your agent. So um

obviously agent RF can help you do this

by first training the model to reason

better across tool outputs and two

training the model to use tools better

in the first place. So all this is

learned organically during the

exploration and rollout process um as it

tries many different ways across the

search space to call your tools and then

think about the outputs from your tools

to arrive at a better answer. So

hopefully it it hill climbs well hill

climbs nicely on your task.

>> Yeah, that's that's awesome and I really

want to try it out. I want any people to

try it out actually and I know you

worked so hard to like make this work.

So

>> we we worked hard.

>> Yeah, but like the whole team and your

whole team was like under the hood, how

does it work? how does it communicate

with the tools and stuff?

>> Yeah. Yeah. So, let's let's dive in. So,

in order to make all this work, uh we've

introduced several major new updates to

the existing RF product. So, first of

all, it's the ability of the model to

call tools during training via calls to

your endpoints. So, calls to your tool

endpoints. And second, it's the ability

for you to specify greater in the form

of an endpoint that we can call to get

your custom reward signal out. So, these

two additions mark the first time we've

allowed our models to interact with the

outside world during the training

process. even even our frontier models

um through your tools as the model is

exploring and doing its rollouts and

through your reward signal when we're

ready to uh update the model.

So to dive even deeper into exactly

what's happening during the training

process. So for each agent roll out, we

assign a unique identifier to all tool

calls and final answers that come out of

that roll out. And um when the agent

calls your tools uh we attach that

unique ID to the tool call so that your

system can recognize different tool

calls as originating from the same

rollout. So this can allow you to keep

track of rollouts as they happen which

could be important for state management

if if you choose. So you can do this in

your own database or in your own

backend. Um so that when we emit the

final answer and then call your grader

um you can then attach all the context

from the agent to the final answer

through that unique identifier and then

you can pass all the stuff into your

greater um and you know you can have

this like very very holistic grading

context.

>> Yeah. Yeah. That's awesome. I think what

I find the most powerful here is really

that all the tool calls and the grading

happens in your environment. So it can

match totally your production

environment and then your model will

just not be surprised when it sees that

specific tool and will know how to call

it

>> and it also gives you like so much

flexibility in the grading. So currently

on our platform we have couple of ways

of grading but here because you receive

every tool call you can store them you

can you can grade them and really shape

the policy that you want for your model.

>> Yeah absolutely totally. So there's a

lot of flexibility here and uh we do

hope that agent RFT helps you teach

agents to achieve frontier performance

on your tasks. So enough talk about

theoretical things. Let's now illustrate

how agent RF works with a real world

example. So we're going to fine-tune an

agent to perform better on fin QA

financial QA which is a benchmark that

gives a model a financial report and

asks it to answer questions about it

that require numerical reasoning. So the

uh original benchmark so this is

actually an academic benchmark that was

published and the original benchmark

prompts uh include the relevant

financial report that the model needs to

answer the question. Um but we've

decided to make things harder because we

like doing things the most difficult way

here at OpenAI. We've modified the

benchmark. We've made it a lot harder by

only giving the model the question

itself without the context, no report.

And uh we require it to use tools like

an agent would to search for the correct

report in this pile of 2,800 financial

reports to answer the question. Uh and

to make the task even more challenging,

we require that the model arrive at its

answer within 10 tool calls.

>> Yeah. Yeah. So that's so much harder

because you have to know where to look

in to. Then once you found where to look

in, you have to reason over it and all

of this in a very constrained project.

>> Totally. Totally. Yeah. Yeah. So here

are the tools that we've given the

access uh we've given the agent access

to. So we have a search tool which is a

semantic search tool. Um we have a list

tool which kind of goes through all the

directories and document paths and um

tells you what's in the file system. And

we have this like funnily named cat tool

which is kind of our engineer brains

just naming things the way that we

understand. But um cat returns a

document given a path. So it's kind of

like opening a document in your

computer. So let's go through an

example. So here's a sample question

from the benchmark. um the agent might

call the search tool uh after seeing

this question about you know answering

some uh question about like intel's

return. So it might ask the search tool

um some query to try to find the

relevant documents and information out

of it. Uh and so the search tool might

return something like this which has a

table and a text form with all the

relevant numbers it needs to answer the

question. And here's the greater for

this task, which is how we generate

reward signal for the agent's final

answers. Um, just to keep things as

simple as possible, since we've already

over complicated things by making the

benchmark harder, we used a model grader

for this task. We could have used an

endpoint grader, um, which is something

that we'll cover soon. Um, but we also

could have used a string grader which

which um actually you know rewards the

model for exact string matches to the

ground truth but is super super brittle

and can penalize the agent for minor

formatting errors like you know writing

out $32 instead of using the dollar

sign. So it's it's super brittle. Um

kind of penalizes the model in ways that

we don't want to. Uh so in our case we

also want to give partial credit to the

agent as well for answers that are

really close to the ground truth like

you know like rounding errors like 0.999

if the ground truth is actually one.

>> All right I'm going to hand it over to

Theo to talk about the demo and training

process itself.

>> Oh thank you. Well that was that was a

great setup. So let me let me dive in

some some code here and just going to

make sure the right screen is being

shared. I'm sorry.

Yeah, that took time.

>> Yep.

All right. So, here you should you

should be seeing the the the code here.

So, the first thing that we're going to

do is we're going to look into the tool

server. So, we're using model to do this

because a very was very fast to just set

up a fast API endpoint and we have a bit

of descriptions on on on how to set that

up. But the main idea here is that we're

going to first set up a base image where

this is going to be a Debian image and

we have fast API, pandas, numpy and open

AI. So those will be the libraries that

we need to to run our code and to run

the tools. So and we're also going to

add the the the corpus the full like

data all the data and the documents so

that the model can actually look into

them. So here we we defined the

different tools and let me just look at

the search tool because as will

mentioned it's a semantic search. So

we're going to get we create some

embeddings we use a open embedding model

and then we'll compute the cosine

similarity. So very similar to rag that

you you are probably all familiar with

and this is how we build um the search

tool which is just defined here. I'm not

going in depth on the other tools

because list and cat are quite

straightforward. And then the way we're

going to provide these tools to the

platform and to the model. It's just

going to be through this list where we

have um the JSONs of the tools where we

have a name which is going to be search

a URL which is the model URL I just set

up and then a set of headers which have

an O token so that only I can access um

those end points.

>> Oh, if you put your name in the URL,

it's great. Yeah,

>> it's it's mine.

>> Yeah. Yeah. Yeah, it's yours. No one

else is

>> No. Yeah. So, that's how we set up the

tools and um then we can have a look at

the greater. So, as Will mentioned,

we're using a model grader because in

the data set there the answers are not

don't always have the same consistency

in the number of decimals or should we

put the dollar sign before or we write

dollars after. So, the idea is to

prevent this brittleleness. We just use

a model greater and as we'll mention we

provide some partial rewards of 0.5 if

the answer is close but not exact but

also this allows us to provide like

answer of one if you say 7% instead of

0.07 as an answer. So this is very

important because we want to make sure

we provide the right signal to the agent

or else it's not going to be able to

learn what was a correct reasoning path

versus what was not a good reasoning

path. All right. And we're using GPT 4.1

and then the uh response format. And you

might be familiar with this from from

previous build hours or RFT engagements.

>> Right. And I also want to remind

everyone that we used a model grader

here. But you know, we also have this

endpoint grader that you can call where

the endpoint grader is basically us

calling your endpoint via the public

internet so that you can define your

custom reward signal. But in this case,

for simplicity sake, we just chose to go

with a model grader.

>> Yeah, totally. Yeah, totally. Thank you,

Will. All right. So now what do we

always do before running our team? You

can imagine that we optimize a prompt

etc. And what we're going to do is we're

going to run a baseline to see how GPT5

performs. And if you remember the

reinforcement fine-tuning uh where build

hour we did with Prashant a couple

months ago, we were very interested in

the variance of uh the model. So given

the specific sample, what is the

variance of scores that it gets for that

sample? And so I'm going to run I'm

going to run those plots. I actually ran

the model free time on each of the

training sets. I mean actually on each

of the sample from training and

validation. I'm showing validation here

because it's just 100 samples and

training is a thousand samples which is

a bit too large and hard to read. But

that's the plot that you might have seen

last time. I'm going to describe it

again. But what how does it look to you?

Will

>> Yeah, I mean well you're going to I'm

going to need some help interpreting

this graph because there's a lot of

stuff going on here. So

>> yeah. All right.

>> Take it away. Yeah. This is

>> for sure. So we ran each sample. So on

the x- axis you have each different

sample. For each sample we ran the model

I mean the agent three times and on the

y-axis we have the score.

>> So if you look at each each point the

red cross is the best score that it got

out of the three runs.

>> So if you look at this sample here it

got zero every time.

>> If you go look at the sample at the top

right it got one every time. And in the

middle sometimes they go zero sometimes

they go one and maybe sometimes they go

five

>> right

>> so that's for the overall the red cross

is the big the best the thick blue bar

which is light blue this is the mean

over the free runs

>> and the thin blue bar is actually the

variance

>> and when I see a plot like this I

>> I don't think it's a great plot for

reinforcement finetuning because many of

the samples do not have variance

>> okay

>> but we still have a We still have a

fraction of them like probably 15% in

the middle that do have variance and is

it is this variance which is going to

enable the model to learn what a good

reasoning path is versus what is not a

good reason.

>> Yeah, totally.

>> And so we expect that all those samples

will actually provide some signal to to

improve the performance of the model.

>> Right.

>> And uh very importantly this is on the

validation set but you can trust me the

distribution on the train set. So it's

kind of similar.

>> Yeah totally. And maybe this is a good

um part to talk about the compute

multiplier because the compute

multiplier kind of controls the amount

of exploration that the model does. And

so maybe like you know over three

repeats basically each data point is

being explored three times. There are

just not enough samples to kind of hike

up those those uh zero scores up into

the blue bar region. But maybe if we set

the compute multiplier higher such that

we make the model explore more, maybe

one of those uh the model has more

chances to to get some like nonzero

reward out of its exploration. So that's

where the uh exploration actually really

matters.

>> Yeah. Yeah. Totally. Totally great. All

right. So now that we've seen this uh we

also share a notebook very simple that

you can run through that will allow you

to run the training on our platform uh

using our our API and

okay now let me actually go and share go

and find

the examples of one of the training runs

we did.

All right.

All right. So now we're on the platform

the OpenAI platform that again might be

you might remember it in some ways and

you can see here that uh this is the job

that we ran this it has a number and

we're going to explore all the

hyperparameters that we ran and then see

the curves for rewards for output tools

and so on. So very high level I run for

free epochs meaning that we go through

each sample three times. The batch size

was set to 16 and as will mentioned

there's a compute multiplier which is

very important number for the amount of

variation that we will observe during

training and here I've set it to one. If

we want more variation and use more

compute and have more chances of

stumbling on good reasoning paths then

you would bump up this compute

multiplier. But you also have to

remember that you are hosting endpoints

and during training we're going to hit

those endpoints. So if you increase this

compute multiplier you're going to have

to increase the robustness of your

endpoints as well.

>> Yeah.

>> Right.

>> Totally.

>> And I'm using reasoning effort medium

and eval samples is the number of times

we evaluate each point from the

validation data set to have robust

curves during training.

All right. So let's have a look at this

reward curve.

Yeah. So you can see at the very

beginning we start at a baseline of

around 0.6 of validation reward. So this

purple curve is the score on the

validation set and that's the full

validation set run twice as per samples.

And then the green curve is actually the

model performance on the specific batch

that you're training on. So here we have

batch size of 16. So this value here

step two 0.461 461 means that like for

all the trajectories that we ran over

those all 16 samples in the batch we

have an average or of 0.461. So it's

less representative and robust than the

validation curve because the validation

curve is on the full validation data

set.

>> Right?

>> And so what we can see here is that very

rapidly in just 10 steps the model

actually improves the performance by 14

percentage point from 0 59 to 0.63.

>> Yeah,

>> that's quite a lot. And so it it

directly probably learned how to use the

tools much better.

>> Yeah.

>> And if you go on a bit longer, you can

see that the work goes down a little bit

and the assumption is that the model is

exploring new solutions to try to push

the performance even more

>> and it manages at the very end to push a

little more.

>> Yeah. And I always love correlating the

reward with the reasoning tokens means

because here you can see that in the big

exploration phase in the middle where

actually reward was going down the model

was starting to think more and more.

>> Right. Yeah.

>> And sometimes it's just not necessary to

think more and more and maybe you just

have to learn how to use your tools

better or different tools. And so this

is what I love about the UI is that we

also show those tool call per rollouts

that show the distribution of how the

number of tool calls evolves

during um during rollout. So you can see

at the very beginning we we use a total

of probably eight or nine tools and then

it drops quite significantly.

>> Yeah.

>> Um to much lower numbers and so you can

see that the performance gain that we

saw after 10 step is also correlated to

a huge drop in tool calls.

>> Yeah. And you can assume that the model

is just learning to use those tools much

more efficiently. Yeah.

>> And I think that's really awesome

because it shows how we're closing the

distribution shift just in those 10

steps. Of course 10 step is a number

that worked here. Maybe in your case it

will be more, maybe it will be less but

um that's very interesting. And then you

can see all this region of interest

where the reward was going down and the

number of reasoning tokens was also

going up is is a region where well the

tool the tool call were definitely

shifting. And then if you go to the very

end, you see that the model starts doing

a lot of lists. Not exactly sure why,

but this allows it to reach higher

rewards and it kind of converges to to a

policy. You can see where list becomes

kind of flat and all those lines become

more or less flat.

>> And um

>> I think that's that's very interesting.

As a business, you might have stopped

after 10 steps because you don't want to

plateau for too long.

>> Mh.

>> But it's always interesting to see what

happens beyond.

>> Sure.

>> All right. Right. So those are the the

high level curves I wanted to explore.

There are many other other curves such

as the number of tokens per tool call

and that will give you a sense um most

probably also about for the speed of the

training run. The more output tokens uh

the longer it will be it will take to

train the model. Um, but let's go back

to sharing the code because

right now we've done a very high level

analysis.

Um,

where is this?

All right. Yeah. Sorry. So, we've done a

very high level analysis of what

happens. But because we have access to

those models, we can just observe the

traces in depth and try to understand

what is happening under the hood. So

I've loaded all the results. I run the

evaluations three times on the

validation set for the baseline model

and step 10 model that we saw the big

increase and the decrease in reason in

the number of tool calls. And what I'm

going to look at is as will mentioned

like the performance initially but also

the latency and also the output tokens.

So let's do some quick plots here.

You'll find the code to do all those

plots. And here you can find a very

simple plot. On the left hand side you

have the average reward over the 3 * 100

samples and the average latency. And

what you can see here is that the well

technically you want to be at the top

left, right? You want higher reward,

lower latency. And that's what we get

from going from baseline to step 10. We

have a five second um reduction. And

that's approximately like 10%. And we

have an 11 percentage point increase

which is which is quite significant.

>> Yeah. Yeah. Wow.

>> And um so latency here sometimes it's a

bit hard to to look just at latency.

What we can look at is also number of

tokens because it will give you some

information on

>> and uh on the on the time it will take

to expose. And you can see the tokens

mean went from like 2500 to 1,500.

>> Wow. Yeah. uh for the

>> huge reduction. Yeah.

>> Yeah. Huge reduction. That's probably

from less reasoning and and less tool

calls out.

>> Right. Right. Right.

>> Um all right. So now that we we saw the

high level, let's look into the tool

calls per per trace. So I also run

something here to compute the means. And

you can see that for the baseline model,

we were around 6.9 tool calls per trace.

And for the finetune model, we're only

at 4.2. So that means smarter model,

faster model uh that has I mean just

closing the distribution shift.

>> Yeah.

>> And if we look more in depth into this,

we can actually plot um equation plot

which is a plot that I really love to

plot after having run RFT to understand

really what's happening in the model

behavior. So let me walk you through

this very simply.

>> You you and the plots man.

Yeah, I think it's a it's a great way to

analyze, you know, absolutely to get an

understanding of the policy change. So

here on the left hand side, you have the

the delta reward. So we take the reward

of step 10 minus that of the baseline

for all of our data points. And on the

x-axis we have the delta and tool calls.

So step 10 minus baseline. And the

equation where you want to be here is

again the top left because you want

higher reward, lower number of tool

calls. Yeah. And you can see that we

have 29 points or in this region which

means that a large fraction of those

data points actually just like a I mean

of those 100 samples are just

faster and higher reward. Then we have

another interesting one which is the

ones where there is no delta in reward

but a decrease in number of tool calls

and that's also quite a big fraction

with like 62 uh I mean more than

probably 50 samples cuz here we count

all of them but that's the idea and then

there are some samples where the model

is starting to lose in reward and lose

and doing less tool call

>> and this kind of highlights that even if

we learned a a policy that's kind of

general we might not be able to capture

all of the data points because this

policy might be a bit too maybe a bit

bit too strict for many of those those

points.

>> Yeah.

>> So that's the trade-off, but we don't

have any point in the bottom right

corner which is the one that we really

don't want to which is more tool calls

and lower reward. So I'm quite happy

with what the model has trained and how

the policy changed.

>> And we can also skim through some of the

traces uh in a little even a little more

depth and like in all of those tool

calls. But since what is interesting to

see is that

not the model has learned to use each of

the tools uh better on the first line

here. This is the number of tool calls

per specific tool. And you can see that

it drops for search for cat and for list

>> so it's really general which is quite

cool. And finally something more about

the model policies. But then this was

would require even more work. But we can

look at is the model being a bit smarter

in the way it uses the tools and the way

it repeats the number of of of

making exactly the same tool call with

different parameters. So here I'm

plotting I mean I was looking into

diagrams. So sometimes does search and

search and then cat and cat

>> and you can see that the number of

repeats drops significantly from 1,000

to 500. So we just divide by two. So the

model is much better at understanding um

I mean I take making the right tool call

the first time so it doesn't have to

repeat the exact same following with

slightly different parameters

>> right

>> and and if we look in depth this is a

very cherrypicked example but you can

see the baseline model doing the search

tool six times in a row before doing

list cast another search and cat whereas

the the function model just follows a

very simple policy of search list and

cat and then it probably just reason on

the output to provide a final answer.

Yeah,

>> totally.

>> Yeah. And I just want to add that on

this benchmark, the documents in the

train set and the documents in the

validation set, um, there's no overlap

in the documents that are required to

answer the question. However, the model

is kind of still operating on the same

pile of documents. So, so it still has

access to basically the entire file

structure, but the questions that are

asked, um, there's there's basically no

overlap in the documents that are

required. So in some ways it kind of

like learns how to use this file system

and like knows what documents are

inside. Um but ultimately you know the

the documents are kept separate.

>> Yeah. So that Yeah.

>> Yeah. That's pretty cool.

>> So that would probably match potentially

your business use case where um you have

this existing corpus that you that you

want to learn how to use better.

>> Yeah. Totally.

>> Yeah.

>> All right. Yeah. Cool. So uh that was

that was a quick demo. uh you will find

the code if you want to to go through it

run it and very high level we also have

some advice on how to get successful

with agent RFT. So the first one is you

need a well specified and constrained

task and this is mainly in the sense

where you need consensus from uh people

who have domain knowledge or aesthetic

uh for for some visual task. Um where

there is one real answer in a way or

people will agree on what the good

answer is. And this is very important

because you want to share some signal to

the model that is consistent and not say

in one example oh answer A is good in

the next one answer A is not good

because then the model will get confused

and will not learn how to reason better.

>> Yeah.

>> The second one is nonzero baseline

performance. So we saw initially in our

variance plot um you need it you need

the model to be sometimes right or else

if it's just never right at after

running like a 100 times on the same

sample it will probably never learn. And

if it's especially if it's like this

across your whole data set,

>> right?

>> And then improve accuracy. IK. So that's

very interesting. If you run multiple

times for each sample and then instead

of looking at the average performance,

you look at the average performance for

the best samp for the best trajectory of

each of the samples. That gives you some

information on again the variance and

how often does the model get it right.

And so during training, we're going to

nudge all the trajectories to go um and

match those best sample those best

trajectories. And technically you can

then bootstrap on this because you will

generalize across other samples and so

you'll bring in some new reasoning

patterns and probably keep on pushing

and you can do that multiple times.

>> And finally is quality over quantity. Um

in this example we use 1,000 training

samples which is something which is

quite a large number. I've done a lot of

engagements with many less samples

probably 150 and we've been quite

successful. And again the idea is really

on the how good the data is and you

don't want any uh mixed signals uh to

the model

>> right yeah

>> all right so that was like on the

performance side on what you do

beforehand now on the infrastructure

side really related to our product um

what you want to do is you want to

mirror production behavior you have the

opportunity to host your tools so just

go for it make it like very similar to

production and like that everything that

you're improving during training will

actually translate to your product then

the second part is investing in

designing your grader. So the grader

will really affect the way uh the model

behaves and so the model policy and so

it's very important to have it be

aligned with your domain knowledge to

make it hard to game and to hack. Um

though this is very hard so as soon as

you get something that is little bit

like hard to game you should like go for

it and try it and then preferably have

some gradient as well mentioned if it's

just binary and just the string check

that will be complex um by nature of

many of the problems which don't have

like a first order logic answer yes or

no right

>> and the want to give the model like

partial credit right

>> exactly yeah yeah

>> and and you want the model to know that

it was going in the right Yeah, exactly.

Like here in this case, we could have

thought of adding some reward for

reading in the right file. It knows that

writing the right file. Maybe just the

reasoning is wrong.

>> Right. Right. It's like teaching a

person kind of.

>> Yeah, that's a little bit how I think

about it. All right. And then yeah,

limit the tool calls output length

because it's just going to make your

training very slow and also going to

confuse the the model. So if you can uh

work on outputting only what is

necessary, that will make it more

efficient across the board. But I think

that's also very reasonable for any

agents and they'll just fine.

>> Totally. And it saves you money, too.

You don't want to shove tons of useless

tokens into the context window. So,

>> yeah.

>> So, neat.

>> All right.

>> Awesome.

>> Cool.

>> Thanks so much for that. That was

incredible.

>> No, thank you.

>> Got analysis and all the plots charts.

>> Thank you, Will and Theo for walking us

through that. Um, really excited about

this next segment. Um, our customer

spotlight. We're going to hear directly

from the research engineer Sam Pretty at

Cognition. So, please welcome Sam

Pretty.

>> Hey everyone. Uh, I'll quickly take over

the screen sharing.

>> Um, is that good?

>> Yeah.

>> Yeah.

>> As long as it's coming from your

computer.

>> Yeah. Yeah, it is.

>> Sweet.

>> Yes. Uh, thank you, Tio, Will, and

Christine. Um so hi everyone I'm SRI I

work as a research engineer at

Cognition. Um at Cognition we build

Devon and Wensurf. So Devon is a

autonomous AI engineer that works

independently on solving tasks in your

codebase. And as part of my work at

Cognition I I work on improving like

models to make parts of Devon smarter.

Um so I'm excited really excited to

share like what we've been working with

uh Will and Theo on the agent RFD

feature.

Um so one of the tasks in Devon like

when you when you give initial query to

Devon the first thing it does it goes

into a planning mode to kind of try to

figure out what it needs to do to like

actually solve this task. Um and from a

UX perspective we don't want the agent

to spend too much time in planning mode

because we want Devon to start working

and like showing edits to the user as

soon as possible. Um and so one of the

motivations of this was can we like

fine-tune GP5 or other frontier models

like that better so that they get to

this um editing stage as quickly as

possible while still maintaining or even

improving accuracy. Um so the way we

designed this task is given the initial

user prompt we want to restrict the set

of tools that's available to Devon. Um,

so in this case it's just read file and

and shell because we don't really need

to make any edits at this stage and let

so and then we let the model explore or

the agent explore it and like figure out

how to like which files to look at and

which files to edit to solve this task.

Um, so in this case we just have the

read file and shell tool. the the

motivation of the shell tool is so that

the the model can run commands like prep

and find to like search the code base

for like certain strings that the user

might have put in the query or just look

for like certain file names or things

like that. Um and so as as the mentioned

earlier so we need obviously the tool

calls and then the data set and the

reward. Um so for the data set we

collected a bunch of uh real world

repositories and collected uh user

queries from those repositories and then

we labeled what are the files like the

user actually edited to solve this uh

this task and ideally we want this sub

agent to return those exact files so

that in the following on the agent can

continue and make the edits to those

files. Um and then for the reward reward

we use the metric called F1 score. Um so

F1 score balances both precision and

recall. Um, this is because we want the

model to obviously like not just like if

we just did precision or just a recall,

the model would either like be very

conservative and only return a few files

or like return too many things to try to

get everything. Um, and obviously we

want to be in the balance so that the

the agent that comes along after does

not have to like is it context is not

polluted with too much data. Um, so

yeah, we can uh get to the the eval

results. Um so we started off with GPT5

being so we started with the GPT5 base

model being um somewhat lower than like

the current frontier model. Um and we we

ran two experiments. So one experiment

was uh GP5 with like a less smaller data

set of around 100 samples. So 100 tasks

across varying repositories. Um and then

a larger larger uh experiment with

around thousand samples. Um so one thing

we tried to maintain was that the set of

repositories would be distinct or

disjoint between the train and the eval

uh case because we wanted to make sure

it wasn't that like the model was just

learning things about the uh about the

data set because ideally like when when

we want to use this in the in the in

real life the model the train model will

have never seen certain repositories

because it will be private repositories.

Um so um as you can see like uh even

with the smaller data set it already

obviously beats the base model by quite

a lot. Um and and the late with the

larger data set, we get a even further

boost. Um and and the plan action score

here is basically the F1 score. So we we

take a look at all the files the model

looked at and also at the end the model

does output u like what it thinks the

right files to edit should be and we

kind of compare that with the label

ground truth. Um so during the

experiment some of the things we noticed

are that uh the model starts learning

how to do a lot of parallel tool calls.

So if you looked at the traces, the

first action that the model does, it

will like kick off like eight different

things um depend like listing the repos,

grapping for things and then following

on it will like once it gets the results

from those tool calls, it will like it

will like independently explore all of

these those things by again running more

parallel tool calls. Um and usually

because running running the tool called

such as read file is quite a bit faster

than like the actual like um the model

inference it it does help a lot that

like these back and forths are like

reduced quite a lot. So um I think in

the eval score for example when we put

this in Devon directly we noticed that

to get to the originally on on the

baseline to get to the the end of the

planning mode it would take around eight

like 8 to 10 back and forths with the

agent or with the model but with the

fine-tuned model we would be done in

like four back and forth. So that like

cuts in cuts time in half by almost

force. Obviously there's a thing where

like sometimes the model could learn

could run a tool call that takes a

longer amount of time. So we we do we do

try to penalize things like if it tries

to do too many tool calls because it

does take a lot of time or things like

that. Um yeah and also during training

we did have to kind of penalize um if

the model took too long because we don't

want the model to like keep exploding

and like never be satisfied. Um so yeah

uh we noticed that like with with this

agent RFD feature we can just push like

already frontier model like GPD5 even

further on like a specialized task when

we when we have a clear reward of what

we want to optimize

um for the for the infrastructure um as

as we mentioned earlier um we we we run

both the tool calls as a remote endpoint

as well as the grader um and so the way

the training works is every step the the

platform like sends us a bunch of

rollout requests So basically like the

model tries given a certain sample it

tries to like uh like does the roll out

and like like there's like around 32

copies or something like that um for

each and so for each each roll out we

spin up a new VM um and and like run the

tool calls in this VM and then the

results are given back to the platform

or the our left grader and then at the

end when we get the final answer the uh

we call a greater endpoint where we

compare the the trajectory. So we look

at the list of all the tool calls that

uh the model made in in that particular

rollout as well as the final answer and

we give it a score based on the label

ground truth. Um so in this case we

decided to go with like isolated VMs

because uh we we didn't in because as

you know remember we we used like a

shell tool so the model could decide to

do some destructive actions. So we

didn't want to like one roll out to

affect the other rollouts in case the

model goes crazy and runs RMRF or

something like that. Um and obviously

like we use VMs because uh we could

reuse the production Devon VM info where

we give every Devon instance a VM but I

think containers work well for this

purpose as well. Um yeah and some of the

some of the interesting things we

noticed was that the RL is quite bursty.

So um at at the beginning of every roll

out they would send us like 500 new

rollout requests. So um you definitely

need to like handle that because that's

like 500 new VMs starting instant at the

same time. Um and then the other other

kind of like foot gun is that um

sometimes like let's say there's

infrastructure error um and the VMs fail

um the it it does like what ends up

happening is the model gets zero reward

because like the tool calls fail and

like the model can't figure out what's

going on. Um and while that's not the

model's fault, that does lead to the

training kind of collapsing or like the

like the model learning in a bad way

because even the model might have done

something good, it got a zero reward. So

um it is good to have a lot of

monitoring on like when there's tool

called failures or like you know there's

abnormal abnormal issues with the model.

Um because sometimes it could be that

the model just has formatting issues and

it's not calling the uh the tool

correctly. Um but uh other times it

could be just our infra issue. But

anyways, uh thank you like to Will and

like Kathy for like helping me debug all

of these issues

>> and all the other people. Yeah,

>> for sure.

>> Yeah. And anyways, it's been it's been

really exciting exciting to try this

agent RF feature and like being able to

tune the performance of GP5 even more.

>> Yeah, thank you.

>> Yeah.

>> Yeah, it's been incredible. Thank you so

much for working with us so closely on

this and really glad to to ship an agent

to you that's um that that's like

beating the bas beating beating the

state of the art it seems. So yeah.

>> Yeah.

>> Um do we have time for questions for

this part?

>> Yeah. Um let's go through some some uh

success stories too that we can share

and then um we do have quite a bit of

questions so I want to make sure we have

enough time for that.

>> Sounds good.

>> Thanks and pretty. We will catch you

later.

>> Thanks so much. Thank you.

>> Bye-bye.

>> All right. So, yeah, we've seen a good

success story with Cognition and SRT and

we just wanted to showcase some others

to show the versatility versatility

um of agent RFT. So, let's let's start

with with one with whom I worked really

closely with Ambience and Ambience

>> closely with all of them.

>> Yeah, that's true. But, uh ambience is a

yeah healthcare uh company and

they're they they're embedded in some

hospitals and one of the tasks that they

look at is ICD10 coding and they have an

agent for this and ICD10 coding is

actually the the work that you have to

do when you want to do billing um after

a session with a patient and you have to

map the topics or the illnesses um the

diagnosis actually of what you've

discussed with some um codes and those

codes are very precise and there are

like 70k of these codes so it's quite a

very quite a hard task And what we're

looking at here is we have a transcript

between the doctor and the patient and

automatically from that transcript we

want to propose the right ICDS and codes

and this is requires a lot of um nuance

understanding of the discussion but also

a lot of medical reasoning and that's

why ambience looked at GP5 and was using

GPT5 and one other aspect sorry is that

it has to be quite fast and so they use

GP5 with low reasoning because of the

way the the doctors use

And so if you look at the plot on the

right hand side, we started with GPT5

low hovering around the 0.5 F1 score and

then we built that agent that actually

has a tool that does a search um for

those IC10 codes and then we actually

RFD that model and you can see the jump

from 0.52 to 0.57. It might look

slightly small but the actual highest

performance you can get is around

0.70.75

because we are looking at a task that is

slightly subjective and doctors agree or

not on what are the actual codes. So

this is a really a significant jump for

them and not only we're seeing this

increase in F1, the fact that we are

fine-tuning as we've seen during the

whole session also allow them to reduce

the latency and so there's a 18% um

average response time and that actually

halfs the number of samples that are

above their latency product uh latency

threshold in the product. So that was a

that was a great use case and it was

great working with with Brandon Patrick

and the team

and uh then regarding another use case

very different no more healthcare here

we're looking at Jensen Spark slides

creation agent so genspark has amazing

agents and one of them is an agent that

builds slides so the user will

communicate with the agent to make

different tool calls and at the very end

those slides um sometimes are not

aesthetically necessarily very pleasing.

There's a bit too much text or they're

too long and therefore uh they use a

reasoning model to try to harmonize all

the output. And this is what we fine-

tuned on. And what was great working

with Flame and team was that they worked

a lot on their model grader and

different type of graders to judge both

the content and the visual aspect of it.

And they were extremely happy with the

output. And I I also find those slides

quite quite pretty on the right hand

side that that we have. And um in terms

of numbers, it provided like 88%

improvement uh on bad cases over the

existing models which is a significant

uh number that we're that we're very

happy with.

>> So yeah, well do you have other use

cases?

>> Yeah, absolutely man. We yeah we should

use GenSpark for the next build slide.

These look great. Yeah. So moving on

just to show you how diverse um the

success stories on agent RF can be we

have we work closely with MacO to build

a GPU kernel building agent. So, Mac is

building agents to write these GPU

kernels, which is really difficult for

LLMs because there's just a a scarcity

of training data out there compared to

other domains. Um, especially true for

new hardware. So, if Nvidia puts out a

new accelerator, there just aren't

enough examples of performant kernels.

But using agent RFT, uh, as few as a 100

PyTorch prompts were enough alone for

GP5 to learn how to write fast kernels

for a new hardware platform, as long as

you have a good grader, which is what

Macro worked really hard on. Um and that

allowed us to not need any code examples

actually to um start writing these

really performant kernels. Uh so the

fine-tuned model actually beats the

state-of-the-art by 72% in writing

correct and performant GPU kernels which

is a huge boost.

>> And um lastly we worked closely with

Rogo. Uh Rogo is building a financial

reasoning agent. It's uh capable of

reading uh financial filings, extracting

investment insights and then supporting

human analysts through this question

answering interface. And uh they wanted

to fine-tune oformin uh to summarize and

present these key findings from earlier

steps in the uh kind of finance

workflow. So rogo is really interesting.

They they used a custom LLM as a judge

grader uh that was accessible via

endpoints. So that we we called their

custom grader that measures the agents

factual accuracy, reasoning,

completeness, financial soundness,

clarity of explor uh clarity of

explanation. Um so you can see how you

can just fit a lot of your own criteria

and your own rubrics into your own

custom grader, which is the power of the

the um part of the power of the agent RF

platform. Uh the results are fantastic

with a 21% increase in core ML

performance with much lower

hallucination rates and missing

citations. Um I also just want to call

out that u rogo did a ton of work in

kind of making their greater uh

unhackable. I think earlier runs showed

that um the model actually started

reward hacking. So what happens with

these um with with the RFT process

sometimes is that if you have an edge

case in your greater um the model is

super super smart and sneaky and will

find ways to uh kind of exploit that

grater. So it's also really important to

make sure that your grader is pretty

watertight. Uh and that's what Rogo did.

They made their greater watertight. They

detected the hack. Um, and as a result,

the true performance that you're trying

to optimize the model on just just

started shooting up.

>> Yeah. Yeah. I remember there was one

robo run where we came back on the

platform that we showed earlier and the

average reward on validation was just

one.

>> Yeah. Little

>> 100%

little too.

>> That's too good to be true. Yeah.

>> All right. Yeah. So, that was for the

customer.

>> So, let's let's wrap up. Let's wrap up.

So um just to summarize uh so let's talk

about when to turn to agent RFT. So um

the general process that we recommend is

first you want to make sure that you

build this really high quality data set

where you're training and eval sets

closely match your production traffic.

So you you kind of want the uh agent to

not be surprised um when you know you

kind of go from like fine-tuning it to

actually exposing it to Showtime. Um,

and second of all, you know, you're

probably on this journey of improving

your agent performance. So, you want to

figure out what the baseline performance

looks like, so you know where to improve

from there. So, uh, you probably want to

run these baseline evals against GP5 or,

you know, whatever models you you like

using. And then from there, you want to

try to optimize the performance without

fine-tuning because that's often one of

the easier ways to get better

performance out. You might want to

adjust other parts of the task like

improving your prompts, improving your

infra, improving your task harness. And

then after you've squeezed out all the

juice out of the task and base model,

that's when you turn to agent RFT to

further optimize and start changing the

weights of the model to be fundamentally

better on your task and your domain in

an end to end fashion.

>> Awesome.

>> Yeah, thank you, Will.

>> Okay, great. Let's move on to Q&A. We

have um a ton of questions. So um wanted

to make sure we had enough time for

them. Um so maybe just let's go to the

next slide and we can tackle them.

>> Cool.

>> Yeah.

>> What kinds of tasks are best suited for

agent RFT?

>> You want to take Okay. Okay. All right.

So I have a take on this. Um so

obviously you know we we've explored and

explained a lot of ways in which you

should kind of structure your data set

or make sure that your data points have

enough variance in them so that during

exploration the model actually knows

what the difference between a good and a

bad data point looks like. So

fundamentally you do want a train set

where hey you still haven't squeezed all

the juice out of it but the the model

given enough exploration can can figure

out what a uh kind of good performance

looks like. So that way it can kind of

hill climb. So that's that's one thing

that I would say. Another thing I would

say is, you know, there's a task itself,

but then there's the way that you're

evaluating and grading it. And if the

way that you're grading it is in a

binary fashion, then it's going to be

really hard for the the agent to hill

climb or kind of gradually get better

and better um and improve on that task.

So you want to make sure that yes, you

have the task itself, but you also have

how you're evaluating it. So those two

things generally lead to quite a bit of

success. Um do you want to talk about

maybe domains or you know other things

that you have?

>> Yeah in terms of domains it was quite

surprising I think we showed it with the

customer stories but it's really widely

applicable. So I see agent RFT as you

presented very very clearly at the

beginning something for any type of

agents and as soon as you have an agent

that uses tools that are out of

distribution of what we trained our

marine model on which will naturally

happen because you have your own tools

then this is really an opportunity for

you to dive in on agent RF and if

there's a lot of reasoning associated

with it that's even better to use GP5 or

a very strong reasoning model.

>> Totally.

>> Yeah.

>> Totally.

Um, all right. What's different about

the RF2 platform now and since it

debuted in May? Well, I'll take this one

even if if if Will is the one who built

it,

>> but obviously and the whole team

>> and a big team.

>> But, um, what I really like about agent

RFT now is well, there there's multiple

things. In May, we were only able to

fine-tune O4 mini with a very specific

set of graders. Now you're able to

fine-tune O4 mini or GPT5 with tools and

with endpoint graders. So the

flexibility is just incomparable and you

can tackle pretty much I mean so many

new tasks. And what is great here is

that with this you can actually create

features in your product that just did

not exist before. They they the model

were not is not good enough and now you

have a path to actually other than

prompting help the model to use those

tools well and to build the product that

you actually want. So for me that that's

that's the most important and then a lot

of work also on the observability of the

platform and those different curves and

all the stability all of this has

improved a lot and that's great.

>> Totally. Yeah. Yeah. And I also just

want to kind of emphasize that with

agent RF we're now in this multi-step RL

paradigm. So with the original RFT

platform it's you give the model the

prompt, the model thinks for a while and

then spits back an answer to you. But

now you can actually do this multi-step

thing. So um the model is now in this

loop with your world and with your

environment and all of that actually

gets trained on end to end with with

your grader. So

>> yeah that's awesome.

>> Yeah.

>> Um Kristen do you have any other

questions from uh

>> yeah some more?

>> Yeah. Why don't we go to if you go to

the next slide I was refreshing during

Okay. So if you're able to refresh um

otherwise I can just read them out loud.

>> Uh let's see.

Um, start slashhow. Oh, so here's

curious why RFT is sampled.

>> Go back. I think we

>> Yeah.

>> Yep.

>> No, that's cool. Yeah. Take it.

>> This one, right? Yep.

>> Oh, okay. Cool. Uh, do we have one after

this or

>> Yeah.

>> Okay. There's more. Okay. Okay.

>> We can run a bit.

>> All right. Sounds good. Sounds good.

Okay. Uh, yeah. Why is RFT sample? Okay.

So, this is sort of a question around

like why maybe RL is sample efficient?

Uh I mean we could talk at length about

that but um fundamentally um first of

all the model is basically generating

its own training data through the

exploration process. So remember how we

talked about this compute multiplier

thing or like how the model is exploring

the search space. Well, it's actually

generating its own training data through

that sampling process. And when you give

it the grade or the reward at the end,

that tells the model how well it's doing

based on the trajectory roll out that it

generated. And that actually ends up um

being the thing that we train the model

on. So we obviously do a bunch of like

reinforcement learning stuff over that

trajectory, you know, like uh and we

take your reward or your your grade and

we kind of apply it over the roll out in

certain ways. Um but ultimately the

model is generating its its own data.

>> Yeah. I also add a note on the fact that

we're working on a frontier model and

fine-tuning a frontier model. So the

prior that you're working on is very

strong already and probably already has

success on your task.

>> So all of this variance only works

because that model just is able to

sometimes get it right.

>> Exactly.

>> And so it's just the the power of the

prior. And so if you think also and you

continue to accelerate if there's a new

model that is stronger and you have RF

on it as well, you can expect also to it

to be sample efficient again because the

prior will keep on increasing.

>> Yeah, the model is generating really

good data for itself basically because

the prior is so good as as Dio said. So

>> yeah. All right, let's see what the next

one is. How does the RL training

objective function differ from general

RFT and tank RFT? Um

>> I can take you can take first. All

right, sure. I'll start. So um yeah, so

there's the difference between the

actual like RL loss function and the

actual reward. So um the reward is what

we allow you to specify. So whether this

be reward functions that are native in

our platform, for example, the string

check greater that we kind of dissed on

earlier or the model grader uh or your

endpoint greater. And so with that

reward, we we do stuff with that reward.

So that would be the RL or reinforcement

learning like loss. Um so that might be

what you're what you're asking about

actually. So that so far doesn't differ

between general RFT and agent antic. So

we don't we don't change the loss part.

Uh we may though we we may try new

things as we're doing our research and

research engineering um to deliver even

like bigger model gains. But yeah, for

now honestly there's no difference. The

main difference is um now you can define

much more flexible reward functions.

>> Yeah. And I think that's super

important. Um when you when you look at

B RFT because you don't have access to

the train of thought, you only have

access to really the final output.

Whereas when you look at agent tech RFT,

you do have a lot of information on the

traces and some pretty even mentioned it

grading those tool calls. Yeah. And so

you can uh have much more control on the

policy that you want to see uh for your

model after training. So that's one one

big difference to me.

>> Totally. We have a little bit left. I

don't know. Should we try to

>> Yeah, let's go through them. Um we can

run over if people are are free to stay.

Um but

>> sounds good.

>> Cool. Use RFT. When there is a new

response of the model, does the new

model learn automatically from previous

um is is this like if there is a

new response of the model? Is this

talking about like domain shift? for

example, like you train the model on

like um a set of data points, but then

you evaluate on

>> Yeah, I'm not not exactly sure. The way

I see it maybe ties in with all this

idea of different trajectories that we

run and technically any new response

generated by the model is a new

trajectory and we actually do leverage

this when we compute uh the objective.

So my answer would be yes.

>> Yes, during training.

>> If that's understanding,

>> during training. Yeah. But but when

you're doing inference then like we're

not using whatever you have during

inference to to go back and make the

model better continuously. Not yet at

least someday.

>> All right.

>> Yeah.

>> Cool.

>> Great question.

>> Yeah,

>> you can read it. Yeah.

>> Uh I guess so. Okay. Are the alpha

endpoints used in the code available to

everybody?

>> Um I guess it depends on which alpha

endpoints we're talking about. So

>> is the alpha endpoints as a tool

integrator?

>> Yeah.

>> Yeah. Okay. Oh, yeah. Okay. So, no, they

they aren't. And um that's where the

agent RFT interest form comes in. So,

this is a functionality that is um that

we're exposing and is something that is

more or less in like I don't want to say

private beta, but is the type of

functionality that we want to work with

you to make sure that you get the most

success possible on. So, um, this is

something that, you know, we'd love for

you to, you know, talk to your, uh,

friendly neighborhood account executive

or account director at OpenAI to, um,

see how we can work together on.

>> Yeah. And that question is such a good

segue right?

>> Yeah, exactly. Um, so we can wrap up

with some resources on the right here.

um feel free to explore these but if

you're interested in learning more about

agent RF and specifically working with

our team here um check out this link

this tiny URL um fill out the interest

form and we will we will be in touch um

and with that I will share one more

slide on the upcoming build hours. So

join us on December 3rd for agent memory

patterns. Um this will be the the last

build hour of our agent series. Um but

many more build hours to come. check out

um our homepage there below and we'll

keep adding them. So, thanks everyone

and we'll see you next time.

>> Thank you.

>> Yeah, thank you. Bye.

Build Hour: Agent RFT

OpenAI

6 days ago

1:02:11

OpenAI SDK & Frameworks

Rank #1

Description

Agent RFT enables reasoning models to become even more powerful, tool-using agents by training directly on the workflows they will execute in production. By operating on agent rollouts, reasoning models can call tools, generate intermediate reasoning steps, and receive real-time feedback via customer-provided endpoints. This Build Hour will walk through the preparation, infrastructure, and safety oversight to use Agentic RFT. Theophile Sautory (Applied AI) and William Hang (API Engineering) cover: • Improving agent performance with optimization and fine-tuning options • Key differences between Base RFT and Agentic RFT • New additions and how Agent RFT works • Task setup and live demos training with tools • Customer spotlight on Cognition with Sampriti Panda (Research Engineer) • Success stories featuring Ambience, Genspark, Mako, and Rogo • Live Q&A 👉 Agent RFT Interest Form: https://tinyurl.com/agentRFT 👉 Follow along with the code repo: https://github.com/openai/build-hours 👉 Sign up for upcoming live Build Hours: https://webinar.openai.com/buildhours/ 00:00 Introduction 01:34 Intro to Agent RFT 11:12 Task Setup 14:15 Demos: Training with Tools 31:33 Best Practices 35:15 Customer Spotlight: Cognition 44:58 Success Stories 51:16 Summary 52:33 Q&A

Video Details

Category

OpenAI SDK & Frameworks

Featured Date

November 12, 2025

Quality Rank

#1

AI Recommended