Reinforcement learning & fine-tuning on TPUs | The Agent Factory Podcast | DailyDevLists

Loading video player...

Full Transcript

4,283 words • EN

This week on the Agent Factory,

>> you can have over 9,000 chips all

working together over high bandwidth,

low latency communication without

[music] having to communicate over the

data center network.

>> The price performance that customers and

Google internally sees with TPUs is the

best in the industry. [music] Launching

a fine-tuning job on Ironwood TPUs using

Max Text involves three important steps.

>> [music]

>> Hi everyone, welcome to the Agent

Factory holiday special. We are the

podcast that goes beyond the hype and

dives into building production ready AI

agents. I'm Shield.

>> Hi. And I'm Don. It's great to be here

today.

>> Thanks so much for joining us today for

the first time, Don. It's great to have

you. You know, Gemini 3 was just

released last month and it's crushing

all the benchmarks. I think one of the

most interesting things about this

launch that not everyone is aware of is

that while other companies are chasing

over GPUs, Google is training,

fine-tuning, and serving its model on

TPUs alone, its own infrastructure.

>> Yeah, that's right. The fact that the

model is trained and served solely on

TPUs also allows Google to serve the

model on a crazy scale and at a very

competitive price.

>> Exactly. and we thought this was a great

opportunity to talk about fine-tuning

and specifically reinforcement learning

with TPUs. By the end of this episode,

we will cover the different steps in

training a model, pre-training and

post-raining, which include supervised

fine-tuning and reinforcement learning

or RL. And I don't know if you heard,

but the RL industry is really buzzing

right now. So, we'll talk about the

latest and greatest with reinforcement

learning. H and then we will talk about

TPUs and what are they great at and how

do you actually fine-tune with them.

We're going to see a very cool demo of

that by Don.

>> Yeah. And to help us cover this topic,

we brought in Kyle Mags, the product

manager on the TPU training team.

>> Thanks for having me. I'm excited to be

here.

>> Welcome to the show, Kyle. So happy to

have you here. And I think now we can

probably get rid of these. Right.

>> Sounds good. Sounds good. So before we

dive into fine-tuning, I want to touch

upon a very basic question. When should

someone even consider fine-tuning?

>> Yeah, that's a great question, Don, and

very timely. There was this uh

interesting paper recently that was

published by Nvidia claiming that

actually small language models are the

future of agentic AI. And why is that?

because with the right specialization

they could be sufficiently powerful and

necessarily more economical for agentic

systems.

>> Yeah, I think the barrier for entry is

mainly the complexity of fine-tuning and

the additional work and AI expertise

required for fine-tuning and hosting

your own models.

>> Exactly. Foundational models like Gemini

are so powerful out of the box. They are

the easiest way to get started. You can

adapt the model to your use case just by

modifying the instructions or few short

learning where you provide examples to

the model through the prompt.

>> So when should you consider fine-tuning?

>> Yeah, that's a great question. You

should consider it in one of those

situations. One either you have a very

unique data set and a problem that

requires very high specialization that a

generalist model may not excel in. For

example, this could happen in a medical

domain. I just wrote a blog about this,

so you can you can catch it in the show

notes. Another situation is when you

have a very strong privacy restriction

and you would like to host your own

model and fine-tune them with your own

data in a very privacy conserving

environment.

>> Oh, that makes sense. Yeah. Well, okay.

So, now after we broke down the

motivation, Kyle, do you want to walk us

through where fine-tuning comes in

during the different steps of the model

life cycle? Yeah, sure. So, I think

people generally break uh the model life

cycle into three steps. So, the first is

pre-training and potentially continual

pre-training. Um, and if we use an

analogy of say learning chemistry, this

first step is about reading all the

background information in your textbook.

This is learning you know how the

different bonds connect to each other

and how all the molecules connect to

each other. The second step is the

actual post-raining or fine-tuning and

this comprises of two steps. The first

is SFT or supervised fine-tuning. And

the second is RL. We can think about

SFT, the first step, as seeing an

example problem in your book as you're

reading the chapter. So you've learned

the subject and now you see how you

would solve a problem. It's all given to

you, right? You're just imitating and

learning from what's already in the

textbook. And then the second step of

post training is reinforcement learning.

And this is where you're actually

tested. So you're given a problem

without the answer. You have to solve it

yourself. And then once you have a

solution, you check the back of the

book. You see how the right proper best

way to answer the solution is given. And

then you compare how you solved it

versus the best way to solve it. And

then you adjust your approach. That's

reinforcement learning. And that's how

it works with models as well because we

ask the model a question. We give it a

prompt. We score the answer. If it does

well, we reward it. If it does poorly,

we penalize it. And then we adjust the

model behavior. And that's called

alignment. And then the last step of the

life cycle is actually doing the

inference or serving the model. And in

RL, you're actually doing that step

three during that training loop, which

is very complicated to set up.

>> Yeah,

>> I see. I really like the analogy of the

learning to education and how supervised

fine-tuning is like learning for a

previous example and the reinforcement

learning is just trying and see how you

do. Uh, this is a great technology.

Thanks for bringing that, Kyle. Um, and

I understand that reinforcement learning

is really hot these days. Can you tell

us a little bit about what is RL and why

do we really need it?

>> Yeah, so in terms of what is RL, RL is

the the actual step of asking the model

to do inference, asking it to perform a

task, judging the result, and then

updating the model's behavior based on

if it was a good result, you would

reward it, or a bad result, you would

penalize it. That's why it's called

reinforcement learning. And this is

different from SFT, supervised

fine-tuning, where you're just ingesting

data and learning um how your maybe

human instructors want you to behave.

And why do we need it? Is because RL is

really important for alignment. You're

grading the entire model response, not

just next token prediction, which is

where SFT is really teaching the model.

And so this can do things that SFT

can't. But the problem is it's just

really complex. So you're managing that

training at the same time as the

inference and then you have to move the

model between training and inference and

how do you do that performantly and

avoid bottlenecks. So it introduces a

lot of complexity but for certain use

cases like safety reinforcement learning

is really important because you can

teach the model what not to do which is

really hard with SFT.

>> I see. So I understand it's very

complicated. That's what I see. Um and I

also understand that maybe not everyone

need RL. So where do you see

specifically the added value of RL?

Where should I look for it? In which

situations?

>> Yeah, there's a couple there's there's

many use cases, but a couple ones stand

above the rest where it's just very

obvious that RL is the right solution uh

for that problem. So one is safety where

you can penalize the model for doing

something unsafe. So imagine a poor

response or more recently we see a lot

of people doing reinforcement learning

with tool use and so this could be

teaching a model how to do search. But

again, back to safety, you have to teach

the model what not to do. So, you know,

don't do that sort of search or don't

delete that sort of data like that. We

want to avoid these cast catastrophic

results. Um, we also have verifiable

domains. So, that's kind of a a an area

where RL uh shines. This is things like

coding and solving math problems where

we know what the right solution is.

Reinforcement learning is great here

because we can give the model a prompt,

it solves it, we compare it to a

verifiable um answer, and then if it's

right, we reward it. If it's wrong, we

penalize it. And so those use cases, it

just makes a lot of sense to do RL.

>> I see. So alignment, safety, tool use,

math, reasoning, coding, all of these

areas will require RL. Um, so and you

mentioned that there's a lot going on in

the industry. Can you tell us a little

bit you know what are the latest

advancement in this space?

>> Yeah definitely. So it would be fair to

call 2025 the year of RL because of how

much happened in RL how much the

industry is interested in RL and how I

think the future is going to be shaped

by RL. So just looking at kind of a

timeline of 2025, you can see the year

started with DeepSeek R1, which was the

first really powerful open-source

thinking model that was open sourced in

January of this year along with the

algorithm they used for reinforcement

learning called GRPO which was much more

efficient than some of the other

algorithms algorithms that were popular

at the time. Then throughout the year,

you can see there are a lot of uh models

that were launched that excelled at

reasoning. Um, and even Grock 4, they

said that when they launched it, they

had trained it for reinforcement

learning at pre-training scale of

200,000 GPUs.

>> Wow.

>> Which is really a massive investment.

So, not only is there interest, but

there's also a lot of investment into

RL.

>> Um, we saw that again in October when

there were a ton of launches from Google

with Tunix, from Meta, from Thinking

Machines. Um and and all of these people

are trying to build solutions in this

space because so many people are trying

to solve these hard RL problems. Uh we

saw Gemini 3 more recently last month

which is a really strong thinking model

that's doing really well in the

benchmarks as we mentioned and then most

recently we just launched Maxex 2.0

which focuses on post training um and

we'll have a demo for that later.

>> Awesome. Thanks for sharing that Kyle.

It seems like there is a very increased

investment in RL that shows how much it

is actually foundational to the advanced

capabilities we see today with the LLMs

and agentic systems.

>> We're also seeing when we're talking to

users and customers, we're seeing some

companies, entire companies, their

entire purpose is to do post training.

They just want to take an open source

model off the shelf. Think like Gemma or

DeepSeek or Quen. they want to post

train it and then their entire business

is built upon post-training those open

source models and their special sauce is

specializing that model for their

customers or their own use case.

>> Yeah. But so what kind of challenges are

these uh companies having?

>> Yeah. So as we uh kind of alluded to

there's a lot of challenges with

combining this training and inference at

the same time. Um we could broadly break

this into maybe three categories. So the

first is infrastructure. How do you

provision the right amount of

infrastructure? So we're talking about

TPUs here, but how many TPUs? What

version of TPUs? How many TPUs for the

training side? How many for the sampling

or inference side? Is that dynamic? What

about if you see a bottleneck? How do

you uh manage this whole process

altogether? It's very complicated. Uh

the second is around the code, which

model to use, which algorithms to use.

So again, you could be using Quen or

GPOSS. You could be using GRPO or DPO or

PO or GSPO. There's so many options out

there and finding the right library that

makes it easy to use is really hard. So

the last step is bringing all of that

together. Can you build an integrated

solution that doesn't break when you

want to do something different from some

you know golden path? So someday there's

going to be a new maybe Quen model or

Deepseek model. How do you quickly add

that use the latest algorithm and move

your use case from good to great?

>> Wow. So many decisions to make here in

this process. Do do we have any good

guidance for developers around that?

>> Well, on the GPU side, we have a ton of

recipes. And of course, stay with us for

the demo in the next part.

>> Yeah. And on the TPU side, we have a

solution called Max Text. And this is a

vertically integrated stack. So, I just

presented this challenge of piecing

together all these solutions from across

the industry. One thing we're doing on

TPUs is trying to give a vertically

integrated stack. So you have everything

that was co-designed together from the

software to the hardware all happening

within a TPU pod for high performance

and efficiency and then you get your

models, your algorithms, all of those

things from the same place and we make

sure it works.

>> Oh, that sounds good. A vertically

integrated stack. I like that. Um, so

now that we have a better understanding

of what is fine-tuning and what's

happening with reinforcement learning in

the industry, let's get down to the

factory floor and start talking about

TPUs and what reinforcement learning in

action look like. Kyle, what are how are

TPUs different? Where do they shine

specifically?

>> Yeah, great question. So, TPUs are

uniquely well suited for RL. Um, it's

almost as if they were designed,

purpose-built for AI applications. So

the first thing you'll notice about TPUs

is that they were designed as a system.

So if you talk to Norm, some would call

the father of TPUs, he'll explain that

TPU pods were designed themselves first

and then the chips were designed. And so

as a result, you have this entire system

that works extremely well together. It

has scalability that other processors

don't have. So within a single pod, uh

you can scale up to 9,216 chips. And so

when you're doing large scale

reinforcement learning, this is well

above and beyond what other accelerators

can offer. And because it all happens

within within the same pod, the

communications between chips are all

over a low latency network. And so with

RL, you're having a trainer, you're

having inference on the samplers, and

then you're doing synchronization

between the two. And with 9,000 chips,

you can have 4,000 on the training side,

4,000 on the inference side, and then

have ultra low communications between

the two.

>> Whoa. So, let me get this straight. You

can have over 9,000 chips all working

together over high bandwidth, low

latency communication without having to

communicate over the data center

network. That That's crazy.

>> Yes, exactly. And I think that's a good

point about not having to go over the

data center network because this is

where things slow down and if your

domain or your pod is much smaller as it

is with other accelerators, you do get

bottlenecked by the data center network.

But TPUs are architected in a 3D Taurus.

So the communication between the chips

is really fast and low latency. And as a

result, because these were purpose-built

for AI, the price performance that

customers and Google internally sees

with TPUs is the best in the industry

because of their purpose-built nature.

>> Yeah, they really were built for dense

computational problems and you see that

in how companies are using them. Yeah,

TPUs can bring so much scale just by how

they are designed to work in a system

and collaborate well. So, how do we

actually fine-tune uh with TPUs?

>> Yes, good question. So, the solution for

TPU fine-tuning is called Max and it

actually brings together several other

solutions in that vertically integrated

stack we talked about. So, the first is

Max itself which provides high

performance models purposefully designed

and architected for training. The second

is algorithms from a post-training

library called Tunix. The third is

inference from VLM which was recently

launched on TPUs and now provides high

performance inference with that popular

open source engine on TPUs. And the last

is an integration with pathways which

provides scale and orchestration so that

you can do that weight synchronization

from trainer to sampler over ICI or at

larger scales over DCN.

>> Okay, Don, I think this is it. We are

ready for the demo. Let's see a real

demo of reinforcement learning gpo with

TPU from our Google experts.

>> Awesome. Let's get into it. Launching a

fine-tuning job on Ironwood TPUs using

MaxEx involves three important steps.

First, preparation. We have to build a

max image to run the job using

appropriate dependencies. And a lot of

these dependencies are kind of cutting

edge because TPUs and Ironwood in

specific are so new. The second task is

provisioning. This is where we use XPK

to build our pathways enabled cluster

with TPU nodes and the inner chip

interconnects all up and running. Third

is actually launching the job. For this,

we're also using XPK to launch and

handle all the orchestration for us. And

finally, we're going to monitor the job

again with XPK and built-in Tensorboard

log files that will give us some nice

graphs to look at. Now, this demo is

already available for older models of

TPU. Just head to the max documentation

link on this slide and clone the code

repo from GitHub. Quick shout out to my

talented colleagues in Maxex engineering

team and Drew Brown who prepared this

demo for us on very short notice. And

look forward to us launching the

Ironwood version of this tutorial soon.

We're going to skip over the preparation

and provisioning steps, but that will be

covered in a longer tutorial that Drew

is putting out. For now, let's just

launch the job.

Usually this is done interactively in a

terminal session, but Drew compiled it

here into a shell script that we can

walk through logically. The first step

here is configuration. We're going to

set the zone and cluster name that we're

going to be using. But the most

important thing here is the TPU type. In

this case, it refers to the version of

TPU 7X for Ironwood and the shape of the

cluster. 64 chips is what we're going to

be using. So we can very comfortably fit

the whole model in memory with plenty of

overhead for the tuning operations. And

this means that we're using a 4x4x4

configuration of the chips in a

three-dimensional topology right next to

each other. And they're all going to be

using our ICI or interchip interconnect

to pass data between them. But we're

going to let Pathways and XPK handle all

of that for us. All we have to say is

TPU type is TPU7X64.

Other than that, we're just setting a

few variables about where we're going to

store the output and cloud storage

bucket and where we're getting the

starting checkpoint for the model we're

training.

Next step is constructing the command

that max will actually run within the

container. And in that case, we're

setting some environment variables.

These can be different depending on what

kind of training you're doing. We're

overriding the batch size and the number

of runs to larger than default because

we want to do some actual learning in

this run. And then we're telling it to

store our output somewhere else. And

that's it. Drew didn't write any code

for this. It's all configuration. You

have all of these tools at your disposal

without having to actually write code

for max text. And then we launch the job

with XPK. XPK is what's going to

actually build the image for us and send

it to the cluster that exists already.

It's just that simple. And we'll go

ahead and launch that job. It'll take

just a minute or two to actually start

up. and we'll give it maybe 10 or 15

minutes and come back later when it's

actually doing some important work.

Okay, so our Ironwood training job has

been running for about half an hour and

we're going to take a look at what it's

doing. We set this one to run for 250

steps, so it's going to be a few hours

long and hopefully we'll see some good

results, but we're at a point where we

can take a peek and see what's

happening.

So we have this pretty simple monitor

job script that will show us what

commands are needed to monitor the jobs.

Step one is to filter all the jobs that

are running on the cluster for the one

we want. You can see there the command

has the cluster name and the filter by

job flag. XPK goes through a bunch of

validation steps and one of those steps

is printing all of the currently running

pods that are on the cluster. It's a lot

of information, a lot more than we

really wanted, but luckily the filter by

job ID will find just our job. And there

we see our job is running. Next, let's

check out the logs. Lucky for us, in

this pathways job, all of our logs get

sent back to the head pod. And you can

tell these right to your console. We can

see here that some of the training

iterations that it's been going through,

some of the prompts and the responses.

And this is what GRPO is doing right

now. It is producing new candidates to

see if it can find better alternatives

to the existing model. But in order to

see what's happening across all of our

steps, we're going to use TensorBoard.

And TensorBoard is going to be pointed

to the log bucket that we've been

outputting this whole time. Let's launch

it.

Here we go. These are live metrics from

the job that is currently running.

training llama 3.1 70 billion parameters

reinforcement learning tuning on

Ironwood TPUs. We can see a few things

here. It's going to show us something

like loss. We see loss has peaked up and

has spiked right up at the top here for

now. This is somewhat expected on a GRPO

run. We'd expect to see initial loss and

then hopefully that will go down over

time like we're on now step 12 of 250.

But that's it. That's how you get

started fine-tuning models using

reinforcement learning with Max Text 2

and Ironwood.

>> Wow, what a great demo.

>> Thanks.

>> So, what data set are we using here,

Don?

>> So, this is the GSM 8K data set. That's

grade school math, which is really good

for reinforcement learning because all

of the answers are verifiable.

>> So, it's like Kyle said before that

reinforcement learning is specifically

relevant when you need math reasoning.

Um and how long did this uh RL process

took?

>> So in the end 250 passes took about

three hours.

>> So you trained in that example with 64

TPUs. What if you wanted to do much

larger scale or use a different model or

even a different algorithm.

>> Okay. So all of these are just changes

to what you pass into the script. Like

this is all been thought of in pathways

and in max text. All of this is

available. You don't have to do a whole

bunch of coding.

>> Awesome.

>> Awesome. So, thank you so much, Don, for

the cool demo. That's a wrap for today's

show. We talked about fine-tuning,

reinforcement learning, and TPUs. Thank

you to our audience for tuning in. We

hope you get we gave you some valuable

tools to fine-tune your specialized

agent. You can find all the resources

shared in this episode in the show

notes. We would love to get your

feedback. So, please add your comments

and questions. And don't forget to

follow the Google Cloud Tech channel for

future episodes. We will come back next

year with a new and revamped season of

the Agent Factory. Thank you so much,

Don and Kyle, for joining us today for

the last episode of this season.

>> It was my pleasure, Sher. Yeah, thanks

for having me.

>> Awesome. So, do you want to power it

down with me?

>> Sure. Thanks.

>> Yeah.

>> Powering down.

>> [music]

>> Heat. Heat.

>> [music]

Reinforcement learning & fine-tuning on TPUs | The Agent Factory Podcast

Google Cloud Tech

69 days ago

23:14

Ai Whitelist

AI Whitelist

Rank #1

Description

With Gemini 3 crushing benchmarks by training and serving solely on TPUs, we're diving deep into the infrastructure that powers the next generation of AI agents. In this holiday special of The Agent Factory, we go beyond the hype to explore how developers can use TPUs and Reinforcement Learning (RL) to build specialized, production-ready agents at scale. Join hosts Shir Meir Lador and Don McCasland and the special guest Kyle Meggs Product Manager on the Google TPU Training Team. We break down the "why" and "how" of fine-tuning, the critical role of RL in model alignment and safety, and how Google's TPU architecture offers unmatched efficiency for these complex workloads. Plus, don't miss the hands-on demo of MaxText 2.0 running a GRPO job on TPU infrastructure. In this episode, you will learn: 1️⃣ Fine-tuning fundamentals: When to choose fine-tuning over prompt engineering (focusing on specialization, privacy, and cost). 2️⃣ The model lifecycle: A clear breakdown of pre-training vs. post-training (SFT & RL), featuring Andrej Karpathy’s "chemistry textbook" analogy. 3️⃣ Reinforcement learning deep dive: When should you use RL? What added value does it bring? What are the latest advancements in the field? 4️⃣ The TPU advantage: How TPU pods and Inter-Chip Interconnect (ICI) solve critical bottlenecks in large-scale fine tuning. 5️⃣ RL on TPU demo: A technical look at the MaxText 2.0 stack running Reinforcement Learning (GRPO) on Google Cloud TPUs. Chapters: 0:00 - Introduction: Gemini 3 and the rise of TPUs 3:13 - Why fine-tune? Specialization and privacy 3:52 - What is fine-tuning? (SFT and RL explained) 5:50 - What is RL and why do we need it? 7:10 - The added value in RL 8:33 - Industry pulse: Why 2025 is the year of RL (DeepSeek-R1, Grok 4, Gemini 3) 10:46 - The challenges of RL: Infrastructure, algorithms, and orchestration 12:52 - Factory floor: How TPUs are designed for scale 15:53 - [Demo] Reinforcement Learning (GRPO) with MaxText 2.0 on TPUs 21:46 - Scaling to 1000+ chips and season wrap up About The Agent Factory: "The Agent Factory" is a video-first technical podcast for developers, by developers, focused on building production-ready AI agents. We explore how to design, build, deploy, and manage agents that bring real value. 🔗 Resources & links mentioned: ➖ Post-training docs → https://goo.gle/4sbBLAd ➖ Google Cloud TPU (Ironwood) documentation → https://goo.gle/3MMFOCY 🔗 Google Cloud open source code: ➖ MaxText → https://goo.gle/4pcDQt4 ➖ GPU recipes → https://goo.gle/495tp4x ➖ TPU recipes → https://goo.gle/4qgMF5U ➖ Andrej Karpathy - Chemistry Analogy → https://goo.gle/4pQcMAO ➖ Paper: "Small Language Models are the Future of Agentic AI" (Nvidia) → https://goo.gle/4qmLQIH ➖ Fine tuning blog → https://goo.gle/4pR211n 🔔 Follow Shir → https://goo.gle/49SAveB 🔔 Follow Don → https://goo.gle/3KKCrff 🔔 Follow Kyle → https://goo.gle/4j7Mg3k Join the conversation on social media with the hashtag #TheAgentFactory. Connect with the community at the Google Developer Program forums. → https://goo.gle/4oP9bmb Watch more Agent Factory → https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #TPU #ReinforcementLearning #FineTuning Speakers: Shir Meir Lador, Kyle Meggs, Don McCasland Products Mentioned: TPU, Gemini 3, Maxtext

Video Details

Category

Feed

AI Whitelist

Featured Date

December 23, 2025

Quality Rank

#1

AI Recommended