Loading video player...
This week on the Agent Factory,
>> you can have over 9,000 chips all
working together over high bandwidth,
low latency communication without
[music] having to communicate over the
data center network.
>> The price performance that customers and
Google internally sees with TPUs is the
best in the industry. [music] Launching
a fine-tuning job on Ironwood TPUs using
Max Text involves three important steps.
>> [music]
>> Hi everyone, welcome to the Agent
Factory holiday special. We are the
podcast that goes beyond the hype and
dives into building production ready AI
agents. I'm Shield.
>> Hi. And I'm Don. It's great to be here
today.
>> Thanks so much for joining us today for
the first time, Don. It's great to have
you. You know, Gemini 3 was just
released last month and it's crushing
all the benchmarks. I think one of the
most interesting things about this
launch that not everyone is aware of is
that while other companies are chasing
over GPUs, Google is training,
fine-tuning, and serving its model on
TPUs alone, its own infrastructure.
>> Yeah, that's right. The fact that the
model is trained and served solely on
TPUs also allows Google to serve the
model on a crazy scale and at a very
competitive price.
>> Exactly. and we thought this was a great
opportunity to talk about fine-tuning
and specifically reinforcement learning
with TPUs. By the end of this episode,
we will cover the different steps in
training a model, pre-training and
post-raining, which include supervised
fine-tuning and reinforcement learning
or RL. And I don't know if you heard,
but the RL industry is really buzzing
right now. So, we'll talk about the
latest and greatest with reinforcement
learning. H and then we will talk about
TPUs and what are they great at and how
do you actually fine-tune with them.
We're going to see a very cool demo of
that by Don.
>> Yeah. And to help us cover this topic,
we brought in Kyle Mags, the product
manager on the TPU training team.
>> Thanks for having me. I'm excited to be
here.
>> Welcome to the show, Kyle. So happy to
have you here. And I think now we can
probably get rid of these. Right.
>> Sounds good. Sounds good. So before we
dive into fine-tuning, I want to touch
upon a very basic question. When should
someone even consider fine-tuning?
>> Yeah, that's a great question, Don, and
very timely. There was this uh
interesting paper recently that was
published by Nvidia claiming that
actually small language models are the
future of agentic AI. And why is that?
because with the right specialization
they could be sufficiently powerful and
necessarily more economical for agentic
systems.
>> Yeah, I think the barrier for entry is
mainly the complexity of fine-tuning and
the additional work and AI expertise
required for fine-tuning and hosting
your own models.
>> Exactly. Foundational models like Gemini
are so powerful out of the box. They are
the easiest way to get started. You can
adapt the model to your use case just by
modifying the instructions or few short
learning where you provide examples to
the model through the prompt.
>> So when should you consider fine-tuning?
>> Yeah, that's a great question. You
should consider it in one of those
situations. One either you have a very
unique data set and a problem that
requires very high specialization that a
generalist model may not excel in. For
example, this could happen in a medical
domain. I just wrote a blog about this,
so you can you can catch it in the show
notes. Another situation is when you
have a very strong privacy restriction
and you would like to host your own
model and fine-tune them with your own
data in a very privacy conserving
environment.
>> Oh, that makes sense. Yeah. Well, okay.
So, now after we broke down the
motivation, Kyle, do you want to walk us
through where fine-tuning comes in
during the different steps of the model
life cycle? Yeah, sure. So, I think
people generally break uh the model life
cycle into three steps. So, the first is
pre-training and potentially continual
pre-training. Um, and if we use an
analogy of say learning chemistry, this
first step is about reading all the
background information in your textbook.
This is learning you know how the
different bonds connect to each other
and how all the molecules connect to
each other. The second step is the
actual post-raining or fine-tuning and
this comprises of two steps. The first
is SFT or supervised fine-tuning. And
the second is RL. We can think about
SFT, the first step, as seeing an
example problem in your book as you're
reading the chapter. So you've learned
the subject and now you see how you
would solve a problem. It's all given to
you, right? You're just imitating and
learning from what's already in the
textbook. And then the second step of
post training is reinforcement learning.
And this is where you're actually
tested. So you're given a problem
without the answer. You have to solve it
yourself. And then once you have a
solution, you check the back of the
book. You see how the right proper best
way to answer the solution is given. And
then you compare how you solved it
versus the best way to solve it. And
then you adjust your approach. That's
reinforcement learning. And that's how
it works with models as well because we
ask the model a question. We give it a
prompt. We score the answer. If it does
well, we reward it. If it does poorly,
we penalize it. And then we adjust the
model behavior. And that's called
alignment. And then the last step of the
life cycle is actually doing the
inference or serving the model. And in
RL, you're actually doing that step
three during that training loop, which
is very complicated to set up.
>> Yeah,
>> I see. I really like the analogy of the
learning to education and how supervised
fine-tuning is like learning for a
previous example and the reinforcement
learning is just trying and see how you
do. Uh, this is a great technology.
Thanks for bringing that, Kyle. Um, and
I understand that reinforcement learning
is really hot these days. Can you tell
us a little bit about what is RL and why
do we really need it?
>> Yeah, so in terms of what is RL, RL is
the the actual step of asking the model
to do inference, asking it to perform a
task, judging the result, and then
updating the model's behavior based on
if it was a good result, you would
reward it, or a bad result, you would
penalize it. That's why it's called
reinforcement learning. And this is
different from SFT, supervised
fine-tuning, where you're just ingesting
data and learning um how your maybe
human instructors want you to behave.
And why do we need it? Is because RL is
really important for alignment. You're
grading the entire model response, not
just next token prediction, which is
where SFT is really teaching the model.
And so this can do things that SFT
can't. But the problem is it's just
really complex. So you're managing that
training at the same time as the
inference and then you have to move the
model between training and inference and
how do you do that performantly and
avoid bottlenecks. So it introduces a
lot of complexity but for certain use
cases like safety reinforcement learning
is really important because you can
teach the model what not to do which is
really hard with SFT.
>> I see. So I understand it's very
complicated. That's what I see. Um and I
also understand that maybe not everyone
need RL. So where do you see
specifically the added value of RL?
Where should I look for it? In which
situations?
>> Yeah, there's a couple there's there's
many use cases, but a couple ones stand
above the rest where it's just very
obvious that RL is the right solution uh
for that problem. So one is safety where
you can penalize the model for doing
something unsafe. So imagine a poor
response or more recently we see a lot
of people doing reinforcement learning
with tool use and so this could be
teaching a model how to do search. But
again, back to safety, you have to teach
the model what not to do. So, you know,
don't do that sort of search or don't
delete that sort of data like that. We
want to avoid these cast catastrophic
results. Um, we also have verifiable
domains. So, that's kind of a a an area
where RL uh shines. This is things like
coding and solving math problems where
we know what the right solution is.
Reinforcement learning is great here
because we can give the model a prompt,
it solves it, we compare it to a
verifiable um answer, and then if it's
right, we reward it. If it's wrong, we
penalize it. And so those use cases, it
just makes a lot of sense to do RL.
>> I see. So alignment, safety, tool use,
math, reasoning, coding, all of these
areas will require RL. Um, so and you
mentioned that there's a lot going on in
the industry. Can you tell us a little
bit you know what are the latest
advancement in this space?
>> Yeah definitely. So it would be fair to
call 2025 the year of RL because of how
much happened in RL how much the
industry is interested in RL and how I
think the future is going to be shaped
by RL. So just looking at kind of a
timeline of 2025, you can see the year
started with DeepSeek R1, which was the
first really powerful open-source
thinking model that was open sourced in
January of this year along with the
algorithm they used for reinforcement
learning called GRPO which was much more
efficient than some of the other
algorithms algorithms that were popular
at the time. Then throughout the year,
you can see there are a lot of uh models
that were launched that excelled at
reasoning. Um, and even Grock 4, they
said that when they launched it, they
had trained it for reinforcement
learning at pre-training scale of
200,000 GPUs.
>> Wow.
>> Which is really a massive investment.
So, not only is there interest, but
there's also a lot of investment into
RL.
>> Um, we saw that again in October when
there were a ton of launches from Google
with Tunix, from Meta, from Thinking
Machines. Um and and all of these people
are trying to build solutions in this
space because so many people are trying
to solve these hard RL problems. Uh we
saw Gemini 3 more recently last month
which is a really strong thinking model
that's doing really well in the
benchmarks as we mentioned and then most
recently we just launched Maxex 2.0
which focuses on post training um and
we'll have a demo for that later.
>> Awesome. Thanks for sharing that Kyle.
It seems like there is a very increased
investment in RL that shows how much it
is actually foundational to the advanced
capabilities we see today with the LLMs
and agentic systems.
>> We're also seeing when we're talking to
users and customers, we're seeing some
companies, entire companies, their
entire purpose is to do post training.
They just want to take an open source
model off the shelf. Think like Gemma or
DeepSeek or Quen. they want to post
train it and then their entire business
is built upon post-training those open
source models and their special sauce is
specializing that model for their
customers or their own use case.
>> Yeah. But so what kind of challenges are
these uh companies having?
>> Yeah. So as we uh kind of alluded to
there's a lot of challenges with
combining this training and inference at
the same time. Um we could broadly break
this into maybe three categories. So the
first is infrastructure. How do you
provision the right amount of
infrastructure? So we're talking about
TPUs here, but how many TPUs? What
version of TPUs? How many TPUs for the
training side? How many for the sampling
or inference side? Is that dynamic? What
about if you see a bottleneck? How do
you uh manage this whole process
altogether? It's very complicated. Uh
the second is around the code, which
model to use, which algorithms to use.
So again, you could be using Quen or
GPOSS. You could be using GRPO or DPO or
PO or GSPO. There's so many options out
there and finding the right library that
makes it easy to use is really hard. So
the last step is bringing all of that
together. Can you build an integrated
solution that doesn't break when you
want to do something different from some
you know golden path? So someday there's
going to be a new maybe Quen model or
Deepseek model. How do you quickly add
that use the latest algorithm and move
your use case from good to great?
>> Wow. So many decisions to make here in
this process. Do do we have any good
guidance for developers around that?
>> Well, on the GPU side, we have a ton of
recipes. And of course, stay with us for
the demo in the next part.
>> Yeah. And on the TPU side, we have a
solution called Max Text. And this is a
vertically integrated stack. So, I just
presented this challenge of piecing
together all these solutions from across
the industry. One thing we're doing on
TPUs is trying to give a vertically
integrated stack. So you have everything
that was co-designed together from the
software to the hardware all happening
within a TPU pod for high performance
and efficiency and then you get your
models, your algorithms, all of those
things from the same place and we make
sure it works.
>> Oh, that sounds good. A vertically
integrated stack. I like that. Um, so
now that we have a better understanding
of what is fine-tuning and what's
happening with reinforcement learning in
the industry, let's get down to the
factory floor and start talking about
TPUs and what reinforcement learning in
action look like. Kyle, what are how are
TPUs different? Where do they shine
specifically?
>> Yeah, great question. So, TPUs are
uniquely well suited for RL. Um, it's
almost as if they were designed,
purpose-built for AI applications. So
the first thing you'll notice about TPUs
is that they were designed as a system.
So if you talk to Norm, some would call
the father of TPUs, he'll explain that
TPU pods were designed themselves first
and then the chips were designed. And so
as a result, you have this entire system
that works extremely well together. It
has scalability that other processors
don't have. So within a single pod, uh
you can scale up to 9,216 chips. And so
when you're doing large scale
reinforcement learning, this is well
above and beyond what other accelerators
can offer. And because it all happens
within within the same pod, the
communications between chips are all
over a low latency network. And so with
RL, you're having a trainer, you're
having inference on the samplers, and
then you're doing synchronization
between the two. And with 9,000 chips,
you can have 4,000 on the training side,
4,000 on the inference side, and then
have ultra low communications between
the two.
>> Whoa. So, let me get this straight. You
can have over 9,000 chips all working
together over high bandwidth, low
latency communication without having to
communicate over the data center
network. That That's crazy.
>> Yes, exactly. And I think that's a good
point about not having to go over the
data center network because this is
where things slow down and if your
domain or your pod is much smaller as it
is with other accelerators, you do get
bottlenecked by the data center network.
But TPUs are architected in a 3D Taurus.
So the communication between the chips
is really fast and low latency. And as a
result, because these were purpose-built
for AI, the price performance that
customers and Google internally sees
with TPUs is the best in the industry
because of their purpose-built nature.
>> Yeah, they really were built for dense
computational problems and you see that
in how companies are using them. Yeah,
TPUs can bring so much scale just by how
they are designed to work in a system
and collaborate well. So, how do we
actually fine-tune uh with TPUs?
>> Yes, good question. So, the solution for
TPU fine-tuning is called Max and it
actually brings together several other
solutions in that vertically integrated
stack we talked about. So, the first is
Max itself which provides high
performance models purposefully designed
and architected for training. The second
is algorithms from a post-training
library called Tunix. The third is
inference from VLM which was recently
launched on TPUs and now provides high
performance inference with that popular
open source engine on TPUs. And the last
is an integration with pathways which
provides scale and orchestration so that
you can do that weight synchronization
from trainer to sampler over ICI or at
larger scales over DCN.
>> Okay, Don, I think this is it. We are
ready for the demo. Let's see a real
demo of reinforcement learning gpo with
TPU from our Google experts.
>> Awesome. Let's get into it. Launching a
fine-tuning job on Ironwood TPUs using
MaxEx involves three important steps.
First, preparation. We have to build a
max image to run the job using
appropriate dependencies. And a lot of
these dependencies are kind of cutting
edge because TPUs and Ironwood in
specific are so new. The second task is
provisioning. This is where we use XPK
to build our pathways enabled cluster
with TPU nodes and the inner chip
interconnects all up and running. Third
is actually launching the job. For this,
we're also using XPK to launch and
handle all the orchestration for us. And
finally, we're going to monitor the job
again with XPK and built-in Tensorboard
log files that will give us some nice
graphs to look at. Now, this demo is
already available for older models of
TPU. Just head to the max documentation
link on this slide and clone the code
repo from GitHub. Quick shout out to my
talented colleagues in Maxex engineering
team and Drew Brown who prepared this
demo for us on very short notice. And
look forward to us launching the
Ironwood version of this tutorial soon.
We're going to skip over the preparation
and provisioning steps, but that will be
covered in a longer tutorial that Drew
is putting out. For now, let's just
launch the job.
Usually this is done interactively in a
terminal session, but Drew compiled it
here into a shell script that we can
walk through logically. The first step
here is configuration. We're going to
set the zone and cluster name that we're
going to be using. But the most
important thing here is the TPU type. In
this case, it refers to the version of
TPU 7X for Ironwood and the shape of the
cluster. 64 chips is what we're going to
be using. So we can very comfortably fit
the whole model in memory with plenty of
overhead for the tuning operations. And
this means that we're using a 4x4x4
configuration of the chips in a
three-dimensional topology right next to
each other. And they're all going to be
using our ICI or interchip interconnect
to pass data between them. But we're
going to let Pathways and XPK handle all
of that for us. All we have to say is
TPU type is TPU7X64.
Other than that, we're just setting a
few variables about where we're going to
store the output and cloud storage
bucket and where we're getting the
starting checkpoint for the model we're
training.
Next step is constructing the command
that max will actually run within the
container. And in that case, we're
setting some environment variables.
These can be different depending on what
kind of training you're doing. We're
overriding the batch size and the number
of runs to larger than default because
we want to do some actual learning in
this run. And then we're telling it to
store our output somewhere else. And
that's it. Drew didn't write any code
for this. It's all configuration. You
have all of these tools at your disposal
without having to actually write code
for max text. And then we launch the job
with XPK. XPK is what's going to
actually build the image for us and send
it to the cluster that exists already.
It's just that simple. And we'll go
ahead and launch that job. It'll take
just a minute or two to actually start
up. and we'll give it maybe 10 or 15
minutes and come back later when it's
actually doing some important work.
Okay, so our Ironwood training job has
been running for about half an hour and
we're going to take a look at what it's
doing. We set this one to run for 250
steps, so it's going to be a few hours
long and hopefully we'll see some good
results, but we're at a point where we
can take a peek and see what's
happening.
So we have this pretty simple monitor
job script that will show us what
commands are needed to monitor the jobs.
Step one is to filter all the jobs that
are running on the cluster for the one
we want. You can see there the command
has the cluster name and the filter by
job flag. XPK goes through a bunch of
validation steps and one of those steps
is printing all of the currently running
pods that are on the cluster. It's a lot
of information, a lot more than we
really wanted, but luckily the filter by
job ID will find just our job. And there
we see our job is running. Next, let's
check out the logs. Lucky for us, in
this pathways job, all of our logs get
sent back to the head pod. And you can
tell these right to your console. We can
see here that some of the training
iterations that it's been going through,
some of the prompts and the responses.
And this is what GRPO is doing right
now. It is producing new candidates to
see if it can find better alternatives
to the existing model. But in order to
see what's happening across all of our
steps, we're going to use TensorBoard.
And TensorBoard is going to be pointed
to the log bucket that we've been
outputting this whole time. Let's launch
it.
Here we go. These are live metrics from
the job that is currently running.
training llama 3.1 70 billion parameters
reinforcement learning tuning on
Ironwood TPUs. We can see a few things
here. It's going to show us something
like loss. We see loss has peaked up and
has spiked right up at the top here for
now. This is somewhat expected on a GRPO
run. We'd expect to see initial loss and
then hopefully that will go down over
time like we're on now step 12 of 250.
But that's it. That's how you get
started fine-tuning models using
reinforcement learning with Max Text 2
and Ironwood.
>> Wow, what a great demo.
>> Thanks.
>> So, what data set are we using here,
Don?
>> So, this is the GSM 8K data set. That's
grade school math, which is really good
for reinforcement learning because all
of the answers are verifiable.
>> So, it's like Kyle said before that
reinforcement learning is specifically
relevant when you need math reasoning.
Um and how long did this uh RL process
took?
>> So in the end 250 passes took about
three hours.
>> So you trained in that example with 64
TPUs. What if you wanted to do much
larger scale or use a different model or
even a different algorithm.
>> Okay. So all of these are just changes
to what you pass into the script. Like
this is all been thought of in pathways
and in max text. All of this is
available. You don't have to do a whole
bunch of coding.
>> Awesome.
>> Awesome. So, thank you so much, Don, for
the cool demo. That's a wrap for today's
show. We talked about fine-tuning,
reinforcement learning, and TPUs. Thank
you to our audience for tuning in. We
hope you get we gave you some valuable
tools to fine-tune your specialized
agent. You can find all the resources
shared in this episode in the show
notes. We would love to get your
feedback. So, please add your comments
and questions. And don't forget to
follow the Google Cloud Tech channel for
future episodes. We will come back next
year with a new and revamped season of
the Agent Factory. Thank you so much,
Don and Kyle, for joining us today for
the last episode of this season.
>> It was my pleasure, Sher. Yeah, thanks
for having me.
>> Awesome. So, do you want to power it
down with me?
>> Sure. Thanks.
>> Yeah.
>> Powering down.
>> [music]
>> Heat. Heat.
>> [music]
With Gemini 3 crushing benchmarks by training and serving solely on TPUs, we're diving deep into the infrastructure that powers the next generation of AI agents. In this holiday special of The Agent Factory, we go beyond the hype to explore how developers can use TPUs and Reinforcement Learning (RL) to build specialized, production-ready agents at scale. Join hosts Shir Meir Lador and Don McCasland and the special guest Kyle Meggs Product Manager on the Google TPU Training Team. We break down the "why" and "how" of fine-tuning, the critical role of RL in model alignment and safety, and how Google's TPU architecture offers unmatched efficiency for these complex workloads. Plus, don't miss the hands-on demo of MaxText 2.0 running a GRPO job on TPU infrastructure. In this episode, you will learn: 1️⃣ Fine-tuning fundamentals: When to choose fine-tuning over prompt engineering (focusing on specialization, privacy, and cost). 2️⃣ The model lifecycle: A clear breakdown of pre-training vs. post-training (SFT & RL), featuring Andrej Karpathy’s "chemistry textbook" analogy. 3️⃣ Reinforcement learning deep dive: When should you use RL? What added value does it bring? What are the latest advancements in the field? 4️⃣ The TPU advantage: How TPU pods and Inter-Chip Interconnect (ICI) solve critical bottlenecks in large-scale fine tuning. 5️⃣ RL on TPU demo: A technical look at the MaxText 2.0 stack running Reinforcement Learning (GRPO) on Google Cloud TPUs. Chapters: 0:00 - Introduction: Gemini 3 and the rise of TPUs 3:13 - Why fine-tune? Specialization and privacy 3:52 - What is fine-tuning? (SFT and RL explained) 5:50 - What is RL and why do we need it? 7:10 - The added value in RL 8:33 - Industry pulse: Why 2025 is the year of RL (DeepSeek-R1, Grok 4, Gemini 3) 10:46 - The challenges of RL: Infrastructure, algorithms, and orchestration 12:52 - Factory floor: How TPUs are designed for scale 15:53 - [Demo] Reinforcement Learning (GRPO) with MaxText 2.0 on TPUs 21:46 - Scaling to 1000+ chips and season wrap up About The Agent Factory: "The Agent Factory" is a video-first technical podcast for developers, by developers, focused on building production-ready AI agents. We explore how to design, build, deploy, and manage agents that bring real value. 🔗 Resources & links mentioned: ➖ Post-training docs → https://goo.gle/4sbBLAd ➖ Google Cloud TPU (Ironwood) documentation → https://goo.gle/3MMFOCY 🔗 Google Cloud open source code: ➖ MaxText → https://goo.gle/4pcDQt4 ➖ GPU recipes → https://goo.gle/495tp4x ➖ TPU recipes → https://goo.gle/4qgMF5U ➖ Andrej Karpathy - Chemistry Analogy → https://goo.gle/4pQcMAO ➖ Paper: "Small Language Models are the Future of Agentic AI" (Nvidia) → https://goo.gle/4qmLQIH ➖ Fine tuning blog → https://goo.gle/4pR211n 🔔 Follow Shir → https://goo.gle/49SAveB 🔔 Follow Don → https://goo.gle/3KKCrff 🔔 Follow Kyle → https://goo.gle/4j7Mg3k Join the conversation on social media with the hashtag #TheAgentFactory. Connect with the community at the Google Developer Program forums. → https://goo.gle/4oP9bmb Watch more Agent Factory → https://www.youtube.com/playlist?list=PLIivdWyY5sqLXR1eSkiM5bE6pFlXC-OSs 🔔 Subscribe to Google Cloud Tech → https://goo.gle/GoogleCloudTech #TPU #ReinforcementLearning #FineTuning Speakers: Shir Meir Lador, Kyle Meggs, Don McCasland Products Mentioned: TPU, Gemini 3, Maxtext