Loading video player...
I'm quiet.
All right. Maybe we don't start yet.
Addie.
How about now? Better. All right. Now I
can hear myself too twice.
All right. I think we are actually live
on stream. So, hi everyone. Uh my name
is Sasha Zelanovich and uh I work at Red
Hat. I'm extremely happy that
everybody's here. I've been organizing
these meetups obviously with a larger
team uh for the whole year. This is
probably our 20th meetup this year and
it just makes me so happy when people
actually show up. So uh thank you. Thank
you for being here. Uh as I said, I work
for Red Hat at Red Hat uh in the AI team
there. And uh Red Hat loves VLM. We're
the leading commercial contributor to
the project. more about 25% of all the
contributions to VLM come from the team
Red Hat. Uh we also have about 10 core
committers on staff. So we're all in on
VLM and you're really in for a treat
here. We're really excited about that.
There's a few firsts here. Uh first
official European meetup. So really
happy that happened here in Zurich. So
thanks for that. Uh this is also our
first live stream outside of China.
We've been live streaming ch meetups in
China. 41,000 people showed up on uh on
the WeChat stream on last Saturday. So,
I don't think we'll break that number,
but we'll be happy if there is, you
know, 41 hopefully. Uh we'll know soon.
I'll let you guys know at the end how
many people are tuning in. Um so, it's
great to see the BLM community is strong
here. All of the 20-ish meetups that we
organized so far have been completely
packed. There's been a wait list and
it's great to see that Zurich is exactly
the same. We're happy we chose Zurich
for our first one in Europe. Uh so, I
just want to say a few thank yous.
Obviously you guys and the BLM
community. Uh thank you for Mistral uh
for sponsoring this uh event. Thank you
for IBM. I'm not going to name
individual people because a lot of work
went here. That would just take forever.
But thank you for everybody who was
involved for putting this event
together. Uh so let me just show you the
agenda quickly. So we have a packed
agenda and we really did go all out
here. uh we have a maintainers project
VLM maintainers VLM to contributors
committers award-winning professors
right uh so it's a really really packed
agenda so we'll give you an intro to VLM
uh and then for the power users of VLM
we'll share a quick project update uh
we'll talk about quantization which is a
way to take these massive LLMs and make
them more efficient faster using less
GPUs while delivering uh you know much
better performance we'll talk about the
mistral thanks guys for being here the
mistral AI works work with BLM uh we'll
also talk about hybrid models which are
now first class citizens in BLM and then
go into distributed inference uh talk
about LLMD which is the orchestration
layer above VLM so uh it's about to get
technical for sure uh so I hope you
enjoy enjoy today's uh today's meetup
just one quick reminder is we'll have a
survey at the end three question survey
super quick so if you can just keep that
in mind uh any feedback that you can
give us. We just obviously want to
improve and make this better uh every
time. So again, thank you so much for
being here. I'm going to pass it over to
Philip from IBM to say a few words and
then we'll go into our agenda.
Thank you Sasha.
So welcome from IBM. My name I'm Philip.
I'm the CTO for IBM Switzerland. So
really happy and glad to host this first
European VLM here in Zurich at our IBM
location. Uh as Sasha said there are
many people who helped but just one
person S that is back she has a great
applause
because every everything it's up and
running is thanks to Sara who did a
great job. Thank you. Uh and therefore
thank you for being here. Without
further ado I'll hand over.
>> Yeah thanks a lot. Hey everyone. Uh,
glad to be here. Uh, my name is Michael
Goen and I'm a lead maintainer on the
VLM project. Uh, I'm on the Red Hat uh,
inference engineering team. I'm a
principal engineer there. We've been
working on it since uh, late 2023 on all
sorts of things. So we'll be giving you
a quick overview of BLM, what's recent,
what's important about it, and also
giving a bit of a a preamble to why we
think this space is so interesting and
important to work on. So
first uh you know at Red Hat obviously
we're very much focused on open source
software and we think working on that
with the community with with other uh
companies with other research
organizations is the way to win. And we
think that making AI inference software
more accessible, more efficient and uh
cheaper for organizations to host their
own models um is the best way to
democratize the use of AI and make it
useful to everyone. Um and as you
probably know like the key turning point
for our investment and your investment
in this space was chatbt in 2022. But we
and many others you know made a bet that
sort of uh you know there's open source
models then they weren't really usable
chatbt was much more usable but sort of
believing in the spirit of open source
you know we keep seeing better and
better and better open source models and
he's getting closer and closer in
quality to the leading closed source
models where we are today and we very
much you know just today we had Kimmy K2
thinking uh which is a one trillion
parameter open- source model um uh that
uh is competitive with GBT5 and Claude
Sonnet 4.5 on human humanity's last exam
and uh you know agentic coding tasks. So
we're there open source one.
Um and so yeah, you know, the sort of
the power of open sort of the green dots
over there versus the red dots, the
closed source models have gotten closer
and closer in terms of performance and
uh uh you know, matching and and
commoditizing the intelligence that the
closed source uh companies uh sort of
pioneered with.
Um obviously the advantages of open
source models should be obvious to you.
You can host them yourself. You can
control who has access to them. Uh what
data goes in, what data goes out. You
can fine-tune them for your
applications. You can control them. You
can have all the security you need. Um
and many people obviously find this
compelling and that's why the open
source community has grown to be so
large for AI and particularly open
source models. Um, and the reason why
we're here today and why this is all
centered around VLM is VLM is the key
software abstraction layer and project
that lives between you know the hardware
uh and the applications here models and
that's why we sort of think of it in
terms of open source software as the key
Linux in this you know AI race. Um so it
is the important software abstraction
layer for all the hardware and all the
models. Um which we'll talk about why
that's important and why BLM turned out
the way it has. Um so first let's state
what VLM's original goal is. Uh to be
the fastest and easiest to use open-
source LM inference and serving engine.
Pretty straightforward.
Um, and the reason why VLM has uh or or
rather how VLM has achieved that is by
being really easy to install. It's just
a pip install away or a Docker pull
away. Um, and it's going to run, you
know, on any, you know, uh, common GPU
you want to run on. Um, like I mentioned
already, it works on all the different
model architectures, hundreds of them,
uh, and all the different major hardware
accelerators. Um, and it integrates with
the open source community. um many many
open source projects, hundreds of them
if not thousands build on top of ELM as
that LLM inference abstraction layer and
build their special applications whether
it's reinforcement learning or agentic
coding or um you know uh document
understanding and detection all sorts of
things build on top of hey how are we
going to efficiently do inference with
large language models and so you know we
already mentioned VLM is easy to install
but it's also very easy to use we offer
you know uh it's a Python based library
with a very simple Python LLM class for
easily getting started with you know
putting in whatever model you have on
hugging face and immediately spinning up
uh and compiling an LM engine and
offering generate interface chat
interface uh also multimodal data um all
sorts of ways uh that like you would
expect to
run inference on a model but of course
under the hood doing it very
efficiently. Okay.
Um, second, and maybe most important,
offering a very easy to use inference
server. So, hosting a model as a server
across whatever GPUs you're interested
in. And importantly, once you have your
server running on localhost 8000, uh you
get a wide variety of endpoints like the
chat completions API, completions, uh
embeddings, responses, rerank, all sorts
of things so that you can replace um BLM
as uh or use VLM as your replacement for
the OpenAI API you are using or the
anthropic API you are using, whatever
closed source uh model API you're
building your applications on not just
the model you're using but your
application interface VLM can emulate
that with any open source model that's
capable enough to you know be used in
your applications
and now I want to talk about why VLM has
gotten so popular uh further so a really
key key investment that VLM made is
relying very heavily on PyTorch PyTorch
as you know is the most common or most
popular open-source ML framework
network. And the reason why VLM is able
to have so much flexibility both in
model definitions and hardware support
is through its integration with PyTorch
as an abstraction layer that the
hardware ultimately implements its most
basic operations through PyTorch. And
the model definitions in open source are
generally written in some form of
PyTorch. And so through very close
collaboration with PyTorch as a project,
you know, VLM regularly uses the latest
features and projects out of the PyTorch
ecosystem. and you know is very much a
power user of PyTorch and enhances
PyTorch through this collaboration.
Um and this is part of the reason why
VLM was a founding member of the PyTorch
Foundation joining uh Deep Speed and
most recently Ry announced that PyTorch
conference uh two weeks ago. um uh you
know founding the PyTorch Foundation as
uh you know uh a really uh you know
important movement to grow and cultivate
the uh you know the the peak of the
PyTorch ecosystem. Uh and you know
hopefully the love keeps growing more
and more. Um yeah, even more examples of
PyTorch and BLM uh taking care of each
other, finding issues and pull requests
and fixes in each of the repositories
through our very close collaboration. So
we we love the PyTorch team. Um
now on to BLM adoption. So uh obviously
you know BLM started in 2023 with the
page attention paper and has quickly
grown in terms of GitHub stars up to
very steadily up to 60,000 GitHub stars
most recently uh with uh regular
deployments of BLM on GPUs reaching you
know 500,000 concurrent GPUs that we
know of. uh it's probably more um and
over 9,000 members in our developer
Slack uh working and asking questions
about the project constantly. It's also
turned into a very largecale
collaborative project as a result. It's
not just popular, it's also well uh uh
distributed and uh many different
contributors work on it. We're reaching
pretty steadily over 800 PRs a month,
which is a lot to deal with. Um and uh
over 1,700 unique contributors have
landed code in the project and we're
we're rapidly increasing that number uh
these days. And as I already mentioned,
you know, we're spreading these
contributions out across not not only
Red Hat, IBM, and Mistral, but uh tons
of other organizations, either
enterprise or research or community. Um
and it's a really vibrant community as a
result of that and is able to serve a
lot of different use cases. Um as I
mentioned before, VLM is able to offer
the software layer that many of the
different hardware platforms are uh you
know wanting to implement in order to
support all the different models they
want to run. And so we have uh these are
all mo all hardware accelerators that
run uh with VLM to some extent. We have
you know various hardware uh inside
directly in the VLM repo like AMD in
Intel, Nvidia CPUs, Google TPUs and many
many more increasingly that are uh
working with VLM as a hardware plug-in
through a common interface that we've
defined that any hardware accelerator
can you know essentially make a plugin
and they can load into VLM as they need
and continue their own development
through the fixed software abstractions
that we laid out. Um, and many many uh
uh more hardware accelerators are are on
the way hopefully. Um, and it's uh it's
it's great to see uh the excitement from
the, you know, hardware community and uh
hopefully a lot more efficiency is
gained for everyone that's interested in
deploying efficiently. Um, so next, you
know, there's obviously broad broad
model support. It's not only about
performance and flexibility. It's about
supporting the models that people want
to run. And we support easily a 100
plus, maybe even 200 plus now different
model architectures. Of course, all the
common state-of-the-art models, but also
small ones that are forgotten along the
way, and there's power users using
those. Um, and of course, we have day
zero support for, you know, the key
models on launch, often with the model
definitions and features implemented and
and provided by the model vendors
themselves.
um which is really great to see.
And of course we're not only supporting
texton models, we also have very good
support for multimodal models. And so
here we have the launches of Quen 3VL
and Deepsee OCR which we were really
proud to be launch partners for and
offering day zero support for the
leading most cutting edge multimodal
input models. Um and finally, you know,
we don't have to support the models
directly in BLM through very close
integration with hugging face
transformers. We were actually able to
make the transformers model definitions
uh general enough so that we can use
them directly in VLM substituting in the
uh the key operations that VLM needs to
run efficiently such as attention,
fusede, linear layers. Um and uh we're
increasingly expanding the flexibility
of this and we're already able to
support text dense models, recently
fused models, uh encoder only models. So
many of the sentence transformers run in
VLM through this. Um and uh yeah, it's
really cool to sort of try and reduce
the amount of duplicate model
definitions and possible sources of
truth, especially as many people do
training through hugging face
transformers. And now with reinforcement
learning, you want to make sure your
training definition and your inference
definition are lined up.
Um, another key thing that BLM focuses
on is running these models efficiently
across a variety or rather a number of
hardware accelerators. So we support
many different types of parallelism in
order to efficiently serve your model
across you know all the GPUs within a
node such as eight H100s but also
running a model across multiple nodes of
GPUs where you have uh uh you know uh
non-uniform interconnects uh where you
have very fast memory or interconnects
within a node you have maybe less uh
less quick uh interconnects as you go
across nodes and you know we support
tensor parallelism pipeline parallelism
mixing those types of parallelisms
together, expert parallelism and data
parallelism which is key for large
mixture of expert models which we
covered a bit later and disagregating
prefill and decode instances so you can
start to uh more reliably uh serve your
models uh with with the latency SLOs's
you have which will be covered
thankfully uh later in this talk. Uh, of
course, a really important growing use
case for VLM is in reinforcement
learning. As I mentioned, this style of
training uh very very heavily actually
relies on inference as well in order to
get rollouts and uh uh you know train
the uh agentic models to be more
agentic. Um and so we're happy VLM is
integrated and featured in all of these
uh you know key reinforcement libraries
and even more in the growing ecosystem
there.
Um, next it's not only about training
models, of course, it's also about using
the models once they're trained. And VLM
is also trying to help out the community
and uh try to define uh some sort of
standard for what the right agentic
interface for running these increasingly
complex models are where they want to
intersperse thinking, not thinking uh
tool calling uh and ultimately you know
working as agents for a long time uh
before they come back to you and need
more context. And so you know with the
GBT OSS launch where we partnered with
OpenAI we fully or you know we support
the responses API um so you can run app
so so you can run the same applications
that use the responses API on chat GBT
with VLM with GPT OSS and we've had
several coding agent uh libraries build
on top of that support offered in VLM.
Hopefully this is some sort of standard.
But we also support the messages API
which is what Anthropic uh uses and uh
is used by cloud code for instance. And
uh you know most recently um we landed
support for this. And so you can run you
know VLM serve quen 3B you know with
tool calling enabled and run it exactly
inside of cloud code and start hosting
uh cloud code and running it completely
locally. Uh which is really cool. And um
yeah, as more agentic coding interfaces
are built on top and support the
features we have in VLM and of course
the more capable models that are coming
out, we're really interested to see how
we can better optimize VLM for the um
unique workload that uh agentic coding
requires.
Um another key key place where VLM has
to focus on is accuracy. um BLM is often
treated as the reference implementation
and it needs to be correct especially
when models launch on day zero. Uh first
impressions make a strong impression. So
we take accuracy very seriously and try
you know to work as uh with the model
vendors directly to make sure that VLM
is both accurate on the simple eval.
We do accurate tool calling um accurate
long context um and try to force these
things along. Uh we also have recently
devoted a lot of work to supporting
batch invariance, a form of making the
inference more deterministic regardless
of the load on the server. Uh which is
also really important for reinforcement
learning as we want to make sure that
the inference code and training code
matches up regardless of the load uh so
that the training happens on policy.
Um secondly, we really care about
performance and reliability. We have a
live VLM benchmark dashboard on the
PyTorch uh uh you know uh dashboard. Um
and this runs on basically every commit
of VLM. So we can quickly get warned
whenever key you know performance uh
regressions happen on key models. Um
this is really useful. Shout out to the
PyTorch team PyTorch team for providing
this. Um, and yeah, we put our money
where our mouth is. And uh, most
recently we've spent over $100,000 a
month on CI so that developers working
on DLM pushing their PRs and getting
testing uh, can have, you know, very
full testing happening for them and they
can focus on the important features that
uh, they want to see landed in the
project.
Um, now I'm going to cover some of the
key focus areas we've had in BLM over
the past uh couple months. Um, one
really key one obviously is torch
compile from the PyTorch side. Um,
hopefully you are familiar with torch
compile a bit. You know, it uh obviously
offers for VLM what it says on the tin.
It allows for automatic kernel
generation and compilation of PyTorch
primitives. And we do use this for
automatic fusion with the various
pietorch operations we use directly in
pytor in in in VLM. Uh however, one of
the more interesting things that we like
to use uh torch compile for and we've
done we've uh uh improved with closed
code development with the torch compile
team is using torch compile as to give
us a graph representation of the model
and start implementing custom graph
level transformations
so that we can have custom operations
that we author you know in CUDA or other
languages for attention for sequence
parallelism for uh you know all produce
RMS norm quant very complicated
sequences and substitute these inside of
the graph without needing to change or
modify the model definition. We think
it's really important to keep the model
definitions as simple and pure as
possible hopefully written in just
native PyTorch and uh the optimizations
that we work in VLM and the fusions we
work on can be applied regardless of the
specific model definition and just in
terms of the operations of the model and
what order they happen in. This takes
the onus and the load off of the model
vendors who just get to upstream their
hopefully simple model definitions. And
as we generate more general fusions,
we're able to see those speedups applied
across dozens of models in parallel.
Um, next another uh really key thing
which uh uh you know the talk from
Thomas Parnell is going to cover a bit.
Um we've invested heavily in supporting
natively hybrid attention models which
we increasingly see as the future of
efficient LM inference. Uh you see this
recently with like Gemma, Llama 4, Quen
3, Next, Deep See, uh or yeah, you know,
a variety of different model
architectures recently. Um and and this
just means we need to efficiently deal
with each layer potentially having
different styles of KV cache or state
management. Um and uh and we directly
support this in BLM through this hybrid
KB cache coordinator which does a lot of
the work to coordinate um yeah these
complex patterns uh in the leading
models. Um next another abstraction we
developed in VLM is the KV connector
interface. This is key for uh you know
things like disagregated prefill decode
uh which you'll hear about later also
for offloading KV cache and doing
general operations uh on on the KV cache
uh in whatever way you want to define
it. Um we use this in a very key way for
with with Nixl and LM cache for
disagregated prefill decode. Um but this
is ultimately an extensible interface
where you can plug in custom KV
connectors based on your research or uh
business needs. Um another really uh
cool feature which is actually a new
type of parallelism is decode context
parallel which was actually contributed
by the moonshot uh Kimmy team. Um this
is a key uh optimization needed for
models that have uh a small amount of uh
uh KV heads uh such as MLA for Deepseek
which only has one uh because as you do
tensor parallelism you need to replicate
the KV heads if you can't evenly divide
into them which means you have to
replicate the KV cache meaning you have
uh duplicate KV cache across all eight
of your GPUs you're deploying DeepS on
which you can see here with the GPU KV
cache size only being 600,000 tokens.
But with decode context parallel where
we're able to interle the KV cache
between the GPU ranks, we're able to 8x
the KV cache and thus get much higher
throughput and support longer context
workloads on the same number of GPUs
just by
not replicating the KB cache.
Um, another really key uh place of
development that uh we're really happy
has landed soon. If you're familiar with
VLM V1 to the the the the journey of VLM
B 0 to V1, one of the key features uh we
developments there we mentioned was
important was peacewise cudigraphs to
allow for more flexibility in the model
definitions and allowing for potentially
complex operations like
uh attention operations. However,
peacewise cudigraphs still has a clear
latency uh uh you know uh degradation.
Um and so we wanted to expand this to
now support a flexible form of
cudigraphs where we can have much more
control over are we able to do peacewise
cudigraphs or full cudigraphs or no
cudigraphs or do it for decode only
batches or uh mixed batches. And uh we
have this flexible cudigraphs system
which you uh or design that uh we have a
great docs page on. Uh but the key thing
here is now we have full and peacewise
cudigraphs on by default. So you have
the best of both worlds. Flexibility
when you need it but low latency
whenever um yeah you need it as well.
Um, another key place we've been pushing
recently and has actually been a great
collaboration point with the community
is something we call model bash where uh
this is an example of a layer of or
rather a transformer block of deepseek
R1 and we open up and post an annotated
profile to everyone of hey we're
spending this much time in this matrix
multiplication this much time in this
operation and um you know identify these
places where Hey, it would be great if
we had some custom kernels or
opportunities for fusion. Um, and of
course, you know, we're we've been
working on squashing and improving the
performance here ourselves. Um, but
we've opened it up to the community and
we've seen a lot of people jump in and
help out and become great new
contributors as a result of uh joining
in on the fun of performance
optimization.
Um, so to recap, you know, DM takes
timely, accurate, optimized model
support very seriously. Um and the
community relies on us for this. Uh we
focus on having a wide hardware uh uh
support ecosystem and defining the right
interfaces for hardware vendors to
define you know just enough software
that they can run hopefully a lot of
models flexibly. Um we integrate and
play very well with the open source
community that builds on top of
inference as an abstraction. We work
very closely with the PyTorch team and
community and believe that's the future
for uh ML frameworks and we're the
frontier and readily innovate with what
is happening in inference systems
research and what real users see and
need in production. Um so hopefully you
know a bit more about some of those
things and what we're up to and uh are
joining the community soon. So, I will
hand it over to Elder.
So,
hello everyone. Uh, my name is Aldar
Cortage. I'm principal research
scientist at Redhead. Uh, and the topic
for today is going to be quantization in
in VLM. Uh so just so we are all on the
same page uh the quantization is a
process in which we are uh uh
compressing a model uh by reducing the
number of bits that we use to represent
either weights or activations or both.
Uh so if we imagine that we have a model
uh whose weights are in uh represented
in FP32 and they have like they're
distributed across this gshion like uh
curve uh we can see that they can take
any possible value. So we have very high
granularity here. U through the process
of quantization we are trying now to
take all of these weights and we are
trying to map them uh in a set of
discrete buckets uh where each bucket uh
corresponds to some specific uh value
that we can represent in IN int. And as
you can see the uh we have much lower
granularity uh in in4. Uh and what's
going to happen through this process is
that for some of the weights which are u
close to each other in FP32
we will not be able to represent their
difference in incore. So this means that
we're going to have to uh put them in
the same bucket and during this process
we'll be introducing some error uh or
like quantization quantization noise and
the entire game uh in quantization is
how do we deal and how do we manage uh
this noise such that we do not destroy
the model uh so that we get the model
which is still usable but compressed in
the end. Uh
so uh the the first kind of thing that
we usually talk about when we talk about
quantization is uh uh uh the main
question whether the there is such a
thing as a lossless quantization because
it's a lossy compression. So we are like
reducing precision of the model and with
the entire advent of LMS uh like
community became very interested into
any kind of techniques that would enable
them to run models more efficiently on
like a lower number of GPUs. um uh and
slightly cheaper. Uh one unfortunate
thing that happened during this period
is that a lot of quantized models
started appearing in the wild uh without
proper validations, without proper
calibrated quantization processes and so
on. And then we ended up in a situation
where there has been a lot of skepticism
around uh quality of quantized models in
general. Um and this prompted us kind of
to uh and this kind of brings us to the
first first like a takeaway message uh
from from quantization in general and
that's the fact that not all quantized
models are created equal. Uh meaning
that during the quantization process
there is like a million different
hyperparameters and knobs that we should
be tuning uh in order to get the model
in the end. And if we have like a
quantized model which is not performing
well, we should not blame the
quantization uh as like the the main
root cause for this. Uh we should like
like like take a look at how the model
has been produced and whether the proper
like calibration process has been done.
Uh and this prompted us like in general
to uh like like write a paper titled uh
give me a BF16 or give me a deaf uh
which is relatively easy to read and
should serve as some kind of a data
sheet where you can take a look at what
are the accuracy performance trade-offs
that you can expect from popular models
uh for popular quantization schemes. So
basically what we did we took uh models
from 8 billion in size up to 400 billion
in size. uh we calibrated the
quantization
uh process for three popular
quantization schemes which are supported
in VLM FP8 intate and infor uh and then
we ran a ton of evaluations in total we
ran more than a million evaluations
basically we touched every single like
open source evolve that exists out there
starting from open LLM leaderboard v1 v2
reasoning evals long context arena hard
coding multimodel and so on and so forth
uh what we found during this
investigation was that if quantization
process is tuned properly for intate and
FP8 model we should always get like
almost indistinguishable models from the
unquantized baselines meaning in a range
of 98 99% recovery uh whereas for in we
do see slight slightly higher drops but
uh the process the models would never
should never like completely destroy the
accuracy of the model so it should still
be like a usable model in the end uh one
snippet of the results that we that to
present there uh is is like like shown
here in this graph where we look at the
reasoning performance uh which
represents basically an average pass at
one score across the popular like a
reasoning benchmarks Aimeme 500 and GPQ
diamond and we look at like a deepseek
R1 distill models from both llama and
quen family and then we look across all
of all of the available sizes to see how
quantization is going to behave at
different scales and then we Following
the recipes from the give me BFF6 paper,
we quantize them to FP8, intate and
info. Uh where for FP8 and intate we are
doing in weight and activation
quantization. For in we are doing weight
only quantization. Um gray bar here
represents the unquantized like a BF16
baseline. Uh and as you can see the blue
which represents FP8 and green bar which
represents the the intate model. uh we
see that these bars are almost
indistinguishable from the BF16 meaning
that if quantization is calibrated
properly usually what you should expect
to get is a model with 98 99% accuracy
recovery relative to the unquantized
baseline. Um uh the the the last model
like in four um is this is a setup where
we do see slightly higher drops but the
mo uh sorry um but the models uh so we
do see slightly higher drops but the
models are never destroyed like in the
sense that they're not usable and
usually what we what we have seen across
all of the evils and all of the models
is that if we calibrate in four properly
we should get accuracy recoveries in a
range of 95% and above which is still
not not terrible given that we are
compressing like model weights by 4x.
Um a standard setup in like academia and
the research papers uh is take a
quantize take a baseline BF16 model we
quantize it down and then we look how
well we are recovering accuracy of the
of the original unquantized checkpoint
which is a perfect and fair setup to
compare different quantization
algorithms different hyperparameters and
so on. Uh however in a real world uh
there is usually a slightly different
objective that we are face and that's
like uh and that's more or less like a
question of like what is the best
accuracy that I can get uh by like
following some some constraints that I
have inside my deployment and that
constraint is like very often just the
amount of the GPU memory that we have at
our disposal to deploy a model. uh and
if there's no in the world without
quantization the best thing that we can
do is we go to hugging face hub we see
which what's the largest model that we
can fit inside our GPU and then we take
that model and whatever accuracy that
model has that's the best accuracy that
we can get uh putting aside like
additional fine-tuning training and so
on so just taking the models off the
shelf um however quantization offers
this additional pathway that we can use
like to fit inside the same uh like
compute constraints but get much higher
accuracy in the end. Uh and that's the
process in which we can go to a larger
model which originally does not fit
inside our GPU and then we start
quantizing down the model uh to a
specific bit that we can actually fit
inside our GPU uh and therefore leverage
uh the accuracy of a of of a of a higher
model. Even though if during this
process we could lose one two 5% of the
accuracy depending on the quantization
scheme, we should still end up at a much
better place than a smaller unquantized
model. And like a perfect example here
is like a quen uh 7B and quen 14B where
for example quen 7B in unquantized form
like a BF16 takes like 14 gigs of the
GPU memory for the weights. If we go to
quan 14B it takes 28 GB of the memory.
So in order to bring it down like to the
same uh like a like a GPU memory
requirements as quin 7B, we can quantize
it to FP8 intate or even lower like twin
4 and kind of benefit from this much
much higher baseline accuracy and then
like basically go from 65 to 74 73 72
whatever is the target scheme. So the
second takeaway message is larger
quantized is always better than smaller
uh unquantized.
uh and these first kind of two takeaway
messages were mostly related to accuracy
of the models because that's the most
obvious thing that uh everyone asked
like like when we quantize a model it's
a lossy process we are like reducing the
bit so it's like the most obvious
question uh to to ask first like what
what happens with the accuracy the
second most important question is what
happens with the inference speed up so
given that we have a compressed model uh
is there anything we can do to kind of
accelerate the inference so to improve
either latency or or throughput inside
our deployment. Uh so here we are
looking at the graph of uh for a llama
3.1 ATP model uh deployed on a single
A6000 GPU for some specific like a dock
string generation use case. Uh and we
are looking how inter token latency uh
changes with respect to the uh number of
queries per second that we are sending
to to our server. And we are looking at
three different models. The first one is
BF-16 represented by the blue graph. So
this is unquantized baseline model. Uh
then we have um intate uh weight and
activation quantized model represented
by the yellow graph here and int for
weight only quantized model represented
by the green graph. And there are like a
three interesting regions to observe in
this plot. Uh the first one is when we
have less than four queries per second
coming to our server. So this is a setup
in which we don't have large enough
inputs to keep our GPUs doing matrix
matrix multiplications. So this is the
setup uh in which the main cost during
for runtime that we are paying is
basically moving the data around because
computation is for free. We don't have
large enough inputs to keep our GPUs
busy. Uh and in order to accelerate the
inference we need to accelerate the data
movement. In order to accelerate data
movement, we are like we're going to do
the weight only quantization because
weights are like this that's the data
that we need to move to tensor coursees
to do the uh to do the computations
there. Uh and therefore we quantize 16
bit weights down to four bits to to to
to accelerate this specific part of the
pipeline. However, if we start like
increasing the number of queries that
are coming to the server and we hit a
point for this specific example of four
queries per second and more, we are
entering second regime which is where we
have large enough inputs to keep our
GPUs busy doing matrix matrix
multiplications. So inputs are large
enough uh to keep tens of course uh like
like busy and to uh uh make the time
that we that uh the time spent uh
reading the weights completely
negligible relative to the time that we
spent during during the mats. Uh and so
we are entering the second regime which
is called like a compute bounded regime
and in this regime uh the only kind of
the the the best way to get the speed
ups is to use faster tensor coursees and
to get the faster tensor coursees we
need to have both operants of the matrix
matrix multiplication quantized either
to intate or FP8 because modern GPUs
come with intate and FPA tensor coursees
um and then we get two times more flops
just by quantizing both operand
both operands and That's the second
phase where now we are we we are
switching the regime where weight and
activation quantization becomes a better
choice for our deployment than int int
for weight only quantization. And then
if we push even more like we push even
more queries per second uh we get to the
third regime uh where uh weight only
quantization becomes a worse choice than
just deploying unquantized model. Uh
this is because uh we are entering like
a heavily comp compute bounded regime.
Uh where the time spent like on loading
the weights is negligible to the time
spent during uh uh uh the during
mattles. Uh and the time the the
overheads which we uh which we had like
to unpack the weights which have been
like compressed to info now start
hurting us in this specific scenario. So
uh the third takeaway message is that
there are no golden bullets. We should
understand you should understand your
deployment uh by doing like a real world
benchmarking of the use cases that you
think are representative for your
deployment in order to figure out in
which of these three regimes you are and
based on that you can pick the proper
proper quantization scheme.
Uh and yeah at the end uh all of the
algorithms that and all of the tricks
that are like that we have used to
produce all these highquality quantized
models uh have been uh implemented uh in
LM compressor. So you can just go there
and like leverage predefined like like
quantization schemes um which have been
tuned by our team. If you would like to
run these performance benchmarks like to
see in which mode which regime you are
memory bandwidth bounded or compute
bounded you can go to guide LLM which
will be able to uh automatically like
like enable you to automatically do some
benchmarking and get graphs which are
relatively similar to what I've shown
here and then you can see in which
regime you are and based on that figure
out which quantization scheme to use. If
you don't want to do any of this, you
just want to take like a high quality
already validated quantized models, you
can go to redhead AI hugging face hub,
basically our team has already like
released couple hundred quantized models
there and we are actively releasing
whenever there is a new model in a
matter of a few days. There will be FP8
int8 uh in4 FP4 checkpoints there which
have been already validated by our team
to have at least 95% accuracy recovery
and like a decent speedups in VLA. Um
one thing that kind of we are we became
very proud of because LM compressor is a
relatively new project is the fact that
uh Llama 4 team from Meta also adopted
it. Uh if you recall from the Llama 3.1
series they have released FP8 model but
they have used their own internal tools
for this to quantize the model for llama
4 they switched over and decided to use
LM compressor to do FP8 FPA
quantization. uh what next what like the
next steps uh if quantization doesn't
give you enough uh speedups for your
specific use case we have recently
launched like a new project called
speculators where we uh released a
library for training of speculative
decoding models so you can train your
own models or you can just uh use
speculators to convert existing
speculative decoding models to a format
that VLM understands so you can just
take eagle tree h or whatever is like
state-of-the-art right now convert it
and then run it in in VLM And we already
have a couple of models which are open
sourced inside uh Redhead AI's hubbing
face hub and we'll be we'll be adding
more. And now I'm going to hand it over
to Dan to cover new FP4 formats for like
which is quantization for modern GPUs.
All right. So uh thanks a lot for uh
this introduction. So today so my name
is an Alistar. I'm a professor um at IST
in Austria and also a researcher at uh
Red Hat AI and in some sense my job is
to really push the limits of what's
possible uh to do accurately with
respect to compressing these models. So
today I'm going to tell you a bit about
uh our our research. So one in short
what uh LDAR described was essentially a
picture in which we can do close to
lossless quantization of in FB8 or
sometimes even in int8 for weights and
activations and that's a great idea
because it's really like in some sense
it doubles your flops the way the way he
expressed it. So um like what we are
really pushing now is whether we can
actually go lower and the next barrier
we are seeing is essentially four bit
precision. So I'm I'm going to give you
kind of a short overview of kind of what
the state-of-the-art u is because um
there has been uh work here already. So
in some sense what what we uh the key
barrier that we're hitting here is the
fact that modern LLMs um is essentially
have very large outlier values in their
weights but also more primarily in their
activation. So to understand what an
outlier is you can essentially see the
graph here on uh on on your left. So or
sorry on your right essentially these
are extremely large values like hundreds
of times larger than the than the
average value and this essentially
messes up things on quantization because
they they essentially will zero out
everything that's kind of close to them
or will need to be clipped in which case
we induce very high error. So the the
basic idea to kind of understand how to
get around this or this is the idea
that's been kind of floated in the
literature in various guises is
essentially to replace this kind of
linear layer like the wxransposed
uh where weights the weights are w and
the activations are x is bas basically
going to be replaced by a variant where
we premultiply each one of these
operands the w and x with a matrix and
this matrix is invertible. So
essentially we multiply with a matrix
itself on one hand. This is the matrix R
and then with its inverse on the other
hand. But then we apply quantization to
these two kind of operants
independently. Okay. So if we didn't
have any quantization R would just go
away and we would have like a perfect
recovery of the original output. So what
we're doing here is essentially we're
saying kind of quantization sort of
commutes with with u the matrix we're
applying. Therefore we should be fine.
And then the matrix R should have some
kind of smoothening property that kind
of um just takes away some of these
outliers. So it kind of makes things
easier to quantize. And then once you
have this image in your head, it's very
easy to understand pretty much all of
the methods that have been uh proposed
for um like weights and activation
quantization such as smooth quant and so
on because they're essentially just
varants of like they use u sort of
increasingly complex or fancy matrices
to uh smoothen their their input. So for
instance, smooth quan just uses a di
diagonal matrix for for those of you who
understand the method. And then uh
Carrot whose lead author I I challenge
you to find in the room uh after after
the talks is uh is essentially doing
Hadamar matrices and then there people
from Meta proposed essentially learning
these matrices and so on. But the
problem uh here is that if we go if we
try to go directly all the way down to
4bit precision unfortunately existing
techniques essentially break. I mean
they're they're okay for academic
applications but uh in in uh like what
Elder would say is that the model is
destroyed. Essentially we're we're
dropping more than 5% accuracy sometimes
even more than 10% relative accuracy. So
the model is no no longer kind of parto
competitive. You you're just better off
taking a smaller model um in terms of
accuracy. Okay. So kind of the challenge
we have been trying to to solve is
whether we can go further on this and
kind of Nvidia who also understands this
this challenge they they they kind of
try to address this by proposing a new
set of formats which are called
microscaling FP4 formats. Okay. And
interestingly they proposed two formats.
One of the formats is designed directly
by Nvidia. It's called NVFP4. And the
other one is in some sense
democratically designed by a set of
consortium of uh uh hardware vendors and
you'll see how that ended up in in a
minute. So essentially the the basic
idea here is the initial idea is very
common is essentially common between the
two formats. We want to have a different
grid. we define different quantization
grid whereas int essentially has a
uniform grid FP 4 these formats use an
FP quantization grid which is finer
towards zero and more relaxed towards uh
the end points of the interval okay so
the grids are the grids are different
they're just truncations of a floating
point these are the same grids now the
second thing is that uh the all of these
um values they're we're essentially
going to define a group size or a
microscaling uh size which means that a
set of uh kind of consecutive values 16
for NVFP and 32 for MXFP they're going
to be quantized together. Okay. And so
the all of these values they're going to
share a single scale. Okay. So that's
that allows us to this finer grain
quantization allows us to reduce the
quantization error that we're going to
have whenever we have to bucketize these
values. Okay. And then the problem uh
with this is that this leaves us with a
lot of scales. Like notice that if you
have one scale uh one value per 32 uh or
16 values then this is actually already
a lot of storage for a model that's like
has billions of parameters. So what they
do is that they quantize the scales
themselves. And this leads us to the
second key difference between these two
formats, which is that um like NVFP
chose fairly reasonable FP8 quantization
E4 M3 for um for for for the scales.
Whereas M where MXFP, this kind of
openly designed format, chose a somewhat
unusual uh scale uh to to to put it
nicely uh which is basically they they
only quantize the scales to powers of
two. So we'll see what what this leads
to, but but it's probably not a great
idea. So okay, so uh we we essentially
have this these microscaling formats
which have some small differences
between them. Now the the reason why
we're actually doing this is because
we're able to get much faster
computational support for IN4 and for in
sorry for FP4 operations in this
context. So in particular, you're able
to get 4x uh speed up on a B200 and 6x
on a B B300 GPUs GPU and according to
their sheets and um also the also this
is something we verified in practice. So
we have kernels and you can uh obtain
you can validate these these numbers. Um
I mean you essentially maximize
throughput at larger
matrix multiplication sizes but you are
able to reach these and sometimes even
slightly exceed them in practice. Okay.
So now the question is does do these
formats does the fact that we're
quantizing weights or and activations in
smaller groups does this actually solve
everything? And we'll see that
surprisingly this is not quite the case.
So this is just a a snapshot of um kind
of results that we we did essentially
rerunning a subset of the million
experiments that that he ran for uh the
previous paper where and we we are
looking at these kind of uh uh numbers
that are highlighted in red which are
the recovery model recoveries when we
use a version of IN4 that's kind of
microscaled to match MXFP4. This this is
what you have on the first uh three
rows. And then we have NVFP4
quantization for weights and activations
and MXFP4. And then what we noticed what
we noticed is a couple of kind of
interesting things. But first, no format
is truly lossless. This is not a not a
silver bullet. Even with advanced
techniques such as GPTQ or or Stinquant,
we actually still drop about four to 5%
sometimes even more. Um so even with
this kind of pretty advanced format.
Now, if we're to rank um sort of formats
by their best performance, we'll see
that MVFP4 is the best. In four is
second by a very small margin. It's very
close to the experimental noise. And
then MXFP4 is terrible. And it turns out
this is this format is terrible because
of the fact that it very aggressively
quantizes the scales. So essentially you
get very high error um on on on scale
quantization. And then we notice that
this also methods that include rotations
this kind of adding this R matrix idea
that I described before do not help
NVFP4 but they help in4 and MX FP4
formats a lot. So they do improve um
accuracy as predicted by their original
papers but there's something odd
happening with NVFP4 which cannot
leverage this rotation idea. And then we
notice that some of the existing
literature u does sort of some of the
methods in in the literature do improve
results a bit but not very
significantly. Okay. So this was the
state of things that uh where where we
started at and we were a bit puzzled
because it's it's really not not so
clear why things are happening. So what
we uh we really tried to do better and
this is we ended up with a new method
that really improves the
state-of-the-arts and is specialized for
for these um formats. It's called micro
rotated GPTQ and you can find the paper
on archive and you can also find the
code on GitHub. So essentially the the
the method combines two key ideas. The
first is that instead of doing very
large rotation matrices which is what
was kind of prescribed by previous work.
We're doing micro rotations. Essentially
these are very small rotation matrices
that are whose size matches the group
size. So essentially we're just m mixing
values within the same group. Okay, this
is in some sense this this comes because
of we did an error analysis and we
noticed that the original weight and
activation distributions they're fairly
spiky and then if you're doing very
large rotations they becomes very they
become normally distributed but it it so
happens that these new formats like
NVFP4 and MXFP4 they're actually not
very good in terms of error recovery for
normal normally distributed data.
However, if you have some essentially if
you have a mix between normal and spiky
distributions, turns out that you can
actually so which is what you get if you
do micro rotations. Turns out that you
actually uh can get good results by by
quantizing to this format. And the
second idea is error correction. So
essentially we have like not so great
mean square error on the quantization
for both weights and activations. But
then we apply a new variant of the GPTQ
algorithm um which is a u sort of error
correction um weight quantization
algorithm with some uh specific
modifications to adapt to the FP format
which I'm not going to go into uh into
the talk but I'm happy to talk about
offline. Now the key observation here
and this is why we really chose this
route is that we noticed on that on
modern GPUs such as like the the
Blackwell line even on the lower powered
ones such as the 5090 you could actually
fuse these very small micro rotations
into the quantization operation.
Therefore you're the the sort of extra
cost of the rotation is zero. So u the
method essentially has no overhead or or
negligible overhead over the u original
uh original approach. So now let let's
get back to what we really care about
which which is um accuracy and you
essentially can see the accuracy of
essentially our our method and then
various other variants on the on your
left hand side you have the um NVFP4
format where you can see that the MRGBTQ
is really kind of outside of the margin
of error for across u all of these uh
all of these benchmarks. These
experiments are done on the plat
platinum bench benchmark that was
released by MIT earlier this year. So we
see that we are able to get about 97%
accuracy recovery. So we're really
outside of this kind of 95% unfortunate
uh sort of ceiling that previous method
methods had. And for some models we're
actually able to get for larger models
we're able to get to 99% recovery. So
we're within within the range of FB8 at
that point. And then we also noticed
that essentially the GBT on the right
hand side for um
which presents the MXFP results were
able to roughly we're able to get around
95% recovery. So the the format still
makes things hard for us but it's it's
reasonable enough that it could be used
I believe. Okay. So those are our uh
those are our uh findings. So currently
believe this to be the state-of-the-art
method for both NVFP4 and MXFP4.
The same trends I mean you get much
better results if you do weight only
quantization. This is supported in VLM
uh as well. Okay. So uh the the key
question now is can we make this fast
and can we integrate it v1 and here I
really want to um really highlight the
way to my my postoc Roberto Castro who
um is moving to is going to move to Red
Hat AI uh next month. So essentially he
managed to implement uh all of these
kernels both for NVFP MXFP both for the
forward pass and also for the backward
pass with support for the micro
rotations in a new library called
cutless uh which essentially is able to
get to close to ideal performance in
terms of um uh speed up. So you can you
can essentially follow the lines there,
but we we're essentially getting pretty
much uh matching the the essentially the
4x speed ups that are predicted by by
the data sheets. And then we're able to
get for larger batch sizes, we're able
to get about 2x uh speed up uh in in VLM
using these formats. So um in in terms
of other like if in case you actually
want to try this out, so this format is
now supported in VLM. So you can if you
have models that are packed to the
correct format you can run it. Currently
you you would need to uh produce the
models using our my lab's um sort of
compressing compre compression algorithm
that's available together with cutlas
and then we're currently actively
working on uh pushing all of this to lm
compressor so you can just use these
much easier flows. Okay. And then um
I'll uh leave you with kind of one
interesting uh finding which is that it
does one one unexpected benefit or
perhaps exceptional benefit of the um
MXFP format because of this this kind of
simpler scale um like uh simple scale um
structure you actually get up to 20%
faster matrix multiplication. So I think
we're going to try to keep uh working on
this. maybe we can actually allow it to
match uh the the speed of NVFP and
essentially get this extra 20% kind of
for free. So I think uh that's the end
of my um section
Uh hey everyone. So my name is Julian.
Uh I'm an open source engineer at
Mistro. So occasional contributor to
VLM. And this is Patrick.
>> Sure. Hi, I'm Patrick. I work in the
science team and also help on open
source.
>> Okay. So we will introduce uh the
training reasoning models uh using VLM
at Mistro. So
um we will introduce magis medium. So
the paper is uh on archive. It was
released in June and kudos to all our
science team to make magis.
So large reasoning models uh it's uh
essentially uh a model is you can call
it also policy for um RL and uh what you
do is that you take prompts that you fit
to the model that make completions and
you want the completions to be good. So
you uh compute a reward and you train
the model based on the reward.
Um so to dive in so what kind of prompts
you can have basically for reasoning
usually it's like math problems so what
is two plus two or to sort also code
problems so to sort a list
uh so you sample these prompts you give
them to the model that will generate uh
several answers based on this prompt so
uh sometimes it will be good like it
will say 2 + 2 is four sometimes would
be bad. So it would say it's five and uh
then you will therefore verify each of
the answers to make sure that the model
will be trained on the good signal.
Uh so to do that you will compute a loss
based on the uh rewards and the
rewards and the logits computed by the
model. uh we didn't give you the formula
which is a bit complicated but you can
find the formula of GPO for magistrol in
the paper
and when you have the loss for like for
all deep learning models you just apply
grad updates so that you can train your
model
so to do that we need uh several uh
items in your in our infrastructure so
first we have uh trainers so the trainer
will be um the essentially the model so
where you will maintain the the model
weights and what will performant updates
so usually it's implemented in PyTorch
or Jax
um and then you have a set of generators
so these are the ones that will uh
compute the completions based on the
prompts
so generators are uh using the latest
policy so the latest weights computed uh
at that time to uh perform completions
and output log probabilities for each of
the prompts that you give. Uh and these
ones are implemented with VLM.
Um and then you have verifiers. So you
usually you can you have in
in the for reinforcement learning sorry
you have models for verifiers but for
magisol we didn't use uh them we had
several rewards computed
we could say like deterministically so
we wanted to enforce for example that
the model was using the same language
for uh the reasoning than the uh the
same language as the prompt we also for
that answer was correct. We for enforce
that the the length of the the length of
the reasoning was uh not too short and
not too too long and this is implemented
in various
uh frameworks as simply end to end
model.
So the vanilla approach for the infra is
to uh load uh the trainer weights uh and
um load them inside sorry that the
generator load the trainer weights
directly and we send one batch of brawns
to generator to the generator.
Um so then uh we wait that the generator
weights are loaded. We can then start to
compute the m
um the m generations for the genator
based on the batch. So you have m times
b uh requests and uh and then you send
the m you sorry you compute the answers
and then you send the answers to the
verifiers
and uh once the answers are received by
the verifiers you compute the m*s b
rewards that you send back to the
trainer
and then the trainer will wait that all
the pros rewards and answers are
received comput a lot computes the loss
and performance updates but this has
several problems and is very slow and I
will leave it Patrick to explain why
cool thanks um great okay um yeah so
maybe we'll we'll take another look at
the vanilla approach and um I think it's
quite obvious that um people can already
see that certain components are going to
be idle when others are going to be run
so whenever you do a gradient update um
and you do vanilla approach um the
generator doesn't get infect fed
anything. So the generator is idle. We
don't generate anything. Similarly, when
we generate the trainer cannot train.
This is a big problem, right? Because
you you might train on thousands of GPUs
and you might need um also tens or even
hundreds of generators on on a GPUs. Um
so we definitely don't want to have all
these idle GPUs. Another big problem is
um that's a that's a first point here is
um so when you how do you actually
transfer the weights from the trainer to
the generator um when you think about
models like um I don't know like deepsek
and they they have 500 billion
parameters so that that's a terabyte in
in BF16 easy in terms of uh file storage
and you you need to kind of dump that
from the trainer and then load it into
the generator. The reason why this is
the the first approach that everybody
takes is because the topology that you
use for training is very different than
the topology that you use for
generation. That's a big problem, right?
This can take a lot of time in in
distributed file systems. Dumping that
much um data is is quite slow. So
problem here. Um yeah, the the problem
of the different topologies um also
makes it even harder actually because it
means you you somehow need to kind of
consolidate um the weights or or or re
um kind of like remap the weights. Also,
if we use VLM for for generation, VLM
doesn't necessarily have the same um
like like weight mapping that we might
use for training. So we might even have
to to change the values of the of the
tensors because we'll um for example
merges Q KV matrices. We might not do
this in training. Also a problem. Next
problem is um that you usually want a
deterministic number of um requests in
order to train. So when you say like
okay I want to let's say you you want to
have um I don't like 10 rollouts and you
want to have 16 prompts right? You want
to have 160 requests and you need 160
requests to do one gradient update
update. The problem is that um if you
start generating it always happens that
some requests going to finish much
earlier than others. So then that means
that you spend a lot of time generating
maybe just like one or two kind of
strangler requests. Again there's a lot
of lost time.
Then also very fun challenge in RL is
that um the the length the generation
length changes during training. So in
the beginning you you might um only
generate up to like 100 tokens. In the
end you can generate up to to 10,000
tokens. So that also poses a big
problem.
And to make this slide even a bit more
messier um now we have the yeah this is
the last point I think I already touched
upon is um the trainers idle when we
when we generate. Okay. So how do we
solve that um at um MR4 training? So we
I mean the first obvious thing is you
want to train and generate in parallel.
Um you cannot train and generate in
parallel when you want to be perfectly
on policy. It's impossible. Um so what
we do is like you you're almost on
policy. That means that you you you
train um and then only you update maybe
the the weights every 10 steps. That
kind of means that I guess for 10 steps
your generators are going to be a bit of
policy. But but it's not a lot. It's 10
steps out of I don't know 10 10,000. So
um you're not going to update after
every um you're not going to update the
weights after every gradient update.
Next thing is um yes is a problem that
um well like like some people also might
um intuitively just say like look we're
going to use the same GPU allocation
that we use for training also for
generation. Um but but this also kind of
means that then you have to stop
training when you use the same GPUs for
generation. So we want to use different
um allocations and then we want to try
to allocate as many GPUs for for
generation um until the training becomes
a bottleneck again. So if that's the
case if our generation kind of we can we
can generate as many tokens as we can
consume during training then um we can
have a balance and and and then we
shouldn't actually have any any
bottleneck if we can generate and train
at the same time.
Cool. And now I think um the beauty of
why this um actually works so well for
Majestra is um that we kind of just
updated the weights on the fly. So I I
talked earlier about how you would have
to kind of dump one terabyte of data and
then load it again. What we actually do
is um we we don't dump it at all. Um so
we we know that our model topology sits
on um a thousand GPUs that's allocated
for training. They're connected um with
with nickel to our generator GPUs. So
we're just going to directly update the
weights through uh through through
nickel. Essentially we use Nvidia's um
fast and envy link for this. It still
means we have to change the topology and
it also still means we need to
potentially merge weights. So it is
quite finicky actually to kind of have
this FSTP state topology and then map it
to whatever VM expects. So it's not just
key renaming, it's a lot of kind of
separating tensors and and potentially
merging them. Um, and the fun part is
that we actually we don't even update
the KV cache. So, we just leave the KV
cache as is. Um, and we we update the
weights. That's actually in the
beginning when we tried this out and it
worked. We're like, all right, that's
cool. We didn't expect that. But
surprisingly, um, the the model
generates pretty good answers um with a
kind of pretty outdated KV cache.
Cool. Um and now also another beauty of
VLM that really helped us and they kind
of solved this problem of um increasing
length in the training is that um I mean
thanks to page attention you don't
really think about length that much
right you you have a you have a budget
of um kind like um yeah like paged like
pages I mean memory pages and you can
kind of just allocate them to to
whatever um sequence you want. So
whether you I don't know generate four
sequences with um 10,000 tokens or you
generate 256 sequences with um I don't
know like 128 tokens doesn't really
matter um from the from the utilization
point of view. So that's quite nice in
that sense like we like it's not like we
we drop in utilization of of the GPU
just because our length gets longer
longer longer because VM will just use
them up correctly. We still we we still
need more time though. um if our length
grows because we're not going to get the
same number of requests in the same
time. Um but then you can just kind of
spin up more generators automatically if
you need during training. It's also
fine. It's not it's not it's not I mean
it's not great with slum but but it's
fine. Um okay, cool. And now to um
finish the the talk. So our beautiful
slide that explains that a bit. Um, and
I think just what I want to show here a
bit is um these these nice colors and
how they go from yellow to to red that
they that kind of shows our training
steps. So you see maybe um pi minus
um two and pi minus one, pi i and then
pi plus 1. So these are graded update
steps. Um and you can see that some
requests they might actually go through
three gradient update steps where um and
and actually you can also imagine like
the model weights are updated every
every step there um but the KV cache
just stays the same um and and that
works. So you you might have certain
sequences um that um are generated
um for like three kind of model updates
and that's still fine. So like weight
updates they they have 10,000 tokens and
only the last generated tokens actually
come from like a real kind of key cache.
The old KV caches that still use this
from old weights but it but it works.
Cool. Um I think that's it. Yeah. From
all sides. So if you have any questions
also about this just um catch us later.
>> Hi everyone. Um so we are Matis and
Miguel from Mistral. Um so we work in
the inference team. So our job is to
make our models run as fast as possible
in production and make that reliable. Um
I just want to have a yeah quick
forward. So we're both uh occasional
contributors to VLM as well. And I think
one great benefit of using BLM is that
we can have our science team tinkering
with it, doing RL, doing training, well
some stuff like that and we can use the
same software in inference and that
helps us a lot having a single source of
truth um making sure our model runs um
as expected. So yeah, today we're going
to cover um disagregated serving with
prefield decode. Um I'm going to cover
the basics and then Mattis will go into
the nitty-gritty details of what we
experienced um in production.
All right.
Okay. So um obviously as Mistral we are
a LM complion service provider and when
our users um ask very important
questions uh to our service um like what
is uh friend's top five best cheese of
course um what you see as a user is
first the time it took um it's the time
it took sorry to get the actual first
token. So we call that the time to first
token. um that's very important for
users to feel some kind of
responsiveness um to your service and
then you have the latency that you have
in between each and every token. So the
inter token latency we're going to talk
a lot about these two uh during that
talk.
Um of course when you serve to um many
many different users you want to
aggregate those metrics and yeah monitor
them. So um looking at the median for
example for the inter token latency is
is not enough. We also want to track the
P99. So the yeah the worst percentile of
what user can get from using your
service and we don't want any
degradation um yeah over the whole um
percentiles.
So yeah very simply like how um
inference works. So very yeah basic um
you have a request incoming and that
will go through the prefill phase that
is compute um uh intensive because you
can process all the tokens in parallel
so you're actually able to leverage your
whole tensor course on your GPU and
that's nice and that way you populate
your KV cache and then you will go
through the decode phase um of that
request that is very memory bound so for
each and every token you're going to
generate that you have to load um all
your model weights from your global
memory to um your yeah tens of calls to
the chip memory.
Um to do that and to have a service that
can accept multiple requests coming at
different times we do uh well VLM do uh
what's called infat batching or
continuous batching that has like
several different names and the idea is
quite simple is that so you process one
request at a time but every time you
have like new requests coming in you can
aggregate those requests so you would
have um like here the request A that's
going through its generation phase but
also included with the context space of
request B and C. So that's nice because
that way you're able to leverage your
GPU resources a bit better because you
can go through different um prefield
phase that really um use all your GPU
resources.
Okay. So, but when it does this um there
is a something you don't want to have is
um high spags of latency uh in the token
latency because every time you're doing
some decode and another prefill request
comes in your schedule with that one at
the same time. So that would take longer
to process and as a user uh that's very
bad because uh you will have your token
streaming like suddenly staggering a
little bit and yeah and you will that's
worse in terms of responsiveness
and also if we want to actually like
make money um generating tokens we want
to aggregate as many requests as
possible um to use um our GPUs to the to
their maximum performance. But if we do
that, we're going to increase a lot
those kind of latency. So it's difficult
to apply in in practice for like yeah
userfacing interaction like chat.
So one of the solution is to use prefill
and decode disagregation. So you would
have the prefill face and the decode
face on two distinct um set of GPUs. So
phys physically different. um one of
course set of GPUs will only do prefill
and another set will do um generation
and that way you you never have prefill
and decode schedule at the same time. So
your intent inter token latency is much
more stable. Another side benefit that
you have from doing disagregated serving
is you can actually optimize both de um
deployments separately. You could for
example having different a different set
of hardware um doing prefill and
generation since we you know that um for
prefill you only care uh about the
number of flops you have uh you might
want some kind of GPUs that have a huge
amount of flops but you care a little
less about the memory bandwidth whereas
on generation you could optimize on
having a hardware that has a better
bandwidth and also like yeah we talked a
bit earlier about different quantization
scheme. You could have for example
having um only weight quantization on
the generation part uh to benefit from
yeah having the um less memory bandwidth
to load your weights whereas um your
prefiller um will have yeah a different
uh quantization scheme that also works
with sharding your model. You could have
like a different uh TP size or EP size
for generation and prefilling. So that
has a lot of benefits.
how that works in VLM. So yeah, very
high picture here. Um it's fairly
simple. Uh so there is a KV connector
API that you can use uh for this. Um
what you would do for example is send a
request to a prefill instance and sort
of like trick it into only doing prefill
by setting max tokens equals equals one.
So we'd only generate one token and then
when the request comes comes back to you
um it would say hey my cache is here uh
you can contact like through nixel for
example to get it and then you will send
that to the decode instance that would
fetch the cav cache and process with
further uh generation
and our goal with this so this is actual
data that we measured um on production
So what you see with the red line is the
P99 of the inter token latency. So as
you can see it's like it's huge um
compared to the P50 that you can see uh
like on the um orange and yellow lines
um for example. Um and what we want to
achieve with um this aggregated serving
is going to that blue line that's a bit
above the others but uh completely
manageable.
So yeah, I will leave the floor to Maxis
Matis to explain that.
>> Yeah,
>> thank you Miguel. Um so I will now cover
the challenge of uh prefield and uh
decode disagregation. Uh there is
several of them. Um I guess uh the main
uh the main one that you will encounter
is usually the infrastructure. You have
different way of doing prefer
disagregation. One of them is through
Ethernet. But if you do that, you will
suffer uh high TTF uh because basically
the CV cache can be super large. For
instance, on one of our older model uh
the CV cache can be up to 16 GB uh which
is super high and uh this is actually in
int 8. So you can imagine and uh if you
don't have Infiniband which is basically
a way to communicate super fast between
uh different nodes uh in like between
GPUs uh you will have high DFT. Uh just
for context uh the greens uh the greens
that you are seeing over there uh those
are several 100 gigabyte gigabyte per
second. Uh so to do so within VLM we are
actually using Nixel which is Nvidia
library that came out I think in March
and uh which use uh UCX as back end.
They have several back ends but the one
that we are using right now is UCX and
it's a way to leverage Infiniband.
Uh Infiniband is uh actually a challenge
because you need to set it up on your
infrastructure and uh this is not easy
and obviously you need to configure it
and to to have the all the the stack uh
that is running.
um then you will face another challenge
with is how do you correctly size your
disagregated instance. So usually when
we talked about disagregated we talked
about XP YD X stands for the number of
preill and Y stands for the number of
decode. So it's quite simple you have
you will have uh X number of prefill and
Y number of decode. This sizing depends
of several stuff. One is the model
obviously and the hardware uh because
models won't behave the same way. But I
think the most important part is it
depends of your input output sequence
sequence lens distribution. The high
level overview of this is quite simple.
If you have uh a super large uh if you
if you send a long context only to your
model you will be in pretty heavy uh if
you are uh generating only long stuff.
So it's uh what happen when you are in
the reasoning you will be in decode
heavy
and um there is some caveat with that.
Um the first one is when you have a
longtail distribution for the input uh
sequence lengths um you in traditional
settings with within VLM you will have
high spikes on other requests and that
the reason the reason of that is because
you need to enable partial prefill is
something you can do with long prefill
token threshold and you can increase
maximum batch tokens on the prefill.
Um it's something you can do because
basically on the prefill pods uh you are
on prefill instances sorry you are um
only compute bound so you can increase
maximum bash tokens as you want.
Another caveat is obviously uh you need
to be sure that uh you your model won't
put too much pressure on your decode. So
you must watch uh like you much look how
your model will behave uh especially in
reasoning and um the last caveat is you
need to be extra sure that you need
disagregated serving because if you are
using if you don't use your model much
uh so for instance for older models or
if you are in the on your local setup
inflight batching is more much more
efficient resource wise.
Another thing that is quite important um
it's also a challenge um is you need to
be able to schedule your request
um and the reason of that is because for
a set of resources um so let's say you
have 100 GPUs 50% of them let's say uh
will be used for prefill and the other
50% will be used for decode so half of
them only will be able to handle the
same workload. Um so the issue with that
is that you can have uh load spikes
which is not great and uh one side can
have a float of work coming. So the way
you can uh handle that is by scheduling
and this is something that you can do
with LLMD and we are working on this at
the moment.
Um and final but not least uh you need
to be aware of how your deployment is
stable. So if your prefill crash because
you are whirling an update or because uh
simply there is a VLM bug or whatever
you need to be aware and you need to be
extra sure that your decode won't die as
well. Uh the this stability has been
greatly improved recently and thank you
that's great and um yeah the other issue
you must be aware is when you want to
upgrade your VLM you need to them they
need to be compatible. It seems obvious
but when you are in production it's
actually quite challenging
uh and you need to to keep the prefill
and decode ratio.
Um so for everything that I mentioned
there is a few metrics to look at and um
you need to look at first at some
metrics on the prefill which is uh
obviously the pending request in the
queue some metrics in the decode which
are also pending requests in the cues
and you need also need to look at the
total experience TFT which will
correspond to the addition of the first
TTFT and the second TTFT because this
will be the TTFT the user will see and
um so if you are noticing some spikes
especially for P99 TTFT that kind of
stuff uh you need to be aware that it
either a scheduling issue or you need to
scale up some instances.
Another optimization that you could do
with DAG uh is you can trade a slight
increase of uh the P99 enter token
latency
um to improve your median TTFT. Uh and
uh this is actually quite important
because user needs to have a good TTF
especially for small request and you do
that by working directly small request
cool request to the decode. Why is that
working? Um well the thing is you only
um you only in increase little bit the
P99 because you only root very small
request but actually um most of the time
at least for us we our request like most
our requests are very small at the
beginning it's like hi that kind of
stuff and uh you want to offload that to
on the pre-ill site
in general uh total response time is
expected to be better than in
traditional deployments.
So in conclusion um disagregated serving
improve uh the quality of service but at
the cost of inquisitive complexity
that's kind of obvious but like it needs
to be said um because you have you need
to set up infinib like I said um but
it's actually very worth it uh for
interactive application high workload
and that that that kind of stuff but you
need to be aware that it's not always
worth it. Correctly sizing XPYD is kind
of hard. You you need to look at the
metrics I mentioned earlier. You need to
be aware of your input output
distribution and especially the
long-term part and uh you need to
schedule the request properly and the
stabilities matters a lot. But at the
end uh I would say that disagregated
serving is quite nice. Uh this is a
snapshot I took a few days ago. Um you
see some spikes over there. It's because
we don't have on this setup at least uh
scheduling yet. Uh we do have today but
I did not update the slide. And uh the
thing is uh over there you have a high
uh P99 ITL but it's way much better. And
the reason is because we on this uh
setup we actually rooted uh a bit more
uh some more small request. So if you if
we want to reduce that we need basically
to uh to to um to have less small
request that comes to the decode
directly.
And uh yeah, I would like also to thank
you as the VLM team and the LLMD team
for their collaboration and dedication
dedication on this subject and other
stuff. And uh kudos to everyone that
made this meetup happening. And I would
like to make a special thanks to Robert
Cho and Will Heaton for this significant
contribution on this subject and uh also
because they helped me with a horrible
bug. So yeah. And by the way, we are
hurrying. Uh you can check our page.
>> Hello everyone. Good evening. Uh my name
is Thomas Parnell and I'm a principal
research scientist at IBM Research and a
committer on the VLM project.
And today I'm going to talk to you about
the work that we've done to um enable
hybrid models as first class citizens in
VLM to kind of the journey from
supporting these models as really a hack
in V 0 to like fully supported well
integrated models in V1.
Um okay so like before I get into what
is a hybrid model I want to motivate a
bit why we need them. Um I think
everyone in this room is probably
familiar with attention
how successful it's been at language
modeling like it's super it's
revolutionized the industry right uh and
in VLM we have um tons of super smart
ways for implementing attention on
modern GPUs so this is including VLM's
dependencies like flash attention flash
info and so on technique techniques like
page attention flash attention tile
softmax tens core CMA quantization. Like
there's so much amazing engineering
that's gone into optimizing attention on
modern GPUs. However, like despite all
that, there are still some theoretical
issues with attention which we can't
avoid. In particular, when like
sequences get very very long, uh we have
two main issues, right? One is that the
state that we need to maintain between
iterations increases linearly with the
sequence length. So as we go to a
million tokens, the KV cache becomes
huge. And secondly, when we go to really
long sequences, the the time to do the
prefill, which affects your TTFT, blows
up quadratically. But these are like
theoretical problems that no amount of
engineering can really overcome.
Um
so like why do we care? Like maybe like
we we don't care about long sequences. I
just want to say why this is so
important is that in the kind of
applications and the way people use
these models we see increasingly longer
and longer sequences. Just some examples
here like one is rag where maybe you you
look up a bunch of do documents in some
vector database and insert them into
your prompt. Depending on how many
documents there are, this can lead to a
really long prompt that you're sending
to your inference server.
Um another example is a gentic patterns,
right? where you have a multi-turn
interaction with the model. You ask the
model to generate something. You do a
tool call. You get the output from that
tool. You pass it back to the model and
this process iterates which can lead to
really long sequences being handled on
the infant server. And finally, we have
like the the kind of emergence of
thinking and reasoning models where we
tell the model to think things through
step by step. We insert some thinking
tokens. This again can lead to really
long rollouts from the model that we
need to be able to support. So like long
sequences are important. Uh we need to
think about how we can support that in
VLM without uh ruining the performance.
So I'm not going to give you like the
whole history of uh state space models
and linear attention. I'm just going to
cover you know some uh developments over
the last few years. Uh I'm going to
start from like 2021 which is this paper
uh S4. So this is one of the first
attempts to um apply state space models
which have a long history you know like
control theory connections to RNN's.
This is not the first state space model
but this is one of the most successful
approaches attempts to apply SSM to
language modeling. So the equations I
show in the top here are kind of the the
core idea. You have like a recursive
mapping from an input sequence X to an
output sequence Y. uh you you have these
matrices A, B and C. A and B kind of
take your input and map it to this
latent uh state H. And then the C matrix
maps H to the output. And what's really
important to understand here is that
this recursion it's linear in the
sequence length. So no matter how long
your sequence gets, you're always going
to do like T steps if your sequence
length is T.
Secondly, like the size, the
dimensionality of this latent state H,
it's constant. So unlike KV cache which
grows as your sequence length gets
longer, the the state H uh is constant
in size. I'll come back to that more
later. So like one downside of this
approach S4 is that it was pretty good
at language modeling, but certain things
like selectively copying parts of the
input and reasoning about things in
context still were not very good.
So that's where uh Mamba comes in and my
clicker is not working. Here we're good.
So so Mamba um kind of solved this
problem with the selective copy copying
and in context reasoning uh by making
these ABC matrices dependent on the time
step. So they can vary at each iteration
of that recursion. Um
unfortunately like despite kind of being
super good at these t tasks uh ma still
was slower than attention or like
moderate sequences because it it doesn't
the algorithm doesn't map nicely into
matrix multiplications. Um so it can't
use like the tensor cores which are
really like prevalent on modern GPUs
like copper and blackwell.
So uh the mambber 2 paper which came
like a year later solved that problem.
So they found a way of like introducing
structure into this matrix A which
allows you to rewrite this whole
algorithm as like a big matrix
transformation from the input sequence
to the output sequence that allows you
to use all of the tensacles very
efficiently. And what's super
interesting in this paper too is they
found a connection they proved a
connection between like this form of SSM
to something called linear attention
which like brings in a whole other like
sub field of of models.
This this linear attention is described
in this 2020 paper. The idea there is
it's kind of different. They try to
approximate the soft max in a way that
lets you like fold it into the feature
maps. Uh so you kind of end up with a
linear attention mechanism.
This idea spawned like tons and tons of
variants on the idea. I'm not going to
list them all, but like one relatively
recent one which has been um extremely
impactful is this gator delta nets. Uh
here again it's like a a linear
attention variant which they show can
actually outperform Mamba 2 in terms of
like quality on downstream tasks by
better learning like associations
between the key and value tenses.
Um so like that's kind of the history.
This is like a an atlas of the hybrid
models that we support in VLM today. Uh
you see that there's quite a lot and
they're also like diverse so they don't
all do the same thing. I mean one thing
you will notice is that they do all use
attention. So attention is still
important. We don't see that going away
right now. But in addition to attention
they all use like different um mambber
or linear attention approaches. You see
like a cluster including Gran I4 from
IBM as well as Nemo Nano from Nvidia
that use attention and Mamba Mamba 2
rather. So Mamba 2 is kind of super
widely used. And then you see kind of
towards the bottom models like Quen 3
Next and Kim linear which are relatively
newer models that are using linear
attention like gated delta nets or this
Kimmy delta attention which is kind of a
var variation on the gated delta net.
Finally, you'll see kind of a recent
trend in these models is that they're
combining this hybrid idea of mixing
attention and mambo with. So, these are
kind of the three um key components we
see emerging in these hybrid models.
Um, cool. So as I said there these
models are diverse so they don't all
have the same architecture but there are
some commonalities between them
particular
in the way we need to be able to manage
the state. So for um attention models
VLM manages the KV cache in blocks. So
we say a block is like um a block is a
contiguous region of GPU memory with
enough space to store the KV cache for
all layers of the model for like a small
number of tokens say 16 um for like a
representative example a block
corresponds to around 64 kilobytes in
GPU memory.
And as we know as we like generate new
tokens with an attentionbased model we
have to concatenate to the KV cache. So
as we generate more and more tokens,
we're like appending blocks, we're
creating uh more and more KV cache
um like mambber or linear attention
models on the other hand work uh
differently. So the state um is much
bigger. So you have like a single rather
than like uh appending to the KB cache
each time you generate a token, you have
a single mambber state um which you
update in place. Um, and what's really
important to note is that that state is
huge compared to a KV cache. So like a
block of KV cache is uh 64 kilobytes
versus 2.57 megabytes for the member
state. So like for a single block the KV
cache is 40 times smaller. But when you
go to long sequences like 128K um the KV
cache becomes 200 times larger. So this
is what like why people are excited
about these models and where they can
deliver like massively higher throughput
um for like large batch sizes. Um
so yeah how did we like first support
these models in VLM? So we we kind of
hacked the support into vzero in the
following way. So on the left hand side
you see how VLM manages the uh attention
blocks. So u we have one KB cache tensor
for each layer for each attention layer
in the model and then like interled
across those tensors are the blocks the
attention blocks. Um the the the number
of blocks n in this figure is something
that VLM determines automatically. So
when we start the inference server, the
LM sees, okay, I've loaded the model. I
do some kind of forward passes. I see
how much memory is left over and I
calculate how many blocks I can fit.
This is like super nice feature of VLM.
Um
mambber state on the other hand in v 0
um you can see on the figure on the
right we just allocate
uh like a mambber state for um the
maximum batch size that we can support
on the server.
So what's the problem with that? Um the
the maximum batch size is something that
the user chooses and we had like a lot
of problems where users were setting
this wrong. If it's too high, you get um
like CUDA out of memory. If you set it
too small, then like you're not fully
utilizing the GPU. So this is like a bad
user experience, right? We want to um
automate how to choose that.
So the goal of this effort right was to
unify how we manage KB cache with how we
manage the mambber state. Uh okay this
is like
partly for elegance. We want to have
like a nice way nice clean way of doing
it. But it also allows us to, you know,
integrate hybrid models properly with
VLM v1 and benefit from stuff like
prefix caching, KV cache transfer, PD
disagregation that we just heard about
and like other cool stuff from VLM v1
like the torch compiler integration,
better scheduling and so on and so on.
So like V1 is great. We want to support
hybrid models.
So um before I tell you about how we
unify the mamba stage with the KD cache,
I need to explain like how VLM supports
a different kind of hybrid model. So
hybrid models is a bit of an overloaded
term. Um we can also use it to refer to
models like GPTOSS
which mix attention with sliding window
attention layers. So how we handle that
in VLM v1 is kind of shown in this
figure. So um I show this for a example
this Google Gemma 3 uh 31B it which has
four full attention layers and 22
sliding window attention layers. So what
we do is we form what we call KV cache
groups where um each KV cache group
contains layers of the same type. So in
this example, we have a group with four
full attention layers and every other
group is like a sliding window attention
group and we do a bit of padding if
necessary because we want to ensure that
all groups are the same size and then
how we store the state is shown on the
right. So we have one KV uh KV cache
tensor for each element in the group. So
four in this example and then the
different KV cache groups actually share
those tenses. So like we have um
attention blocks and sliding window
attention blocks sharing the same data
structure um which is really great like
it allows us to do very simple memory
management. we can um mix attention
blocks and sliding window attention
blocks interchangeably provided
the page size so like the size of the
block in GPU memory is the same for all
of the KV cache groups
and maybe you already start to see the
problem here for mambber um is that the
mambber state is huge compared to the
attention state right it's like 2.6 6
mgab versus 64k. So how do we solve
that? And this kind of may seem like
black magic, but I'll walk you through
it's actually very simple how we solve
this problem. Um firstly, we relax the
constraint that like the block size for
attention and mambo has to be the same.
Um
secondly, we take the attention layers
and we increase the block size hugely
from 16 to like hundreds or thousands of
tokens. to ensure that the attention
page size is at least like greater than
the mambber page size and we can do a
bit of padding on the mambbo and I'll
show you an example of this. Finally,
and this is like super important, we
have to ensure that the views into the
KV cache tensor are compatible and I'll
show you an example of this shortly.
Okay, so this shows you how we align the
page. This is shown for like a real
example of Neotron Nano from Nvidia. Um
you see in the top like the attention KB
cache group which has block size 16 and
page size 64K.
All the other groups are Mamba groups
and they have like a block size of
around um the block size is set to the
maximum model length. So in this case
128K but the page size is 2.6 six megs.
So we this is the problem we have to
solve. Um so firstly we uh just select
the block size for attention that's a
multiple of 16 such that it's bigger
than the the page size of attention is
bigger than mamba and then we do a bit
of padding. So like it sounds
complicated but it's really simple. Uh
you might already say rock size 672
that sounds really weird like that's not
going to perform well. I'll come back to
that in a moment.
Regarding how we handle the striding, um
this shows kind of um how it worked in
our first attempt and this was horrible.
This gave really bad bugs and I spent a
long time debugging this. Uh the two so
the the the two like KV cache groups
share the same view into a KV cache
tensor. On the left you see the um the
way the flash info attention back end
like looks into that tensor. So you can
see it has key zero value zero key one
value one. It's laid out like block by
block by block.
The mambber view on the other hand so
mambber has like two substates the con
state and then the SSM state and it's
laid out in a different way. So we have
all the con states for all the blocks
followed by all the SSM states for all
the blocks.
And what happens here is like if you
write to mambber state zero, you kind of
go and write data into that tensor in a
way that doesn't align with what
attention expects and you lead to like
completely corrupted data and really
crazy eval results.
So the simple solution to that is to
change the striding and we do like a lot
of tricks like this to make all the
different attention back ends line up
together. we have to change the stries
to ensure that like the views are
compatible and writing to one block is
the same thing across the different
views.
Cool. So like that's the main like
points I wanted to make to give you an
idea of how VLM unifies
the state management for hybrid models.
But there was also a bunch of other work
needed. So um we had to do a lot of
changes in the modeling code to support
v 0ero and um I think Michael mentioned
in the beginning uh cudigraphs um these
are super important for hybrid models
because uh
these like mambo and linear attention
mechanisms implemented using tons of
Triton kernels
and these kernels have like um a high
launch overhead. So you see like
overheads on the CPU coming from
launching lots of different kernels and
this is why um peacewise cuda graphs um
doesn't really give us enough
performance and we have to use full
cudigraphs.
Yeah. Finally um prefix caching is
something we landed kind of recently. Um
it's a bit experimental. Please try it
and let us know if you have any
problems.
Finally um
I mentioned the weird block size, right?
So block size 672.
Some attention backends like Flash
Attention or Flash Info support this
really fine.
Others though do not.
Excuse me.
So in particular when you're running on
Blackwell we want to use um the TRT LLM
kernels to get the best performance out
of the GPUs and Blackwell like those
kernels only support block size 16 32
and 64. So you can't just use some
random value of like 718.
So we've implemented something recently
where we can decouple the kernel block
size from the block size used by the KV
cache manager. This is like super
important if you want to use hybrid
models on black belt.
Okay, cool. So to kind of close, I want
to show you some benchmark results. Um
for this we're going to use a granite 4
model. So this is um like a hybrid model
that combines attention and mamba 2
ande.
It has 7 billion parameters and 1
billion of those are active at like any
given time.
Then we use VLM's like inbuilt
benchmarking tool uh VLM bench serve. We
use random data. So yeah like prefix
caching is not involved in this
benchmark. We use quite long prompts. So
like 32k in 128 tokens out and we sweep
the concurrency from one to four up to
16.
Um
yeah like we used the last version of
VLM that supported both v 0 and v1.
V v0 has now been stripped out. So you
can't with the latest VLM v1 is what you
have to use. But as you'll see it's
great. So
wait, how do I go back?
Cool. So yeah, thanks to the Misel guys
for talking about what is TTFT, what is
ITL? Uh I'm going to use the same
metrics here. So you see three panes. On
the left hand side, it is the
concurrency against the TTFT. In the
middle is the concurrency against the
ITL. And on the right hand side, the
concurrency against the throughput on
the Y-axis. Uh you see three different
colors here. Orange, green and blue.
Orange is V 0, green is V1 with
peacewise cudigraphs and blue is V1 with
uh full and peacewise which means we use
peacewise for like batches that include
prefill and decode requests and full
CUDA graph. So decode only batches.
And what you see here is like for the
TTFT V1 is consistently better. This is
coming from like the torch compile uh
integration for the ITL. Um like if we
only use peacewise v1 is actually worse
but low concurrency. This is coming from
what I said about the overhead of
launching Triton kernels. It's super
important.
Um especially when you're using like and
the latency is very low. But we see when
we use full and peacewise we recover
that gap and the ITL is like
consistently better between V1 and V0.
Similarly for throughput we see like 31
to 91% higher throughput when using full
impedise graphs in V1 compared to V0.
Yeah. Finally, I just want to say it's
so great to see um so many like local
fans of VLM and people interested in
this work. I'm based locally here in our
research lab in Rishlon just down by the
lake. We work tons on LLM inference also
on other topics like model
architectures.
We're working on supporting IBM's own
accelerator spire in VLM. If you're
interested in this stuff, please come
and find me after the talk and yeah,
happy to talk. Thanks.
All right. Um I'm Tyler Michael Smith.
Uh this is the last talk. Um so which
will be uh hopefully pleasant news. Uh
if you like me, you are very hungry
right now. Um so I'm going to be talking
about LLMD. Uh so this is a distributed
inference framework that we uh have
built around VLM.
Right? So, you know, the way we kind of
think about the inference stack is, you
know, VLM is the inference server that
really connects the models here at the
top to the accelerators down at the
bottom. Um and then we've got LMD which
is the Kubernetesbased
uh inference distributed inference
framework uh that we use to orchestrate
um and and put in like a lot of
distributed optimizations
um and it's really about optimizing
uh inference performance and it's about
improving SLO targets
um and so like okay question is okay why
LM MD what do we have to do special why
can't we just use Kubernetes as is like
you know inference gateways
um so you know the big difference is you
know between LLM comp LLM requests and
modern HTTP requests are you know you
know HTTP requests are cheap they're
uniform they're fast uh they're you know
stateless a lot of times um but LLM
request requests are slow. You can't
predict how long they're going to take
uh beforehand. Um they're nonuniform,
very expensive, and they're stateful.
You know, we have to manage the KV
caches.
Um and they're, you know, orders of
magnitude uh more expensive per request.
Okay. So, yeah, like I said, LLMD is
really about bringing together VLM and
the Kubernetes community. Um and this is
an architecture diagram. So uh just to
kind of like trace a request uh so
request comes in, hits the inference
gateway. Um it does a gRPC call to uh
this body based this inferenceuler. So
this is the um the endpoint picker that
decides what VLM pods a request should
be routed to.
um that then comes back to the insurance
gateway which then sends requests to the
VLM pods and there may be pre-filled
decode disagregation involved here uh
like you heard uh during uh one of the
Mistral talks.
Um and so we kind of have uh in the LLMD
community we have these four well-lit
paths. Um, these aren't like
like productized solutions. These are
really like ways of uh showing people,
companies, users a way to uh kind of
like walk the path to get to a really
good deployment uh a good opinionated
deployment in like a specific case um
for LLMD basically to highlight the
features that we have uh uh in the
project. Um so the first is intelligent
inference scheduling. Um and then then I
I'll I'll walk through all of these.
So uh intelligent inference scheduling
uh like I said we have this endpoint
picker that decides which uh instances
of folm a request should get routed to.
Um
and we have a couple of of different uh
uh routing uh algorithms in there. So uh
one is loadare routing. So uh every uh
200 milliseconds the endpoint picker
will scrape the metrics of all of the
VLM pods. So it knows exactly you know
how many requests are in flight uh for
instance and can you know use that to
route to specific uh you know pick the
pods that have the least load
essentially.
Uh and that's kind of good in all cases.
Um, and we also have something called
prefix aware routing where um, there's a
couple of different flavors of this. So,
one, the endpoint picker can uh, keep
track of where it's routed uh, requests
in the past. And we're taking advantage
of a feature in BLM called automatic
prefix caching where if you're
processing the same prompt multiple
times since the KV cache has already
been processed uh on subsequent requests
uh it doesn't have to be processed
again. Um, so the endpoint picker can
take advantage of this and route to the
same pods that it thinks has those uh uh
prompt prefixes already processed. Um,
we also have something called precise KV
scheduling which we have uh something
called a KV events API where the VLM
pods will report back to the endpoint
picker when they add or evict blocks
from their KV caches. So the in that
case the endpoint picker can know really
precisely what's in cache. Um, and this
is really good for uh multi-turn
conversations like you talking to a
Slack uh a chatbot or um it can be
really good for agentic uh workflows as
well.
Um and so it's it's you know very
pluggable. We have a lot of different uh
uh scores, filters
um and uh request control.
Um, and here's a a performance chart
comparing uh roundroin
uh uh scoring to um
uh KV cache load aware and like a KV
utilization score. Um, and you can see,
you know, we reporting like uh in some
cases up to 50x performance improvement
to TTFT.
Um, mostly from taking advantage of uh
prefix aware routing.
Okay. Second well-lit path is pre-filled
decode disagregation. You heard a lot
about this from uh the MR talk. So I'm
not going to go into it uh too deep. Um
but in in uh LLMD
uh we're using Nixel. So the exact same
code that uh the MRA all folks are
using. Um
uh you know we have a sidecar and the
decode pod. uh the uh gateway will route
first to the sidecar which will then
route to the prefill pod which will do
the prompt processing send the request
back to the sidecar which then gets
routed to the decode pod. Uh and when
that happens it will pull the KV caches
uh using Nixel from the prefill pod to
the decode.
Um and so yeah, we use uh GPU direct
RDMA uh
via UCX uh via the Nixel integration
which uses the KV connector API in VLM.
Um so this is asynchronous
um zero copy and zero memory over. So it
pulls directly from the prefillers uh KV
cache and inserts directly into the
decoders KV cache without any uh
additional buffer uh as workspace.
Um
and so one of the primary advantages of
uh prefill decode disagregation is it
gives you specialization in how you
parallelize pre-fill versus decode. Um
and so for example you can do tensor
parallel size 4 decodes with tensor
parallel size pre one prefills. Um and
then as the uh you know the nature of
your workload the ratio of uh input
tokens to output tokens changes then we
can vary the number of prefill and
decode workers in the cluster.
Um and as you can see here, so this is a
chart that compares uh sort of like the
x- axis is um sort of like the aggregate
throughput of the system. Um sorry,
that's the y- axis and the x-axis is
each individual users uh output speed.
So this is really like a throughput
latency uh trade-off here. Um we see
really good performance uh especially in
like middling uh request rates um for
the green line which is uh disagregated
inference using four instances of vm
uh for the prefiller at TP2
and only two instances of the decoder at
TP size 4. So we like specializing
prefill or specializing parallelization
of the pre-filler versus the decoder.
Uh and then the that's compared to the
orange line which is four aggregated
instances of ELM running at tensor
parallel size 4.
Um and so
uh one thing about this is you do have
to tune the uh prefill and decode uh
sizes um to your workload.
Okay. Um the third uh well path uh that
we have is uh KV cache management. Um,
so this is about
like letting your KV cache state grow
larger and larger uh to really take
advantage of prefix caching. Um, so here
we have, you know, this is also using
the KV connector API. Um, so we've got
north south uh KV management. So this is
really about offloading to CPU memory
and then to storage. Um so that you know
basically effectively increases the size
of your KV cache on a single node. Um
and then we've got east west KV
management that uh will
it's really distributed KV cache. So uh
one pod can pull uh a KV cache
from another pod. And so we we have a
couple of different integrations for
this. Uh one is LM cache. Uh we also
have support for uh uh Dynamo KVBM
uh works in this as well.
Right. And then the last well-lit path
uh I'll talk about is something called
wide expert parallelism. So this is
really about uh largecale multi-node
serving for um like sparse models. Um so
in this case we have a single instance
of VLM uh spanning uh multiple nodes. Um
and then it uh basically to maximize
throughput for large models. Okay. So uh
first of all I'll talk about like so
what is a what is ane model? So uh
Thomas uh mentioned these in his talk as
well. Um they're really
So, so here's a diagram of of ane model.
Um so
it's, you know, all transformers are
organized as attention, feed forward,
attention, feed forward. Uh,
and then so coming out of attention, the
output gets fed into a router. The
router selects the top K weights and the
top K IDs for every token. Um each token
then gets routed to uh the expert that
corresponds to its top K ids. Um and
then each expert is a matrix
multiplication.
Um and then they gets combined together
according to their top K weights. Uh so
this is a form of activation sparity. So
the act each activation
only
uh takes into account the k experts that
it gets routed to. So each token only
you know will hit you know eight out of
eight divided by 256
um uh experts. Um and what we've seen
over the past year or so is that uh
essentially every large model that has
uh been introduced in open source has
been ae model. Um, and this includes
GPTOSS from Open AI, uh, Deepseek V3 and
R1, Llama 4 for Meta, uh, Quen 3 from
Quen, and Kimmy K2, uh, including the,
uh, thinking model that just came out
today. Um, and these have, you know,
hundreds of experts, and each token, uh,
only activates a handful, uh, per
forward pass.
Um this is an ablation study by Deep
Seek basically showing that you know as
they uh activated fewer uh experts uh
and like added more and smaller experts
uh their models did better and better.
Um
and so really what we've seen is we've
gone from this uh case of like you know
mixed straw which had like eight
by
had like uh you know you know eight
experts uh to something like a deepcp3
which has 256 uh in each token
activating eight. This has introduced a
bunch of challenges uh for how to serve
these models. Um so one uh for densees
you can just use tensor parallelism.
This works pretty well. You can use a
simple fusede model. Um an expert
imbalance isn't uh too problematic.
uh but for for sparse uh TP sharding
becomes pretty problematic. Uh scaling
out to
uh multi-node becomes essential because
they're huge. Um however,
we can use sparse communication ops
because of the activation sparity of uh
of the router.
Um
and then because of the high sparsity we
really want to have high concurrency to
get a high arithmetic intensity in our
expert uh operations. So that's kind of
good and kind of bad. Um good because uh
it can handle high concurrency but bad
because you really want to go to high
concurrency. Um and then you do have to
handle load imbalance across experts
because there are so many and because uh
you may be at a very high concurrency
multiode setup.
Okay. So then uh in in response to all
this um
you know we've introduced a bunch of
optimizations into uh VLM to handle
this. So we've gone from tensor parallel
layers to data parallel attention
especially for deepseek where tensor
parallel will replicate the KB cache
um and expert parallels
um we're using these sparse all to all
dispatch and combine uh operations from
uh DPP from deepseek and there's some
kernels from perplexity as well that
we've integrated into vm
So coming out of the
uh coming out of the uh uh router, you
take the top K ids for every token uh
and you pass these the there's a a
dispatch all to all operation which
looks at the top K ids and will send the
token to the correct node
um that has the experts
uh that that token uh needs to uh
attend. into
um and this is what like fundamentally
what lets us scale to multiode. Uh we've
introduced expert parallel load
balancing to uh
uh basically we replicate experts to uh
take the heavy hitters and kind of
spread them across the cluster. Um and
then we've also introduced dual batch
overlap. These sparse alls are very
expensive and we can ex execute them at
the same time that we do uh computation
hops.
Okay. So yeah, this is uh EPLB like I
say said we replicate heavily used
experts and then we periodically
rebalance uh the system according to the
token distribution uh so that we can um
you know depending on the distribution
of the input tokens and the uh tokens
generated uh different experts may be
attended to more than others and so uh
we can dynamically rebalance the experts
uh to handle this
and you can see so this is a graph here
x-axis is time y ais is balancess so
balancess is the maximum
uh experts or sorry the mean experts
or sorry mean tokens per expert divided
by the maximum number of tokens for any
expert and as you can see it increases
uh quite a bit after these rebalancing
steps uh marked with the red dash lines.
Uh and then the another key optimization
we added is dual batch overlap. So we
take the batch that we're currently
processing. We're we split it into two
and then while one batch is executing
the uh sparse all to alls another batch
is executing the you know computation
ops like the experts and attention.
Okay. And so yeah putting it all
together. Um so here's some performance
numbers uh we ran on a cluster. Um so
focusing on the decode first. So expert
parallel size this is the number of GPUs
we're using. So we go from 32
uh up to 96. Uh so that's 12 nodes of 8
H1 H200s uh used for the decode. Um we
have like a fixed workload here of uh
256 concurrent requests per expert
parallel rank. Um and the thing that we
see here is that as we increase the size
of the deployment we lose a little bit
of performance like it scales almost
linearly
um
uh like performance per GPU
uh is is almost constant. Um but the
thing that we get out of it is a super
linear increase in the KV cache size. So
as we add more GPUs to the system, the
higher concurrency per GPU uh we can
handle.
Um
and then on the left we have the prefill
throughput. Um so in this case we're
really seeing extra parallel size is
best at EP16
um for the prefiller. And so what we do
here is we have you know one very large
uh uh instance for the decoder and we
have a few smaller instances for the
pre-filler. uh the the kind of
phenomenon where we really want to like
maximize the KV cache space uh for the
decoder just isn't true for the
pre-filler because we don't have that
many requests concurrently in flight um
and they they actually just have shorter
sequence lengths um
because we haven't generated with them
yet. Um and so you know we're getting uh
about you know 2.2 to to 2K output
tokens per second per GPU. And so all of
the, you know, the kernel optimizations
that we've added uh and the addition of
fast EPLB and dual batch overlap and
decode are really uh delivering compound
gains,
right Sasha?
>> All right.
>> Oh, loud. Uh thank you so much everyone.
We have another first at this V meetup,
but this is the first time that we ever
finished on time and I promise you that.
So, two minutes early. Uh, thank you to
our speakers. You guys crushed it. A lot
of work obviously goes into this. Uh,
two very quick things. I promised you
this will get very technical. If you
enjoy this type of content, Michael, our
first speaker, and myself host bi-weekly
BLM office hours. We call them office
hours because the init initial attend u
initial idea was to come you know have
you guys come ask questions but we've
turned them into much more than that.
Yes you can still come and ask questions
but every other week every other
Thursday we have amazing speakers from
Mal Hugging Pace Nvidia Meta the list
goes on and we dig into some awesome
topics. So check them out. You can scan
the QR code come join us. I know it's at
8:00 at night for you guys. That's the
unfortunate part. Um, as a fellow
European, I know we dinner late, so if
you want to join us over dinner, that's
awesome. Uh, but honestly, seriously, if
there's somebody in this community that
wants to help us bring these office
hours to a more appropriate time zone or
a more appropriate time in Europe, uh,
talk to me, please. Uh, and then one
last thing, I promise you a survey. Uh,
please, if you can scan this QR code,
it's super quick. Uh, you can just give
us five stars if you want and click
submit. That would be appreciated. No,
I'm just kidding. But thank you guys so
much for coming. Uh I think uh that
we're going to skip the Q&A here, but
all of us are going to be outside. So if
you have any questions, if you want to
continue this discussion, find us
outside. And now let's have some awesome
food, awesome drinks. Thank you IBM for
the venue and
Welcome to the first official vLLM Meetup in Europe — streamed live from Zürich and hosted by Red Hat, IBM, and Mistral AI! We're bringing the vLLM community together for an evening of technical deep dives, demos, and conversations with the engineers driving the future of open-source inference. Whether you're building on vLLM today or exploring high-performance inference for enterprise and research — join us virtually from anywhere in the world to learn, ask questions, and connect with the community. 🧠 What to Expect ✅ Hear directly from vLLM maintainers and contributors ✅ Deep dives into quantization, hybrid models, distributed inference ✅ Mistral × vLLM integration insights ✅ Real-world demos + roadmap updates ✅ Live Q&A with the vLLM community 📚 Agenda (Session Lengths) - Welcome & Opening Remarks (10 mins) - Intro to vLLM + Project Update (~20 mins) - Beginner → Advanced Quantization in vLLM (~30 mins) - Mistral & vLLM (~30 mins) - Hybrid Models as First-Class Citizens in vLLM (~20 mins) - Distributed Inference with vLLM & llm-d (~30 mins) - Live Q&A + Community Hangout (~30 mins) Want to Join Us in Zürich? In-person details & registration: https://luma.com/0gls27kb Join Our Bi-weekly vLLM Office Hours: https://red.ht/office-hours Contribute to vLLM on GitHub: https://github.com/vllm-project/vllm