[vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6, 2025 | DailyDevLists

Loading video player...

Full Transcript

22,072 words • EN

I'm quiet.

All right. Maybe we don't start yet.

Addie.

How about now? Better. All right. Now I

can hear myself too twice.

All right. I think we are actually live

on stream. So, hi everyone. Uh my name

is Sasha Zelanovich and uh I work at Red

Hat. I'm extremely happy that

everybody's here. I've been organizing

these meetups obviously with a larger

team uh for the whole year. This is

probably our 20th meetup this year and

it just makes me so happy when people

actually show up. So uh thank you. Thank

you for being here. Uh as I said, I work

for Red Hat at Red Hat uh in the AI team

there. And uh Red Hat loves VLM. We're

the leading commercial contributor to

the project. more about 25% of all the

contributions to VLM come from the team

Red Hat. Uh we also have about 10 core

committers on staff. So we're all in on

VLM and you're really in for a treat

here. We're really excited about that.

There's a few firsts here. Uh first

official European meetup. So really

happy that happened here in Zurich. So

thanks for that. Uh this is also our

first live stream outside of China.

We've been live streaming ch meetups in

China. 41,000 people showed up on uh on

the WeChat stream on last Saturday. So,

I don't think we'll break that number,

but we'll be happy if there is, you

know, 41 hopefully. Uh we'll know soon.

I'll let you guys know at the end how

many people are tuning in. Um so, it's

great to see the BLM community is strong

here. All of the 20-ish meetups that we

organized so far have been completely

packed. There's been a wait list and

it's great to see that Zurich is exactly

the same. We're happy we chose Zurich

for our first one in Europe. Uh so, I

just want to say a few thank yous.

Obviously you guys and the BLM

community. Uh thank you for Mistral uh

for sponsoring this uh event. Thank you

for IBM. I'm not going to name

individual people because a lot of work

went here. That would just take forever.

But thank you for everybody who was

involved for putting this event

together. Uh so let me just show you the

agenda quickly. So we have a packed

agenda and we really did go all out

here. uh we have a maintainers project

VLM maintainers VLM to contributors

committers award-winning professors

right uh so it's a really really packed

agenda so we'll give you an intro to VLM

uh and then for the power users of VLM

we'll share a quick project update uh

we'll talk about quantization which is a

way to take these massive LLMs and make

them more efficient faster using less

GPUs while delivering uh you know much

better performance we'll talk about the

mistral thanks guys for being here the

mistral AI works work with BLM uh we'll

also talk about hybrid models which are

now first class citizens in BLM and then

go into distributed inference uh talk

about LLMD which is the orchestration

layer above VLM so uh it's about to get

technical for sure uh so I hope you

enjoy enjoy today's uh today's meetup

just one quick reminder is we'll have a

survey at the end three question survey

super quick so if you can just keep that

in mind uh any feedback that you can

give us. We just obviously want to

improve and make this better uh every

time. So again, thank you so much for

being here. I'm going to pass it over to

Philip from IBM to say a few words and

then we'll go into our agenda.

Thank you Sasha.

So welcome from IBM. My name I'm Philip.

I'm the CTO for IBM Switzerland. So

really happy and glad to host this first

European VLM here in Zurich at our IBM

location. Uh as Sasha said there are

many people who helped but just one

person S that is back she has a great

applause

because every everything it's up and

running is thanks to Sara who did a

great job. Thank you. Uh and therefore

thank you for being here. Without

further ado I'll hand over.

>> Yeah thanks a lot. Hey everyone. Uh,

glad to be here. Uh, my name is Michael

Goen and I'm a lead maintainer on the

VLM project. Uh, I'm on the Red Hat uh,

inference engineering team. I'm a

principal engineer there. We've been

working on it since uh, late 2023 on all

sorts of things. So we'll be giving you

a quick overview of BLM, what's recent,

what's important about it, and also

giving a bit of a a preamble to why we

think this space is so interesting and

important to work on. So

first uh you know at Red Hat obviously

we're very much focused on open source

software and we think working on that

with the community with with other uh

companies with other research

organizations is the way to win. And we

think that making AI inference software

more accessible, more efficient and uh

cheaper for organizations to host their

own models um is the best way to

democratize the use of AI and make it

useful to everyone. Um and as you

probably know like the key turning point

for our investment and your investment

in this space was chatbt in 2022. But we

and many others you know made a bet that

sort of uh you know there's open source

models then they weren't really usable

chatbt was much more usable but sort of

believing in the spirit of open source

you know we keep seeing better and

better and better open source models and

he's getting closer and closer in

quality to the leading closed source

models where we are today and we very

much you know just today we had Kimmy K2

thinking uh which is a one trillion

parameter open- source model um uh that

uh is competitive with GBT5 and Claude

Sonnet 4.5 on human humanity's last exam

and uh you know agentic coding tasks. So

we're there open source one.

Um and so yeah, you know, the sort of

the power of open sort of the green dots

over there versus the red dots, the

closed source models have gotten closer

and closer in terms of performance and

uh uh you know, matching and and

commoditizing the intelligence that the

closed source uh companies uh sort of

pioneered with.

Um obviously the advantages of open

source models should be obvious to you.

You can host them yourself. You can

control who has access to them. Uh what

data goes in, what data goes out. You

can fine-tune them for your

applications. You can control them. You

can have all the security you need. Um

and many people obviously find this

compelling and that's why the open

source community has grown to be so

large for AI and particularly open

source models. Um, and the reason why

we're here today and why this is all

centered around VLM is VLM is the key

software abstraction layer and project

that lives between you know the hardware

uh and the applications here models and

that's why we sort of think of it in

terms of open source software as the key

Linux in this you know AI race. Um so it

is the important software abstraction

layer for all the hardware and all the

models. Um which we'll talk about why

that's important and why BLM turned out

the way it has. Um so first let's state

what VLM's original goal is. Uh to be

the fastest and easiest to use open-

source LM inference and serving engine.

Pretty straightforward.

Um, and the reason why VLM has uh or or

rather how VLM has achieved that is by

being really easy to install. It's just

a pip install away or a Docker pull

away. Um, and it's going to run, you

know, on any, you know, uh, common GPU

you want to run on. Um, like I mentioned

already, it works on all the different

model architectures, hundreds of them,

uh, and all the different major hardware

accelerators. Um, and it integrates with

the open source community. um many many

open source projects, hundreds of them

if not thousands build on top of ELM as

that LLM inference abstraction layer and

build their special applications whether

it's reinforcement learning or agentic

coding or um you know uh document

understanding and detection all sorts of

things build on top of hey how are we

going to efficiently do inference with

large language models and so you know we

already mentioned VLM is easy to install

but it's also very easy to use we offer

you know uh it's a Python based library

with a very simple Python LLM class for

easily getting started with you know

putting in whatever model you have on

hugging face and immediately spinning up

uh and compiling an LM engine and

offering generate interface chat

interface uh also multimodal data um all

sorts of ways uh that like you would

expect to

run inference on a model but of course

under the hood doing it very

efficiently. Okay.

Um, second, and maybe most important,

offering a very easy to use inference

server. So, hosting a model as a server

across whatever GPUs you're interested

in. And importantly, once you have your

server running on localhost 8000, uh you

get a wide variety of endpoints like the

chat completions API, completions, uh

embeddings, responses, rerank, all sorts

of things so that you can replace um BLM

as uh or use VLM as your replacement for

the OpenAI API you are using or the

anthropic API you are using, whatever

closed source uh model API you're

building your applications on not just

the model you're using but your

application interface VLM can emulate

that with any open source model that's

capable enough to you know be used in

your applications

and now I want to talk about why VLM has

gotten so popular uh further so a really

key key investment that VLM made is

relying very heavily on PyTorch PyTorch

as you know is the most common or most

popular open-source ML framework

network. And the reason why VLM is able

to have so much flexibility both in

model definitions and hardware support

is through its integration with PyTorch

as an abstraction layer that the

hardware ultimately implements its most

basic operations through PyTorch. And

the model definitions in open source are

generally written in some form of

PyTorch. And so through very close

collaboration with PyTorch as a project,

you know, VLM regularly uses the latest

features and projects out of the PyTorch

ecosystem. and you know is very much a

power user of PyTorch and enhances

PyTorch through this collaboration.

Um and this is part of the reason why

VLM was a founding member of the PyTorch

Foundation joining uh Deep Speed and

most recently Ry announced that PyTorch

conference uh two weeks ago. um uh you

know founding the PyTorch Foundation as

uh you know uh a really uh you know

important movement to grow and cultivate

the uh you know the the peak of the

PyTorch ecosystem. Uh and you know

hopefully the love keeps growing more

and more. Um yeah, even more examples of

PyTorch and BLM uh taking care of each

other, finding issues and pull requests

and fixes in each of the repositories

through our very close collaboration. So

we we love the PyTorch team. Um

now on to BLM adoption. So uh obviously

you know BLM started in 2023 with the

page attention paper and has quickly

grown in terms of GitHub stars up to

very steadily up to 60,000 GitHub stars

most recently uh with uh regular

deployments of BLM on GPUs reaching you

know 500,000 concurrent GPUs that we

know of. uh it's probably more um and

over 9,000 members in our developer

Slack uh working and asking questions

about the project constantly. It's also

turned into a very largecale

collaborative project as a result. It's

not just popular, it's also well uh uh

distributed and uh many different

contributors work on it. We're reaching

pretty steadily over 800 PRs a month,

which is a lot to deal with. Um and uh

over 1,700 unique contributors have

landed code in the project and we're

we're rapidly increasing that number uh

these days. And as I already mentioned,

you know, we're spreading these

contributions out across not not only

Red Hat, IBM, and Mistral, but uh tons

of other organizations, either

enterprise or research or community. Um

and it's a really vibrant community as a

result of that and is able to serve a

lot of different use cases. Um as I

mentioned before, VLM is able to offer

the software layer that many of the

different hardware platforms are uh you

know wanting to implement in order to

support all the different models they

want to run. And so we have uh these are

all mo all hardware accelerators that

run uh with VLM to some extent. We have

you know various hardware uh inside

directly in the VLM repo like AMD in

Intel, Nvidia CPUs, Google TPUs and many

many more increasingly that are uh

working with VLM as a hardware plug-in

through a common interface that we've

defined that any hardware accelerator

can you know essentially make a plugin

and they can load into VLM as they need

and continue their own development

through the fixed software abstractions

that we laid out. Um, and many many uh

uh more hardware accelerators are are on

the way hopefully. Um, and it's uh it's

it's great to see uh the excitement from

the, you know, hardware community and uh

hopefully a lot more efficiency is

gained for everyone that's interested in

deploying efficiently. Um, so next, you

know, there's obviously broad broad

model support. It's not only about

performance and flexibility. It's about

supporting the models that people want

to run. And we support easily a 100

plus, maybe even 200 plus now different

model architectures. Of course, all the

common state-of-the-art models, but also

small ones that are forgotten along the

way, and there's power users using

those. Um, and of course, we have day

zero support for, you know, the key

models on launch, often with the model

definitions and features implemented and

and provided by the model vendors

themselves.

um which is really great to see.

And of course we're not only supporting

texton models, we also have very good

support for multimodal models. And so

here we have the launches of Quen 3VL

and Deepsee OCR which we were really

proud to be launch partners for and

offering day zero support for the

leading most cutting edge multimodal

input models. Um and finally, you know,

we don't have to support the models

directly in BLM through very close

integration with hugging face

transformers. We were actually able to

make the transformers model definitions

uh general enough so that we can use

them directly in VLM substituting in the

uh the key operations that VLM needs to

run efficiently such as attention,

fusede, linear layers. Um and uh we're

increasingly expanding the flexibility

of this and we're already able to

support text dense models, recently

fused models, uh encoder only models. So

many of the sentence transformers run in

VLM through this. Um and uh yeah, it's

really cool to sort of try and reduce

the amount of duplicate model

definitions and possible sources of

truth, especially as many people do

training through hugging face

transformers. And now with reinforcement

learning, you want to make sure your

training definition and your inference

definition are lined up.

Um, another key thing that BLM focuses

on is running these models efficiently

across a variety or rather a number of

hardware accelerators. So we support

many different types of parallelism in

order to efficiently serve your model

across you know all the GPUs within a

node such as eight H100s but also

running a model across multiple nodes of

GPUs where you have uh uh you know uh

non-uniform interconnects uh where you

have very fast memory or interconnects

within a node you have maybe less uh

less quick uh interconnects as you go

across nodes and you know we support

tensor parallelism pipeline parallelism

mixing those types of parallelisms

together, expert parallelism and data

parallelism which is key for large

mixture of expert models which we

covered a bit later and disagregating

prefill and decode instances so you can

start to uh more reliably uh serve your

models uh with with the latency SLOs's

you have which will be covered

thankfully uh later in this talk. Uh, of

course, a really important growing use

case for VLM is in reinforcement

learning. As I mentioned, this style of

training uh very very heavily actually

relies on inference as well in order to

get rollouts and uh uh you know train

the uh agentic models to be more

agentic. Um and so we're happy VLM is

integrated and featured in all of these

uh you know key reinforcement libraries

and even more in the growing ecosystem

there.

Um, next it's not only about training

models, of course, it's also about using

the models once they're trained. And VLM

is also trying to help out the community

and uh try to define uh some sort of

standard for what the right agentic

interface for running these increasingly

complex models are where they want to

intersperse thinking, not thinking uh

tool calling uh and ultimately you know

working as agents for a long time uh

before they come back to you and need

more context. And so you know with the

GBT OSS launch where we partnered with

OpenAI we fully or you know we support

the responses API um so you can run app

so so you can run the same applications

that use the responses API on chat GBT

with VLM with GPT OSS and we've had

several coding agent uh libraries build

on top of that support offered in VLM.

Hopefully this is some sort of standard.

But we also support the messages API

which is what Anthropic uh uses and uh

is used by cloud code for instance. And

uh you know most recently um we landed

support for this. And so you can run you

know VLM serve quen 3B you know with

tool calling enabled and run it exactly

inside of cloud code and start hosting

uh cloud code and running it completely

locally. Uh which is really cool. And um

yeah, as more agentic coding interfaces

are built on top and support the

features we have in VLM and of course

the more capable models that are coming

out, we're really interested to see how

we can better optimize VLM for the um

unique workload that uh agentic coding

requires.

Um another key key place where VLM has

to focus on is accuracy. um BLM is often

treated as the reference implementation

and it needs to be correct especially

when models launch on day zero. Uh first

impressions make a strong impression. So

we take accuracy very seriously and try

you know to work as uh with the model

vendors directly to make sure that VLM

is both accurate on the simple eval.

We do accurate tool calling um accurate

long context um and try to force these

things along. Uh we also have recently

devoted a lot of work to supporting

batch invariance, a form of making the

inference more deterministic regardless

of the load on the server. Uh which is

also really important for reinforcement

learning as we want to make sure that

the inference code and training code

matches up regardless of the load uh so

that the training happens on policy.

Um secondly, we really care about

performance and reliability. We have a

live VLM benchmark dashboard on the

PyTorch uh uh you know uh dashboard. Um

and this runs on basically every commit

of VLM. So we can quickly get warned

whenever key you know performance uh

regressions happen on key models. Um

this is really useful. Shout out to the

PyTorch team PyTorch team for providing

this. Um, and yeah, we put our money

where our mouth is. And uh, most

recently we've spent over $100,000 a

month on CI so that developers working

on DLM pushing their PRs and getting

testing uh, can have, you know, very

full testing happening for them and they

can focus on the important features that

uh, they want to see landed in the

project.

Um, now I'm going to cover some of the

key focus areas we've had in BLM over

the past uh couple months. Um, one

really key one obviously is torch

compile from the PyTorch side. Um,

hopefully you are familiar with torch

compile a bit. You know, it uh obviously

offers for VLM what it says on the tin.

It allows for automatic kernel

generation and compilation of PyTorch

primitives. And we do use this for

automatic fusion with the various

pietorch operations we use directly in

pytor in in in VLM. Uh however, one of

the more interesting things that we like

to use uh torch compile for and we've

done we've uh uh improved with closed

code development with the torch compile

team is using torch compile as to give

us a graph representation of the model

and start implementing custom graph

level transformations

so that we can have custom operations

that we author you know in CUDA or other

languages for attention for sequence

parallelism for uh you know all produce

RMS norm quant very complicated

sequences and substitute these inside of

the graph without needing to change or

modify the model definition. We think

it's really important to keep the model

definitions as simple and pure as

possible hopefully written in just

native PyTorch and uh the optimizations

that we work in VLM and the fusions we

work on can be applied regardless of the

specific model definition and just in

terms of the operations of the model and

what order they happen in. This takes

the onus and the load off of the model

vendors who just get to upstream their

hopefully simple model definitions. And

as we generate more general fusions,

we're able to see those speedups applied

across dozens of models in parallel.

Um, next another uh really key thing

which uh uh you know the talk from

Thomas Parnell is going to cover a bit.

Um we've invested heavily in supporting

natively hybrid attention models which

we increasingly see as the future of

efficient LM inference. Uh you see this

recently with like Gemma, Llama 4, Quen

3, Next, Deep See, uh or yeah, you know,

a variety of different model

architectures recently. Um and and this

just means we need to efficiently deal

with each layer potentially having

different styles of KV cache or state

management. Um and uh and we directly

support this in BLM through this hybrid

KB cache coordinator which does a lot of

the work to coordinate um yeah these

complex patterns uh in the leading

models. Um next another abstraction we

developed in VLM is the KV connector

interface. This is key for uh you know

things like disagregated prefill decode

uh which you'll hear about later also

for offloading KV cache and doing

general operations uh on on the KV cache

uh in whatever way you want to define

it. Um we use this in a very key way for

with with Nixl and LM cache for

disagregated prefill decode. Um but this

is ultimately an extensible interface

where you can plug in custom KV

connectors based on your research or uh

business needs. Um another really uh

cool feature which is actually a new

type of parallelism is decode context

parallel which was actually contributed

by the moonshot uh Kimmy team. Um this

is a key uh optimization needed for

models that have uh a small amount of uh

uh KV heads uh such as MLA for Deepseek

which only has one uh because as you do

tensor parallelism you need to replicate

the KV heads if you can't evenly divide

into them which means you have to

replicate the KV cache meaning you have

uh duplicate KV cache across all eight

of your GPUs you're deploying DeepS on

which you can see here with the GPU KV

cache size only being 600,000 tokens.

But with decode context parallel where

we're able to interle the KV cache

between the GPU ranks, we're able to 8x

the KV cache and thus get much higher

throughput and support longer context

workloads on the same number of GPUs

just by

not replicating the KB cache.

Um, another really key uh place of

development that uh we're really happy

has landed soon. If you're familiar with

VLM V1 to the the the the journey of VLM

B 0 to V1, one of the key features uh we

developments there we mentioned was

important was peacewise cudigraphs to

allow for more flexibility in the model

definitions and allowing for potentially

complex operations like

uh attention operations. However,

peacewise cudigraphs still has a clear

latency uh uh you know uh degradation.

Um and so we wanted to expand this to

now support a flexible form of

cudigraphs where we can have much more

control over are we able to do peacewise

cudigraphs or full cudigraphs or no

cudigraphs or do it for decode only

batches or uh mixed batches. And uh we

have this flexible cudigraphs system

which you uh or design that uh we have a

great docs page on. Uh but the key thing

here is now we have full and peacewise

cudigraphs on by default. So you have

the best of both worlds. Flexibility

when you need it but low latency

whenever um yeah you need it as well.

Um, another key place we've been pushing

recently and has actually been a great

collaboration point with the community

is something we call model bash where uh

this is an example of a layer of or

rather a transformer block of deepseek

R1 and we open up and post an annotated

profile to everyone of hey we're

spending this much time in this matrix

multiplication this much time in this

operation and um you know identify these

places where Hey, it would be great if

we had some custom kernels or

opportunities for fusion. Um, and of

course, you know, we're we've been

working on squashing and improving the

performance here ourselves. Um, but

we've opened it up to the community and

we've seen a lot of people jump in and

help out and become great new

contributors as a result of uh joining

in on the fun of performance

optimization.

Um, so to recap, you know, DM takes

timely, accurate, optimized model

support very seriously. Um and the

community relies on us for this. Uh we

focus on having a wide hardware uh uh

support ecosystem and defining the right

interfaces for hardware vendors to

define you know just enough software

that they can run hopefully a lot of

models flexibly. Um we integrate and

play very well with the open source

community that builds on top of

inference as an abstraction. We work

very closely with the PyTorch team and

community and believe that's the future

for uh ML frameworks and we're the

frontier and readily innovate with what

is happening in inference systems

research and what real users see and

need in production. Um so hopefully you

know a bit more about some of those

things and what we're up to and uh are

joining the community soon. So, I will

hand it over to Elder.

So,

hello everyone. Uh, my name is Aldar

Cortage. I'm principal research

scientist at Redhead. Uh, and the topic

for today is going to be quantization in

in VLM. Uh so just so we are all on the

same page uh the quantization is a

process in which we are uh uh

compressing a model uh by reducing the

number of bits that we use to represent

either weights or activations or both.

Uh so if we imagine that we have a model

uh whose weights are in uh represented

in FP32 and they have like they're

distributed across this gshion like uh

curve uh we can see that they can take

any possible value. So we have very high

granularity here. U through the process

of quantization we are trying now to

take all of these weights and we are

trying to map them uh in a set of

discrete buckets uh where each bucket uh

corresponds to some specific uh value

that we can represent in IN int. And as

you can see the uh we have much lower

granularity uh in in4. Uh and what's

going to happen through this process is

that for some of the weights which are u

close to each other in FP32

we will not be able to represent their

difference in incore. So this means that

we're going to have to uh put them in

the same bucket and during this process

we'll be introducing some error uh or

like quantization quantization noise and

the entire game uh in quantization is

how do we deal and how do we manage uh

this noise such that we do not destroy

the model uh so that we get the model

which is still usable but compressed in

the end. Uh

so uh the the first kind of thing that

we usually talk about when we talk about

quantization is uh uh uh the main

question whether the there is such a

thing as a lossless quantization because

it's a lossy compression. So we are like

reducing precision of the model and with

the entire advent of LMS uh like

community became very interested into

any kind of techniques that would enable

them to run models more efficiently on

like a lower number of GPUs. um uh and

slightly cheaper. Uh one unfortunate

thing that happened during this period

is that a lot of quantized models

started appearing in the wild uh without

proper validations, without proper

calibrated quantization processes and so

on. And then we ended up in a situation

where there has been a lot of skepticism

around uh quality of quantized models in

general. Um and this prompted us kind of

to uh and this kind of brings us to the

first first like a takeaway message uh

from from quantization in general and

that's the fact that not all quantized

models are created equal. Uh meaning

that during the quantization process

there is like a million different

hyperparameters and knobs that we should

be tuning uh in order to get the model

in the end. And if we have like a

quantized model which is not performing

well, we should not blame the

quantization uh as like the the main

root cause for this. Uh we should like

like like take a look at how the model

has been produced and whether the proper

like calibration process has been done.

Uh and this prompted us like in general

to uh like like write a paper titled uh

give me a BF16 or give me a deaf uh

which is relatively easy to read and

should serve as some kind of a data

sheet where you can take a look at what

are the accuracy performance trade-offs

that you can expect from popular models

uh for popular quantization schemes. So

basically what we did we took uh models

from 8 billion in size up to 400 billion

in size. uh we calibrated the

quantization

uh process for three popular

quantization schemes which are supported

in VLM FP8 intate and infor uh and then

we ran a ton of evaluations in total we

ran more than a million evaluations

basically we touched every single like

open source evolve that exists out there

starting from open LLM leaderboard v1 v2

reasoning evals long context arena hard

coding multimodel and so on and so forth

uh what we found during this

investigation was that if quantization

process is tuned properly for intate and

FP8 model we should always get like

almost indistinguishable models from the

unquantized baselines meaning in a range

of 98 99% recovery uh whereas for in we

do see slight slightly higher drops but

uh the process the models would never

should never like completely destroy the

accuracy of the model so it should still

be like a usable model in the end uh one

snippet of the results that we that to

present there uh is is like like shown

here in this graph where we look at the

reasoning performance uh which

represents basically an average pass at

one score across the popular like a

reasoning benchmarks Aimeme 500 and GPQ

diamond and we look at like a deepseek

R1 distill models from both llama and

quen family and then we look across all

of all of the available sizes to see how

quantization is going to behave at

different scales and then we Following

the recipes from the give me BFF6 paper,

we quantize them to FP8, intate and

info. Uh where for FP8 and intate we are

doing in weight and activation

quantization. For in we are doing weight

only quantization. Um gray bar here

represents the unquantized like a BF16

baseline. Uh and as you can see the blue

which represents FP8 and green bar which

represents the the intate model. uh we

see that these bars are almost

indistinguishable from the BF16 meaning

that if quantization is calibrated

properly usually what you should expect

to get is a model with 98 99% accuracy

recovery relative to the unquantized

baseline. Um uh the the the last model

like in four um is this is a setup where

we do see slightly higher drops but the

mo uh sorry um but the models uh so we

do see slightly higher drops but the

models are never destroyed like in the

sense that they're not usable and

usually what we what we have seen across

all of the evils and all of the models

is that if we calibrate in four properly

we should get accuracy recoveries in a

range of 95% and above which is still

not not terrible given that we are

compressing like model weights by 4x.

Um a standard setup in like academia and

the research papers uh is take a

quantize take a baseline BF16 model we

quantize it down and then we look how

well we are recovering accuracy of the

of the original unquantized checkpoint

which is a perfect and fair setup to

compare different quantization

algorithms different hyperparameters and

so on. Uh however in a real world uh

there is usually a slightly different

objective that we are face and that's

like uh and that's more or less like a

question of like what is the best

accuracy that I can get uh by like

following some some constraints that I

have inside my deployment and that

constraint is like very often just the

amount of the GPU memory that we have at

our disposal to deploy a model. uh and

if there's no in the world without

quantization the best thing that we can

do is we go to hugging face hub we see

which what's the largest model that we

can fit inside our GPU and then we take

that model and whatever accuracy that

model has that's the best accuracy that

we can get uh putting aside like

additional fine-tuning training and so

on so just taking the models off the

shelf um however quantization offers

this additional pathway that we can use

like to fit inside the same uh like

compute constraints but get much higher

accuracy in the end. Uh and that's the

process in which we can go to a larger

model which originally does not fit

inside our GPU and then we start

quantizing down the model uh to a

specific bit that we can actually fit

inside our GPU uh and therefore leverage

uh the accuracy of a of of a of a higher

model. Even though if during this

process we could lose one two 5% of the

accuracy depending on the quantization

scheme, we should still end up at a much

better place than a smaller unquantized

model. And like a perfect example here

is like a quen uh 7B and quen 14B where

for example quen 7B in unquantized form

like a BF16 takes like 14 gigs of the

GPU memory for the weights. If we go to

quan 14B it takes 28 GB of the memory.

So in order to bring it down like to the

same uh like a like a GPU memory

requirements as quin 7B, we can quantize

it to FP8 intate or even lower like twin

4 and kind of benefit from this much

much higher baseline accuracy and then

like basically go from 65 to 74 73 72

whatever is the target scheme. So the

second takeaway message is larger

quantized is always better than smaller

uh unquantized.

uh and these first kind of two takeaway

messages were mostly related to accuracy

of the models because that's the most

obvious thing that uh everyone asked

like like when we quantize a model it's

a lossy process we are like reducing the

bit so it's like the most obvious

question uh to to ask first like what

what happens with the accuracy the

second most important question is what

happens with the inference speed up so

given that we have a compressed model uh

is there anything we can do to kind of

accelerate the inference so to improve

either latency or or throughput inside

our deployment. Uh so here we are

looking at the graph of uh for a llama

3.1 ATP model uh deployed on a single

A6000 GPU for some specific like a dock

string generation use case. Uh and we

are looking how inter token latency uh

changes with respect to the uh number of

queries per second that we are sending

to to our server. And we are looking at

three different models. The first one is

BF-16 represented by the blue graph. So

this is unquantized baseline model. Uh

then we have um intate uh weight and

activation quantized model represented

by the yellow graph here and int for

weight only quantized model represented

by the green graph. And there are like a

three interesting regions to observe in

this plot. Uh the first one is when we

have less than four queries per second

coming to our server. So this is a setup

in which we don't have large enough

inputs to keep our GPUs doing matrix

matrix multiplications. So this is the

setup uh in which the main cost during

for runtime that we are paying is

basically moving the data around because

computation is for free. We don't have

large enough inputs to keep our GPUs

busy. Uh and in order to accelerate the

inference we need to accelerate the data

movement. In order to accelerate data

movement, we are like we're going to do

the weight only quantization because

weights are like this that's the data

that we need to move to tensor coursees

to do the uh to do the computations

there. Uh and therefore we quantize 16

bit weights down to four bits to to to

to accelerate this specific part of the

pipeline. However, if we start like

increasing the number of queries that

are coming to the server and we hit a

point for this specific example of four

queries per second and more, we are

entering second regime which is where we

have large enough inputs to keep our

GPUs busy doing matrix matrix

multiplications. So inputs are large

enough uh to keep tens of course uh like

like busy and to uh uh make the time

that we that uh the time spent uh

reading the weights completely

negligible relative to the time that we

spent during during the mats. Uh and so

we are entering the second regime which

is called like a compute bounded regime

and in this regime uh the only kind of

the the the best way to get the speed

ups is to use faster tensor coursees and

to get the faster tensor coursees we

need to have both operants of the matrix

matrix multiplication quantized either

to intate or FP8 because modern GPUs

come with intate and FPA tensor coursees

um and then we get two times more flops

just by quantizing both operand

both operands and That's the second

phase where now we are we we are

switching the regime where weight and

activation quantization becomes a better

choice for our deployment than int int

for weight only quantization. And then

if we push even more like we push even

more queries per second uh we get to the

third regime uh where uh weight only

quantization becomes a worse choice than

just deploying unquantized model. Uh

this is because uh we are entering like

a heavily comp compute bounded regime.

Uh where the time spent like on loading

the weights is negligible to the time

spent during uh uh uh the during

mattles. Uh and the time the the

overheads which we uh which we had like

to unpack the weights which have been

like compressed to info now start

hurting us in this specific scenario. So

uh the third takeaway message is that

there are no golden bullets. We should

understand you should understand your

deployment uh by doing like a real world

benchmarking of the use cases that you

think are representative for your

deployment in order to figure out in

which of these three regimes you are and

based on that you can pick the proper

proper quantization scheme.

Uh and yeah at the end uh all of the

algorithms that and all of the tricks

that are like that we have used to

produce all these highquality quantized

models uh have been uh implemented uh in

LM compressor. So you can just go there

and like leverage predefined like like

quantization schemes um which have been

tuned by our team. If you would like to

run these performance benchmarks like to

see in which mode which regime you are

memory bandwidth bounded or compute

bounded you can go to guide LLM which

will be able to uh automatically like

like enable you to automatically do some

benchmarking and get graphs which are

relatively similar to what I've shown

here and then you can see in which

regime you are and based on that figure

out which quantization scheme to use. If

you don't want to do any of this, you

just want to take like a high quality

already validated quantized models, you

can go to redhead AI hugging face hub,

basically our team has already like

released couple hundred quantized models

there and we are actively releasing

whenever there is a new model in a

matter of a few days. There will be FP8

int8 uh in4 FP4 checkpoints there which

have been already validated by our team

to have at least 95% accuracy recovery

and like a decent speedups in VLA. Um

one thing that kind of we are we became

very proud of because LM compressor is a

relatively new project is the fact that

uh Llama 4 team from Meta also adopted

it. Uh if you recall from the Llama 3.1

series they have released FP8 model but

they have used their own internal tools

for this to quantize the model for llama

4 they switched over and decided to use

LM compressor to do FP8 FPA

quantization. uh what next what like the

next steps uh if quantization doesn't

give you enough uh speedups for your

specific use case we have recently

launched like a new project called

speculators where we uh released a

library for training of speculative

decoding models so you can train your

own models or you can just uh use

speculators to convert existing

speculative decoding models to a format

that VLM understands so you can just

take eagle tree h or whatever is like

state-of-the-art right now convert it

and then run it in in VLM And we already

have a couple of models which are open

sourced inside uh Redhead AI's hubbing

face hub and we'll be we'll be adding

more. And now I'm going to hand it over

to Dan to cover new FP4 formats for like

which is quantization for modern GPUs.

All right. So uh thanks a lot for uh

this introduction. So today so my name

is an Alistar. I'm a professor um at IST

in Austria and also a researcher at uh

Red Hat AI and in some sense my job is

to really push the limits of what's

possible uh to do accurately with

respect to compressing these models. So

today I'm going to tell you a bit about

uh our our research. So one in short

what uh LDAR described was essentially a

picture in which we can do close to

lossless quantization of in FB8 or

sometimes even in int8 for weights and

activations and that's a great idea

because it's really like in some sense

it doubles your flops the way the way he

expressed it. So um like what we are

really pushing now is whether we can

actually go lower and the next barrier

we are seeing is essentially four bit

precision. So I'm I'm going to give you

kind of a short overview of kind of what

the state-of-the-art u is because um

there has been uh work here already. So

in some sense what what we uh the key

barrier that we're hitting here is the

fact that modern LLMs um is essentially

have very large outlier values in their

weights but also more primarily in their

activation. So to understand what an

outlier is you can essentially see the

graph here on uh on on your left. So or

sorry on your right essentially these

are extremely large values like hundreds

of times larger than the than the

average value and this essentially

messes up things on quantization because

they they essentially will zero out

everything that's kind of close to them

or will need to be clipped in which case

we induce very high error. So the the

basic idea to kind of understand how to

get around this or this is the idea

that's been kind of floated in the

literature in various guises is

essentially to replace this kind of

linear layer like the wxransposed

uh where weights the weights are w and

the activations are x is bas basically

going to be replaced by a variant where

we premultiply each one of these

operands the w and x with a matrix and

this matrix is invertible. So

essentially we multiply with a matrix

itself on one hand. This is the matrix R

and then with its inverse on the other

hand. But then we apply quantization to

these two kind of operants

independently. Okay. So if we didn't

have any quantization R would just go

away and we would have like a perfect

recovery of the original output. So what

we're doing here is essentially we're

saying kind of quantization sort of

commutes with with u the matrix we're

applying. Therefore we should be fine.

And then the matrix R should have some

kind of smoothening property that kind

of um just takes away some of these

outliers. So it kind of makes things

easier to quantize. And then once you

have this image in your head, it's very

easy to understand pretty much all of

the methods that have been uh proposed

for um like weights and activation

quantization such as smooth quant and so

on because they're essentially just

varants of like they use u sort of

increasingly complex or fancy matrices

to uh smoothen their their input. So for

instance, smooth quan just uses a di

diagonal matrix for for those of you who

understand the method. And then uh

Carrot whose lead author I I challenge

you to find in the room uh after after

the talks is uh is essentially doing

Hadamar matrices and then there people

from Meta proposed essentially learning

these matrices and so on. But the

problem uh here is that if we go if we

try to go directly all the way down to

4bit precision unfortunately existing

techniques essentially break. I mean

they're they're okay for academic

applications but uh in in uh like what

Elder would say is that the model is

destroyed. Essentially we're we're

dropping more than 5% accuracy sometimes

even more than 10% relative accuracy. So

the model is no no longer kind of parto

competitive. You you're just better off

taking a smaller model um in terms of

accuracy. Okay. So kind of the challenge

we have been trying to to solve is

whether we can go further on this and

kind of Nvidia who also understands this

this challenge they they they kind of

try to address this by proposing a new

set of formats which are called

microscaling FP4 formats. Okay. And

interestingly they proposed two formats.

One of the formats is designed directly

by Nvidia. It's called NVFP4. And the

other one is in some sense

democratically designed by a set of

consortium of uh uh hardware vendors and

you'll see how that ended up in in a

minute. So essentially the the basic

idea here is the initial idea is very

common is essentially common between the

two formats. We want to have a different

grid. we define different quantization

grid whereas int essentially has a

uniform grid FP 4 these formats use an

FP quantization grid which is finer

towards zero and more relaxed towards uh

the end points of the interval okay so

the grids are the grids are different

they're just truncations of a floating

point these are the same grids now the

second thing is that uh the all of these

um values they're we're essentially

going to define a group size or a

microscaling uh size which means that a

set of uh kind of consecutive values 16

for NVFP and 32 for MXFP they're going

to be quantized together. Okay. And so

the all of these values they're going to

share a single scale. Okay. So that's

that allows us to this finer grain

quantization allows us to reduce the

quantization error that we're going to

have whenever we have to bucketize these

values. Okay. And then the problem uh

with this is that this leaves us with a

lot of scales. Like notice that if you

have one scale uh one value per 32 uh or

16 values then this is actually already

a lot of storage for a model that's like

has billions of parameters. So what they

do is that they quantize the scales

themselves. And this leads us to the

second key difference between these two

formats, which is that um like NVFP

chose fairly reasonable FP8 quantization

E4 M3 for um for for for the scales.

Whereas M where MXFP, this kind of

openly designed format, chose a somewhat

unusual uh scale uh to to to put it

nicely uh which is basically they they

only quantize the scales to powers of

two. So we'll see what what this leads

to, but but it's probably not a great

idea. So okay, so uh we we essentially

have this these microscaling formats

which have some small differences

between them. Now the the reason why

we're actually doing this is because

we're able to get much faster

computational support for IN4 and for in

sorry for FP4 operations in this

context. So in particular, you're able

to get 4x uh speed up on a B200 and 6x

on a B B300 GPUs GPU and according to

their sheets and um also the also this

is something we verified in practice. So

we have kernels and you can uh obtain

you can validate these these numbers. Um

I mean you essentially maximize

throughput at larger

matrix multiplication sizes but you are

able to reach these and sometimes even

slightly exceed them in practice. Okay.

So now the question is does do these

formats does the fact that we're

quantizing weights or and activations in

smaller groups does this actually solve

everything? And we'll see that

surprisingly this is not quite the case.

So this is just a a snapshot of um kind

of results that we we did essentially

rerunning a subset of the million

experiments that that he ran for uh the

previous paper where and we we are

looking at these kind of uh uh numbers

that are highlighted in red which are

the recovery model recoveries when we

use a version of IN4 that's kind of

microscaled to match MXFP4. This this is

what you have on the first uh three

rows. And then we have NVFP4

quantization for weights and activations

and MXFP4. And then what we noticed what

we noticed is a couple of kind of

interesting things. But first, no format

is truly lossless. This is not a not a

silver bullet. Even with advanced

techniques such as GPTQ or or Stinquant,

we actually still drop about four to 5%

sometimes even more. Um so even with

this kind of pretty advanced format.

Now, if we're to rank um sort of formats

by their best performance, we'll see

that MVFP4 is the best. In four is

second by a very small margin. It's very

close to the experimental noise. And

then MXFP4 is terrible. And it turns out

this is this format is terrible because

of the fact that it very aggressively

quantizes the scales. So essentially you

get very high error um on on on scale

quantization. And then we notice that

this also methods that include rotations

this kind of adding this R matrix idea

that I described before do not help

NVFP4 but they help in4 and MX FP4

formats a lot. So they do improve um

accuracy as predicted by their original

papers but there's something odd

happening with NVFP4 which cannot

leverage this rotation idea. And then we

notice that some of the existing

literature u does sort of some of the

methods in in the literature do improve

results a bit but not very

significantly. Okay. So this was the

state of things that uh where where we

started at and we were a bit puzzled

because it's it's really not not so

clear why things are happening. So what

we uh we really tried to do better and

this is we ended up with a new method

that really improves the

state-of-the-arts and is specialized for

for these um formats. It's called micro

rotated GPTQ and you can find the paper

on archive and you can also find the

code on GitHub. So essentially the the

the method combines two key ideas. The

first is that instead of doing very

large rotation matrices which is what

was kind of prescribed by previous work.

We're doing micro rotations. Essentially

these are very small rotation matrices

that are whose size matches the group

size. So essentially we're just m mixing

values within the same group. Okay, this

is in some sense this this comes because

of we did an error analysis and we

noticed that the original weight and

activation distributions they're fairly

spiky and then if you're doing very

large rotations they becomes very they

become normally distributed but it it so

happens that these new formats like

NVFP4 and MXFP4 they're actually not

very good in terms of error recovery for

normal normally distributed data.

However, if you have some essentially if

you have a mix between normal and spiky

distributions, turns out that you can

actually so which is what you get if you

do micro rotations. Turns out that you

actually uh can get good results by by

quantizing to this format. And the

second idea is error correction. So

essentially we have like not so great

mean square error on the quantization

for both weights and activations. But

then we apply a new variant of the GPTQ

algorithm um which is a u sort of error

correction um weight quantization

algorithm with some uh specific

modifications to adapt to the FP format

which I'm not going to go into uh into

the talk but I'm happy to talk about

offline. Now the key observation here

and this is why we really chose this

route is that we noticed on that on

modern GPUs such as like the the

Blackwell line even on the lower powered

ones such as the 5090 you could actually

fuse these very small micro rotations

into the quantization operation.

Therefore you're the the sort of extra

cost of the rotation is zero. So u the

method essentially has no overhead or or

negligible overhead over the u original

uh original approach. So now let let's

get back to what we really care about

which which is um accuracy and you

essentially can see the accuracy of

essentially our our method and then

various other variants on the on your

left hand side you have the um NVFP4

format where you can see that the MRGBTQ

is really kind of outside of the margin

of error for across u all of these uh

all of these benchmarks. These

experiments are done on the plat

platinum bench benchmark that was

released by MIT earlier this year. So we

see that we are able to get about 97%

accuracy recovery. So we're really

outside of this kind of 95% unfortunate

uh sort of ceiling that previous method

methods had. And for some models we're

actually able to get for larger models

we're able to get to 99% recovery. So

we're within within the range of FB8 at

that point. And then we also noticed

that essentially the GBT on the right

hand side for um

which presents the MXFP results were

able to roughly we're able to get around

95% recovery. So the the format still

makes things hard for us but it's it's

reasonable enough that it could be used

I believe. Okay. So those are our uh

those are our uh findings. So currently

believe this to be the state-of-the-art

method for both NVFP4 and MXFP4.

The same trends I mean you get much

better results if you do weight only

quantization. This is supported in VLM

uh as well. Okay. So uh the the key

question now is can we make this fast

and can we integrate it v1 and here I

really want to um really highlight the

way to my my postoc Roberto Castro who

um is moving to is going to move to Red

Hat AI uh next month. So essentially he

managed to implement uh all of these

kernels both for NVFP MXFP both for the

forward pass and also for the backward

pass with support for the micro

rotations in a new library called

cutless uh which essentially is able to

get to close to ideal performance in

terms of um uh speed up. So you can you

can essentially follow the lines there,

but we we're essentially getting pretty

much uh matching the the essentially the

4x speed ups that are predicted by by

the data sheets. And then we're able to

get for larger batch sizes, we're able

to get about 2x uh speed up uh in in VLM

using these formats. So um in in terms

of other like if in case you actually

want to try this out, so this format is

now supported in VLM. So you can if you

have models that are packed to the

correct format you can run it. Currently

you you would need to uh produce the

models using our my lab's um sort of

compressing compre compression algorithm

that's available together with cutlas

and then we're currently actively

working on uh pushing all of this to lm

compressor so you can just use these

much easier flows. Okay. And then um

I'll uh leave you with kind of one

interesting uh finding which is that it

does one one unexpected benefit or

perhaps exceptional benefit of the um

MXFP format because of this this kind of

simpler scale um like uh simple scale um

structure you actually get up to 20%

faster matrix multiplication. So I think

we're going to try to keep uh working on

this. maybe we can actually allow it to

match uh the the speed of NVFP and

essentially get this extra 20% kind of

for free. So I think uh that's the end

of my um section

Uh hey everyone. So my name is Julian.

Uh I'm an open source engineer at

Mistro. So occasional contributor to

VLM. And this is Patrick.

>> Sure. Hi, I'm Patrick. I work in the

science team and also help on open

source.

>> Okay. So we will introduce uh the

training reasoning models uh using VLM

at Mistro. So

um we will introduce magis medium. So

the paper is uh on archive. It was

released in June and kudos to all our

science team to make magis.

So large reasoning models uh it's uh

essentially uh a model is you can call

it also policy for um RL and uh what you

do is that you take prompts that you fit

to the model that make completions and

you want the completions to be good. So

you uh compute a reward and you train

the model based on the reward.

Um so to dive in so what kind of prompts

you can have basically for reasoning

usually it's like math problems so what

is two plus two or to sort also code

problems so to sort a list

uh so you sample these prompts you give

them to the model that will generate uh

several answers based on this prompt so

uh sometimes it will be good like it

will say 2 + 2 is four sometimes would

be bad. So it would say it's five and uh

then you will therefore verify each of

the answers to make sure that the model

will be trained on the good signal.

Uh so to do that you will compute a loss

based on the uh rewards and the

rewards and the logits computed by the

model. uh we didn't give you the formula

which is a bit complicated but you can

find the formula of GPO for magistrol in

the paper

and when you have the loss for like for

all deep learning models you just apply

grad updates so that you can train your

model

so to do that we need uh several uh

items in your in our infrastructure so

first we have uh trainers so the trainer

will be um the essentially the model so

where you will maintain the the model

weights and what will performant updates

so usually it's implemented in PyTorch

or Jax

um and then you have a set of generators

so these are the ones that will uh

compute the completions based on the

prompts

so generators are uh using the latest

policy so the latest weights computed uh

at that time to uh perform completions

and output log probabilities for each of

the prompts that you give. Uh and these

ones are implemented with VLM.

Um and then you have verifiers. So you

usually you can you have in

in the for reinforcement learning sorry

you have models for verifiers but for

magisol we didn't use uh them we had

several rewards computed

we could say like deterministically so

we wanted to enforce for example that

the model was using the same language

for uh the reasoning than the uh the

same language as the prompt we also for

that answer was correct. We for enforce

that the the length of the the length of

the reasoning was uh not too short and

not too too long and this is implemented

in various

uh frameworks as simply end to end

model.

So the vanilla approach for the infra is

to uh load uh the trainer weights uh and

um load them inside sorry that the

generator load the trainer weights

directly and we send one batch of brawns

to generator to the generator.

Um so then uh we wait that the generator

weights are loaded. We can then start to

compute the m

um the m generations for the genator

based on the batch. So you have m times

b uh requests and uh and then you send

the m you sorry you compute the answers

and then you send the answers to the

verifiers

and uh once the answers are received by

the verifiers you compute the m*s b

rewards that you send back to the

trainer

and then the trainer will wait that all

the pros rewards and answers are

received comput a lot computes the loss

and performance updates but this has

several problems and is very slow and I

will leave it Patrick to explain why

cool thanks um great okay um yeah so

maybe we'll we'll take another look at

the vanilla approach and um I think it's

quite obvious that um people can already

see that certain components are going to

be idle when others are going to be run

so whenever you do a gradient update um

and you do vanilla approach um the

generator doesn't get infect fed

anything. So the generator is idle. We

don't generate anything. Similarly, when

we generate the trainer cannot train.

This is a big problem, right? Because

you you might train on thousands of GPUs

and you might need um also tens or even

hundreds of generators on on a GPUs. Um

so we definitely don't want to have all

these idle GPUs. Another big problem is

um that's a that's a first point here is

um so when you how do you actually

transfer the weights from the trainer to

the generator um when you think about

models like um I don't know like deepsek

and they they have 500 billion

parameters so that that's a terabyte in

in BF16 easy in terms of uh file storage

and you you need to kind of dump that

from the trainer and then load it into

the generator. The reason why this is

the the first approach that everybody

takes is because the topology that you

use for training is very different than

the topology that you use for

generation. That's a big problem, right?

This can take a lot of time in in

distributed file systems. Dumping that

much um data is is quite slow. So

problem here. Um yeah, the the problem

of the different topologies um also

makes it even harder actually because it

means you you somehow need to kind of

consolidate um the weights or or or re

um kind of like remap the weights. Also,

if we use VLM for for generation, VLM

doesn't necessarily have the same um

like like weight mapping that we might

use for training. So we might even have

to to change the values of the of the

tensors because we'll um for example

merges Q KV matrices. We might not do

this in training. Also a problem. Next

problem is um that you usually want a

deterministic number of um requests in

order to train. So when you say like

okay I want to let's say you you want to

have um I don't like 10 rollouts and you

want to have 16 prompts right? You want

to have 160 requests and you need 160

requests to do one gradient update

update. The problem is that um if you

start generating it always happens that

some requests going to finish much

earlier than others. So then that means

that you spend a lot of time generating

maybe just like one or two kind of

strangler requests. Again there's a lot

of lost time.

Then also very fun challenge in RL is

that um the the length the generation

length changes during training. So in

the beginning you you might um only

generate up to like 100 tokens. In the

end you can generate up to to 10,000

tokens. So that also poses a big

problem.

And to make this slide even a bit more

messier um now we have the yeah this is

the last point I think I already touched

upon is um the trainers idle when we

when we generate. Okay. So how do we

solve that um at um MR4 training? So we

I mean the first obvious thing is you

want to train and generate in parallel.

Um you cannot train and generate in

parallel when you want to be perfectly

on policy. It's impossible. Um so what

we do is like you you're almost on

policy. That means that you you you

train um and then only you update maybe

the the weights every 10 steps. That

kind of means that I guess for 10 steps

your generators are going to be a bit of

policy. But but it's not a lot. It's 10

steps out of I don't know 10 10,000. So

um you're not going to update after

every um you're not going to update the

weights after every gradient update.

Next thing is um yes is a problem that

um well like like some people also might

um intuitively just say like look we're

going to use the same GPU allocation

that we use for training also for

generation. Um but but this also kind of

means that then you have to stop

training when you use the same GPUs for

generation. So we want to use different

um allocations and then we want to try

to allocate as many GPUs for for

generation um until the training becomes

a bottleneck again. So if that's the

case if our generation kind of we can we

can generate as many tokens as we can

consume during training then um we can

have a balance and and and then we

shouldn't actually have any any

bottleneck if we can generate and train

at the same time.

Cool. And now I think um the beauty of

why this um actually works so well for

Majestra is um that we kind of just

updated the weights on the fly. So I I

talked earlier about how you would have

to kind of dump one terabyte of data and

then load it again. What we actually do

is um we we don't dump it at all. Um so

we we know that our model topology sits

on um a thousand GPUs that's allocated

for training. They're connected um with

with nickel to our generator GPUs. So

we're just going to directly update the

weights through uh through through

nickel. Essentially we use Nvidia's um

fast and envy link for this. It still

means we have to change the topology and

it also still means we need to

potentially merge weights. So it is

quite finicky actually to kind of have

this FSTP state topology and then map it

to whatever VM expects. So it's not just

key renaming, it's a lot of kind of

separating tensors and and potentially

merging them. Um, and the fun part is

that we actually we don't even update

the KV cache. So, we just leave the KV

cache as is. Um, and we we update the

weights. That's actually in the

beginning when we tried this out and it

worked. We're like, all right, that's

cool. We didn't expect that. But

surprisingly, um, the the model

generates pretty good answers um with a

kind of pretty outdated KV cache.

Cool. Um and now also another beauty of

VLM that really helped us and they kind

of solved this problem of um increasing

length in the training is that um I mean

thanks to page attention you don't

really think about length that much

right you you have a you have a budget

of um kind like um yeah like paged like

pages I mean memory pages and you can

kind of just allocate them to to

whatever um sequence you want. So

whether you I don't know generate four

sequences with um 10,000 tokens or you

generate 256 sequences with um I don't

know like 128 tokens doesn't really

matter um from the from the utilization

point of view. So that's quite nice in

that sense like we like it's not like we

we drop in utilization of of the GPU

just because our length gets longer

longer longer because VM will just use

them up correctly. We still we we still

need more time though. um if our length

grows because we're not going to get the

same number of requests in the same

time. Um but then you can just kind of

spin up more generators automatically if

you need during training. It's also

fine. It's not it's not it's not I mean

it's not great with slum but but it's

fine. Um okay, cool. And now to um

finish the the talk. So our beautiful

slide that explains that a bit. Um, and

I think just what I want to show here a

bit is um these these nice colors and

how they go from yellow to to red that

they that kind of shows our training

steps. So you see maybe um pi minus

um two and pi minus one, pi i and then

pi plus 1. So these are graded update

steps. Um and you can see that some

requests they might actually go through

three gradient update steps where um and

and actually you can also imagine like

the model weights are updated every

every step there um but the KV cache

just stays the same um and and that

works. So you you might have certain

sequences um that um are generated

um for like three kind of model updates

and that's still fine. So like weight

updates they they have 10,000 tokens and

only the last generated tokens actually

come from like a real kind of key cache.

The old KV caches that still use this

from old weights but it but it works.

Cool. Um I think that's it. Yeah. From

all sides. So if you have any questions

also about this just um catch us later.

>> Hi everyone. Um so we are Matis and

Miguel from Mistral. Um so we work in

the inference team. So our job is to

make our models run as fast as possible

in production and make that reliable. Um

I just want to have a yeah quick

forward. So we're both uh occasional

contributors to VLM as well. And I think

one great benefit of using BLM is that

we can have our science team tinkering

with it, doing RL, doing training, well

some stuff like that and we can use the

same software in inference and that

helps us a lot having a single source of

truth um making sure our model runs um

as expected. So yeah, today we're going

to cover um disagregated serving with

prefield decode. Um I'm going to cover

the basics and then Mattis will go into

the nitty-gritty details of what we

experienced um in production.

All right.

Okay. So um obviously as Mistral we are

a LM complion service provider and when

our users um ask very important

questions uh to our service um like what

is uh friend's top five best cheese of

course um what you see as a user is

first the time it took um it's the time

it took sorry to get the actual first

token. So we call that the time to first

token. um that's very important for

users to feel some kind of

responsiveness um to your service and

then you have the latency that you have

in between each and every token. So the

inter token latency we're going to talk

a lot about these two uh during that

talk.

Um of course when you serve to um many

many different users you want to

aggregate those metrics and yeah monitor

them. So um looking at the median for

example for the inter token latency is

is not enough. We also want to track the

P99. So the yeah the worst percentile of

what user can get from using your

service and we don't want any

degradation um yeah over the whole um

percentiles.

So yeah very simply like how um

inference works. So very yeah basic um

you have a request incoming and that

will go through the prefill phase that

is compute um uh intensive because you

can process all the tokens in parallel

so you're actually able to leverage your

whole tensor course on your GPU and

that's nice and that way you populate

your KV cache and then you will go

through the decode phase um of that

request that is very memory bound so for

each and every token you're going to

generate that you have to load um all

your model weights from your global

memory to um your yeah tens of calls to

the chip memory.

Um to do that and to have a service that

can accept multiple requests coming at

different times we do uh well VLM do uh

what's called infat batching or

continuous batching that has like

several different names and the idea is

quite simple is that so you process one

request at a time but every time you

have like new requests coming in you can

aggregate those requests so you would

have um like here the request A that's

going through its generation phase but

also included with the context space of

request B and C. So that's nice because

that way you're able to leverage your

GPU resources a bit better because you

can go through different um prefield

phase that really um use all your GPU

resources.

Okay. So, but when it does this um there

is a something you don't want to have is

um high spags of latency uh in the token

latency because every time you're doing

some decode and another prefill request

comes in your schedule with that one at

the same time. So that would take longer

to process and as a user uh that's very

bad because uh you will have your token

streaming like suddenly staggering a

little bit and yeah and you will that's

worse in terms of responsiveness

and also if we want to actually like

make money um generating tokens we want

to aggregate as many requests as

possible um to use um our GPUs to the to

their maximum performance. But if we do

that, we're going to increase a lot

those kind of latency. So it's difficult

to apply in in practice for like yeah

userfacing interaction like chat.

So one of the solution is to use prefill

and decode disagregation. So you would

have the prefill face and the decode

face on two distinct um set of GPUs. So

phys physically different. um one of

course set of GPUs will only do prefill

and another set will do um generation

and that way you you never have prefill

and decode schedule at the same time. So

your intent inter token latency is much

more stable. Another side benefit that

you have from doing disagregated serving

is you can actually optimize both de um

deployments separately. You could for

example having different a different set

of hardware um doing prefill and

generation since we you know that um for

prefill you only care uh about the

number of flops you have uh you might

want some kind of GPUs that have a huge

amount of flops but you care a little

less about the memory bandwidth whereas

on generation you could optimize on

having a hardware that has a better

bandwidth and also like yeah we talked a

bit earlier about different quantization

scheme. You could have for example

having um only weight quantization on

the generation part uh to benefit from

yeah having the um less memory bandwidth

to load your weights whereas um your

prefiller um will have yeah a different

uh quantization scheme that also works

with sharding your model. You could have

like a different uh TP size or EP size

for generation and prefilling. So that

has a lot of benefits.

how that works in VLM. So yeah, very

high picture here. Um it's fairly

simple. Uh so there is a KV connector

API that you can use uh for this. Um

what you would do for example is send a

request to a prefill instance and sort

of like trick it into only doing prefill

by setting max tokens equals equals one.

So we'd only generate one token and then

when the request comes comes back to you

um it would say hey my cache is here uh

you can contact like through nixel for

example to get it and then you will send

that to the decode instance that would

fetch the cav cache and process with

further uh generation

and our goal with this so this is actual

data that we measured um on production

So what you see with the red line is the

P99 of the inter token latency. So as

you can see it's like it's huge um

compared to the P50 that you can see uh

like on the um orange and yellow lines

um for example. Um and what we want to

achieve with um this aggregated serving

is going to that blue line that's a bit

above the others but uh completely

manageable.

So yeah, I will leave the floor to Maxis

Matis to explain that.

>> Yeah,

>> thank you Miguel. Um so I will now cover

the challenge of uh prefield and uh

decode disagregation. Uh there is

several of them. Um I guess uh the main

uh the main one that you will encounter

is usually the infrastructure. You have

different way of doing prefer

disagregation. One of them is through

Ethernet. But if you do that, you will

suffer uh high TTF uh because basically

the CV cache can be super large. For

instance, on one of our older model uh

the CV cache can be up to 16 GB uh which

is super high and uh this is actually in

int 8. So you can imagine and uh if you

don't have Infiniband which is basically

a way to communicate super fast between

uh different nodes uh in like between

GPUs uh you will have high DFT. Uh just

for context uh the greens uh the greens

that you are seeing over there uh those

are several 100 gigabyte gigabyte per

second. Uh so to do so within VLM we are

actually using Nixel which is Nvidia

library that came out I think in March

and uh which use uh UCX as back end.

They have several back ends but the one

that we are using right now is UCX and

it's a way to leverage Infiniband.

Uh Infiniband is uh actually a challenge

because you need to set it up on your

infrastructure and uh this is not easy

and obviously you need to configure it

and to to have the all the the stack uh

that is running.

um then you will face another challenge

with is how do you correctly size your

disagregated instance. So usually when

we talked about disagregated we talked

about XP YD X stands for the number of

preill and Y stands for the number of

decode. So it's quite simple you have

you will have uh X number of prefill and

Y number of decode. This sizing depends

of several stuff. One is the model

obviously and the hardware uh because

models won't behave the same way. But I

think the most important part is it

depends of your input output sequence

sequence lens distribution. The high

level overview of this is quite simple.

If you have uh a super large uh if you

if you send a long context only to your

model you will be in pretty heavy uh if

you are uh generating only long stuff.

So it's uh what happen when you are in

the reasoning you will be in decode

heavy

and um there is some caveat with that.

Um the first one is when you have a

longtail distribution for the input uh

sequence lengths um you in traditional

settings with within VLM you will have

high spikes on other requests and that

the reason the reason of that is because

you need to enable partial prefill is

something you can do with long prefill

token threshold and you can increase

maximum batch tokens on the prefill.

Um it's something you can do because

basically on the prefill pods uh you are

on prefill instances sorry you are um

only compute bound so you can increase

maximum bash tokens as you want.

Another caveat is obviously uh you need

to be sure that uh you your model won't

put too much pressure on your decode. So

you must watch uh like you much look how

your model will behave uh especially in

reasoning and um the last caveat is you

need to be extra sure that you need

disagregated serving because if you are

using if you don't use your model much

uh so for instance for older models or

if you are in the on your local setup

inflight batching is more much more

efficient resource wise.

Another thing that is quite important um

it's also a challenge um is you need to

be able to schedule your request

um and the reason of that is because for

a set of resources um so let's say you

have 100 GPUs 50% of them let's say uh

will be used for prefill and the other

50% will be used for decode so half of

them only will be able to handle the

same workload. Um so the issue with that

is that you can have uh load spikes

which is not great and uh one side can

have a float of work coming. So the way

you can uh handle that is by scheduling

and this is something that you can do

with LLMD and we are working on this at

the moment.

Um and final but not least uh you need

to be aware of how your deployment is

stable. So if your prefill crash because

you are whirling an update or because uh

simply there is a VLM bug or whatever

you need to be aware and you need to be

extra sure that your decode won't die as

well. Uh the this stability has been

greatly improved recently and thank you

that's great and um yeah the other issue

you must be aware is when you want to

upgrade your VLM you need to them they

need to be compatible. It seems obvious

but when you are in production it's

actually quite challenging

uh and you need to to keep the prefill

and decode ratio.

Um so for everything that I mentioned

there is a few metrics to look at and um

you need to look at first at some

metrics on the prefill which is uh

obviously the pending request in the

queue some metrics in the decode which

are also pending requests in the cues

and you need also need to look at the

total experience TFT which will

correspond to the addition of the first

TTFT and the second TTFT because this

will be the TTFT the user will see and

um so if you are noticing some spikes

especially for P99 TTFT that kind of

stuff uh you need to be aware that it

either a scheduling issue or you need to

scale up some instances.

Another optimization that you could do

with DAG uh is you can trade a slight

increase of uh the P99 enter token

latency

um to improve your median TTFT. Uh and

uh this is actually quite important

because user needs to have a good TTF

especially for small request and you do

that by working directly small request

cool request to the decode. Why is that

working? Um well the thing is you only

um you only in increase little bit the

P99 because you only root very small

request but actually um most of the time

at least for us we our request like most

our requests are very small at the

beginning it's like hi that kind of

stuff and uh you want to offload that to

on the pre-ill site

in general uh total response time is

expected to be better than in

traditional deployments.

So in conclusion um disagregated serving

improve uh the quality of service but at

the cost of inquisitive complexity

that's kind of obvious but like it needs

to be said um because you have you need

to set up infinib like I said um but

it's actually very worth it uh for

interactive application high workload

and that that that kind of stuff but you

need to be aware that it's not always

worth it. Correctly sizing XPYD is kind

of hard. You you need to look at the

metrics I mentioned earlier. You need to

be aware of your input output

distribution and especially the

long-term part and uh you need to

schedule the request properly and the

stabilities matters a lot. But at the

end uh I would say that disagregated

serving is quite nice. Uh this is a

snapshot I took a few days ago. Um you

see some spikes over there. It's because

we don't have on this setup at least uh

scheduling yet. Uh we do have today but

I did not update the slide. And uh the

thing is uh over there you have a high

uh P99 ITL but it's way much better. And

the reason is because we on this uh

setup we actually rooted uh a bit more

uh some more small request. So if you if

we want to reduce that we need basically

to uh to to um to have less small

request that comes to the decode

directly.

And uh yeah, I would like also to thank

you as the VLM team and the LLMD team

for their collaboration and dedication

dedication on this subject and other

stuff. And uh kudos to everyone that

made this meetup happening. And I would

like to make a special thanks to Robert

Cho and Will Heaton for this significant

contribution on this subject and uh also

because they helped me with a horrible

bug. So yeah. And by the way, we are

hurrying. Uh you can check our page.

>> Hello everyone. Good evening. Uh my name

is Thomas Parnell and I'm a principal

research scientist at IBM Research and a

committer on the VLM project.

And today I'm going to talk to you about

the work that we've done to um enable

hybrid models as first class citizens in

VLM to kind of the journey from

supporting these models as really a hack

in V 0 to like fully supported well

integrated models in V1.

Um okay so like before I get into what

is a hybrid model I want to motivate a

bit why we need them. Um I think

everyone in this room is probably

familiar with attention

how successful it's been at language

modeling like it's super it's

revolutionized the industry right uh and

in VLM we have um tons of super smart

ways for implementing attention on

modern GPUs so this is including VLM's

dependencies like flash attention flash

info and so on technique techniques like

page attention flash attention tile

softmax tens core CMA quantization. Like

there's so much amazing engineering

that's gone into optimizing attention on

modern GPUs. However, like despite all

that, there are still some theoretical

issues with attention which we can't

avoid. In particular, when like

sequences get very very long, uh we have

two main issues, right? One is that the

state that we need to maintain between

iterations increases linearly with the

sequence length. So as we go to a

million tokens, the KV cache becomes

huge. And secondly, when we go to really

long sequences, the the time to do the

prefill, which affects your TTFT, blows

up quadratically. But these are like

theoretical problems that no amount of

engineering can really overcome.

Um

so like why do we care? Like maybe like

we we don't care about long sequences. I

just want to say why this is so

important is that in the kind of

applications and the way people use

these models we see increasingly longer

and longer sequences. Just some examples

here like one is rag where maybe you you

look up a bunch of do documents in some

vector database and insert them into

your prompt. Depending on how many

documents there are, this can lead to a

really long prompt that you're sending

to your inference server.

Um another example is a gentic patterns,

right? where you have a multi-turn

interaction with the model. You ask the

model to generate something. You do a

tool call. You get the output from that

tool. You pass it back to the model and

this process iterates which can lead to

really long sequences being handled on

the infant server. And finally, we have

like the the kind of emergence of

thinking and reasoning models where we

tell the model to think things through

step by step. We insert some thinking

tokens. This again can lead to really

long rollouts from the model that we

need to be able to support. So like long

sequences are important. Uh we need to

think about how we can support that in

VLM without uh ruining the performance.

So I'm not going to give you like the

whole history of uh state space models

and linear attention. I'm just going to

cover you know some uh developments over

the last few years. Uh I'm going to

start from like 2021 which is this paper

uh S4. So this is one of the first

attempts to um apply state space models

which have a long history you know like

control theory connections to RNN's.

This is not the first state space model

but this is one of the most successful

approaches attempts to apply SSM to

language modeling. So the equations I

show in the top here are kind of the the

core idea. You have like a recursive

mapping from an input sequence X to an

output sequence Y. uh you you have these

matrices A, B and C. A and B kind of

take your input and map it to this

latent uh state H. And then the C matrix

maps H to the output. And what's really

important to understand here is that

this recursion it's linear in the

sequence length. So no matter how long

your sequence gets, you're always going

to do like T steps if your sequence

length is T.

Secondly, like the size, the

dimensionality of this latent state H,

it's constant. So unlike KV cache which

grows as your sequence length gets

longer, the the state H uh is constant

in size. I'll come back to that more

later. So like one downside of this

approach S4 is that it was pretty good

at language modeling, but certain things

like selectively copying parts of the

input and reasoning about things in

context still were not very good.

So that's where uh Mamba comes in and my

clicker is not working. Here we're good.

So so Mamba um kind of solved this

problem with the selective copy copying

and in context reasoning uh by making

these ABC matrices dependent on the time

step. So they can vary at each iteration

of that recursion. Um

unfortunately like despite kind of being

super good at these t tasks uh ma still

was slower than attention or like

moderate sequences because it it doesn't

the algorithm doesn't map nicely into

matrix multiplications. Um so it can't

use like the tensor cores which are

really like prevalent on modern GPUs

like copper and blackwell.

So uh the mambber 2 paper which came

like a year later solved that problem.

So they found a way of like introducing

structure into this matrix A which

allows you to rewrite this whole

algorithm as like a big matrix

transformation from the input sequence

to the output sequence that allows you

to use all of the tensacles very

efficiently. And what's super

interesting in this paper too is they

found a connection they proved a

connection between like this form of SSM

to something called linear attention

which like brings in a whole other like

sub field of of models.

This this linear attention is described

in this 2020 paper. The idea there is

it's kind of different. They try to

approximate the soft max in a way that

lets you like fold it into the feature

maps. Uh so you kind of end up with a

linear attention mechanism.

This idea spawned like tons and tons of

variants on the idea. I'm not going to

list them all, but like one relatively

recent one which has been um extremely

impactful is this gator delta nets. Uh

here again it's like a a linear

attention variant which they show can

actually outperform Mamba 2 in terms of

like quality on downstream tasks by

better learning like associations

between the key and value tenses.

Um so like that's kind of the history.

This is like a an atlas of the hybrid

models that we support in VLM today. Uh

you see that there's quite a lot and

they're also like diverse so they don't

all do the same thing. I mean one thing

you will notice is that they do all use

attention. So attention is still

important. We don't see that going away

right now. But in addition to attention

they all use like different um mambber

or linear attention approaches. You see

like a cluster including Gran I4 from

IBM as well as Nemo Nano from Nvidia

that use attention and Mamba Mamba 2

rather. So Mamba 2 is kind of super

widely used. And then you see kind of

towards the bottom models like Quen 3

Next and Kim linear which are relatively

newer models that are using linear

attention like gated delta nets or this

Kimmy delta attention which is kind of a

var variation on the gated delta net.

Finally, you'll see kind of a recent

trend in these models is that they're

combining this hybrid idea of mixing

attention and mambo with. So, these are

kind of the three um key components we

see emerging in these hybrid models.

Um, cool. So as I said there these

models are diverse so they don't all

have the same architecture but there are

some commonalities between them

particular

in the way we need to be able to manage

the state. So for um attention models

VLM manages the KV cache in blocks. So

we say a block is like um a block is a

contiguous region of GPU memory with

enough space to store the KV cache for

all layers of the model for like a small

number of tokens say 16 um for like a

representative example a block

corresponds to around 64 kilobytes in

GPU memory.

And as we know as we like generate new

tokens with an attentionbased model we

have to concatenate to the KV cache. So

as we generate more and more tokens,

we're like appending blocks, we're

creating uh more and more KV cache

um like mambber or linear attention

models on the other hand work uh

differently. So the state um is much

bigger. So you have like a single rather

than like uh appending to the KB cache

each time you generate a token, you have

a single mambber state um which you

update in place. Um, and what's really

important to note is that that state is

huge compared to a KV cache. So like a

block of KV cache is uh 64 kilobytes

versus 2.57 megabytes for the member

state. So like for a single block the KV

cache is 40 times smaller. But when you

go to long sequences like 128K um the KV

cache becomes 200 times larger. So this

is what like why people are excited

about these models and where they can

deliver like massively higher throughput

um for like large batch sizes. Um

so yeah how did we like first support

these models in VLM? So we we kind of

hacked the support into vzero in the

following way. So on the left hand side

you see how VLM manages the uh attention

blocks. So u we have one KB cache tensor

for each layer for each attention layer

in the model and then like interled

across those tensors are the blocks the

attention blocks. Um the the the number

of blocks n in this figure is something

that VLM determines automatically. So

when we start the inference server, the

LM sees, okay, I've loaded the model. I

do some kind of forward passes. I see

how much memory is left over and I

calculate how many blocks I can fit.

This is like super nice feature of VLM.

Um

mambber state on the other hand in v 0

um you can see on the figure on the

right we just allocate

uh like a mambber state for um the

maximum batch size that we can support

on the server.

So what's the problem with that? Um the

the maximum batch size is something that

the user chooses and we had like a lot

of problems where users were setting

this wrong. If it's too high, you get um

like CUDA out of memory. If you set it

too small, then like you're not fully

utilizing the GPU. So this is like a bad

user experience, right? We want to um

automate how to choose that.

So the goal of this effort right was to

unify how we manage KB cache with how we

manage the mambber state. Uh okay this

is like

partly for elegance. We want to have

like a nice way nice clean way of doing

it. But it also allows us to, you know,

integrate hybrid models properly with

VLM v1 and benefit from stuff like

prefix caching, KV cache transfer, PD

disagregation that we just heard about

and like other cool stuff from VLM v1

like the torch compiler integration,

better scheduling and so on and so on.

So like V1 is great. We want to support

hybrid models.

So um before I tell you about how we

unify the mamba stage with the KD cache,

I need to explain like how VLM supports

a different kind of hybrid model. So

hybrid models is a bit of an overloaded

term. Um we can also use it to refer to

models like GPTOSS

which mix attention with sliding window

attention layers. So how we handle that

in VLM v1 is kind of shown in this

figure. So um I show this for a example

this Google Gemma 3 uh 31B it which has

four full attention layers and 22

sliding window attention layers. So what

we do is we form what we call KV cache

groups where um each KV cache group

contains layers of the same type. So in

this example, we have a group with four

full attention layers and every other

group is like a sliding window attention

group and we do a bit of padding if

necessary because we want to ensure that

all groups are the same size and then

how we store the state is shown on the

right. So we have one KV uh KV cache

tensor for each element in the group. So

four in this example and then the

different KV cache groups actually share

those tenses. So like we have um

attention blocks and sliding window

attention blocks sharing the same data

structure um which is really great like

it allows us to do very simple memory

management. we can um mix attention

blocks and sliding window attention

blocks interchangeably provided

the page size so like the size of the

block in GPU memory is the same for all

of the KV cache groups

and maybe you already start to see the

problem here for mambber um is that the

mambber state is huge compared to the

attention state right it's like 2.6 6

mgab versus 64k. So how do we solve

that? And this kind of may seem like

black magic, but I'll walk you through

it's actually very simple how we solve

this problem. Um firstly, we relax the

constraint that like the block size for

attention and mambo has to be the same.

Um

secondly, we take the attention layers

and we increase the block size hugely

from 16 to like hundreds or thousands of

tokens. to ensure that the attention

page size is at least like greater than

the mambber page size and we can do a

bit of padding on the mambbo and I'll

show you an example of this. Finally,

and this is like super important, we

have to ensure that the views into the

KV cache tensor are compatible and I'll

show you an example of this shortly.

Okay, so this shows you how we align the

page. This is shown for like a real

example of Neotron Nano from Nvidia. Um

you see in the top like the attention KB

cache group which has block size 16 and

page size 64K.

All the other groups are Mamba groups

and they have like a block size of

around um the block size is set to the

maximum model length. So in this case

128K but the page size is 2.6 six megs.

So we this is the problem we have to

solve. Um so firstly we uh just select

the block size for attention that's a

multiple of 16 such that it's bigger

than the the page size of attention is

bigger than mamba and then we do a bit

of padding. So like it sounds

complicated but it's really simple. Uh

you might already say rock size 672

that sounds really weird like that's not

going to perform well. I'll come back to

that in a moment.

Regarding how we handle the striding, um

this shows kind of um how it worked in

our first attempt and this was horrible.

This gave really bad bugs and I spent a

long time debugging this. Uh the two so

the the the two like KV cache groups

share the same view into a KV cache

tensor. On the left you see the um the

way the flash info attention back end

like looks into that tensor. So you can

see it has key zero value zero key one

value one. It's laid out like block by

block by block.

The mambber view on the other hand so

mambber has like two substates the con

state and then the SSM state and it's

laid out in a different way. So we have

all the con states for all the blocks

followed by all the SSM states for all

the blocks.

And what happens here is like if you

write to mambber state zero, you kind of

go and write data into that tensor in a

way that doesn't align with what

attention expects and you lead to like

completely corrupted data and really

crazy eval results.

So the simple solution to that is to

change the striding and we do like a lot

of tricks like this to make all the

different attention back ends line up

together. we have to change the stries

to ensure that like the views are

compatible and writing to one block is

the same thing across the different

views.

Cool. So like that's the main like

points I wanted to make to give you an

idea of how VLM unifies

the state management for hybrid models.

But there was also a bunch of other work

needed. So um we had to do a lot of

changes in the modeling code to support

v 0ero and um I think Michael mentioned

in the beginning uh cudigraphs um these

are super important for hybrid models

because uh

these like mambo and linear attention

mechanisms implemented using tons of

Triton kernels

and these kernels have like um a high

launch overhead. So you see like

overheads on the CPU coming from

launching lots of different kernels and

this is why um peacewise cuda graphs um

doesn't really give us enough

performance and we have to use full

cudigraphs.

Yeah. Finally um prefix caching is

something we landed kind of recently. Um

it's a bit experimental. Please try it

and let us know if you have any

problems.

Finally um

I mentioned the weird block size, right?

So block size 672.

Some attention backends like Flash

Attention or Flash Info support this

really fine.

Others though do not.

Excuse me.

So in particular when you're running on

Blackwell we want to use um the TRT LLM

kernels to get the best performance out

of the GPUs and Blackwell like those

kernels only support block size 16 32

and 64. So you can't just use some

random value of like 718.

So we've implemented something recently

where we can decouple the kernel block

size from the block size used by the KV

cache manager. This is like super

important if you want to use hybrid

models on black belt.

Okay, cool. So to kind of close, I want

to show you some benchmark results. Um

for this we're going to use a granite 4

model. So this is um like a hybrid model

that combines attention and mamba 2

ande.

It has 7 billion parameters and 1

billion of those are active at like any

given time.

Then we use VLM's like inbuilt

benchmarking tool uh VLM bench serve. We

use random data. So yeah like prefix

caching is not involved in this

benchmark. We use quite long prompts. So

like 32k in 128 tokens out and we sweep

the concurrency from one to four up to

16.

Um

yeah like we used the last version of

VLM that supported both v 0 and v1.

V v0 has now been stripped out. So you

can't with the latest VLM v1 is what you

have to use. But as you'll see it's

great. So

wait, how do I go back?

Cool. So yeah, thanks to the Misel guys

for talking about what is TTFT, what is

ITL? Uh I'm going to use the same

metrics here. So you see three panes. On

the left hand side, it is the

concurrency against the TTFT. In the

middle is the concurrency against the

ITL. And on the right hand side, the

concurrency against the throughput on

the Y-axis. Uh you see three different

colors here. Orange, green and blue.

Orange is V 0, green is V1 with

peacewise cudigraphs and blue is V1 with

uh full and peacewise which means we use

peacewise for like batches that include

prefill and decode requests and full

CUDA graph. So decode only batches.

And what you see here is like for the

TTFT V1 is consistently better. This is

coming from like the torch compile uh

integration for the ITL. Um like if we

only use peacewise v1 is actually worse

but low concurrency. This is coming from

what I said about the overhead of

launching Triton kernels. It's super

important.

Um especially when you're using like and

the latency is very low. But we see when

we use full and peacewise we recover

that gap and the ITL is like

consistently better between V1 and V0.

Similarly for throughput we see like 31

to 91% higher throughput when using full

impedise graphs in V1 compared to V0.

Yeah. Finally, I just want to say it's

so great to see um so many like local

fans of VLM and people interested in

this work. I'm based locally here in our

research lab in Rishlon just down by the

lake. We work tons on LLM inference also

on other topics like model

architectures.

We're working on supporting IBM's own

accelerator spire in VLM. If you're

interested in this stuff, please come

and find me after the talk and yeah,

happy to talk. Thanks.

All right. Um I'm Tyler Michael Smith.

Uh this is the last talk. Um so which

will be uh hopefully pleasant news. Uh

if you like me, you are very hungry

right now. Um so I'm going to be talking

about LLMD. Uh so this is a distributed

inference framework that we uh have

built around VLM.

Right? So, you know, the way we kind of

think about the inference stack is, you

know, VLM is the inference server that

really connects the models here at the

top to the accelerators down at the

bottom. Um and then we've got LMD which

is the Kubernetesbased

uh inference distributed inference

framework uh that we use to orchestrate

um and and put in like a lot of

distributed optimizations

um and it's really about optimizing

uh inference performance and it's about

improving SLO targets

um and so like okay question is okay why

LM MD what do we have to do special why

can't we just use Kubernetes as is like

you know inference gateways

um so you know the big difference is you

know between LLM comp LLM requests and

modern HTTP requests are you know you

know HTTP requests are cheap they're

uniform they're fast uh they're you know

stateless a lot of times um but LLM

request requests are slow. You can't

predict how long they're going to take

uh beforehand. Um they're nonuniform,

very expensive, and they're stateful.

You know, we have to manage the KV

caches.

Um and they're, you know, orders of

magnitude uh more expensive per request.

Okay. So, yeah, like I said, LLMD is

really about bringing together VLM and

the Kubernetes community. Um and this is

an architecture diagram. So uh just to

kind of like trace a request uh so

request comes in, hits the inference

gateway. Um it does a gRPC call to uh

this body based this inferenceuler. So

this is the um the endpoint picker that

decides what VLM pods a request should

be routed to.

um that then comes back to the insurance

gateway which then sends requests to the

VLM pods and there may be pre-filled

decode disagregation involved here uh

like you heard uh during uh one of the

Mistral talks.

Um and so we kind of have uh in the LLMD

community we have these four well-lit

paths. Um, these aren't like

like productized solutions. These are

really like ways of uh showing people,

companies, users a way to uh kind of

like walk the path to get to a really

good deployment uh a good opinionated

deployment in like a specific case um

for LLMD basically to highlight the

features that we have uh uh in the

project. Um so the first is intelligent

inference scheduling. Um and then then I

I'll I'll walk through all of these.

So uh intelligent inference scheduling

uh like I said we have this endpoint

picker that decides which uh instances

of folm a request should get routed to.

Um

and we have a couple of of different uh

uh routing uh algorithms in there. So uh

one is loadare routing. So uh every uh

200 milliseconds the endpoint picker

will scrape the metrics of all of the

VLM pods. So it knows exactly you know

how many requests are in flight uh for

instance and can you know use that to

route to specific uh you know pick the

pods that have the least load

essentially.

Uh and that's kind of good in all cases.

Um, and we also have something called

prefix aware routing where um, there's a

couple of different flavors of this. So,

one, the endpoint picker can uh, keep

track of where it's routed uh, requests

in the past. And we're taking advantage

of a feature in BLM called automatic

prefix caching where if you're

processing the same prompt multiple

times since the KV cache has already

been processed uh on subsequent requests

uh it doesn't have to be processed

again. Um, so the endpoint picker can

take advantage of this and route to the

same pods that it thinks has those uh uh

prompt prefixes already processed. Um,

we also have something called precise KV

scheduling which we have uh something

called a KV events API where the VLM

pods will report back to the endpoint

picker when they add or evict blocks

from their KV caches. So the in that

case the endpoint picker can know really

precisely what's in cache. Um, and this

is really good for uh multi-turn

conversations like you talking to a

Slack uh a chatbot or um it can be

really good for agentic uh workflows as

well.

Um and so it's it's you know very

pluggable. We have a lot of different uh

uh scores, filters

um and uh request control.

Um, and here's a a performance chart

comparing uh roundroin

uh uh scoring to um

uh KV cache load aware and like a KV

utilization score. Um, and you can see,

you know, we reporting like uh in some

cases up to 50x performance improvement

to TTFT.

Um, mostly from taking advantage of uh

prefix aware routing.

Okay. Second well-lit path is pre-filled

decode disagregation. You heard a lot

about this from uh the MR talk. So I'm

not going to go into it uh too deep. Um

but in in uh LLMD

uh we're using Nixel. So the exact same

code that uh the MRA all folks are

using. Um

uh you know we have a sidecar and the

decode pod. uh the uh gateway will route

first to the sidecar which will then

route to the prefill pod which will do

the prompt processing send the request

back to the sidecar which then gets

routed to the decode pod. Uh and when

that happens it will pull the KV caches

uh using Nixel from the prefill pod to

the decode.

Um and so yeah, we use uh GPU direct

RDMA uh

via UCX uh via the Nixel integration

which uses the KV connector API in VLM.

Um so this is asynchronous

um zero copy and zero memory over. So it

pulls directly from the prefillers uh KV

cache and inserts directly into the

decoders KV cache without any uh

additional buffer uh as workspace.

Um

and so one of the primary advantages of

uh prefill decode disagregation is it

gives you specialization in how you

parallelize pre-fill versus decode. Um

and so for example you can do tensor

parallel size 4 decodes with tensor

parallel size pre one prefills. Um and

then as the uh you know the nature of

your workload the ratio of uh input

tokens to output tokens changes then we

can vary the number of prefill and

decode workers in the cluster.

Um and as you can see here, so this is a

chart that compares uh sort of like the

x- axis is um sort of like the aggregate

throughput of the system. Um sorry,

that's the y- axis and the x-axis is

each individual users uh output speed.

So this is really like a throughput

latency uh trade-off here. Um we see

really good performance uh especially in

like middling uh request rates um for

the green line which is uh disagregated

inference using four instances of vm

uh for the prefiller at TP2

and only two instances of the decoder at

TP size 4. So we like specializing

prefill or specializing parallelization

of the pre-filler versus the decoder.

Uh and then the that's compared to the

orange line which is four aggregated

instances of ELM running at tensor

parallel size 4.

Um and so

uh one thing about this is you do have

to tune the uh prefill and decode uh

sizes um to your workload.

Okay. Um the third uh well path uh that

we have is uh KV cache management. Um,

so this is about

like letting your KV cache state grow

larger and larger uh to really take

advantage of prefix caching. Um, so here

we have, you know, this is also using

the KV connector API. Um, so we've got

north south uh KV management. So this is

really about offloading to CPU memory

and then to storage. Um so that you know

basically effectively increases the size

of your KV cache on a single node. Um

and then we've got east west KV

management that uh will

it's really distributed KV cache. So uh

one pod can pull uh a KV cache

from another pod. And so we we have a

couple of different integrations for

this. Uh one is LM cache. Uh we also

have support for uh uh Dynamo KVBM

uh works in this as well.

Right. And then the last well-lit path

uh I'll talk about is something called

wide expert parallelism. So this is

really about uh largecale multi-node

serving for um like sparse models. Um so

in this case we have a single instance

of VLM uh spanning uh multiple nodes. Um

and then it uh basically to maximize

throughput for large models. Okay. So uh

first of all I'll talk about like so

what is a what is ane model? So uh

Thomas uh mentioned these in his talk as

well. Um they're really

So, so here's a diagram of of ane model.

Um so

it's, you know, all transformers are

organized as attention, feed forward,

attention, feed forward. Uh,

and then so coming out of attention, the

output gets fed into a router. The

router selects the top K weights and the

top K IDs for every token. Um each token

then gets routed to uh the expert that

corresponds to its top K ids. Um and

then each expert is a matrix

multiplication.

Um and then they gets combined together

according to their top K weights. Uh so

this is a form of activation sparity. So

the act each activation

only

uh takes into account the k experts that

it gets routed to. So each token only

you know will hit you know eight out of

eight divided by 256

um uh experts. Um and what we've seen

over the past year or so is that uh

essentially every large model that has

uh been introduced in open source has

been ae model. Um, and this includes

GPTOSS from Open AI, uh, Deepseek V3 and

R1, Llama 4 for Meta, uh, Quen 3 from

Quen, and Kimmy K2, uh, including the,

uh, thinking model that just came out

today. Um, and these have, you know,

hundreds of experts, and each token, uh,

only activates a handful, uh, per

forward pass.

Um this is an ablation study by Deep

Seek basically showing that you know as

they uh activated fewer uh experts uh

and like added more and smaller experts

uh their models did better and better.

Um

and so really what we've seen is we've

gone from this uh case of like you know

mixed straw which had like eight

by

had like uh you know you know eight

experts uh to something like a deepcp3

which has 256 uh in each token

activating eight. This has introduced a

bunch of challenges uh for how to serve

these models. Um so one uh for densees

you can just use tensor parallelism.

This works pretty well. You can use a

simple fusede model. Um an expert

imbalance isn't uh too problematic.

uh but for for sparse uh TP sharding

becomes pretty problematic. Uh scaling

out to

uh multi-node becomes essential because

they're huge. Um however,

we can use sparse communication ops

because of the activation sparity of uh

of the router.

Um

and then because of the high sparsity we

really want to have high concurrency to

get a high arithmetic intensity in our

expert uh operations. So that's kind of

good and kind of bad. Um good because uh

it can handle high concurrency but bad

because you really want to go to high

concurrency. Um and then you do have to

handle load imbalance across experts

because there are so many and because uh

you may be at a very high concurrency

multiode setup.

Okay. So then uh in in response to all

this um

you know we've introduced a bunch of

optimizations into uh VLM to handle

this. So we've gone from tensor parallel

layers to data parallel attention

especially for deepseek where tensor

parallel will replicate the KB cache

um and expert parallels

um we're using these sparse all to all

dispatch and combine uh operations from

uh DPP from deepseek and there's some

kernels from perplexity as well that

we've integrated into vm

So coming out of the

uh coming out of the uh uh router, you

take the top K ids for every token uh

and you pass these the there's a a

dispatch all to all operation which

looks at the top K ids and will send the

token to the correct node

um that has the experts

uh that that token uh needs to uh

attend. into

um and this is what like fundamentally

what lets us scale to multiode. Uh we've

introduced expert parallel load

balancing to uh

uh basically we replicate experts to uh

take the heavy hitters and kind of

spread them across the cluster. Um and

then we've also introduced dual batch

overlap. These sparse alls are very

expensive and we can ex execute them at

the same time that we do uh computation

hops.

Okay. So yeah, this is uh EPLB like I

say said we replicate heavily used

experts and then we periodically

rebalance uh the system according to the

token distribution uh so that we can um

you know depending on the distribution

of the input tokens and the uh tokens

generated uh different experts may be

attended to more than others and so uh

we can dynamically rebalance the experts

uh to handle this

and you can see so this is a graph here

x-axis is time y ais is balancess so

balancess is the maximum

uh experts or sorry the mean experts

or sorry mean tokens per expert divided

by the maximum number of tokens for any

expert and as you can see it increases

uh quite a bit after these rebalancing

steps uh marked with the red dash lines.

Uh and then the another key optimization

we added is dual batch overlap. So we

take the batch that we're currently

processing. We're we split it into two

and then while one batch is executing

the uh sparse all to alls another batch

is executing the you know computation

ops like the experts and attention.

Okay. And so yeah putting it all

together. Um so here's some performance

numbers uh we ran on a cluster. Um so

focusing on the decode first. So expert

parallel size this is the number of GPUs

we're using. So we go from 32

uh up to 96. Uh so that's 12 nodes of 8

H1 H200s uh used for the decode. Um we

have like a fixed workload here of uh

256 concurrent requests per expert

parallel rank. Um and the thing that we

see here is that as we increase the size

of the deployment we lose a little bit

of performance like it scales almost

linearly

um

uh like performance per GPU

uh is is almost constant. Um but the

thing that we get out of it is a super

linear increase in the KV cache size. So

as we add more GPUs to the system, the

higher concurrency per GPU uh we can

handle.

Um

and then on the left we have the prefill

throughput. Um so in this case we're

really seeing extra parallel size is

best at EP16

um for the prefiller. And so what we do

here is we have you know one very large

uh uh instance for the decoder and we

have a few smaller instances for the

pre-filler. uh the the kind of

phenomenon where we really want to like

maximize the KV cache space uh for the

decoder just isn't true for the

pre-filler because we don't have that

many requests concurrently in flight um

and they they actually just have shorter

sequence lengths um

because we haven't generated with them

yet. Um and so you know we're getting uh

about you know 2.2 to to 2K output

tokens per second per GPU. And so all of

the, you know, the kernel optimizations

that we've added uh and the addition of

fast EPLB and dual batch overlap and

decode are really uh delivering compound

gains,

right Sasha?

>> All right.

>> Oh, loud. Uh thank you so much everyone.

We have another first at this V meetup,

but this is the first time that we ever

finished on time and I promise you that.

So, two minutes early. Uh, thank you to

our speakers. You guys crushed it. A lot

of work obviously goes into this. Uh,

two very quick things. I promised you

this will get very technical. If you

enjoy this type of content, Michael, our

first speaker, and myself host bi-weekly

BLM office hours. We call them office

hours because the init initial attend u

initial idea was to come you know have

you guys come ask questions but we've

turned them into much more than that.

Yes you can still come and ask questions

but every other week every other

Thursday we have amazing speakers from

Mal Hugging Pace Nvidia Meta the list

goes on and we dig into some awesome

topics. So check them out. You can scan

the QR code come join us. I know it's at

8:00 at night for you guys. That's the

unfortunate part. Um, as a fellow

European, I know we dinner late, so if

you want to join us over dinner, that's

awesome. Uh, but honestly, seriously, if

there's somebody in this community that

wants to help us bring these office

hours to a more appropriate time zone or

a more appropriate time in Europe, uh,

talk to me, please. Uh, and then one

last thing, I promise you a survey. Uh,

please, if you can scan this QR code,

it's super quick. Uh, you can just give

us five stars if you want and click

submit. That would be appreciated. No,

I'm just kidding. But thank you guys so

much for coming. Uh I think uh that

we're going to skip the Q&A here, but

all of us are going to be outside. So if

you have any questions, if you want to

continue this discussion, find us

outside. And now let's have some awesome

food, awesome drinks. Thank you IBM for

the venue and

[vLLM Office Hours #36] LIVE from Zürich vLLM Meetup - November 6, 2025

Red Hat

10 days ago

2:18:03

YouTube - AI & Machine Learning

Rank #1

Description

Welcome to the first official vLLM Meetup in Europe — streamed live from Zürich and hosted by Red Hat, IBM, and Mistral AI! We're bringing the vLLM community together for an evening of technical deep dives, demos, and conversations with the engineers driving the future of open-source inference. Whether you're building on vLLM today or exploring high-performance inference for enterprise and research — join us virtually from anywhere in the world to learn, ask questions, and connect with the community. 🧠 What to Expect ✅ Hear directly from vLLM maintainers and contributors ✅ Deep dives into quantization, hybrid models, distributed inference ✅ Mistral × vLLM integration insights ✅ Real-world demos + roadmap updates ✅ Live Q&A with the vLLM community 📚 Agenda (Session Lengths) - Welcome & Opening Remarks (10 mins) - Intro to vLLM + Project Update (~20 mins) - Beginner → Advanced Quantization in vLLM (~30 mins) - Mistral & vLLM (~30 mins) - Hybrid Models as First-Class Citizens in vLLM (~20 mins) - Distributed Inference with vLLM & llm-d (~30 mins) - Live Q&A + Community Hangout (~30 mins) Want to Join Us in Zürich? In-person details & registration: https://luma.com/0gls27kb Join Our Bi-weekly vLLM Office Hours: https://red.ht/office-hours Contribute to vLLM on GitHub: https://github.com/vllm-project/vllm

Video Details

Category

YouTube - AI & Machine Learning

Featured Date

November 7, 2025

Quality Rank

#1

AI Recommended