Loading video player...
Okay, so Google just made ondevice
retrieval augmented generation or rag a
lot easier with this new lightweight
embedding model they're calling
embedding Gemma. It's trained on top of
their Gemma 3. It's extremely
lightweight which makes it very useful
for running on device especially you
need around 200 mgabytes of VRAM
but it goes beyond simple search. You
can use the same model for other natural
language processing tasks such as
classification or a topic modeling. So
later in the video, I'll show you how to
effectively
set this for different tasks because
setting this up can be a bit tricky. Now
the model itself is about 300 million
parameters which makes it a very small
and lightweight model but it's best in
class for its weight. Later in the video
I'll show you how to use this and a
simple rag setup as well as some
classification examples using this
model. But before that let's look at
some quick technical details. So this is
trained on top of the Gemma 3
architecture. It supports more than 100
languages which makes it extremely
excellent for multilingual tasks. The
output dimensions of this model are
customizable. So the biggest dimension
you can get is 768 and the lowest is
128.
So it uses matrica representation which
essentially is a dimensionality
reduction technique for embedding
models. So you can basically truncate
the models at different dimensions. Now
keep in mind that this also is going to
result in loss of accuracy. So as
expected more dimensions in the output
will preserve more accuracy. As you
reduce the number of dimensions in the
output of the embedding model, the
accuracy also decreases.
But you save both on speed as well as on
compute cost. So if you're looking for a
lightweight embedding model, this could
be an excellent candidate. So here a
quick comparison on MTB benchmark. The
best embedding model sub 1 billion
parameter right now is the quen
embedding 600 million parameters. But
this is almost half the size and I think
it still preserves the performance
relative to the other embedding models.
So this is going to be an excellent
choice for lightweight retrieval. Of all
the Frontier Labs, Google is in a very
interesting position right now. So they
have openweight models like Gemma 3 or
more lightweight Gemma 3N for developers
who are interested in openw weight
models. On the other hand, if you're
trying to use Frontier models, then you
have Gemini 2.5 Pro and even for
embeddings,
you have the state-of-the-art Gemini
embeddings, which are multimodal in
nature. So, they really are trying to
meet developers where they are. So, I'm
going to show you some quick code
examples, but before that, I want to
highlight a very interesting research
out of the Deepmind team. So I'll
probably create another detailed video
on this but there was an interesting
paper on the theoretical limits of
embedding based retrieval which shows
that dense embedding based retrieval in
rag systems have their inherent flaws
and they have theoretical limits when it
comes to retrieving correct documents.
So irrespective of the size of the
embedding model and the power of the
embedding model that you're using there
are theoretical limits when it comes to
retrieval which means we might be
leaving a lot on the table if just
relying on dense embedding models. Now
the embedding Gemma are also dense
embedding models so they will also
suffer from that theoretical limit. In
the rest of the video, I'll show you an
example of how to use this in a
retrieval augmented generation system.
But first, I want to show you how to
effectively use them since they support
retrieval classification and topic
modeling like tasks. But before that
supports multiple different dimensions
and the accuracy of retrieval is going
to be dependent on the length of the
output vector. On average, you can
expect about 3% drop from the highest
dimension to the lowest dimensions. For
code related tasks, it's I think a lot
more significant. We are seeing about 6%
drop. Also you can run these in
different precisions because they have
quantization availing
but the effect of quantization is not as
pronounced as the dimensions or the
output size. Next based on the task that
you're performing you will have to set
your prompt instructions in a different
way. The form that it follows is nature
of the task. So that is whether
retrieval classification question
answer and then the query or the
document that is going to be provided by
the user. So for example, if you're
doing a retrieval and you are providing
the user query, then the task is going
to be search result and the query
content. If you are embedding the
documents, then you're going to have
title of the document. It can be set to
none and then the actual text. So this
metadata helps the embedding model do
better retrieval. Then you also have
support for question answering. In this
case the task is question answering and
the query is going to be the user query.
Also they support factchecking,
classification,
topic modeling or clustering center
similarity and code retrieval. So these
are different distinct tasks for which
you can use this embedding model. Okay,
so this is a quick notebook provided by
the Gemma team for setting up rag or
retrieval augmented generation. So in
this case they're using the transformer
package. So we set up our pipeline. The
model that is being used for text
generation in rag is Gemma 3 4bit the
instruct fine-tune version. Then for
embedding we're using the embedding
gemma 300 million parameters. And if you
look at the architecture so the maximum
sequence length that you can feed into
the model is48
tokens. That means your chunk size can
be this large. The output dimension by
default is 768 but you can truncate it
to different sizes based on your needs
as I showed you before. Now if you're
using the transformer package then the
nature of the task definition is going
to be a little different. So you're
going to provide either your document or
your query. Then the prompt instruction
is going to be prompt name and then the
nature of the task. So in this case we
are embedding the user query and the
nature of the task is going to be
retrieval query. If you are embedding
documents then you just need to set them
up like this.
So it's going to be the title of the
document and then the actual text. And
if you don't have a title of the
document, then you set the title to none
and just provide the actual text. So
this is how you do document embedding.
And the first part was related to query
embedding. And here are again a list of
different available tasks. So in fact
actually you can do a lot more. You can
use the same model for reranking
summarization as well. multilel
classification, instruction retrieval,
classification, clustering and I think
there are some other options as well.
Okay, so for this simple rack setup, our
corpus consists of HR and leave
policies. There are different document
categories for each one of them. You
have the document title, the actual
contents. So this is just a dictionary
of different policy documents. The user
query is how do I reset my password? We
set the similarity threshold and then
there are a couple of helpful functions
to calculate the best match. So
basically it computes the similarity
across all the documents and then picks
the document with the best similarity.
And first we're looking at the category
of documents. So we have the HR and
leave policies, IT and security, finance
and expenses, office and facilities. And
then within those we are looking at the
best document. So this is basically the
retrieval part for generation. We need
the Gemma 3N model. So here is a simple
prompt template which tells the model to
answer the question based on the
provided context. The context is going
to be the retrieve documents and the
question is going to be coming from the
user query. Again we're using
transformance package here. So when the
user query is how do I reset my
password? It looked at the document
category. It found account password
management to be the best category and
then it did the retrieval for that
specific question. You can also
fine-tune this model for your specific
task. You need to curate a data set. The
data set is supposed to have triplets.
In this case, each example is going to
contain three different samples. Two of
them are going to be relevant and one of
them is completely irrelevant.
So the examples are going to be your
anchor positive example and negative
example. Then you select the loss that
you want to use for training. Here
they're using the sentence transformer
training package. You're going to
provide the output directory where you
want to place your model, number of epex
and batch size along with the learning
rate. Right? So the rest of the training
is very similar to how you would train a
neural network. Now here are just a
quick examples of before fine-tuning
what the similarity score could
potentially look like if you do search
on the same corpus versus when you have
fine-tuned the model. So let me know if
you are interested in more detailed
tutorial on how to fine-tune embedding
models. But overall, I think this is an
excellent embedding model for
lightweight retrieval when you have a
relatively small number of documents and
you want quick retrieval. Do check it
out. Let me know how your experience is
and also let me know if you are
interested in a fine-tuning of embedding
models. Anyways, I hope you found this
video useful. Thanks for watching and as
always, see you in the next one.
In this video we learn how to use Google’s Embedding Gemma (300M) to build fast, on-device RAG with ≈200MB memory and support for 100+ languages. We will look at a RAG example. LINK: https://developers.googleblog.com/en/introducing-embeddinggemma/ https://huggingface.co/blog/embeddinggemma https://arxiv.org/pdf/2205.13147 https://huggingface.co/blog/matryoshka https://github.com/google-gemini/gemma-cookbook/blob/main/Gemma/%5BGemma_3%5DRAG_with_EmbeddingGemma.ipynb https://ai.google.dev/gemma/docs/embeddinggemma/fine-tuning-embeddinggemma-with-sentence-transformers https://ai.google.dev/gemma/docs/embeddinggemma/model_card Website: https://engineerprompt.ai/ RAG Beyond Basics Course: https://prompt-s-site.thinkific.com/courses/rag Let's Connect: 🦾 Discord: https://discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: https://ko-fi.com/promptengineering |🔴 Patreon: https://www.patreon.com/PromptEngineering 💼Consulting: https://calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: http://tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: https://bit.ly/localGPT (use Code: PromptEngineering for 50% off). Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0 TIMESTAMPS: 00:00 EmbeddingGemma 02:15 Comparison with Other Embedding Models 02:41 Google's Interesting position 03:21 Dense Embeddings is Killing Retrieval 06:13 RAG with EmbeddingGemma 09:37 Fine-Tuning and Training the Model