Embedding Gemma: On-Device RAG Made Easy | DailyDevLists

Loading video player...

Full Transcript

1,673 words • EN

Okay, so Google just made ondevice

retrieval augmented generation or rag a

lot easier with this new lightweight

embedding model they're calling

embedding Gemma. It's trained on top of

their Gemma 3. It's extremely

lightweight which makes it very useful

for running on device especially you

need around 200 mgabytes of VRAM

but it goes beyond simple search. You

can use the same model for other natural

language processing tasks such as

classification or a topic modeling. So

later in the video, I'll show you how to

effectively

set this for different tasks because

setting this up can be a bit tricky. Now

the model itself is about 300 million

parameters which makes it a very small

and lightweight model but it's best in

class for its weight. Later in the video

I'll show you how to use this and a

simple rag setup as well as some

classification examples using this

model. But before that let's look at

some quick technical details. So this is

trained on top of the Gemma 3

architecture. It supports more than 100

languages which makes it extremely

excellent for multilingual tasks. The

output dimensions of this model are

customizable. So the biggest dimension

you can get is 768 and the lowest is

128.

So it uses matrica representation which

essentially is a dimensionality

reduction technique for embedding

models. So you can basically truncate

the models at different dimensions. Now

keep in mind that this also is going to

result in loss of accuracy. So as

expected more dimensions in the output

will preserve more accuracy. As you

reduce the number of dimensions in the

output of the embedding model, the

accuracy also decreases.

But you save both on speed as well as on

compute cost. So if you're looking for a

lightweight embedding model, this could

be an excellent candidate. So here a

quick comparison on MTB benchmark. The

best embedding model sub 1 billion

parameter right now is the quen

embedding 600 million parameters. But

this is almost half the size and I think

it still preserves the performance

relative to the other embedding models.

So this is going to be an excellent

choice for lightweight retrieval. Of all

the Frontier Labs, Google is in a very

interesting position right now. So they

have openweight models like Gemma 3 or

more lightweight Gemma 3N for developers

who are interested in openw weight

models. On the other hand, if you're

trying to use Frontier models, then you

have Gemini 2.5 Pro and even for

embeddings,

you have the state-of-the-art Gemini

embeddings, which are multimodal in

nature. So, they really are trying to

meet developers where they are. So, I'm

going to show you some quick code

examples, but before that, I want to

highlight a very interesting research

out of the Deepmind team. So I'll

probably create another detailed video

on this but there was an interesting

paper on the theoretical limits of

embedding based retrieval which shows

that dense embedding based retrieval in

rag systems have their inherent flaws

and they have theoretical limits when it

comes to retrieving correct documents.

So irrespective of the size of the

embedding model and the power of the

embedding model that you're using there

are theoretical limits when it comes to

retrieval which means we might be

leaving a lot on the table if just

relying on dense embedding models. Now

the embedding Gemma are also dense

embedding models so they will also

suffer from that theoretical limit. In

the rest of the video, I'll show you an

example of how to use this in a

retrieval augmented generation system.

But first, I want to show you how to

effectively use them since they support

retrieval classification and topic

modeling like tasks. But before that

supports multiple different dimensions

and the accuracy of retrieval is going

to be dependent on the length of the

output vector. On average, you can

expect about 3% drop from the highest

dimension to the lowest dimensions. For

code related tasks, it's I think a lot

more significant. We are seeing about 6%

drop. Also you can run these in

different precisions because they have

quantization availing

but the effect of quantization is not as

pronounced as the dimensions or the

output size. Next based on the task that

you're performing you will have to set

your prompt instructions in a different

way. The form that it follows is nature

of the task. So that is whether

retrieval classification question

answer and then the query or the

document that is going to be provided by

the user. So for example, if you're

doing a retrieval and you are providing

the user query, then the task is going

to be search result and the query

content. If you are embedding the

documents, then you're going to have

title of the document. It can be set to

none and then the actual text. So this

metadata helps the embedding model do

better retrieval. Then you also have

support for question answering. In this

case the task is question answering and

the query is going to be the user query.

Also they support factchecking,

classification,

topic modeling or clustering center

similarity and code retrieval. So these

are different distinct tasks for which

you can use this embedding model. Okay,

so this is a quick notebook provided by

the Gemma team for setting up rag or

retrieval augmented generation. So in

this case they're using the transformer

package. So we set up our pipeline. The

model that is being used for text

generation in rag is Gemma 3 4bit the

instruct fine-tune version. Then for

embedding we're using the embedding

gemma 300 million parameters. And if you

look at the architecture so the maximum

sequence length that you can feed into

the model is48

tokens. That means your chunk size can

be this large. The output dimension by

default is 768 but you can truncate it

to different sizes based on your needs

as I showed you before. Now if you're

using the transformer package then the

nature of the task definition is going

to be a little different. So you're

going to provide either your document or

your query. Then the prompt instruction

is going to be prompt name and then the

nature of the task. So in this case we

are embedding the user query and the

nature of the task is going to be

retrieval query. If you are embedding

documents then you just need to set them

up like this.

So it's going to be the title of the

document and then the actual text. And

if you don't have a title of the

document, then you set the title to none

and just provide the actual text. So

this is how you do document embedding.

And the first part was related to query

embedding. And here are again a list of

different available tasks. So in fact

actually you can do a lot more. You can

use the same model for reranking

summarization as well. multilel

classification, instruction retrieval,

classification, clustering and I think

there are some other options as well.

Okay, so for this simple rack setup, our

corpus consists of HR and leave

policies. There are different document

categories for each one of them. You

have the document title, the actual

contents. So this is just a dictionary

of different policy documents. The user

query is how do I reset my password? We

set the similarity threshold and then

there are a couple of helpful functions

to calculate the best match. So

basically it computes the similarity

across all the documents and then picks

the document with the best similarity.

And first we're looking at the category

of documents. So we have the HR and

leave policies, IT and security, finance

and expenses, office and facilities. And

then within those we are looking at the

best document. So this is basically the

retrieval part for generation. We need

the Gemma 3N model. So here is a simple

prompt template which tells the model to

answer the question based on the

provided context. The context is going

to be the retrieve documents and the

question is going to be coming from the

user query. Again we're using

transformance package here. So when the

user query is how do I reset my

password? It looked at the document

category. It found account password

management to be the best category and

then it did the retrieval for that

specific question. You can also

fine-tune this model for your specific

task. You need to curate a data set. The

data set is supposed to have triplets.

In this case, each example is going to

contain three different samples. Two of

them are going to be relevant and one of

them is completely irrelevant.

So the examples are going to be your

anchor positive example and negative

example. Then you select the loss that

you want to use for training. Here

they're using the sentence transformer

training package. You're going to

provide the output directory where you

want to place your model, number of epex

and batch size along with the learning

rate. Right? So the rest of the training

is very similar to how you would train a

neural network. Now here are just a

quick examples of before fine-tuning

what the similarity score could

potentially look like if you do search

on the same corpus versus when you have

fine-tuned the model. So let me know if

you are interested in more detailed

tutorial on how to fine-tune embedding

models. But overall, I think this is an

excellent embedding model for

lightweight retrieval when you have a

relatively small number of documents and

you want quick retrieval. Do check it

out. Let me know how your experience is

and also let me know if you are

interested in a fine-tuning of embedding

models. Anyways, I hope you found this

video useful. Thanks for watching and as

always, see you in the next one.

Embedding Gemma: On-Device RAG Made Easy

Prompt Engineering

72 days ago

11:10

RAG & Vector Search

Rank #10

Description

In this video we learn how to use Google’s Embedding Gemma (300M) to build fast, on-device RAG with ≈200MB memory and support for 100+ languages. We will look at a RAG example. LINK: https://developers.googleblog.com/en/introducing-embeddinggemma/ https://huggingface.co/blog/embeddinggemma https://arxiv.org/pdf/2205.13147 https://huggingface.co/blog/matryoshka https://github.com/google-gemini/gemma-cookbook/blob/main/Gemma/%5BGemma_3%5DRAG_with_EmbeddingGemma.ipynb https://ai.google.dev/gemma/docs/embeddinggemma/fine-tuning-embeddinggemma-with-sentence-transformers https://ai.google.dev/gemma/docs/embeddinggemma/model_card Website: https://engineerprompt.ai/ RAG Beyond Basics Course: https://prompt-s-site.thinkific.com/courses/rag Let's Connect: 🦾 Discord: https://discord.com/invite/t4eYQRUcXB ☕ Buy me a Coffee: https://ko-fi.com/promptengineering |🔴 Patreon: https://www.patreon.com/PromptEngineering 💼Consulting: https://calendly.com/engineerprompt/consulting-call 📧 Business Contact: engineerprompt@gmail.com Become Member: http://tinyurl.com/y5h28s6h 💻 Pre-configured localGPT VM: https://bit.ly/localGPT (use Code: PromptEngineering for 50% off). Signup for Newsletter, localgpt: https://tally.so/r/3y9bb0 TIMESTAMPS: 00:00 EmbeddingGemma 02:15 Comparison with Other Embedding Models 02:41 Google's Interesting position 03:21 Dense Embeddings is Killing Retrieval 06:13 RAG with EmbeddingGemma 09:37 Fine-Tuning and Training the Model

Video Details

Category

RAG & Vector Search

Featured Date

November 17, 2025

Quality Rank

#10

AI Recommended