RAG Crash Course for Beginners | DailyDevLists

Loading video player...

Full Transcript

10,026 words • EN

Everyone's talking about RAG. If you

feel left out, this is the only video

you need to watch to catch up. In this

video, we'll learn Rag in a super

simplified manner with visualizations

that will make it easy for anyone to

understand. No background knowledge in

AI or AI models or coding or programming

required. We'll start with the simplest

explanation of rag there is. Then we'll

look into when to and when not to rag.

We'll then look into what is rag. We'll

then understand some of the

prerequisites such as vector search

versus semantic search, embedding

models, vector DB, chunking using a

simple use case, and finally bring all

of that together into rack architecture.

We'll then look into caching,

monitoring, and error handling

techniques, and close with exploring a

brief setup of deploying rack in

production. But that's not all. This is

not just a theory course. We have

hands-on labs after each lecture that

will help you practice what you learned.

Our labs open up instantly right in the

browser. So there is no need to spend

time setting up an environment. These

labs are staged with challenges that

will help you think and learn by doing

and it comes absolutely free with this

course. I'll let you know how to go

about the labs when we hit our first lab

session. For now, let's start with the

first topic. Let's start with the

simplest explanation of rag. Say you

were to ask Chad WT what's the

reimbursement policy for home office

setup. You already know when you ask

this question that Chad GBT is going to

give an incorrect answer because it

doesn't have access to our policy

document that's private to our company.

So an LLM like GPT would hallucinate and

provide an incorrect or generic answer

that's common to most companies. The

problem here is that it doesn't have the

necessary context of what you're asking

about. So what do you do? You look up

your internal policy document and get

the section of the policy that describes

home office setup by yourself. Then you

add that to your prompt and tell Chad

GBT to refer to this policy. Now with

this additional information, Chad Gibbt

is able to generate more accurate

responses

and that is the simplest explanation of

rack that stands for retrieval augmented

generation. The part where you look up

your internal policy documents and

retrieve the relevant information is

known as retrieval. The part where you

improve or augment your prompt with the

retrieved information is known as

augmenting. And the part where LLM

generates a response based on the

augmented prompt is known as generation.

And that is something you've done

unknowingly many times. Now, of course,

that is a very simplified explanation of

rag. And when we talk about rag systems,

that is not what we typically refer to.

So let's see what that is next. Now, you

don't want your users to have to locate

and retrieve relevant information by

themselves. Instead, you want your users

to simply ask the question, what's the

reimbursement policy of home office

setup? And our system that's based on

rag should be able to do the lookup and

retrieval of relevant information,

improve or augment the user's prompt and

get an LLM to generate the right

response. Now, how exactly do we

retrieve relevant information? How do we

augment and how do we generate? And

that's what we're going to discuss

throughout the rest of this video. Now,

one of the common mistakes people make

is to consider rag as the solution for

everything. Rack is not the solution to

all problems. At the end of the day,

we're all trying to get AI to generate

better responses and there are different

ways to do that. We can prompt better.

That's called prompt engineering. We can

fine-tune models. And then there's rag.

When to use what? Let's take a simple

use case to understand these better. So,

back to our use case. We've started to

notice a lot of people copy pasting

company policies into chat GBT to get

answers. So we decided to build an

internal chatbot that can answer

people's questions. We call it the

policy copilot. It is a system that

users can simply ask a question such as

what's the reimbursement policy and our

chatbot system should be able to locate

the necessary information from the

internal policy documents and then

generate accurate responses and send

that back to the user. Now we also want

to add some restrictions and

limitations. We don't want the chatbot

to answer everything. Some questions

should be off limits like performance

review appeals or salary discussions.

And when those topics come up, we want

to direct users to HR directly instead

of giving them answers in the chat. We

also want our chatbot to have a specific

voice and style. So our CEO has this

warm Scottish accent and a particular

way of speaking that makes people feel

certain way. We want our policy co-pilot

to sound just like that, authoritative

and distinctly Scottish. So when the

users ask what's the reimbursement

policy for home office setup, it

responds when the users ask how many

sick days do I get per year? It says

when the user asks can I work from home

permanently? It says

and when the users ask when are

performance reviews conducted it

responds

as you can see it's not just the

Scottish accent there's this what should

I say refreshing candandor that tells it

like it is let's look at how to solve

each of these areas the restrictions and

security requires us to define how the

chatbot responds what it must reveal and

what it must not so these are strict

instructions provided to the LLM to

control its behavior based on users

request such is never to reveal personal

employee information or confidential

details. If someone asks about sensitive

topics, politely redirect them to HR.

Prompt engineering best practices are a

good solution to this. Think of it as

the rule book that keeps our chatbot

safe and professional. Next, we look at

how to solve the problem of voice,

style, and language. Now, we know if we

asked Chad GBT to simply respond to me

in a Scottish accent, it would. But the

accent, as we saw earlier, is not simply

what we are going after here. We wanted

to speak like our Scottish CEO, use

words he usually uses, the tone, the

language. So, we take all of his past

speeches, he's given, emails written by

him, blog post, videos created, and

fine-tune a new model that can respond

in the same language and tone. A good

solution for this is fine-tuning.

Fine-tuning is the process where you

provide a model hundreds of sample

questions and sample answers and have it

respond to you in that way all the time.

Now, you might be wondering, why can't

fine-tuning solve this information

problem? Why can't we train a model with

all of the questions a user might ask

and answers it can generate? The

problems are the policies can change

constantly and when they do, you need to

retrain the model every time and

trainings are not easy. They're

expensive and slow. Retraining takes

time and computational resources. Users

can't verify where the answers came

from, so there's no citations possible.

The larger the training data, the lower

the accuracy. And then there's knowledge

cutoff. The model only knows what was in

the training data. Fine-tuning is great

for stable unchanging patterns like

communication style, but terrible for

dynamic factual information. And

finally, the best solution to get the

most accurate responses is rag. Rag

works because it retrieves information

dynamically at query time, not at

training time because the whole point of

rag is retrieving the most relevant

information for the user's query real

time. Next, we'll look at rag in more

detail. Let's now look at what rag is in

the first place. So far, we've decided

that we're going to build our policy

copilot system where employees can ask

question and it retrieves the relevant

information, augments prompts, and

generates a response. We'll now see how

each of these work. Let's look at

retrieval first. Retrieval is a process

of retrieving relevant information. But

how do you do that? There may be

hundreds of policy documents. How do you

find which one is the right one that has

context related to the user's question?

And what do you search for within these

files? First, we identify a few keywords

from the user's question. In this case,

we've identified reimbursement and home

office to be the relevant keywords. One

of the simplest ways is to use a GREP

command to search for specific terms in

these files such as reimbursement or

home office and hope that one of these

files will have these terms.

Alternatively, if these files were

stored in a database, you could run a

query against it like this. Now, these

would only return content that exactly

matches the keywords we are looking for

and the chances of getting accurate

information every time is low. This

approach of searching the documents with

the exact words is known as keyword

search and it is a very popular

technique that's used by many of the

search platforms. To explain it simply,

this approach goes through all the

documents, identifies keywords and ranks

them based on their frequency. In this

case, it counts the occurrences of

reimbursement in all documents and

records them. So we have three

occurrences in the first document, none

in the middle two, but another three in

the third one. It then does the same for

home office and we see that it's only

present in the home office setup

document. Combining these two columns is

now able to identify the document that

has the maximum occurrences of these two

keywords and thus able to rightly select

the document that has these keywords.

Now that was a super simplified

explanation. Keyword search is a science

in itself and has a lot of complex

calculations that go in and there are

multiple proven approaches available.

Two of the most popular techniques used

are known as TF and BM25. We won't go

into the specifics of how these work.

We'll just see how to work with them.

Let's see each of these in action.

First, we import the TF vectorzer from

the scikitlearn open-source Python

library. Think of the scikit library as

a toolbox with pre-built algorithms that

you can use without having to write them

from scratch. We then define three

sample documents. The documents are

simple sentences for now. you could read

the contents of a file in. We then

create the TF ID of analyzer and we'll

call that analyzer. The word scores can

then be calculated by running the fit

transform method. We then print the

results on screen. The word scores show

a bi-dimensional array with the

importance of each word in each

sentence. The word office appears in all

sentences. So they get a score of 0.4.

The first sentence identifies words,

equipment, and policy and gives them a

score of 0.7 and 0.5. The second

sentence identifies the words furniture

and guidelines and the third identifies

the words travel and policy. Now that

the vectors are created, we run a query.

We use the analyzer to transform to

query the word furniture. What it does

is it returns an array that returns a

score that compares the query word

furniture to each document. Now let's

see the same with the BM25 techniques.

We use the rank BM25 library which is a

popular library that implements a BM25

algorithm. We then create what is known

as the BM25 index and then get the word

scores. In this case, we can see some

differences. The word office gets a

score of zero because the BM25 algorithm

is a bit more strict in assigning

scores. And because this word is present

in all documents, it doesn't see it to

be very relevant. It then continues to

assign a score for the most important

and unique words in sentences like

equipment in the first sentence,

furniture and guidelines in the second,

and travel in the third. And as before,

we run a query, but this time using the

get scores method and print the array.

We can see it's again identified the

second document to be the relevant

document here the right way. Well, it's

time to gain some hands-on practice on

what we just learned.

Follow the link in the description below

to gain free access to the labs

associated with this course. Create a

free account and click on enroll to

start the labs. On the left side of the

screen, you will see the list of labs.

Only start the lab when I ask you to.

We'll do only one lab at a time. Let's

start with the first lab. Click on start

to launch the lab. Give it a few seconds

to load. Once loaded, familiarize

yourself with the lab environment. On

the left hand side, you have a questions

portal that gives you the task to do. On

the right hand side, you have a VS Code

editor and terminal to the system.

Remember that this lab gives you access

to a real Linux system. Click on okay to

proceed to the first task. The first

task requires you to explore the

document collection. Open the TechCorp

documents in the VS Code editor. On the

right, we see there is a TechCorp docs

folder. Expand it to reveal the

subfolders. The ask is to count how many

documents are in the employee handbook.

This is what I call a warm-up question

that will help you explore and

familiarize yourself with the lab. The

real tasks are coming up. In this case,

it's three, so I select three as the

answer. Then proceed to the next task.

This is about performing a basic GP

search. As we discussed in the lecture,

we'll run a GP command to search for

anything related to holiday in the

folder and store the results in a file

named extracted content. To open the

terminal, click anywhere in the panel

below and select terminal. This creates

a new file with the results. Click check

to check your work and continue to the

next task. The next task is to set up a

Python virtual environment and install

dependencies. I'll let you do that

yourself.

We'll move to the next task now. Here we

explore the TF script. Here we first

import the TF vectorzer from the

scikitle learn library. We then

transform the dogs. Then we compare

using cosine similarities or cosine is

one approach of comparing two vectors to

identify similarities. And then we

finally print the results. We now

execute the script and then we view the

results. And for now we'll just click

check to proceed to the next step. We

then move to the next step. Here the

question is to analyze the score printed

and identify the score of the top

results. So the ask is to search for pet

policy docs and identify the score for

the top result. Here we see the top

result is rightly identified as the pet

policy.md file with a score of 0.4676

whereas the other files have a score

less than 0.1. So the answer to this

question is 0.4676.

The next task is to review and execute

the BM25 script. Open the BM25 search.py

file and inspect it. You'll see that we

import the rank BM25 uh package. We then

create an index and then for each query

called the BM25.get scores method and

from the results we get the top three

results and we go through each result

and print that. Finally, there is a

hybrid approach that combines TF and

BM25 techniques using a weighted

approach. I'll let you explore that by

yourself. Let's get back to the next

topic. We just looked at vector search.

Let's now understand semantic search.

Now, one of the challenge with keyword

search is that if the exact keyword

isn't there, the search fails. For

example, instead of reimbursement, if we

say allowance, it tries to find the

exact word allowance. And instead of

home office, if the user asks work from

home, it's unable to find that anywhere.

These combination of keywords aren't

found in the documents and thus the

document is not found. In our example

code, say we say desk instead of

furniture is not going to be able to

find any matches in scores and thus

unable to find any matching document.

That's the limitation of keyword search

and that's where we need semantic

search. Semantic search searches

documents based on the meaning of words

and thus have higher chances of locating

the right documents based on the inputs

and that's what we will look at next. So

what exactly is semantic search? Think

of it as search that understands meaning

not just words. When you search for

allowance, semantic search can find

documents about allowance or

reimbursements or anything that has

similar meaning even if those exact

words aren't used. Similarly, if you

search for home office or work from

home, it can find documents that has

anything to do with remote work. The

magic happens through something called

embeddings. We convert both your search

query and all the documents into

mathematical vectors. Think of them as

coordinates in a highdimensional space.

Documents with similar meanings end up

close together in this space. So, when

you search, we find the closest matches

based on the meaning, not just word

overlap. We can measure how similar two

pieces of text are by calculating the

distance between their vectors. The

closer the vectors, the more similar the

meaning. So reimbursement and allowance

would have vectors that are close

together even though they're different

words. We'll see this in more detail

next. Let's now understand embedding

models. So if you look at machine

learning models, they can be categorized

at a high level based on use case such

as computer vision, NLPs or natural

language processing, audio among many

others. And within each category, you

have a number of models available. This

is as shown on hugging face which is a

popular platform where you can discover

models, data sets and applications. Our

interest here is the sentence similarity

within natural language processing.

And within sentence similarity, one of

the popular models is sentence

transformers all mini LM L6V2.

This model map sentences and paragraphs

to a 384 dimensional dense vector space

and can be used for clustering and

semantic search. It is also a 22 million

parameter model. Now, what does that

mean?

The parameter size reflects the brain

power of the model. Think of parameters

as the learned knowledge stored in the

AI's memory. Each parameter is a number

that the model learned during training

to understand language patterns. 22

million parameters means this model has

22 million learned values that help it

understand how words relate to each

other, what sentences mean semantically,

which concepts are similar or different.

Let's compare that to things we already

know like GPT models. Let's compare this

model to GPT 3.5 and GPD4 model that we

use. The 22 million parameter size of

our all mini LLM model is very small

compared to the 175 billion parameters

of GPD 3.5 and 1.8 trillion parameter

size of GPD4. The size of the model is

proportional to that too. The all mini

model is 90 mgabytes in size. As such,

it can be used locally in our laptops

and the size of GPT 3.5 and 4 are 350GB

and 3.6TB respectively. and thus the use

case differs. The all mini LM model is a

perfect fit as an embedding model for

our use case whereas the GBT models are

used for text generation and reasoning.

So we just mentioned embeddings. What

are they actually? As its simplest form,

an embedding model takes text and

converts it into numbers that represent

meaning. So sentence like dogs are

allowed in the office is converted into

an array of numbers known as vectors.

When you give the model a sentence like

dogs are allowed in the office, it

doesn't just look at the words. Instead,

it thinks about what this sentence

actually means. Is it about animals? Is

it about workplace policies? Is it about

permissions? The model then creates a

list of numbers that captures all these

different aspects of meanings.

Each number represents something that

model learned about language. Maybe the

first number captures how animal related

the text is. The second number captures

how workspace related it is and so on.

And it then plots that in a graph. So

dogs get a number 0.00005597

and is added to a section of the graph

that represents animals and pets also

fall into the same category. However,

remote does not go there. Similarly,

office falls into the workplace area.

So, our first sentence moves closer into

the workplace section and so does the

second sentence because it is also

related to work. And the same applies to

the last sentence as that's also related

to the workplace. We then compute the

distance between these points. The

shorter the distance, the closer they

match. So, finally, if you look at these

two sentences, you'll see that the first

two are similar. That's a similarity

search explained in the most simplest of

forms. And this explanation only works

for a two-dimensional array. But in most

cases the dimensions are too many that

we can't even imagine how it would look

visually. In this case the model we are

using uses 384 dimensions. So we don't

even know how to imagine this or plot

that on a graph. So then how do we

calculate similarities between them?

This is where the magic of mathematics

comes in. Since we can't visualize 384

dimensions, we need a mathematical way

to measure how close two points are in

this highdimensional space. The solution

is something called the dot product.

Think of it as a mathematical ruler that

can measure distance in any number of

dimensions, even ones we can't see. So

here's how it works in simple terms. For

the sake of simplicity, I'll convert the

vectors for each sentence into

two-dimensional vectors of simple

numbers. So dogs are allowed in the

office gets a vector value of 1, 5. The

second sentence gets 2 and four and the

third one gets 6 and 1. The process

involves multiplying the vectors, adding

them, and then normalizing them. Let's

look at the first two. We first multiply

the values in the vectors. So, we

multiply 1 * 2 to get 2 and 5 * 4 to get

20. We do the same for the other two

pairs.

We multiply 1 5 with 6 and 1 to get 6 5.

And then we multiply 2A 4 with 6 and 1

to get 12A 4. We then add the multiplied

numbers together. So 2 + 20 gives us 22.

And we get 30 and 16 for the others. And

finally, these go through a

normalization process to convert these

numbers into anything between 0 and 1.

That also takes into consideration the

total size of the vectors among other

things. Finally, the pair with the value

closest to one are similar and away from

one are dissimilar. So that's a basic

explanation of how sentences are

compared for similarity.

Now, of course, you don't have to do all

of that math by yourself. We have

libraries that does that for you. Numpy

is a powerful Python library for working

with numbers and mathematical

operations. We import numpy as np and

then we call the np dot method and pass

in the vectors for it to calculate the

dotproduct between the two vectors. It

returns a similarity score between zero

and one.

So let's take a closer look at that. So

first we install the required libraries

such as the sentence transformers and

numpy library. So the sentence

transformers as we saw provides the

sentence transformer class and the all

mini lm model. The numpy provides the np

function for calculating dotproducts

between vectors. So here we can see the

complete code in action. We first import

the sentence transformers library and

the numpy library. Then we load the all

mini LM L6V2 model that we've been

discussing. What it does is it downloads

the model, loads the 22 million

parameters into memory, prepares a model

to convert text into embeddings. We then

define our three test sentences about

dogs, pets, and remote work. And we then

encode these sentences into embeddings

using the embedding model. And finally,

we calculate the similarity between each

pair of sentences using NumPai's

dotproduct function. Now, let's see what

happens when we run this code. We print

out the similarity scores between each

pair of sentences. And the results are

quite interesting. Looking at the

results, dogs versus pets shows 73.3%

similarity. That makes sense because

both are talking about animals in the

workplace. Dogs versus remote shows only

36.2% similarity. That makes sense, too,

because one is about animals, the other

is about work arrangements. Pets versus

remote shows 33.8% similarity. Again,

these are quite different topics. This

demonstrates exactly what we've been

talking about. The model can understand

semantic meaning, not just word

matching. Even though dogs and pets are

different words, the model recognizes

they're both about animals in the

workplace context. And it correctly

identifies that remote work policies are

quite different from animal policies.

And this is the foundation of how rack

systems are built. They can find

semantically similar content even when

the exact words don't match. This is

what makes rag so powerful compared to

traditional keyboard search. So, so far

we've been looking at sentence

transformers and the all mini LM L6V2

model. But sentence transformers are

just one example of embedding models.

There are many other popular embedding

models out there that you can choose

from depending on your use case. Now,

let me clarify an important distinction.

The sentence transformers we've been

using are local models. They run on your

local machine. They're completely free

and they don't require an internet

connection. But there are also remote or

API models like OpenAI's embeddings that

run on external servers where you pay

use and need an internet connection. In

this sample code, you can see how we use

the OpenAI library and use the

embeddings API endpoint to create a new

embedding. The model is text embedding 3

small and it returns the embedding

vector for it. There's also this

leaderboard of top embedding models

posted by HuggingFace. We can see some

of the most popular ones here. Gemini

topping the chart with Quen 3 and others

that are following. Well, that's all for

now. Head over to the labs and practice

working with embedding models. All

right, we're now going to look into the

second lab. This is called embedding

models. So, I'm just going to click on

start lab to start the lab. We'll give

it a few minutes to load. Okay, so in

this lab, we're going to look at uh

embedding models. We'll explore semantic

search using embedded models which are

the foundation of modern uh rag systems.

So let's uh go to the first task. So the

first task is about keyword uh search

limitations. So first we navigate to the

project. We create a new virtual

environment and install the

requirements.

I go to the terminal and we're going to

uh set up the virtual environment.

So our project is within this uh folder

called rag project. And here we have the

virtual environment uh that's being set

up.

Okay. Okay, we now run the um the next

step once the once the virtual

environment is set up, the next step is

to run the keyword limitation demo. If

you go to the rag project, you'll see

the keyword limitation uh demo script.

So, this is a simple script that

searches for a word or keyword that does

not exist in in the documents and proves

that uh pure keyword based uh search uh

are less likely to yield the right

results. For example, in this case, the

query is distributed workforce policies

and the none of the documents have

something that's exactly like that,

right? So, let's try running the script.

And if you look at the script, most of

the scores are zero because um the

keywords distributed workforce policies

does not really exist in any of the

scripts. So, the correct answer here is

missing synonyms and context.

All right.

The next task is to install embedding

dependencies. So we go to the rack

project. So we're already in that

project.

We source the uh virtual environment. We

install the embedding packages. So I'm

going to copy copy this command install

it. So the packages are sentence

transformers hacking phase hub and open

AI.

The next question is to run the local

embedding scripts. So if you see the

script name is semantic search demo. So

let's look at the semantic search uh

demo script. And if you look into this,

we can see that the first step is

loading the documents. And then we load

the local embedding model which is the

all mini LML L6V2. And then we generate

embeddings by calling the method model.

And then we pass in the docs. And then

we have the query which is the same

query we used before which is

distributed workflows policies. And then

we generate embeddings for the query.

And then we calculate u the similarities

using the np uh method. And then we

print the results. So let's uh run the

script. Twitter uv run python semantic

search demo.

Now as you can see in the same set of

documents the script has now identified

the relevant uh documents uh which has

the meaning that's closer to the

distributed workforce policies query

that we are looking for. So if you see

that for each document is uh given a

rating and that means that it's able to

identify the document that has the

closest semantic results. We'll go to

the next question.

So the task is to uh look at the results

and then say look at the semantic search

results and what is the similarity score

between remote work policy and

distributed work policies. If you look

at the first score say 0.3982 and that

is the score for remote work policy.

The next question is a multiple choice

question. So this basically confirms our

learning. So the question is based on

the comparison between semantic search

and keyboard search which is a TF IDF

and BM25 that we saw earlier. Which

approach better understands the meaning

of queries? Of course we know that

semantic search understands uh the

meaning of queries better. And that's

basically about uh this lab. In the next

lab we'll explore um vector databases.

Let's now understand vector databases.

So far we saw how we could use the

sentence transformer libraries and load

simple sentences into it to create

embeddings and then compare those

embeddings to each other in a super

simple way. However, we have a bigger

task at hand. Our policy co-pilot system

and it has hundreds or thousands of

large policy documents. Let's say we

have 500 policy documents each with

multiple sections. When a user asks,

"What's the reimbursement policy for

home office setup? Our system needs to

search through all of these documents to

find the most relevant ones." Now, if we

were to do this the naive way, comparing

the query embedding with every single

stored embedding, we'd have a big

problem because with 500 documents, each

with 384 dimensions, that's 192,000

calculations for every single query.

This is like searching through a phone

book page by page. It works for a small

phone book, but imagine trying to find a

specific number in a phone book with

millions of entries. You'd be there all

day. That's where vector databases comes

in. Think of them as having a smart

librarian who knows exactly where to

look. Vector databases can retrieve

relevant results instantly. They

efficiently use resources. They're

scalable and they do that by using smart

indexing algorithms. What does indexing

mean? Earlier we saw how we represented

documents or sentences on a vector graph

and then compared their similarities.

But when there are thousands of such

policies, it's going to be impossible to

compare them. And that's where indexing

comes in. Instead of checking every

single vector, we pre-organize them into

neighborhoods. In this case, the animal

policies are grouped together. All

health benefits are grouped together.

All remote work policies are grouped

together. That way, when someone asks

about bringing their dog to work, we

don't search the entire space. we go

directly to the animal policies

neighborhood and only search there.

Let's look at the three most popular

indexing algorithms used by vector

databases. HNSW or hierarchical

navigable small world is the most widely

used algorithm. It creates a graph

structure where each vector is connected

to its most similar neighbors. So when

searching, it starts from a random point

and follows the connections to find the

closest matches. It's fast and accurate,

which is why most vector databases use

it by default. IVF or inverted file and

LSH or locally sensitive hashing are

other examples of the same. Let's now

look at some of the popular vector DB

implementations. Chroma is perfect for

learning because it's open- source and

Python friendly. We can install it on

your computer and start experimenting

immediately. It's free, which makes it

great for students and small projects.

Pine cone is a managed service, meaning

they handle all the infrastructure for

you. You just send your data and queries

and they take care of everything else.

It's used by big companies in

production, but you pay per use. There

are other great options too. VV8 with

its GraphQL API is another example. But

for learning, I recommend starting with

Chroma. So the best approach is to start

with Chroma for learning and

experimentation and then move up to Pine

Cone or similar services for production

use case.

So first we install the required library

such as the Chroma database. Then we

import the Chroma DBA library. We

connect to the client. We create a

collection called policies. Chroma

creates a new collection in memory. Sets

up the default embedding model. The all

mini LM embedding model. Prepares

storage for vectors and metadata. We

then add policy documents to the

collection using the collection add

command. So this converts text to

384dimensional vector that we spoke

about earlier. Saves the vector in the

collection, adds the vector to the hnsw

index structure. The document is

immediately searchable. To search, we

run the collection.query method and pass

the query string. Now let's talk about

some important Chromb concepts. First,

the default behavior of Chromadb is that

it's not persistent. When you create a

client with just chromadb.client,

client, it stores everything in memory.

This means when your program stops, all

your data is lost. This is fine for

learning and experimentation but not for

production. To make Chromb persistent,

you need to use persistent client

instead of client. You specify a path

where you want to store the database

files. This way, your data survives

program restarts and you can build up

your vector database over time. You can

also change the embedding model that

Chromma DB uses. By default, it uses the

all mini LM model, but you might want to

use a different model for better

performance or to match what you used

during training. You can use OpenAI's

embedding models or even create a custom

embedding function using any model you

want. In this case, we pass in a new

parameter called embedding function that

passes in OpenAI's embedding function as

a parameter along with the API key.

Let's head over to the labs and gain

hands-on experience.

Okay, let's now look at the uh lab on

vector DB. So I'm going to start the lab

now. Okay. So the lab has uh what I'm

going to do is I'm going I'm just going

to go through a high level overview of

the lab and I'll leave you to do most of

it but I'll just explain how the lab

functions. Right. So uh in this lab

we're going to learn how to scale the

semantic search with vector databases.

So let's get that going.

So the first task is to simply

understand the uh concepts. So before we

start building, let's uh understand what

vector databases are. So we already

discussed that in the video, but here's

a quick uh description of what it is and

what it can help us do. And there's a

question on um what is the primary

advantage of using a vector database or

strong embedding models um in memory. So

I'll uh let you answer that uh yourself.

The next step is to navigate to the

project directory which is right here.

And then um we again activate the

virtual environment and we install the

embedding uh model package which is

sentence transformers which we also did

in the last lab. And then the next step

is to install the vector database. In

this case we're going to use chromb. So

um the task is to install the chromb

package.

Again I'll just skip through that for

now. Uh the next task is to initialize a

chromb vector database. So um if you go

here there's a script called init vector

DB and if you look into the script we

first import the chromb package. We also

have the sentence transformers. Um we

then uh create uh the chromb client

using the chromadb.client method. And

then we create a collection. We'll call

it techcorp docs. And then uh we load

the embedding model which is all mini LM

L6 model. And then we um

test the model with a sample document.

So we have identified a test doc which

is really just a sentence that's given

here. Um we'll then add the test

document to the collection using the

collection add method and we'll print

the results and then uh we'll print the

count of uh documents within the

collection. And that's basically it. So

that's a a quick uh beginner level uh

script.

In the next one, there are a couple of

questions that are being asked. So uh

you can answer those questions based on

the results of the script. The next one

is uh called as store documents. So this

is where we store actual documents

within the chromob uh database. Again,

this is another script that starts off

and loads the model and client as we did

before. uh but in this case we're

reading the tech corp docs um documents

using the tech corp docs method which we

have in the utilities function. So

that's what loads all the uh documents

that are in the tech corp uh docs

folder. So now we're loading actual

documents

and then we follow the same approach of

adding those documents to the collection

and then we verify the collection. So

again just uh another uh layer to that

uh script the basic script. In this case

we're just storing documents.

We'll continue to the next task. This is

where we do perform uh a vector search

against the documents. So um the script

this time is vector search uh demo. So

click on the vector search demo script.

And here we have uh some sample

documents um there are sentences and

then there's a query. Let's now

understand chunking. Now that we

understood how vector databases work, we

have a new challenge. We've been working

with simple sentences like dogs are

allowed in the office on Fridays. But

what happens when we have real policy

documents? What if we have a 50page

employee handbook that we want to add to

our vector database? Let's think about

this practically. We have an employee

handbook, 50 pages of policy content,

multiple sections per page, complex

policies with detailed explanations.

What happens when we try to add this

entire document to Chromob as a single

entry? Well, it would work. Technically,

Chromob would create an embedding for

the entire document, but when someone

asks what's the remote work policy,

they'd get back with the entire 50page

handbook. That's not very helpful.

This is what I call the precision

problem. Without chunking, when someone

asks what's the remote work policy, they

get the entire 50page handbook. The user

gets overwhelmed with irrelevant

information. They have to search through

everything to find what they actually

need. But with chunking, we break that

handbook into smaller focused pieces.

Now, when someone asks about remote

work, they get back just the specific

policy sections that are relevant. The

user gets exactly what they asked for.

clear focused answers. Now, how do we

actually break documents into chunks?

There are several strategies, but we'll

focus on some of the simplest ones. With

fixed size chunks, we simply take 500

characters per chunk. This is simple and

reliable for most use cases. We just

split the document into equals sized

pieces, which makes it easy to

understand and implement.

But there's a problem with this

approach. What happens when we split

right in the middle of a sentence? We

might end up with dogs are allowed in

one chunk and on Fridays in the other.

This breaks the meaning and makes it

hard for the system to understand the

complete information. That's where

overlap comes in. We add a 50 character

overlap between the chunks. So the end

of one chunk overlaps with the beginning

of the next. This way if we do split the

sentence, the important context is

preserved in both chunks.

Now there are other methods of chunking

like sentencebased chunking. This is

where every sentence is split into a

separate chunk or paragraph based

chunking where each paragraph becomes a

single chunk. Chunking might sound

simple but it's actually quite tricky.

The main challenge is finding the right

balance. If chunks are too small we lose

context. So as we saw earlier, if one

chunk has docs are allowed and on the

other chunk has on Fridays, the user

would get incomplete information. We'd

have poor understanding because we're

missing important details and the

information would be fragmented.

On the other hand, if chunks are too

large, we have poor precision. If we put

an entire policy in one chunk, we're

back to the same problem we started

with. The search would be inefficient

because there's too much irrelevant

content and the results would be

overwhelming. So it's important to

choose the right strategy based on your

requirements. Apart from the fixed size

chunking, there are other methods like

sentencebased chunking and paragraph

that we saw, but even others like

semantic chunking and agentic chunking

that is for now out out of scopes of

this video. Now let's build a simple

chunking function. This function takes a

document and splits it into overlapping

chunks. The key features are it tries to

break at sentence boundaries when

possible. It maintains the overlap for

context and it handles the end of the

document properly. Now this is a simple

chunking done by a Python library. Now

let's see how chunking integrates with

our vector database. The complete

workflow is that we chunk our large

policy document, add each chunks to the

vector database with a unique ID and

then when we query we get back with the

specific chunks that are most relevant.

This gives us the best of both worlds.

We can handle large documents, but we

get precise, relevant answers. Instead

of searching through entire documents,

we are searching through focused chunks

that contain exactly what the user is

looking for. Let me share some key

principles for effective chunking. For

size guidelines, 200 to 500 characters

is a good balance of context and

position. With 50 to 100 characters

overlap to maintain continuity, you

might need to adjust based on your

content. Technical documents might need

different chunk sizes than general

policies. For boundary rules, always try

to split at sentences to maintain

grammatical integrity. Avoid midword

breaks to keep words intact and preserve

paragraphs to maintain logical

structure. Finally, always test the real

queries to ensure your chunks actually

answer questions. Verify that the

overlap reserves meaning and monitor

your search results to see if you need

to adjust the chunk size. Remember,

chunking is all about finding the right

balance between context and precision.

It's not just about breaking documents

into pieces. It's about breaking them in

a way that makes sense for your users.

All right, let's look into the next lab

on document chunking. Okay, in this lab,

we're going to look at uh chunking

techniques. So, we'll learn how to

optimize rack performance by breaking

documents into focused searchable

chunks.

So, first we activate the virtual

environment. So, this is something we

have uh already done many times.

All right. Uh, so first

we're going to look at this chunking

problem demo script. So if you expand

the rack project, there should be a

script called as chunking problem demo

script. The thing is uh this script

demonstrates the core problem of

searching a large documents in rack

systems. It creates a sample employee

handbook and shows how searching for

specific information like u internet

speed requirements returns the entire

document instead of just the relevant

section. So uh we'll see uh a large

document stored as a single chunk search

queries that should be that should find

specific sections or results that return

the entire document. So here you can see

there's a sample document um that has

multiple sections and uh we're adding

that document to the uh collection chrom

and then we're doing a query for

internet speed requirements. So let's

run the script and see

how it works.

So the script runs now. As you can see,

it returns the entire document. It's

truncated here, but uh the result shows

the uh entire document. So that's the

problem uh with this uh approach. So the

answer to this is large documents return

irrelevant uh results. Next uh we will

look at some of the uh dependencies,

libraries and dependencies that we'll be

using. So first um we have what is known

as lang chain. So if you uh don't know

what lang chain is, we have other videos

that are on our platform. We have a

future course that's coming up that will

be for lang chain end to end. So do

remember to subscribe to our channel to

be notified when it comes out. So lang

chain is a powerful framework for

building rag applications. It provides

recursive character text splitter for

smart uh document chunking and there's

also the uh spacy which is an advanced

natural language processing library and

it provides uh it also provides a spacey

text splitter for sentence aware

chunking. So we'll use spacy for

sentence um aware chunking and it uh

these libraries take care of chunk sizes

overlaps operators etc. And we'll

install the lang chain and spacey

dependencies. Okay, we'll go to the next

question and we'll first look at basic

chunking.

So if you open the basic uh chunking

script, you'll see that it uses the lang

chain uh text splitter

um from which we have the recursive

character text splitter uh library. So

here we have a sample document and uh

this is where we are doing the

splitting. So as you can see we specify

the chunk size 200, the chunk overlap is

50. So that's the uh 50 characters

that's going to be overlap between each

chunks or some of these se separators

that are defined.

So we then do a splitter.split text to

split the text into different chunks and

then we have we just go through the

chunks and print them.

So I'll let you do that yourself. We'll

go to the next one and there's uh a

bunch of questions uh that are asked

that you can you have to read the script

and understand and answer. So I'll let

you do that uh by yourself.

The next one we'll look at is sentence

chunking. So in sentence chunking again

uh if you look at the script we're using

spacy as a library. And then we have um

a question that's based on the output of

that script. And then finally we looked

at chunked search. So this is a another

script that performs a chunked vector

search uh demo that kind of connects

everything we have uh learned so far

together. So first we chunk the

documents and then we add these chunk

documents to a collection uh and there's

comparison between chunked no chunking a

collection with no chunking and

collection with chunking and then uh

we'll see the difference between the

two. Again I'll let you u go through

that by yourself and there's a question

that's based on that. So, yep, that's u

a quick lab on chunking and I'll see you

in the video. Let's now bring it all

together to build our rack system. Now

that we understand all the individual

components of racks, that's retrieval,

augmentation, and generation. It's time

to see how they all work together in a

real system. We've been building our

policy copilot system piece by piece.

But what does it look like when

everything is connected and running in

production? So, we know the basic flow.

User query goes to retrieval, then

augmentation, then generation, and

finally response. But this is just the

highle view. In a real system, there are

many more components working behind the

scenes to make this happen smoothly,

efficiently, and reliably. Now,

everything we spoke about so far, such

as chunking, creating embeddings,

storing it in vector DB, etc. are things

that need to be done before the user

starts asking questions. because loading

thousands of documents, chunking and

storing them in DB and creating

embeddings out of them and scoring them

all of that takes a lot of time and so

they go together before this stage

called as a rag pipeline. Let's take a

closer look at that simple rag pipeline.

The rag pipeline gets the policy

documents, chunks them into small pieces

using a chunk size of 500 with an

overlap of 50 characters, then converts

them into embeddings using OpenAI's

embedding models and then finally loads

them into a vector DB.

Now, when a query comes in, we search

the rag pipeline and it gives us the

necessary chunks of document. We then

augment that document along with the

user's query and sends that to the LLM

to generate a response. So that's a

super simplistic rack pipeline. Let's

head over to the labs and see this in

action. All right. So this is the last

lab in this course and this one is about

building a complete rag pipeline. So uh

we'll learn how document chunking

integrates with vector search, how query

processing connects to retrieval, how

context augmentation feeds into response

generation, and how the complete rag

pipeline works end to end. So this

basically combines everything from the

first four labs that we've just done.

All right. So first we start with

setting up the virtual environment. So

the environment is already set up. You

just need to activate it.

All right. So, first we start by looking

at the complete rack demo script. So, we

have a single script now that combines

everything we've done so far.

And uh we'll start looking at it uh

section by section. So, there's the

first section that has the document

loading and chunking. And there's the a

function for that. We have some sample

documents. And then we have a text

splitter. And we have all the uh chunks

that are created here. And then we have

section two which is a vector database

setup. Here you can see we set up a

chromb vector database and store the

document chunks there. And then we have

the uh user query processing section.

This is where we actually process the

user queries. And then we do the actual

search. And finally we have the context

augmentation. This is where we build

augmented prompt with retrieved context

for LLMs.

And so here you can see how uh a prompt

is generated with the uh context in

place which is the basically the

policies that were retrieved and then

you have the actual question the user's

question itself and some additional uh

prompt engineering

and then finally we have the generate

response function that generates a

response using LLM

and finally we have the complete rag

pipeline that calls each of those

functions. funs that we have written

before

and then there's the main function.

Well, I'll let you explore this uh lab

by yourself. There's a lot of uh

interesting questions and challenges

throughout.

This section covers the essential

production concerns. Caching to make

systems fast, monitoring to know what's

happening, and error handling to keep

systems running when things go wrong.

Let's start with a fundamental problem.

Rag systems are slow. Every query

involves multiple expensive operations.

Generating embeddings, searching vector

databases, calling LLM APIs. Without

optimization, a single query can take

nearly a second. But here's the thing.

Most queries are repeated or are very

similar. People ask the same questions

over and over. What's the reimbursement

policy for home office setup? Gets asked

dozens of times. Caching solves this by

storing the results of expensive

operations and reusing them. Instead of

taking 950 milliseconds, a cache

response might then just take just 5

seconds. That's 190 times faster. The

key insight is that we don't need to

recomputee everything for every query.

We can cache at multiple levels, the

embeddings, the search results, or even

the final answers. So there are four

main types of caching that we can

implement in rag systems. each solving a

different performance bottleneck. Query

cache is the simplest. We store complete

question answer pairs. When someone asks

what's the remote work policy, again, we

return the exact same answer instantly.

This works great for frequently asked

questions. Embedding cache stores the

computed vectors for text. This is

useful because generating embeddings is

expensive and we often process the same

text multiple times like policy chunks

that appear in multiple searches. Vector

search cache stores the results of

database queries. This helps when

similar queries return the same results.

Remote work and working from home might

return identical chunks. Llm response

cache stores the generated answers. This

is the most expensive operation to cache

but also the most valuable since LLM

calls are typically the slowest part of

the pipeline. The key is to cache at the

right level, not too granular, not too

broad, and with appropriate expiration

times. Let's look at how to actually

implement caching. Well, Reddis is a

popular caching tool because it's fast,

supports different data types, and has

built-in expiration. The example shows a

simple but effective caching strategy.

We create a unique cache key by hashing

the query and context together. This

ensures that different queries can get

different cache entries, but similar

queries can share the same entry. We

check the cache first. If we find a

cache response, we return it

immediately. If not, we generate the

response using our normal rack pipeline,

then store it in the cache with an

expiration time. The TTL or time to live

is crucial. We want to cache to we want

to cache long enough to get performance

benefits, but not so long that the data

becomes stale. For policy documents and

R might be appropriate for more dynamic

content, we might use shorter times. You

can't manage what you don't measure. In

production, we need to monitor

everything to understand how our rag

system is performing and when problems

occur. The best metrics are response

time, how fast we answer questions,

throughput, how many queries we handle

per second, error rate, what percentage

of requests fail, but rack systems have

their own specific metrics we need to

track. Retrieval quality measures how

relevant the return chunks are to the

user's question. Embedding performance

tracks how long it takes to generate

vectors. Chunking efficiencies monitors

how well we're breaking up documents. We

set alerting thresholds to know

immediately when something goes wrong.

So if response time exceeds 2 seconds,

there's uh there's a performance issue.

If error rate goes above 5%, then

there's a system problem. So the key is

to set realistic thresholds based on

actual performance, not theoretical

targets. So we want alerts that indicate

real problems, not false alarms that

cause alert fatigue. Now things will go

wrong in production. Vector databases

will go down. Llm services will be

unavailable. Networks will have timeouts

and we need to handle these failures

gracefully. The goal is graceful

degradation. The system should still

work even if not at full capacity. So

users should get some answer rather than

an entire error message. So the example

uh shows a cascading fallback strategy.

If the full rack pipeline fails, we try

keyword search. If that fails, we return

the retrieved chunks directly. If even

that fails, we use simple text matching.

And as a last resort, we return a

helpful error message. And we

periodically test if the service is back

by sending a few requests. And this is

uh the halfopen state. So if those

succeed, we close the circuit and resume

normal operation. Let's now bring it all

together to build our rag system. Now

that we understand the core rack

architecture, we need to talk about what

happens when we put these systems into

production. Real world rack systems face

challenges that don't exist in our

simple examples. Performance issues,

failures, and the need to handle

thousands of users. So this diagram

shows a complete production rack system

running on Kubernetes. And let me walk

you through each layer. So we have a

data layer, a rag pipeline layer, and

the application layer, and a monitoring

stack. So the data layer includes all

our storage systems. So Chromad for

vectors, Red is for caching, PostgresQL

for metadata. The rag pipeline layer

contains the core rack functionality

broken down into microservices. So query

processing chunking embedding

generation, retrieval, augmentation and

generation. And each service can scale

independently based on demand. The

application layer contains all the

userfacing services. So there's the web

UI, there's the mobile app back end if

there's any the admin interface etc.

These services handle users interactions

and present the rack capabilities

through different interfaces and then we

have our complete monitoring stack.

Prometheus for metrics, graphana for

dashboards, Jerger for tracing and the

ELK stack for logging. Now this layered

architecture separates concerns clearly.

Applications handle user interactions.

The rag pipeline processes the core

functionality and the data layer

provides storage. This can handle

thousands of concurrent users while

maintaining high availability and

performance. Well, that's a highle

overview. We haven't spoken about a lot

of advanced topics like multimodal rack,

graph rag, hybrid search techniques,

federated rack, reranking techniques,

query expansion, context compression. To

learn more about AI and other uh related

technologies, check out our AI learning

path on CodeCloud. Well, thank you so

much for watching. Do subscribe to our

channel for more videos like this. Until

next time, goodbye.

RAG Crash Course for Beginners

KodeKloud

55 days ago

58:50

RAG & Vector Search

Rank #3

Description

🧪RAG Labs for Free: https://kode.wiki/3KfeX1a Ever wondered how ChatGPT remembers your documents or how AI searches through company data? The secret is RAG (Retrieval Augmented Generation)! In this hands-on RAG tutorial, we will show you exactly how to build production-ready RAG systems from scratch. No fluff, just practical coding examples you can follow along with. What makes this video different? You get a real lab environment to practice everything we cover! 🧪RAG Labs for Free: https://kode.wiki/3KfeX1a ⚡ Quick Overview: • RAG Components Overview • Vector Search & Embedding Models • ChromaDB and VectorDB • Document Chunking Strategies • Complete RAG Pipeline Build 🚨Start Your AI Journey with KodeKloud: https://kode.wiki/41NLyks ⏰ TIMESTAMPS: 00:00 - Introduction to RAG Tutorial 01:15 - Simplest RAG Explanation 03:32 - When not to RAG? 07:40 - What is RAG? 11:49 - Free Lab 1: Keyword Search (TF-IDF & BM25) 15:02 - What are Semantic Search? 16:54 - Understanding Embedding Models 19:00 - Embeddings and Vectors 21:00 - The Dot Product 26:00 - Lab 2: Embedding Models 29:50 - Vector Databases Explained 33:04 - ChromaDB Tutorial 34:45 - Lab 3: Vector Databases 38:17 - Chunking Explained 39:39 - Document Chunking Strategies 43:22 - Lab 4: Document Chunking 48:45 - Build your RAG Architecture 49:31 - Lab 5: Complete RAG Pipeline 51:50 - Caching, Monitoring and Error Handling 56:34 - RAG in Production 58:08 - Conclusion #RAG #RetrievalAugmentedGeneration #Vectordb #AI #EmbeddingModels #VectorDatabase #ChromaDB #AITutorial #SemanticSearch #LLM #OpenAI #DocumentChunking

Video Details

Category

RAG & Vector Search

Featured Date

November 12, 2025

Quality Rank

#3

AI Recommended