Complete RAG Tutorial 2026 (Free Labs) | DailyDevLists

Loading video player...

Full Transcript

8,273 words • EN

When you make a Google search, you're

trying to get access to more information

from disparit websites curated by a

search engine. So even though you

implicitly have knowledge in your brain

that contains information by using

Google search, you are essentially

extending your knowledge by explicitly

searching in external data sources from

the internet. Large language models that

we love and use today like chatbt also

have the same predicament as you do when

it comes to knowledge. What typically

happens during training is the model is

initially pre-trained with trillions and

trillions of tokens to implicitly learn

from them. So practically every credible

information that exists on the internet

consumed by the model during the

pre-training stage to build up its

implicit knowledge. And you might see

this and immediately go well the data

that goes in pre-training certainly

couldn't be all the data. What about my

company data? What about data from my

own computer? And that's an extremely

important point because this is the core

limitation with LLMs that certainly need

to be addressed which is how do we

exactly extend the LLM to have knowledge

beyond its implicit knowledge. Welcome

to CodeCloud and today we're going to

cover Rag as a crash course. We're

starting from covering some fundamental

understanding by looking at some of the

common misconceptions and when or when

not to use Rag. We'll also get into some

real life examples of how rag can be

used in different scenarios like a law

firm or a chatbot. Then we'll cover some

metrics that we can use to actually

evaluate the rag system and cover some

practical use cases of rag. And finally

end with some future and emerging

concepts in rag. By the time you finish

this course, you'll have a comprehensive

view on RAG and have a solid

understanding of how to improve your

skill set with practical know-hows that

you can use right off the video. So,

let's get started. In early 2021, a

concept called RAG was brought forward

to address this challenge. Back then,

the context window from large language

models were extremely limited. Like

around 4,000 tokens was the norm. So we

not only had the problem of not being

able to give more data to process due to

small context window but also we wanted

to extend their ability to access

external data to use them as a knowledge

similar to how you run your Google

search. Rag proposed the method that

allows the model to essentially augment

its knowledge by retrieving from an

external data source and generating its

answer from it. And that's how the term

rag was formed retrieval augmented

generation. Now, since 2021, Rag has

certainly matured into a much more

comprehensive feature where we now have

more established databases specific for

rag use cases called vector databases.

And tools like Chroma and Pineong are

popular to use when you're trying to set

up an external database for the model to

use. We also have more established

methodologies when it comes to

converting a regular document into

vectors by using embedding models like

OpenAI's text embedding 3 large and

coherence embedding models. These

embedding models are essential in

converting a regular text into semantic

representation which is how LLM is able

to search the external document stored

in a vector databases to get relevant

information. As you can see, this type

of configuration is why Rag has become a

dominant feature in the AI industry and

why you should learn the fundamentals of

how to structure RAG since the use cases

and applications for them are quite

extensive. A common misconception about

RAG is that RAG gives LLM long-term

memory, but this isn't true. While rag

can certainly extend the model's ability

by improving its context with more

relevant data from the vector database,

the retrieve data is ephemeral which

means it persists during the turn. Where

the confusion comes from is the fact

that rag can appear as if the LLM can

persist its external knowledge since

this knowledge is stored in the

database. So, in a way, since the LLM

has access to this knowledge, as long as

the database is available, it can

certainly seem like the model has a

long-term memory. Another misconception

with rag is that it has the ability to

return all relevant data. While in

theory, this holds true, there's a limit

to this. Typically, in a rag setup, you

have tens and hundreds of gigabytes of

database in your vector database. But

instead of being stored in its raw

format like in a structured databases

like SQL in vector databases, they're

stored in a semantic space as a vector

format. In other words, let's say you're

given data is John Wick is a great

movie. Instead of storing this data

directly in the database, the embedding

model will convert them into a vector

that represents a semantic

representation. And this is important

because later if we search the term

great film instead of great movie even

though the text doesn't contain the word

film it's able to retrieve those

records. This is the power of rag in

vector databases. Given that this is a

core concept in rag let's run through a

lab to actually dig into this so that we

can get a hands-on experience of how to

actually set up a rag system. So going

back to the earlier statement John Wick

is a great movie. We need to first

convert this into a vector embedding if

we want to use rag. So you might be

wondering, wait, why can't we just store

the sentence as is? That's because the

sentence is actually going to be stored

in a vector database so that the LLM can

retrieve it. And you might be asking,

well, can't the LLM just retrieve them

by the sentence as is? Well, if we do

that, then we might as well just use a

SQL database instead of a vector

database. But the whole point in

converting this sentence into a vector

embedding is that we can search it by

the semantic meaning rather than the

text that's contained in the sentence.

Let's look at how this might work in

code so I can explain it a bit more. So

I'm going to create a variable called

prompt and pass in the string John Wick

is a great movie and use OpenAI's

embedding model to create the sentence

embedding for it. So after importing

OpenAI and adding my API key, the

sentence embedding can be retrieved like

this. And as you can see from the

output, the embedding is a bunch of high

decimal data points like this. And while

a conventional database might just store

the sentence like such, in a vector

database, we're going to store both the

embedding and the value so that the

meaning and semantics can be searched

just not just the value itself. While we

will cover this more in detail in our

lab that's embedded throughout the

video, I will store this embedding that

we just made into a local variable to

simulate an actual vector database which

is typically stored in pine cone or

chroma. Now let's look at how the rest

of the rag pipeline actually comes

together to complete the rag. Now that

we understand how sentence and data is

converted to embeddings to be searched

through semantics. So given our

understanding of the basics of how

vector embedding and vector databases

work, where it can be used by LLMs

through RAG, let's look at situations

where RAG could actually be a bad tool

to use and when RAG is a good tool to

use. As you just saw in how to convert a

sentence to its vector embedding, what

we did was break them into its semantic

representation. So when you are in a

situation where you have to search the

database by the meaning rather than the

text, using rag can be a really good way

to retrieve information that pertains to

its certain topic. And this is a great

way to extend the LLM's knowledge. Also,

when you have large sets of documents

that contain disparate information, rag

is a really good way to neutralize them

by essentially making all information

accessible into a single search. And

like we've seen before, you'll certainly

have to put in a lot of work upfront in

converting these documents into a vector

embedding to store them in the vector

database. But once you do get these

done, the rest just works like magic in

instant retrieval of relevant

information at your fingertips. Let's

dive into a quick lab to cover how the

basic embedding and vector databases

actually look to get a more concrete

understanding before jumping into deeper

concepts.

Let's start with the first lab. Click on

start to launch the lab. Give it a few

seconds to load. Once loaded,

familiarize yourself with the lab

environment. On the left hand side, you

have a questions portal that gives you

the task to do. On the right hand side,

you have a VS Code editor and terminal

to the system. Remember that this lab

gives you access to a real Linux system.

[music] Click on okay to proceed to the

first task. The first task requires you

to explore the document collection. Open

the TechCorp documents in the VS Code

editor on the right. We see there is a

Tech Corp docs folder. Expand it to

reveal the subfolders. The ask is to

count how many documents are in the

employee handbook. This is what I call a

warm-up question that will help you

explore and familiarize yourself with

the lab. The real tasks are coming up.

In this case, it's three. So, I select

three as the answer. Then proceed to the

next task. This is about performing a

basic rep search. As we discussed in the

lecture, we'll run a GRP command to

search for anything related to holiday

in the folder and store the results in a

file named extracted content. To open

the terminal, click anywhere in the

panel below and select terminal. This

creates a new file with the results.

Click check to check your work and

continue to the next task. The next task

is to set up a Python virtual

environment and install dependencies.

I'll let you do that yourself. We'll

move to the next task now. Here we

explore the TF script. Here we first

import the TF vectorzer from the scikit

learn library. We then transform the

dogs. Then we compare using cosine

similarities. So cosine is one approach

of comparing two vectors to identify

similarities. And then we finally print

the results. We now execute the script

and then we view the results. And for

now we'll just click check to proceed to

the next step. We then move to the next

step. Here the question is to analyze

the score printed and identify the score

of the top results. So the ask is to

search for pet policy docs and identify

the score for the top result. Here we

see the top result is rightly identified

as the pet policy.md file with a score

of 0.4676

whereas the other files have a score

less than 0.1. So the answer to this

question is 0.4676.

The next task is to review and execute

the BM25 script. Open the BM25 search.py

file and inspect it. You'll see that we

import the rank BM25 uh package. We then

create an index and then for each query

called the BM25.get scores method and

from the results we get the top three

results and we go through each result

and print that. Finally, there is a

hybrid approach that combines TF and

BM25 techniques using a weighted

approach. I'll let you explore that by

yourself. Let's get back to the next

topic.

Okay. So, in this lab, we're going to

look at embedding models. We'll explore

semantic search using embedded models

which are the foundation of modern uh

rag systems. So, let's uh go to the

first task. So, the first task is about

keyword uh search limitations. So first

we navigate to the project. We create a

new virtual environment and install the

requirements.

I go to the terminal and we're going to

uh set up the virtual environment.

So our project is within this uh folder

called rag project. And here we have the

virtual environment uh that's being set

up.

Okay. We now run the um the next step

once the once the virtual environment is

set up the next step is to run the

keyword limitation demo. If you go to

the rag project, you'll see the keyword

limitation uh demo script. So this is a

simple script that searches for a word

or keyword that does not exist [music]

in in the documents and proves that uh

pure keyword based uh search uh are less

likely to yield the right results. For

example, in this case, the query is

distributed workforce policies and the

none of the documents have something

that's exactly like that, right? So,

let's try running the script.

And if you look at the script, most of

the scores are zero because um the

[music] keywords distributed workforce

policies does not really exist in any of

the scripts. So, the correct answer here

is

missing synonyms and context.

All right.

The next task is to install embedding

dependencies. So we go to the rack

project. So we're already in that

project.

We source the uh virtual environment. We

install the embedding packages. So I'm

going to copy copy this command install

it. So the packages are sentence

transformers hacking phase hub and open

AAI.

The next question is to run the local

embedding scripts. So if you see the

script name is semantic [music] search

demo. So let's look at the semantic

search uh demo script. And if you look

into this, we can see that the first

step is loading the documents. And then

we load the local embedding model which

is the all mini LML L6V2. And then we

generate embeddings by calling the

method model. And then we pass in the

docs. And then we have the query which

is the same query we used before which

is distributed workflows policies. And

then we generate embeddings for the

query. And then we calculate u the

similarities using the np. Uh method.

[music] And then we print the results.

So let's uh run the script. to do uv run

python semantic search demo.

Now as you can see in the same set of

documents the script has now identified

the relevant uh documents uh which has

the meaning that's closer to the

distributor for policies query that we

are looking for. So if you see that for

each document it's uh given a rating and

that means that it's able to identify

the document that has the closest

semantic results. We'll go to the next

question.

So the task is to uh look at the results

and then say look at the semantic search

results and what is the similarity score

between remote work policy and

distributed work policies. If you look

at the first score say 0.3982 and that

is the score [music] for remote work

policy.

The next question is a multiple choice

question. So this basically confirms our

learning. So the question is based on

the comparison between semantic search

and keyboard search which is a TF IDF

and BM25 that we saw earlier. Which

approach better understands the meaning

of queries? Of course we know that

semantic search understands uh the

meaning of queries better. And that's

basically about uh this lab. In the next

lab we'll explore um vector databases.

I'm just going to go through a highle

overview of the lab and I'll leave you

to do most of it but I'll just explain

how the lab functions. Right. So uh in

this lab we're going to learn how to

scale the semantic search with vector

databases. So let's get that going.

So the first task is to simply

understand the uh concepts. So before we

start building, let's uh understand what

vector databases are. So we already

discussed that in the video, but here's

a quick uh description of what it is and

what it can help us do. And there's a

question on um what is the primary

advantage of using a vector database or

strong embedding models um in memory. So

I'll let you answer that uh yourself.

The next step is to navigate to the

project directory which is right here.

And then um we again activate the

virtual environment and we install the

embedding uh model package which is

sentence transformers which we also did

in the last lab. And then the next step

is to install the vector database. In

this case we're going to use chromb. So

um the task is to install the chromb

package.

Again I'll just skip through that for

now. Uh the next task is to initialize a

chromodb vector database. So um if you

go here there's a script called init

vector db and if you look into the

script we first import the chromb

package. We also have the sentence

transformers. Um we then uh create uh

the chromb client using the

chromadb.client method and then we

create a collection. We'll call it tech

corp docs and then uh we load the

embedding model which is all mini LM L6

model and then we um

test the model with a sample document.

So we have identified a test doc which

is really just a sentence that's given

here. Um we'll then add the test

document to the collection using the

collection add method and we'll print

print the results and then uh we'll

print the count of uh documents within

the collection and that's basically it.

So that's a a quick uh beginner level uh

script.

In the next one there are a couple of

questions that are being asked. So uh

you can answer those questions based on

the results of the script. The next one

is uh called as store documents. So this

is where we store actual documents

within the chromob uh database. Again

this is another script that starts off

and loads the model and client as we did

before. Uh but in this case we are

reading the tech corp docs u documents

using the techp docs method which we

have in the utilities function. So

that's what loads all the uh documents

that are in the tech corp uh docs

folder. So now we're loading actual

documents

and then we follow the same approach of

adding those documents to the collection

and then we verify the collection. So

again just uh another uh layer to that

uh script the basic script. In this case

we're just storing documents.

We'll continue to the next task. This is

where we do perform uh a vector search

against the documents. So um the script

this time is vector search uh demo. So

click on the vector search demo script.

And here we have uh some sample

documents um that are sentences and then

there's a query.

Okay. What about when rag isn't useful?

What kind of scenarios should I use for

rag? Because vector databases only work

in a text space when you want to store

graphs, images, and charts. The vanilla

vector embedding might not be a great

option because it leverages what's

contained in the image. So use cases

where you might have to search images

and charts would not be the best to use

rag since it can't retrieve based on

other modalities than text. Another

example is when you want to search by

the format of the document. For example,

a document can have multiple pages and

at each page there might be different

sections of the document and certain

sections might have different formatting

and tables of information. Meaning when

you want to search by certain page of

the document or locationally where in

the document something's stored, rag is

not able to fulfill that since its

intent is by the semantic meaning of

those words rather than the position or

the format of the document which can

often be done by using a vision model.

However, even though rag might not lend

itself to how the document is physically

formatted and preserve the physical

structure of the document, separating

the document into semantic structure is

actually very common practice in rag.

You might think, well, can I just store

the entire document in a vector database

and retrieve the whole document? While

this sentiment is certainly common in

regular SQL databases that we can store

as objects rowby row, things look

slightly different in rag. And here's

why. Large language models typically are

limited in how much context it can hold.

And this limitation is often referred to

as is the context window. This is why

when you go to chache and paste in a

large PDF document, Chacht will say the

document is too large. For the same

reason, we will need to chunk our

document in sections so that when we

retrieve the document, it doesn't

overload the LLM with the entire

document. And besides the context

window, it's important to feed LLM the

correct context. And documents often

contain information that is irrelevant

to what we're actually trying to look

at. In other words, even if the context

window can fit the entire document,

chunking the document into its semantic

group is actually beneficial overall.

But how do we decide how to chunk the

document? One way to do this is by going

by fixed size chunking. You can chunk

the document in a naive way where you go

by certain character limit so that as

the document is being stored in the

database, each row essentially holds up

to say 5,000 characters per chunk. You

can also go by total number of words so

that each row contains up to certain

predefined number of words. Another way

is to go by tokens where instead of

chunking by a fixed size in original

language, you can actually chunk them by

the size of the tokens which is an

embedding representation of the

document. As you can see, fixed size

chunking is the simplest strategy when

it comes to chunking since you can just

take an upward bound that can be

predefined up front. This can also apply

for sentence chunking or paragraph

chunking all based on essentially where

you want to cut the document by certain

number in threshold. But fixed size

chunking doesn't necessarily lead to the

best result in rag since it ignores the

semantic grouping of the document. In

other words, documents are often grouped

in sections or topics and fixed size

chunking ignores this and can abruptly

chunk the document regardless of what

the document might have contained. For

this reason, semantic chunking is

another popular method. Instead of

splitting the document into a fixed

number of characters, words, tokens, or

sentences, you can break the text in

where the meaning actually starts to

shift. This way, each row in the

database naturally respects the semantic

topics, which makes retrieving rowby row

very rich with context. One way to do

this is by breaking the document into

sentences and measuring the similarity

between them. So that when the coherence

start to drop between those sentences

which is an indication that now the flow

in context or topics has changed then

you can just chunk the document in this

manner. Chunking the document this way

ensures that the natural break point

that might exist in the document are

preserved so that you're not abruptly

chunking the document because of its

size. The obvious drawback is that

compared to naive approaches like

fixed-size chunking, it adds an

engineering overhead to chunk and store

the document based on its semantic flow

of the documents. And one way to reduce

engineering overhead while keeping the

semantics between each row is by using

what's called an overlapping chunking or

sliding window technique. In this

method, you are processing the document

in a way that intentionally adds

duplication between each row by

essentially adding a lip between chunks

so that there's an overlap in context

coverage between each row. This way,

every row contains a bit of semantic

context from the previous row and

subsequent row. But chunk overlap can be

more like art than science since the

exact number of how much overlap there

should be can be trivial and

non-deterministic. But it is a great way

to ensure higher accuracy while still

trying to keep the engineering very

simple. While there are other

methodologies like this that can

document chunking more flexible, one

notable method that's becoming popular

is agentic chunking where you leverage

AI to chunk the document for you. So

essentially you allow AI to pre-process

the document and since LLMs are

extremely good at understanding text

instead of coming up with your own

method of chunking based on fixed size

or natural breakpoint based on semantic

algorithm like we just covered. We allow

AI to actually make the call for us. The

obvious drawback to this method is cost

and speed since you'll have to

essentially frontload all the work and

any changes will require rerunning of

those agents to maintain the

authenticity of chunking. But using

lower grade or even more affordable LLMs

should do just fine since the job is

just splitting the document word thinks.

Given that chunking is an important

concept to drill down, let's hop on to a

quick lab to get a more concrete

understanding.

Okay, in this lab we're going to look at

uh chunking techniques. So we'll learn

how to optimize rack performance by

breaking documents into focused

searchable chunks.

So first we activate the virtual

environment. So this is something we

have uh already done many times. All

right. Uh so first

we're going to look at this chunking

problem demo script. So if you expand

the rack project, there should be a

script called as chunking problem demo

script. The thing is uh this script

demonstrates the core problem of

searching large documents in rack

systems. It creates a sample employee

handbook and shows how searching for

specific information like u internet

speed requirements returns the entire

document instead of just the relevant

section. So uh we'll see uh a large

document stored as a single chunk.

search queries that should be that

should find specific sections or results

that return the entire document. So here

you can see there's a sample document um

that has multiple sections and uh we're

adding that document to the uh

collection chromb and then we're doing a

query for internet speed requirements.

So let's run the script and see

how it works.

So the script runs now and as you can

see it returns the entire document. It's

truncated here but uh the result shows

the uh entire document. So that's the

problem uh with this uh approach. So the

answer to this is large documents return

irrelevant uh results. Next uh we will

look at some of the uh dependencies

libraries and dependencies that we'll be

using. So first um we have what is known

as lang chain. So if you uh don't know

what lang chain is, we have other videos

that are on our platform. We have a

future course that's coming up that will

be for lang chain end to end. So do

remember to subscribe to our channel to

be notified when it comes out. So lang

chain is a powerful framework for

building a rag applications. It provides

recursive character text splitter for

smart uh document chunking [music] and

there's also the uh spacy which is an

advanced natural language processing

library and it provides uh it also

provides a spacey text splitter for

sentence aware chunking. So we'll use

spacy for sentence um aware chunking and

it uh these libraries take care of chunk

sizes overlaps operators etc. And we'll

install the lang chain and spacey

dependencies. Okay, we'll go to the next

question and we'll first look at basic

chunking.

So if you open the basic uh chunking

script, you'll see that it uses the lang

chain uh text splitter

um from which we have the recursive

character text splitter uh library. So

here we have a sample document and uh

this is where we are doing the

splitting. So as you can see we specify

the chunk size 200, the chunk overlap is

50. So that's the uh 50 characters

that's going to be overlap between each

chunks or some of these se separators

that are defined.

So we then do a splitter.split text to

split the text into different chunks and

then we have we just go through the

chunks and print them.

So I'll let you do that yourself. We'll

go to the next one and there's uh a

bunch of questions uh that are asked

that you can you have to read the script

and understand and answer. So I'll let

you do that uh by yourself.

The next one we'll look at is sentence

chunking. [music] So in sentence

chunking again uh if you look at the

script we're using spacy as a library.

And then we have um a question that's

based on the output of that script.

[music]

And then finally we looked at chunked

search. So this is a another script that

performs a chunked vector search uh demo

that kind of connects everything we have

uh learned so far together. So first we

chunk the documents and then we add

these [music] chunk documents to a

collection uh and there's comparison

between chunked no chunking a collection

with no chunking and collection with

chunking and then uh we'll see the

difference between the two. Again I'll

let you u go through that by yourself

and there's a question that's based on

that. In this lab, task six covers

agentic chunking. This is the most

advanced chunking method. Instead of

splitting by character count or sentence

boundaries, an AI model analyzes the

document and decides optimal split

points based on semantic topic shifts.

Before running this demo, you need to

source the batch profile to load the API

configuration. Run the source command

like this and then run uvun python

egentic chunking demo.py. You will see

the AI analyzing the document for

natural topic boundaries and creating

semantic coherent chunks where each

chunk contains one complete topic or

idea. The question asks about the main

advantage of agentic chunking. The

answer is that it splits based on

semantic meaning and topic shifts. This

produces the highest quality chunks but

comes with a trade-off of higher cost

and slower [music] processing since it

requires LLM API calls. To summarize, we

learned four chunking techniques from

basic to advanced. Basic chunking splits

by character count. Overlap preserves

context at boundaries. Sentence aware

chunking respects natural language

structure and agentic chunking uses AI

to understand document semantics.

Different methods suit different use

cases. For most applications, sentence

aware chunking with overlap is a good

balance. For high-v value documents

where quality matters most, a gentic

chunking provides the best results. So

you might be wondering what real use

cases for rag might be. And here are a

few potential use cases that rag can be

extremely valuable to an organization. A

common use case is a law firm. In a

traditional mid-size law firm, you're

going to have millions and millions of

documents stored inside a document

management software. And given that

different files from various matters

contain different information, you want

to have a system like Rag that can run

comprehensive searches through large

sets of documents. And for extra

security, you can make sure that the

privacy of those search results is

contained within a specific case rather

than the whole by adding additional

search parameters in how you store the

documents in the vector database.

Another example is using a chatbot that

can search through a company's knowledge

base and policy to either help internal

staff understand more about the

company's policy or even client-f facing

chat applications that can directly help

answer questions about the company. All

of this can be done by leveraging rag.

As you can see, the use cases for rag

can extend to cases like law firms and

chatbots, which means you're going to

rely on rag more and more. And just like

anything else, any system that is

important or heavily reliant on is going

to need to be evaluated. Evaluation

allows us to look at the system and make

sure that they are working properly and

also detect when things aren't working

as they should. But how do we measure

rag? What are the best ways to evaluate

our rag system that we can set up? Here

are some ways that you can add

evaluation in your rag system. In the

retrieval side, you want to get high

relevance [snorts]

and high comprehensiveness and high

correctness. In other words, because the

searches are done in semantics, we want

to make sure that each time that we

retrieve information from the data

source, we're able to measure whether

the results are relevant to the question

or the query that we're being asked. And

same thing applies to comprehensiveness.

Does the set of retrieve documents cover

all information that is needed to

actually answer the prompt? And finally,

are the retrieved documents actually

correct and ranked properly? What's

important to keep in mind is that your

specific setup might require measuring

different segments? But here are some

more common and established methods to

keep in mind if you want to take further

steps beyond how they're actually

introduced here. The first is recall at

K. And it's done by basically given all

the relevant information and documents

in your system, how many did the

retriever actually find within the top

results. Another one is precision at K

which measures out of the top results,

how many were actually relevant, which

is slightly different than recall at K

since it measures the quality of the

actual result itself. MRR or mean

reciprocal rank is a measurement of the

first relevant document in your result

and how high the ranking is. And

finally, NDCG or normalized discounted

commulative gain is a measurement on

whether the retriever is ranking

relevant documents higher than

irrelevant ones. As you can see, there

are so many different ways to pick at

the metric side of the things when it

comes to rag. That will help you find

the best way to ensure that the system

that you're actually trying to set up is

actually working properly as it should,

especially as you rely more and more on

the system. Let's hop onto a lab to try

to measure how to evaluate rag. In this

lab, we're going to learn how to

evaluate rag systems [music] using four

key metrics. When you build rag systems,

you need to measure how well it

retrieves relevant documents. We will

implement precision at K, recall at K,

mean reciprocal rank, and normalized

discounted commulative gain. Each metric

answers a different question about your

retrieval quality. This lab takes about

20 to 30 minutes to complete. The first

step is to set up our environment. In

[music] this task, we're asked to

navigate to the project directory and

activate the virtual environment. Run

the following command to change the

directory to rag project and then source

the virtual environment and activate

[music] it. After that, run pythonverify

environment.py. This script will

automatically install all the required

packages including chromb sentence

transformers and lang chain text

splitters. It will also pre-download the

embedding model so we do not face any

timeout issues later. Wait until you see

the environment setup completed message.

Next, we have an informal section about

ground truth data. Ground truth is

essential for evaluation because we need

to know what documents should be

retrieved to measure how well our system

[music] actually performs. The file

contains six test queries mapped to

relevant document ids. Feel free to run

python ground_ruuth.py

to see the data. Then click got it to

continue. Now we move to task number one

which is about precision at K. In this

question, we learn the theory behind

precision. Precision at K measures the

top K documents retrieved. How many are

actually relevant? The formula is

simple. Number of relevant docs in the

top K divided by K. This is important

because users hate irrelevant results.

If you search for vacation policy and

get five results, but only two are about

vacation, you're wasting time reading

three irrelevant documents. The [music]

question asks, if you retrieve five

documents and two are relevant, what is

precision at five? The answer is 0.4

because 2 / 5 equals 0.4. [music]

In the next task, we're asked to

implement precision at K. Open the file

1 precision atk.py.

You need to complete two todos. At line

50, replace the nan with the count of

relevant documents found in [music] the

top K results. At line 56, replace non

with the precision calculation, which is

the count divided by k. The hints are in

the comments. Run the script with uv run

python one precision at k.py and verify

you see the [music] task one completed

message. After running the script, we

have a question about the results.

Looking at your output for query about

travel expense policy, the rag retrieved

three documents, but only one is

relevant. So precision at three equals

0.333. This shows that two of the three

retrieved documents are noise. Moving to

task number two, which covers recall at

K. Recall measures among all the

relevant documents that exist for a

query, how many appear in the top K

retrieved results. This is different

from precision. Recall is about coverage

while precision is about quality.

Imagine searching for documents about a

legal case. Missing even one relevant

document could mean missing critical

evidence. The question asks if there are

four relevant documents and you found

three in your top five. What is recall

at five? The answer is 0.75 because 3 /

4 equals 0.75. Now we implement recall

at K. Open two recall at K.py [music]

and complete the two to-dos. At line 58,

count how many relevant docs are found

in top K. In line 64, calculate recall

by dividing the found count by the total

number of relevant documents. Run the

script and verify completion. The

results question for recall is

interesting. For the query about home

office equipment reimbursement with K

equals 2, there are three relevant

documents total. Even if both top two

results are relevant, we can only find

two out of the three. So recall at 2

equals 0.667. This demonstrates the

trade-off. Smaller K means potential

lower call. Task three introduces mean

reciprocal rank or MRR. [music]

This metric measures how high the first

relevant result appear. In Q&A systems,

users typically only look at the first

result. If the answer is buried at

position five, users might give up. The

formula is 1 divided by the position of

the first relevant document. Position

one gives MR of 1. Position two gives MR

of 0.5. and position three gives 0.333.

The theory question asks about a

scenario where the first relevant

document is at position three. The

answer is 0.33. For the implementation,

open three mean reciprocal rank.py. The

loop already finds the position of the

first relevant document. At line 59,

store this position in a variable. At

line 65, calculate MR by taking one

divided by the position. Run the script

to complete the task. The MR results

question is a bit different. When you

look at the output, both test queries

show MR equals 1. The question asks why

all queries get a perfect score. The

answer is that rag systems always found

the first relevant document at position

one. This is actually a good thing. It

shows the semantic search is working

well. Finally, we have task 4 covering

NDCG or normalized discounted

commulative gain. This is most

sophisticated metric. Unlike MRR which

only looks at the first result, NDCG

evaluates the entire ranking. It rewards

putting relevant documents at the top

and penalizes burying them lower.

Documents at higher position get more

credit. Position one gets the full

credit. Position two gets about 63%,

position 3 gets about 50%. The formula

involves logarithms. NDCG of one means

perfect ranking. The theory question

asks about two relevant docs at position

one and position two. The answer is one

because that is the ideal ranking for

implementation. Open four normalized

DCG.py. At line 57, calculate DCG by

summing 1 / log base 2 of position + 1

for each relevant document. At line 63,

calculate IDCG, which is the ideal DCG

assuming all relevant documents are at

the top positions. Use math.log 2 for

the calculations. Run the script to

complete the task. The NDCG results

question examines a real output for the

desk reimbursement query. The retrieved

order is policy 3, policy 4, policy 1.

Two documents are relevant. Policy 3 and

policy 1. Policy 3 is in position one,

which is good. Policy 1 is at position

three instead of position two. A

irrelevant document policy 4 sits at

position two. This is why NDCG equals

0.92 instead of one. The ranking is good

but not perfect. Before wrapping up, we

implemented four evaluation metrics.

Precision at K tells you how much noise

is in your result. Recall at K tells you

how much coverage you have. [music] MRR

tells you how quickly users find the

first relevant result. And NDCG measures

overall ranking quality. Each metric has

its own use cases. Use precision [music]

when you want to minimize noise. Use

recall when you cannot find to miss

documents. Use MR for Q&A systems where

only the top results matters. And use

NDCG for search engines where the order

of all results is important. Now that we

covered some basic concepts when it

comes to RAG, let's look at some more

emerging concepts and techniques when it

comes to RAG. One of the biggest

challenges when it comes to RAG is

redundancy. What I mean by that is that

while rag as a system works totally fine

serving what's being asked for, users

tend to ask either the same questions or

similar questions over and over again.

And at that point, it starts to add

redundancy which can certainly be

removed. But what if we can store these

in a cache and basically add a memory

layer? The theory behind this is simple.

You can basically have the system first

check the cache and see if the prompt

being asked could realistically be

answered by what's already stored in

cache first and only tap into the

existing rag pipeline if the cache

doesn't seem to be sufficient. This kind

of setup is called CAG or cache

augmented generation. The the one

important thing to keep in mind is how

you actually manage this cache so that

it invalidates properly. And the use

cases for CAG might be extremely good

for contents that don't change as often.

But if the underlying data set

constantly changes and the cache

essentially needs to be refreshed every

time so that the newest information is

fed into the model, the use cases for

KAG start to become diminished. Another

emerging and popular use case of a rag

is agentic rag. Compared to rag pipeline

that we covered in the video, which is a

single shot, meaning it's a one-time

event where you ask a question, rag then

retrieves and generates the responses

and stops. But we know that in most

systems and especially in agentic

systems where instead of of you being

the driver, the agent himself is what

actually initiates these requests. And

the biggest difference here is that

instead of using rag as a taskbased

system with agentic rag you can now use

it as a goalbased system where you allow

the model to formulate how to actually

get the relevant data that you're

looking for. As much as agents are

becoming more popular, the use cases for

agentic rag are actually more niche than

most might expect. And that's because

agentic rag tends to be slower than

traditional rag since it has to perform

more than one search to make sure that

the facts that are contained in the

vector databases are extracted the most

optimal way that aligns with the goal.

And it's for this reason that agentic

rag should be used in cases where slower

responses are tolerated at the exchange

of potentially higher quality data.

Another emerging concept here has to do

with how to actually cast a wider net

given a single question by leveraging

LLMs. In a conventional rag pipeline,

your prompt is taken and the burden of

actually decomposing your prompt to find

relevant context in the vector database

through embedding is done by the rag

system itself. But what if we want to

make sure that we can cast a wider net?

For example, if the user asked the

question, what are the security risks of

using rag with our customer data? Even

though the question itself is totally

and completely normal, depending on how

the question is phrased, the quality of

the results might be different. In other

words, the phrase security risk can be

phrased differently like risk of data

leakage or maybe risks of unauthorized

access and even risks of prompt

injection and so on. As you can see,

these variants that we find here are

more specific cases and examples of the

original question that were asked but

phrased more differently. Multi-query

rag achieves this by essentially using

an LLM to produce different variations

of the original prompt and then run them

in a rag to merge and dduplicate the

results before answering. Now, similar

to the previously mentioned agentic rag,

the performance may be slower than a

conventional rag pipeline and also the

results may be more noisy than what the

user had asked for, especially if the

intent was to only know about a specific

idea rather than its variations.

However, multi-query rag can be

extremely valuable tool for more a

generic use cases that might benefit

from a setup like this. Another use

cases that is emerging is what's called

hierarchical rack. And this is a really

cool way to map your information. Most

documents, especially in corporate

settings, are inherently hierarchical.

They're organized by different

categories like company vision,

strategy, goals, or even hierarchically

like company, division, department, and

team, and so on. But the nature of rag

actually flattens these hierarchical

structures and store them in a singular

chunk that goes into the database. So as

you can see losing this hierarchical

nature of how information is grouped can

actually be improved. Hierarchical rag

tries to preserve this so that when you

retrieve information from the database

the first and course level of

information is checked first before it

actually goes deeper into the hierarchy.

This kind of setup reduces the risk of

using far less relevant and far less

important detail. And a similar analogy

is like going from the globe to

continent to country to province and

then to a city instead of looking at all

of them all at once. But it's important

to keep in mind that this adds an

engineering overhead in not only setting

it up this way but also in maintaining

the structure as new information enters

into the system. Finally, we have what's

called multimodal rag. This is where we

extend beyond simple text but get into

other modalities like images, charts,

diagrams, and screenshots. Data is often

stored in this modality and is going to

be extremely important since most

documents contain images and diagrams

and charts, especially in corporate

settings. and losing the ability for

your rag system to search by them and

retrieve relevant information stored in

these modalities can be an extremely

important feature to add in your rag

system. The idea is to use models that

can understand both text and images by

covering the input into an embedding

into the same way that we converted our

input data into vector embeddings. While

the actual method in theory behind how

to actually set up a multimodal rag

system is certainly beyond a crash

course, being able to go beyond simple

text is going to be important for you to

know as you try to implement a rag

system for your project and your

company. So we just covered extremely

important concept of rag starting from

how LLMs extend their knowledge with

external data all the way to more

advanced concepts like KAG, agentic rag

and more. And the biggest idea that I

want to get across is that while RAG

might appear as if it's extending LLM's

long-term memory, there's actually more

technique that's involved in basically

fragmenting your facts and data into

retrievable chunks so that LLMs can

leverage your knowledge base stored in a

vector database to better assist your

use cases.

[music]

Complete RAG Tutorial 2026 (Free Labs)

KodeKloud

68 days ago

48:25

Ai Whitelist

AI Whitelist

Rank #1

Description

Learn Retrieval Augmented Generation (RAG) from scratch in this comprehensive crash course! Discover how to extend Large Language Models beyond their training data using vector databases, embedding models, and advanced retrieval techniques. 🔗 RAG Bootcamp: https://kode.wiki/4j9Bfyw What you'll learn: ✅ RAG fundamentals and common misconceptions ✅ Vector embeddings and semantic search explained simply ✅ Building your first RAG system with real code examples ✅ Chunking strategies that actually work ✅ How to evaluate and improve your RAG pipeline ✅ Cutting-edge techniques: Agentic RAG, Multi-modal RAG, and more Whether you're building chatbots, AI assistants, or enterprise AI solutions, this crash course gives you everything you need to implement RAG successfully. 📚 Timestamps: 00:00 - Introduction to RAG Bootcamp 01:51 - What is RAG? 03:27 - Common Misconceptions about RAG 09:10 - Vector Embedding 12:45 - When RAG Isn't Useful 15:20 - Chunking Strategies 15:50 - Fixed-Sized Chunking 17:30 - Semantic Chunking 19:15 - Overlapping Chunking 20:40 - Agentic Chunking 22:10 - Lab: Document Chunking 25:30 - Why RAG is Needed (Real Use Cases) 27:45 - Evaluating RAG 29:20 - RAG Evaluation Metrics (Recall@K, Precision@K, MRR, NDCG) 32:00 - Lab: RAG Evaluation 35:15 - Future of RAG 36:00 - Cache-Augmented Generation (CAG) 38:20 - Agentic RAG 40:45 - Multi-Query RAG 43:10 - Hierarchical RAG 45:30 - Multimodal RAG 47:50 - Conclusion & Key Takeaways 🎓 More AI Courses: #RAG #AI #MachineLearning #LLM #ChatGPT #ArtificialIntelligence #VectorDatabase #OpenAI #DeepLearning #DataScience #kodekloud

Video Details

Category

Feed

AI Whitelist

Featured Date

December 24, 2025

Quality Rank

#1

AI Recommended