Loading video player...
When you make a Google search, you're
trying to get access to more information
from disparit websites curated by a
search engine. So even though you
implicitly have knowledge in your brain
that contains information by using
Google search, you are essentially
extending your knowledge by explicitly
searching in external data sources from
the internet. Large language models that
we love and use today like chatbt also
have the same predicament as you do when
it comes to knowledge. What typically
happens during training is the model is
initially pre-trained with trillions and
trillions of tokens to implicitly learn
from them. So practically every credible
information that exists on the internet
consumed by the model during the
pre-training stage to build up its
implicit knowledge. And you might see
this and immediately go well the data
that goes in pre-training certainly
couldn't be all the data. What about my
company data? What about data from my
own computer? And that's an extremely
important point because this is the core
limitation with LLMs that certainly need
to be addressed which is how do we
exactly extend the LLM to have knowledge
beyond its implicit knowledge. Welcome
to CodeCloud and today we're going to
cover Rag as a crash course. We're
starting from covering some fundamental
understanding by looking at some of the
common misconceptions and when or when
not to use Rag. We'll also get into some
real life examples of how rag can be
used in different scenarios like a law
firm or a chatbot. Then we'll cover some
metrics that we can use to actually
evaluate the rag system and cover some
practical use cases of rag. And finally
end with some future and emerging
concepts in rag. By the time you finish
this course, you'll have a comprehensive
view on RAG and have a solid
understanding of how to improve your
skill set with practical know-hows that
you can use right off the video. So,
let's get started. In early 2021, a
concept called RAG was brought forward
to address this challenge. Back then,
the context window from large language
models were extremely limited. Like
around 4,000 tokens was the norm. So we
not only had the problem of not being
able to give more data to process due to
small context window but also we wanted
to extend their ability to access
external data to use them as a knowledge
similar to how you run your Google
search. Rag proposed the method that
allows the model to essentially augment
its knowledge by retrieving from an
external data source and generating its
answer from it. And that's how the term
rag was formed retrieval augmented
generation. Now, since 2021, Rag has
certainly matured into a much more
comprehensive feature where we now have
more established databases specific for
rag use cases called vector databases.
And tools like Chroma and Pineong are
popular to use when you're trying to set
up an external database for the model to
use. We also have more established
methodologies when it comes to
converting a regular document into
vectors by using embedding models like
OpenAI's text embedding 3 large and
coherence embedding models. These
embedding models are essential in
converting a regular text into semantic
representation which is how LLM is able
to search the external document stored
in a vector databases to get relevant
information. As you can see, this type
of configuration is why Rag has become a
dominant feature in the AI industry and
why you should learn the fundamentals of
how to structure RAG since the use cases
and applications for them are quite
extensive. A common misconception about
RAG is that RAG gives LLM long-term
memory, but this isn't true. While rag
can certainly extend the model's ability
by improving its context with more
relevant data from the vector database,
the retrieve data is ephemeral which
means it persists during the turn. Where
the confusion comes from is the fact
that rag can appear as if the LLM can
persist its external knowledge since
this knowledge is stored in the
database. So, in a way, since the LLM
has access to this knowledge, as long as
the database is available, it can
certainly seem like the model has a
long-term memory. Another misconception
with rag is that it has the ability to
return all relevant data. While in
theory, this holds true, there's a limit
to this. Typically, in a rag setup, you
have tens and hundreds of gigabytes of
database in your vector database. But
instead of being stored in its raw
format like in a structured databases
like SQL in vector databases, they're
stored in a semantic space as a vector
format. In other words, let's say you're
given data is John Wick is a great
movie. Instead of storing this data
directly in the database, the embedding
model will convert them into a vector
that represents a semantic
representation. And this is important
because later if we search the term
great film instead of great movie even
though the text doesn't contain the word
film it's able to retrieve those
records. This is the power of rag in
vector databases. Given that this is a
core concept in rag let's run through a
lab to actually dig into this so that we
can get a hands-on experience of how to
actually set up a rag system. So going
back to the earlier statement John Wick
is a great movie. We need to first
convert this into a vector embedding if
we want to use rag. So you might be
wondering, wait, why can't we just store
the sentence as is? That's because the
sentence is actually going to be stored
in a vector database so that the LLM can
retrieve it. And you might be asking,
well, can't the LLM just retrieve them
by the sentence as is? Well, if we do
that, then we might as well just use a
SQL database instead of a vector
database. But the whole point in
converting this sentence into a vector
embedding is that we can search it by
the semantic meaning rather than the
text that's contained in the sentence.
Let's look at how this might work in
code so I can explain it a bit more. So
I'm going to create a variable called
prompt and pass in the string John Wick
is a great movie and use OpenAI's
embedding model to create the sentence
embedding for it. So after importing
OpenAI and adding my API key, the
sentence embedding can be retrieved like
this. And as you can see from the
output, the embedding is a bunch of high
decimal data points like this. And while
a conventional database might just store
the sentence like such, in a vector
database, we're going to store both the
embedding and the value so that the
meaning and semantics can be searched
just not just the value itself. While we
will cover this more in detail in our
lab that's embedded throughout the
video, I will store this embedding that
we just made into a local variable to
simulate an actual vector database which
is typically stored in pine cone or
chroma. Now let's look at how the rest
of the rag pipeline actually comes
together to complete the rag. Now that
we understand how sentence and data is
converted to embeddings to be searched
through semantics. So given our
understanding of the basics of how
vector embedding and vector databases
work, where it can be used by LLMs
through RAG, let's look at situations
where RAG could actually be a bad tool
to use and when RAG is a good tool to
use. As you just saw in how to convert a
sentence to its vector embedding, what
we did was break them into its semantic
representation. So when you are in a
situation where you have to search the
database by the meaning rather than the
text, using rag can be a really good way
to retrieve information that pertains to
its certain topic. And this is a great
way to extend the LLM's knowledge. Also,
when you have large sets of documents
that contain disparate information, rag
is a really good way to neutralize them
by essentially making all information
accessible into a single search. And
like we've seen before, you'll certainly
have to put in a lot of work upfront in
converting these documents into a vector
embedding to store them in the vector
database. But once you do get these
done, the rest just works like magic in
instant retrieval of relevant
information at your fingertips. Let's
dive into a quick lab to cover how the
basic embedding and vector databases
actually look to get a more concrete
understanding before jumping into deeper
concepts.
Let's start with the first lab. Click on
start to launch the lab. Give it a few
seconds to load. Once loaded,
familiarize yourself with the lab
environment. On the left hand side, you
have a questions portal that gives you
the task to do. On the right hand side,
you have a VS Code editor and terminal
to the system. Remember that this lab
gives you access to a real Linux system.
[music] Click on okay to proceed to the
first task. The first task requires you
to explore the document collection. Open
the TechCorp documents in the VS Code
editor on the right. We see there is a
Tech Corp docs folder. Expand it to
reveal the subfolders. The ask is to
count how many documents are in the
employee handbook. This is what I call a
warm-up question that will help you
explore and familiarize yourself with
the lab. The real tasks are coming up.
In this case, it's three. So, I select
three as the answer. Then proceed to the
next task. This is about performing a
basic rep search. As we discussed in the
lecture, we'll run a GRP command to
search for anything related to holiday
in the folder and store the results in a
file named extracted content. To open
the terminal, click anywhere in the
panel below and select terminal. This
creates a new file with the results.
Click check to check your work and
continue to the next task. The next task
is to set up a Python virtual
environment and install dependencies.
I'll let you do that yourself. We'll
move to the next task now. Here we
explore the TF script. Here we first
import the TF vectorzer from the scikit
learn library. We then transform the
dogs. Then we compare using cosine
similarities. So cosine is one approach
of comparing two vectors to identify
similarities. And then we finally print
the results. We now execute the script
and then we view the results. And for
now we'll just click check to proceed to
the next step. We then move to the next
step. Here the question is to analyze
the score printed and identify the score
of the top results. So the ask is to
search for pet policy docs and identify
the score for the top result. Here we
see the top result is rightly identified
as the pet policy.md file with a score
of 0.4676
whereas the other files have a score
less than 0.1. So the answer to this
question is 0.4676.
The next task is to review and execute
the BM25 script. Open the BM25 search.py
file and inspect it. You'll see that we
import the rank BM25 uh package. We then
create an index and then for each query
called the BM25.get scores method and
from the results we get the top three
results and we go through each result
and print that. Finally, there is a
hybrid approach that combines TF and
BM25 techniques using a weighted
approach. I'll let you explore that by
yourself. Let's get back to the next
topic.
Okay. So, in this lab, we're going to
look at embedding models. We'll explore
semantic search using embedded models
which are the foundation of modern uh
rag systems. So, let's uh go to the
first task. So, the first task is about
keyword uh search limitations. So first
we navigate to the project. We create a
new virtual environment and install the
requirements.
I go to the terminal and we're going to
uh set up the virtual environment.
So our project is within this uh folder
called rag project. And here we have the
virtual environment uh that's being set
up.
Okay. We now run the um the next step
once the once the virtual environment is
set up the next step is to run the
keyword limitation demo. If you go to
the rag project, you'll see the keyword
limitation uh demo script. So this is a
simple script that searches for a word
or keyword that does not exist [music]
in in the documents and proves that uh
pure keyword based uh search uh are less
likely to yield the right results. For
example, in this case, the query is
distributed workforce policies and the
none of the documents have something
that's exactly like that, right? So,
let's try running the script.
And if you look at the script, most of
the scores are zero because um the
[music] keywords distributed workforce
policies does not really exist in any of
the scripts. So, the correct answer here
is
missing synonyms and context.
All right.
The next task is to install embedding
dependencies. So we go to the rack
project. So we're already in that
project.
We source the uh virtual environment. We
install the embedding packages. So I'm
going to copy copy this command install
it. So the packages are sentence
transformers hacking phase hub and open
AAI.
The next question is to run the local
embedding scripts. So if you see the
script name is semantic [music] search
demo. So let's look at the semantic
search uh demo script. And if you look
into this, we can see that the first
step is loading the documents. And then
we load the local embedding model which
is the all mini LML L6V2. And then we
generate embeddings by calling the
method model. And then we pass in the
docs. And then we have the query which
is the same query we used before which
is distributed workflows policies. And
then we generate embeddings for the
query. And then we calculate u the
similarities using the np. Uh method.
[music] And then we print the results.
So let's uh run the script. to do uv run
python semantic search demo.
Now as you can see in the same set of
documents the script has now identified
the relevant uh documents uh which has
the meaning that's closer to the
distributor for policies query that we
are looking for. So if you see that for
each document it's uh given a rating and
that means that it's able to identify
the document that has the closest
semantic results. We'll go to the next
question.
So the task is to uh look at the results
and then say look at the semantic search
results and what is the similarity score
between remote work policy and
distributed work policies. If you look
at the first score say 0.3982 and that
is the score [music] for remote work
policy.
The next question is a multiple choice
question. So this basically confirms our
learning. So the question is based on
the comparison between semantic search
and keyboard search which is a TF IDF
and BM25 that we saw earlier. Which
approach better understands the meaning
of queries? Of course we know that
semantic search understands uh the
meaning of queries better. And that's
basically about uh this lab. In the next
lab we'll explore um vector databases.
I'm just going to go through a highle
overview of the lab and I'll leave you
to do most of it but I'll just explain
how the lab functions. Right. So uh in
this lab we're going to learn how to
scale the semantic search with vector
databases. So let's get that going.
So the first task is to simply
understand the uh concepts. So before we
start building, let's uh understand what
vector databases are. So we already
discussed that in the video, but here's
a quick uh description of what it is and
what it can help us do. And there's a
question on um what is the primary
advantage of using a vector database or
strong embedding models um in memory. So
I'll let you answer that uh yourself.
The next step is to navigate to the
project directory which is right here.
And then um we again activate the
virtual environment and we install the
embedding uh model package which is
sentence transformers which we also did
in the last lab. And then the next step
is to install the vector database. In
this case we're going to use chromb. So
um the task is to install the chromb
package.
Again I'll just skip through that for
now. Uh the next task is to initialize a
chromodb vector database. So um if you
go here there's a script called init
vector db and if you look into the
script we first import the chromb
package. We also have the sentence
transformers. Um we then uh create uh
the chromb client using the
chromadb.client method and then we
create a collection. We'll call it tech
corp docs and then uh we load the
embedding model which is all mini LM L6
model and then we um
test the model with a sample document.
So we have identified a test doc which
is really just a sentence that's given
here. Um we'll then add the test
document to the collection using the
collection add method and we'll print
print the results and then uh we'll
print the count of uh documents within
the collection and that's basically it.
So that's a a quick uh beginner level uh
script.
In the next one there are a couple of
questions that are being asked. So uh
you can answer those questions based on
the results of the script. The next one
is uh called as store documents. So this
is where we store actual documents
within the chromob uh database. Again
this is another script that starts off
and loads the model and client as we did
before. Uh but in this case we are
reading the tech corp docs u documents
using the techp docs method which we
have in the utilities function. So
that's what loads all the uh documents
that are in the tech corp uh docs
folder. So now we're loading actual
documents
and then we follow the same approach of
adding those documents to the collection
and then we verify the collection. So
again just uh another uh layer to that
uh script the basic script. In this case
we're just storing documents.
We'll continue to the next task. This is
where we do perform uh a vector search
against the documents. So um the script
this time is vector search uh demo. So
click on the vector search demo script.
And here we have uh some sample
documents um that are sentences and then
there's a query.
Okay. What about when rag isn't useful?
What kind of scenarios should I use for
rag? Because vector databases only work
in a text space when you want to store
graphs, images, and charts. The vanilla
vector embedding might not be a great
option because it leverages what's
contained in the image. So use cases
where you might have to search images
and charts would not be the best to use
rag since it can't retrieve based on
other modalities than text. Another
example is when you want to search by
the format of the document. For example,
a document can have multiple pages and
at each page there might be different
sections of the document and certain
sections might have different formatting
and tables of information. Meaning when
you want to search by certain page of
the document or locationally where in
the document something's stored, rag is
not able to fulfill that since its
intent is by the semantic meaning of
those words rather than the position or
the format of the document which can
often be done by using a vision model.
However, even though rag might not lend
itself to how the document is physically
formatted and preserve the physical
structure of the document, separating
the document into semantic structure is
actually very common practice in rag.
You might think, well, can I just store
the entire document in a vector database
and retrieve the whole document? While
this sentiment is certainly common in
regular SQL databases that we can store
as objects rowby row, things look
slightly different in rag. And here's
why. Large language models typically are
limited in how much context it can hold.
And this limitation is often referred to
as is the context window. This is why
when you go to chache and paste in a
large PDF document, Chacht will say the
document is too large. For the same
reason, we will need to chunk our
document in sections so that when we
retrieve the document, it doesn't
overload the LLM with the entire
document. And besides the context
window, it's important to feed LLM the
correct context. And documents often
contain information that is irrelevant
to what we're actually trying to look
at. In other words, even if the context
window can fit the entire document,
chunking the document into its semantic
group is actually beneficial overall.
But how do we decide how to chunk the
document? One way to do this is by going
by fixed size chunking. You can chunk
the document in a naive way where you go
by certain character limit so that as
the document is being stored in the
database, each row essentially holds up
to say 5,000 characters per chunk. You
can also go by total number of words so
that each row contains up to certain
predefined number of words. Another way
is to go by tokens where instead of
chunking by a fixed size in original
language, you can actually chunk them by
the size of the tokens which is an
embedding representation of the
document. As you can see, fixed size
chunking is the simplest strategy when
it comes to chunking since you can just
take an upward bound that can be
predefined up front. This can also apply
for sentence chunking or paragraph
chunking all based on essentially where
you want to cut the document by certain
number in threshold. But fixed size
chunking doesn't necessarily lead to the
best result in rag since it ignores the
semantic grouping of the document. In
other words, documents are often grouped
in sections or topics and fixed size
chunking ignores this and can abruptly
chunk the document regardless of what
the document might have contained. For
this reason, semantic chunking is
another popular method. Instead of
splitting the document into a fixed
number of characters, words, tokens, or
sentences, you can break the text in
where the meaning actually starts to
shift. This way, each row in the
database naturally respects the semantic
topics, which makes retrieving rowby row
very rich with context. One way to do
this is by breaking the document into
sentences and measuring the similarity
between them. So that when the coherence
start to drop between those sentences
which is an indication that now the flow
in context or topics has changed then
you can just chunk the document in this
manner. Chunking the document this way
ensures that the natural break point
that might exist in the document are
preserved so that you're not abruptly
chunking the document because of its
size. The obvious drawback is that
compared to naive approaches like
fixed-size chunking, it adds an
engineering overhead to chunk and store
the document based on its semantic flow
of the documents. And one way to reduce
engineering overhead while keeping the
semantics between each row is by using
what's called an overlapping chunking or
sliding window technique. In this
method, you are processing the document
in a way that intentionally adds
duplication between each row by
essentially adding a lip between chunks
so that there's an overlap in context
coverage between each row. This way,
every row contains a bit of semantic
context from the previous row and
subsequent row. But chunk overlap can be
more like art than science since the
exact number of how much overlap there
should be can be trivial and
non-deterministic. But it is a great way
to ensure higher accuracy while still
trying to keep the engineering very
simple. While there are other
methodologies like this that can
document chunking more flexible, one
notable method that's becoming popular
is agentic chunking where you leverage
AI to chunk the document for you. So
essentially you allow AI to pre-process
the document and since LLMs are
extremely good at understanding text
instead of coming up with your own
method of chunking based on fixed size
or natural breakpoint based on semantic
algorithm like we just covered. We allow
AI to actually make the call for us. The
obvious drawback to this method is cost
and speed since you'll have to
essentially frontload all the work and
any changes will require rerunning of
those agents to maintain the
authenticity of chunking. But using
lower grade or even more affordable LLMs
should do just fine since the job is
just splitting the document word thinks.
Given that chunking is an important
concept to drill down, let's hop on to a
quick lab to get a more concrete
understanding.
Okay, in this lab we're going to look at
uh chunking techniques. So we'll learn
how to optimize rack performance by
breaking documents into focused
searchable chunks.
So first we activate the virtual
environment. So this is something we
have uh already done many times. All
right. Uh so first
we're going to look at this chunking
problem demo script. So if you expand
the rack project, there should be a
script called as chunking problem demo
script. The thing is uh this script
demonstrates the core problem of
searching large documents in rack
systems. It creates a sample employee
handbook and shows how searching for
specific information like u internet
speed requirements returns the entire
document instead of just the relevant
section. So uh we'll see uh a large
document stored as a single chunk.
search queries that should be that
should find specific sections or results
that return the entire document. So here
you can see there's a sample document um
that has multiple sections and uh we're
adding that document to the uh
collection chromb and then we're doing a
query for internet speed requirements.
So let's run the script and see
how it works.
So the script runs now and as you can
see it returns the entire document. It's
truncated here but uh the result shows
the uh entire document. So that's the
problem uh with this uh approach. So the
answer to this is large documents return
irrelevant uh results. Next uh we will
look at some of the uh dependencies
libraries and dependencies that we'll be
using. So first um we have what is known
as lang chain. So if you uh don't know
what lang chain is, we have other videos
that are on our platform. We have a
future course that's coming up that will
be for lang chain end to end. So do
remember to subscribe to our channel to
be notified when it comes out. So lang
chain is a powerful framework for
building a rag applications. It provides
recursive character text splitter for
smart uh document chunking [music] and
there's also the uh spacy which is an
advanced natural language processing
library and it provides uh it also
provides a spacey text splitter for
sentence aware chunking. So we'll use
spacy for sentence um aware chunking and
it uh these libraries take care of chunk
sizes overlaps operators etc. And we'll
install the lang chain and spacey
dependencies. Okay, we'll go to the next
question and we'll first look at basic
chunking.
So if you open the basic uh chunking
script, you'll see that it uses the lang
chain uh text splitter
um from which we have the recursive
character text splitter uh library. So
here we have a sample document and uh
this is where we are doing the
splitting. So as you can see we specify
the chunk size 200, the chunk overlap is
50. So that's the uh 50 characters
that's going to be overlap between each
chunks or some of these se separators
that are defined.
So we then do a splitter.split text to
split the text into different chunks and
then we have we just go through the
chunks and print them.
So I'll let you do that yourself. We'll
go to the next one and there's uh a
bunch of questions uh that are asked
that you can you have to read the script
and understand and answer. So I'll let
you do that uh by yourself.
The next one we'll look at is sentence
chunking. [music] So in sentence
chunking again uh if you look at the
script we're using spacy as a library.
And then we have um a question that's
based on the output of that script.
[music]
And then finally we looked at chunked
search. So this is a another script that
performs a chunked vector search uh demo
that kind of connects everything we have
uh learned so far together. So first we
chunk the documents and then we add
these [music] chunk documents to a
collection uh and there's comparison
between chunked no chunking a collection
with no chunking and collection with
chunking and then uh we'll see the
difference between the two. Again I'll
let you u go through that by yourself
and there's a question that's based on
that. In this lab, task six covers
agentic chunking. This is the most
advanced chunking method. Instead of
splitting by character count or sentence
boundaries, an AI model analyzes the
document and decides optimal split
points based on semantic topic shifts.
Before running this demo, you need to
source the batch profile to load the API
configuration. Run the source command
like this and then run uvun python
egentic chunking demo.py. You will see
the AI analyzing the document for
natural topic boundaries and creating
semantic coherent chunks where each
chunk contains one complete topic or
idea. The question asks about the main
advantage of agentic chunking. The
answer is that it splits based on
semantic meaning and topic shifts. This
produces the highest quality chunks but
comes with a trade-off of higher cost
and slower [music] processing since it
requires LLM API calls. To summarize, we
learned four chunking techniques from
basic to advanced. Basic chunking splits
by character count. Overlap preserves
context at boundaries. Sentence aware
chunking respects natural language
structure and agentic chunking uses AI
to understand document semantics.
Different methods suit different use
cases. For most applications, sentence
aware chunking with overlap is a good
balance. For high-v value documents
where quality matters most, a gentic
chunking provides the best results. So
you might be wondering what real use
cases for rag might be. And here are a
few potential use cases that rag can be
extremely valuable to an organization. A
common use case is a law firm. In a
traditional mid-size law firm, you're
going to have millions and millions of
documents stored inside a document
management software. And given that
different files from various matters
contain different information, you want
to have a system like Rag that can run
comprehensive searches through large
sets of documents. And for extra
security, you can make sure that the
privacy of those search results is
contained within a specific case rather
than the whole by adding additional
search parameters in how you store the
documents in the vector database.
Another example is using a chatbot that
can search through a company's knowledge
base and policy to either help internal
staff understand more about the
company's policy or even client-f facing
chat applications that can directly help
answer questions about the company. All
of this can be done by leveraging rag.
As you can see, the use cases for rag
can extend to cases like law firms and
chatbots, which means you're going to
rely on rag more and more. And just like
anything else, any system that is
important or heavily reliant on is going
to need to be evaluated. Evaluation
allows us to look at the system and make
sure that they are working properly and
also detect when things aren't working
as they should. But how do we measure
rag? What are the best ways to evaluate
our rag system that we can set up? Here
are some ways that you can add
evaluation in your rag system. In the
retrieval side, you want to get high
relevance [snorts]
and high comprehensiveness and high
correctness. In other words, because the
searches are done in semantics, we want
to make sure that each time that we
retrieve information from the data
source, we're able to measure whether
the results are relevant to the question
or the query that we're being asked. And
same thing applies to comprehensiveness.
Does the set of retrieve documents cover
all information that is needed to
actually answer the prompt? And finally,
are the retrieved documents actually
correct and ranked properly? What's
important to keep in mind is that your
specific setup might require measuring
different segments? But here are some
more common and established methods to
keep in mind if you want to take further
steps beyond how they're actually
introduced here. The first is recall at
K. And it's done by basically given all
the relevant information and documents
in your system, how many did the
retriever actually find within the top
results. Another one is precision at K
which measures out of the top results,
how many were actually relevant, which
is slightly different than recall at K
since it measures the quality of the
actual result itself. MRR or mean
reciprocal rank is a measurement of the
first relevant document in your result
and how high the ranking is. And
finally, NDCG or normalized discounted
commulative gain is a measurement on
whether the retriever is ranking
relevant documents higher than
irrelevant ones. As you can see, there
are so many different ways to pick at
the metric side of the things when it
comes to rag. That will help you find
the best way to ensure that the system
that you're actually trying to set up is
actually working properly as it should,
especially as you rely more and more on
the system. Let's hop onto a lab to try
to measure how to evaluate rag. In this
lab, we're going to learn how to
evaluate rag systems [music] using four
key metrics. When you build rag systems,
you need to measure how well it
retrieves relevant documents. We will
implement precision at K, recall at K,
mean reciprocal rank, and normalized
discounted commulative gain. Each metric
answers a different question about your
retrieval quality. This lab takes about
20 to 30 minutes to complete. The first
step is to set up our environment. In
[music] this task, we're asked to
navigate to the project directory and
activate the virtual environment. Run
the following command to change the
directory to rag project and then source
the virtual environment and activate
[music] it. After that, run pythonverify
environment.py. This script will
automatically install all the required
packages including chromb sentence
transformers and lang chain text
splitters. It will also pre-download the
embedding model so we do not face any
timeout issues later. Wait until you see
the environment setup completed message.
Next, we have an informal section about
ground truth data. Ground truth is
essential for evaluation because we need
to know what documents should be
retrieved to measure how well our system
[music] actually performs. The file
contains six test queries mapped to
relevant document ids. Feel free to run
python ground_ruuth.py
to see the data. Then click got it to
continue. Now we move to task number one
which is about precision at K. In this
question, we learn the theory behind
precision. Precision at K measures the
top K documents retrieved. How many are
actually relevant? The formula is
simple. Number of relevant docs in the
top K divided by K. This is important
because users hate irrelevant results.
If you search for vacation policy and
get five results, but only two are about
vacation, you're wasting time reading
three irrelevant documents. The [music]
question asks, if you retrieve five
documents and two are relevant, what is
precision at five? The answer is 0.4
because 2 / 5 equals 0.4. [music]
In the next task, we're asked to
implement precision at K. Open the file
1 precision atk.py.
You need to complete two todos. At line
50, replace the nan with the count of
relevant documents found in [music] the
top K results. At line 56, replace non
with the precision calculation, which is
the count divided by k. The hints are in
the comments. Run the script with uv run
python one precision at k.py and verify
you see the [music] task one completed
message. After running the script, we
have a question about the results.
Looking at your output for query about
travel expense policy, the rag retrieved
three documents, but only one is
relevant. So precision at three equals
0.333. This shows that two of the three
retrieved documents are noise. Moving to
task number two, which covers recall at
K. Recall measures among all the
relevant documents that exist for a
query, how many appear in the top K
retrieved results. This is different
from precision. Recall is about coverage
while precision is about quality.
Imagine searching for documents about a
legal case. Missing even one relevant
document could mean missing critical
evidence. The question asks if there are
four relevant documents and you found
three in your top five. What is recall
at five? The answer is 0.75 because 3 /
4 equals 0.75. Now we implement recall
at K. Open two recall at K.py [music]
and complete the two to-dos. At line 58,
count how many relevant docs are found
in top K. In line 64, calculate recall
by dividing the found count by the total
number of relevant documents. Run the
script and verify completion. The
results question for recall is
interesting. For the query about home
office equipment reimbursement with K
equals 2, there are three relevant
documents total. Even if both top two
results are relevant, we can only find
two out of the three. So recall at 2
equals 0.667. This demonstrates the
trade-off. Smaller K means potential
lower call. Task three introduces mean
reciprocal rank or MRR. [music]
This metric measures how high the first
relevant result appear. In Q&A systems,
users typically only look at the first
result. If the answer is buried at
position five, users might give up. The
formula is 1 divided by the position of
the first relevant document. Position
one gives MR of 1. Position two gives MR
of 0.5. and position three gives 0.333.
The theory question asks about a
scenario where the first relevant
document is at position three. The
answer is 0.33. For the implementation,
open three mean reciprocal rank.py. The
loop already finds the position of the
first relevant document. At line 59,
store this position in a variable. At
line 65, calculate MR by taking one
divided by the position. Run the script
to complete the task. The MR results
question is a bit different. When you
look at the output, both test queries
show MR equals 1. The question asks why
all queries get a perfect score. The
answer is that rag systems always found
the first relevant document at position
one. This is actually a good thing. It
shows the semantic search is working
well. Finally, we have task 4 covering
NDCG or normalized discounted
commulative gain. This is most
sophisticated metric. Unlike MRR which
only looks at the first result, NDCG
evaluates the entire ranking. It rewards
putting relevant documents at the top
and penalizes burying them lower.
Documents at higher position get more
credit. Position one gets the full
credit. Position two gets about 63%,
position 3 gets about 50%. The formula
involves logarithms. NDCG of one means
perfect ranking. The theory question
asks about two relevant docs at position
one and position two. The answer is one
because that is the ideal ranking for
implementation. Open four normalized
DCG.py. At line 57, calculate DCG by
summing 1 / log base 2 of position + 1
for each relevant document. At line 63,
calculate IDCG, which is the ideal DCG
assuming all relevant documents are at
the top positions. Use math.log 2 for
the calculations. Run the script to
complete the task. The NDCG results
question examines a real output for the
desk reimbursement query. The retrieved
order is policy 3, policy 4, policy 1.
Two documents are relevant. Policy 3 and
policy 1. Policy 3 is in position one,
which is good. Policy 1 is at position
three instead of position two. A
irrelevant document policy 4 sits at
position two. This is why NDCG equals
0.92 instead of one. The ranking is good
but not perfect. Before wrapping up, we
implemented four evaluation metrics.
Precision at K tells you how much noise
is in your result. Recall at K tells you
how much coverage you have. [music] MRR
tells you how quickly users find the
first relevant result. And NDCG measures
overall ranking quality. Each metric has
its own use cases. Use precision [music]
when you want to minimize noise. Use
recall when you cannot find to miss
documents. Use MR for Q&A systems where
only the top results matters. And use
NDCG for search engines where the order
of all results is important. Now that we
covered some basic concepts when it
comes to RAG, let's look at some more
emerging concepts and techniques when it
comes to RAG. One of the biggest
challenges when it comes to RAG is
redundancy. What I mean by that is that
while rag as a system works totally fine
serving what's being asked for, users
tend to ask either the same questions or
similar questions over and over again.
And at that point, it starts to add
redundancy which can certainly be
removed. But what if we can store these
in a cache and basically add a memory
layer? The theory behind this is simple.
You can basically have the system first
check the cache and see if the prompt
being asked could realistically be
answered by what's already stored in
cache first and only tap into the
existing rag pipeline if the cache
doesn't seem to be sufficient. This kind
of setup is called CAG or cache
augmented generation. The the one
important thing to keep in mind is how
you actually manage this cache so that
it invalidates properly. And the use
cases for CAG might be extremely good
for contents that don't change as often.
But if the underlying data set
constantly changes and the cache
essentially needs to be refreshed every
time so that the newest information is
fed into the model, the use cases for
KAG start to become diminished. Another
emerging and popular use case of a rag
is agentic rag. Compared to rag pipeline
that we covered in the video, which is a
single shot, meaning it's a one-time
event where you ask a question, rag then
retrieves and generates the responses
and stops. But we know that in most
systems and especially in agentic
systems where instead of of you being
the driver, the agent himself is what
actually initiates these requests. And
the biggest difference here is that
instead of using rag as a taskbased
system with agentic rag you can now use
it as a goalbased system where you allow
the model to formulate how to actually
get the relevant data that you're
looking for. As much as agents are
becoming more popular, the use cases for
agentic rag are actually more niche than
most might expect. And that's because
agentic rag tends to be slower than
traditional rag since it has to perform
more than one search to make sure that
the facts that are contained in the
vector databases are extracted the most
optimal way that aligns with the goal.
And it's for this reason that agentic
rag should be used in cases where slower
responses are tolerated at the exchange
of potentially higher quality data.
Another emerging concept here has to do
with how to actually cast a wider net
given a single question by leveraging
LLMs. In a conventional rag pipeline,
your prompt is taken and the burden of
actually decomposing your prompt to find
relevant context in the vector database
through embedding is done by the rag
system itself. But what if we want to
make sure that we can cast a wider net?
For example, if the user asked the
question, what are the security risks of
using rag with our customer data? Even
though the question itself is totally
and completely normal, depending on how
the question is phrased, the quality of
the results might be different. In other
words, the phrase security risk can be
phrased differently like risk of data
leakage or maybe risks of unauthorized
access and even risks of prompt
injection and so on. As you can see,
these variants that we find here are
more specific cases and examples of the
original question that were asked but
phrased more differently. Multi-query
rag achieves this by essentially using
an LLM to produce different variations
of the original prompt and then run them
in a rag to merge and dduplicate the
results before answering. Now, similar
to the previously mentioned agentic rag,
the performance may be slower than a
conventional rag pipeline and also the
results may be more noisy than what the
user had asked for, especially if the
intent was to only know about a specific
idea rather than its variations.
However, multi-query rag can be
extremely valuable tool for more a
generic use cases that might benefit
from a setup like this. Another use
cases that is emerging is what's called
hierarchical rack. And this is a really
cool way to map your information. Most
documents, especially in corporate
settings, are inherently hierarchical.
They're organized by different
categories like company vision,
strategy, goals, or even hierarchically
like company, division, department, and
team, and so on. But the nature of rag
actually flattens these hierarchical
structures and store them in a singular
chunk that goes into the database. So as
you can see losing this hierarchical
nature of how information is grouped can
actually be improved. Hierarchical rag
tries to preserve this so that when you
retrieve information from the database
the first and course level of
information is checked first before it
actually goes deeper into the hierarchy.
This kind of setup reduces the risk of
using far less relevant and far less
important detail. And a similar analogy
is like going from the globe to
continent to country to province and
then to a city instead of looking at all
of them all at once. But it's important
to keep in mind that this adds an
engineering overhead in not only setting
it up this way but also in maintaining
the structure as new information enters
into the system. Finally, we have what's
called multimodal rag. This is where we
extend beyond simple text but get into
other modalities like images, charts,
diagrams, and screenshots. Data is often
stored in this modality and is going to
be extremely important since most
documents contain images and diagrams
and charts, especially in corporate
settings. and losing the ability for
your rag system to search by them and
retrieve relevant information stored in
these modalities can be an extremely
important feature to add in your rag
system. The idea is to use models that
can understand both text and images by
covering the input into an embedding
into the same way that we converted our
input data into vector embeddings. While
the actual method in theory behind how
to actually set up a multimodal rag
system is certainly beyond a crash
course, being able to go beyond simple
text is going to be important for you to
know as you try to implement a rag
system for your project and your
company. So we just covered extremely
important concept of rag starting from
how LLMs extend their knowledge with
external data all the way to more
advanced concepts like KAG, agentic rag
and more. And the biggest idea that I
want to get across is that while RAG
might appear as if it's extending LLM's
long-term memory, there's actually more
technique that's involved in basically
fragmenting your facts and data into
retrievable chunks so that LLMs can
leverage your knowledge base stored in a
vector database to better assist your
use cases.
[music]
Learn Retrieval Augmented Generation (RAG) from scratch in this comprehensive crash course! Discover how to extend Large Language Models beyond their training data using vector databases, embedding models, and advanced retrieval techniques. š RAG Bootcamp: https://kode.wiki/4j9Bfyw What you'll learn: ā RAG fundamentals and common misconceptions ā Vector embeddings and semantic search explained simply ā Building your first RAG system with real code examples ā Chunking strategies that actually work ā How to evaluate and improve your RAG pipeline ā Cutting-edge techniques: Agentic RAG, Multi-modal RAG, and more Whether you're building chatbots, AI assistants, or enterprise AI solutions, this crash course gives you everything you need to implement RAG successfully. š Timestamps: 00:00 - Introduction to RAG Bootcamp 01:51 - What is RAG? 03:27 - Common Misconceptions about RAG 09:10 - Vector Embedding 12:45 - When RAG Isn't Useful 15:20 - Chunking Strategies 15:50 - Fixed-Sized Chunking 17:30 - Semantic Chunking 19:15 - Overlapping Chunking 20:40 - Agentic Chunking 22:10 - Lab: Document Chunking 25:30 - Why RAG is Needed (Real Use Cases) 27:45 - Evaluating RAG 29:20 - RAG Evaluation Metrics (Recall@K, Precision@K, MRR, NDCG) 32:00 - Lab: RAG Evaluation 35:15 - Future of RAG 36:00 - Cache-Augmented Generation (CAG) 38:20 - Agentic RAG 40:45 - Multi-Query RAG 43:10 - Hierarchical RAG 45:30 - Multimodal RAG 47:50 - Conclusion & Key Takeaways š More AI Courses: #RAG #AI #MachineLearning #LLM #ChatGPT #ArtificialIntelligence #VectorDatabase #OpenAI #DeepLearning #DataScience #kodekloud