Loading video player...
Retrieval augmented generation is the
way to give your AI agents the ability
to search and leverage your knowledge
and documents, but it feels like there
are a million strategies for RAG out
there. How do you know which is best for
your use case? What even are all the RAG
strategies that we can pick from? And
should we be combining some together? It
gets overwhelming pretty fast when you
try to optimize a rag system for your
use case. But don't worry because I've
got you covered in this video answering
all of those questions for you. Now,
beware. This is a veryformationally
dense video. I want to try a new format
where I have shorter content that's
super valuepacked for you. And so,
please let me know in the comments as
I'm going through any of these rag
strategies if there's one that you want
me to make a dedicated video for. And if
I already have that, I'll link to the
video when I'm covering that strategy
specifically. So the goal that I have
for you right now is just to get you
started thinking about the strategies
that will apply to your use cases and
how you can combine them together
because usually the optimal solution is
going to combine around three to five
rag strategies. So hopefully going into
this you understand rag at least at a
high level. We have our data preparation
phase and the actual retrieval augmented
generation phase. And so for data
preparation, we take our documents, we
chunk them up into bite-sized pieces of
information for our LLM. We embed them
to then put in our vector database or
potentially a knowledge graph, which
I'll talk about later as well. And then
for our query process, this is where we
take a question from a user like what
are the action items from a meeting for
example, very common rag use case. We
embed that query and then we search our
vector database to find similar chunks
that we then pass into the large
language model. So it's able to leverage
that as extra context to augment its
answer. That's why it's called retrieval
augmented generation. And so for example
the action items and the meaning are
XYZ. So that's rag at a high level but
there are so many different ways for us
to one do the data preparation different
chunking strategies and then also the
way that we search the vector database.
There are a million ways to do that.
even storing information in alternate
formats like a knowledge graph. And so
that is what I'm going to cover with you
here. Starting with our number one
re-ranking. This is the first strategy
that I use for almost every rag
implementation. And also as a resource
to go along with all of these
strategies, I have a GitHub repo that
I'll link to below that has a readme
that dives deeper into all of the 11
strategies that we have today. So, some
research in the docs folder, some pseudo
code examples for you to reference, and
then also I have a full implementation
that's not production ready because it's
not ideal to try to combine as many as
possible, but I'm doing this just as a
reference for you. Feel free to give it
to an AI coding assistant to use as a
starting point as well. And so, with
that, reranking. So, with this strategy,
we have a twostep retrieval. First,
we're going to pull a large number of
chunks from our vector database. But
then we're going to use this specialized
reranker model often a cross encoder to
find the ones that are actually relevant
to our query and then just return some
of those. And so in the end, the large
language model only gets a few of the
chunks, but it's the ones that are the
most relevant. And this is so important
because if we were to go to our LLM and
just give it 20, 50 or more chunks right
away, we're going to completely
overwhelm it. And so by having this
specialized model pull in more context
but then reduce it for the LLM, we're
able to consider more knowledge without
overwhelming it. And it is going to be
slightly more expensive because we have
the second model, but it's not that much
more. I love using re-ranking in most of
my rag implementations. And I've got a
code example you can pause and take a
look at right here if you are
interested. I'll have this for each of
the strategies that I cover. Next up, we
have a gentic rag. I've covered this a
ton on my channel before. Link to a
video right here. It's all about giving
our agent the ability to choose how it
searches our knowledge base. Like maybe
it can do a classic semantic search, but
also if it wants to, it can read the
entire text of a single document. And
I'll show you this right now in a live
project. So I'm here in my Neon
dashboard, which is quickly becoming my
go-to for Postgress. And I love using
Postgress with PG Vector for most of my
rag AI agents. And so I have one table
here for our chunks and then another
table that stores the higher level
information each individual document and
my agent can pick and choose where it
searches based on the question. And so
this makes rag very flexible but it is
going to be less predictable as well. So
you want to incorporate a gentic rag
when you have very clear instructions
for when you want to search the
knowledge in the different ways that you
give it. And then also here is a code
example if you want to pause and take a
look at that. Next up, we have knowledge
graphs. Another thing that I've covered
a lot in my content. Link to a video
right here. We're combining traditional
vector search, which is what I showed in
this diagram with a new type of
database, a graph database that stores
entity relationships. So, our agent can
not just do similarity search, but it
can also search through relationships
that we have in all of the entities in
our knowledge. And so you generally end
up with a graph that looks like this
that you're usually using a large
language model to build extracting the
entities and relationships from the raw
text that you feed in. And so knowledge
graphs are fantastic for interconnected
data. But just keep in mind since we're
usually using an LLM to extract from
documents. It's going to be a lot slower
and more expensive to create our
knowledge graphs. And so take a look at
this. This is the pseudo code if you
want to see an example using graffiti
which is my favorite library for working
with knowledge graphs. Next, we have
contextual retrieval, which is something
that Anthropic has done a lot of
research on. They have some very
enticing statistics for how much it
helps with the general retrieval
process. And so, what we're doing is
we're using a large language model to
enrich each chunk with information that
we put at the start that describes how
the chunk fits with the rest of the
document. So, back in my Neon dashboard,
I'll show you what this looks like in a
real database. And so, for all of the
chunks that I have stored here, if I
click into any one of them, take a look
at this. we have this text that is
prepended that describes how this
specific chunk fits with the document.
And then we have the triple dash and
then the content of the actual chunk.
And so this is embedded along with the
rest of the information for every chunk
that we have. So there's just more
context with everything that we store.
But we're using a large language model
to create every chunk now. And so it's
going to be a lot slower and more
expensive like knowledge graphs. Next is
query expansion. This is one of the
simplest. All we're doing here is taking
the user query and before we send it
into the search, we are using a large
language model to expand the query to
make it more specific in ways that we
know are going to lead to pulling more
relevant chunks from the knowledge base.
And so we define the instructions for
how to improve the precision by adding
more relevant details. Obviously, the
trade-off here is that it's going to be
slower because we have an extra-l large
language model call for every single
search that we perform. And another
simple and kind of similar rag strategy
is multi-query rag. So instead of using
a large language model to expand upon
one query, we're using an LLM to
generate multiple different variants and
then sending them into our search in
parallel. And so it gives us more
comprehensive coverage obviously at the
cost of having an LLM call before each
search again and then more database
queries overall. And so here is a quick
code example. Now on to contextaware
chunking. This one's a little bit
different because up until this point,
we've only been talking about strategies
for the query process, but it's also
important to have solid strategies for
data preparation. And so, this is
speaking to how we split up our
documents to put in our knowledge base
because we definitely need to. If we
don't split our documents into
bite-sized pieces of information, then
our embeddings are inaccurate and our
agents are pulling way too much
information. But when we split, we want
to make sure that we maintain the
document structure. And so what we're
doing here is we're using an embedding
model to find the natural boundaries in
our document so that we can split and
it's going to be free and fast and we
will maintain our document structure.
It's obviously more complex than if
we're just doing like a split every
1,000 characters or something like that,
but I find this to be very very worth
it. And dockling is a library that I use
in Python that makes it very easy to
implement hybrid chunking, which is a
form of contextaware chunking. I got a
video on that right here. Now, for
another chunking strategy, we have late
chunking. Full disclosure, this is the
only one that I haven't used myself.
It's also definitely the most
complicated, but I wanted to include it
here because I think that it is
fascinating. The idea here is that we
apply the embedding model onto the
document before we chunk it, unlike most
chunking strategies. And then we're
going to chunk up the token embeddings.
And so, what we get out of this is that
each of the chunks still maintain the
context of the rest of the document. So,
this obviously leads to maintaining full
document context better and it's
leveraging longer context embedding
models. Of course, the trade-off here is
it is a lot more complex. In fact, you
might even be thinking to yourself,
Cole, who wa this is insane. Like, what
are you even talking about here? Well,
just let me know if you want me to make
a video on late chunking specifically,
like I said, for any of these
strategies. Next, we have hierarchical
rag. And the idea here is that we have
different layers of our knowledge stored
in our database. We can have these
parent child chunk relationships and
generally we store these relationships
as metadata for all of our chunks. And
so we can search small to be very
precise, right? Like searching
individual paragraphs, but then we can
pull the entire document for a specific
chunk that we find. So we're balancing
precision, you know, searching small
with context, returning big. And you
could argue that hierarchical rag is
sort of a subset of a gentic rag. This
sounds very similar to what I was
showing you in Neon earlier. And going
back to neon really quick, I'll actually
show you this. Let's say that our search
finds this chunk right here. We can look
at the metadata and we can see that this
chunk came from this specific file. So
then we could go to the document
metadata table and pull the content of
that entire file. So for a system where
you wanted to do precise search, but
then look at larger sets of context,
like assuming your documents aren't too
big to read the whole thing, then this
is an awesome approach. which obviously
adds more complexibility and a little
bit of unpredictability like a gentic
rag, but this is a very powerful
strategy as well. Next is
self-reflective rag. After the last
couple, I just want to show you another
simple one again cuz all we have here is
a self-correcting search loop. So we
perform our initial search and then we
call upon a large language model given
the chunks and the question to produce
some kind of grade like maybe on a 1
through five scale and if it's less than
three for example then we're going to
call the rag tool again with a refined
search to try to get more relevant
chunks and so it's self-correcting just
at the cost of more LLM calls because
obviously after every search now we need
to call into a secondary LLM before
we're returning chunks to our agent and
then potentially retrying. Last but not
least, we have fine-tune embeddings. And
this applies to both embedding during
the query process and in the indexing
process. Because what you can do with
embedding models, just like large
language models, is fine-tune them on a
domain specific data set like for legal
or for medical. And from my research, 5
to 10% accuracy gains, you can make it
so smaller embedding models, even open
source ones, can outperform a larger,
more generic ones on your specific use
case. Now this requires a lot of data to
train and infrastructure ongoing
maintenance since it is your embedding
model. Now but this is a very powerful
use case when you have a data set that
you can use to train a model. For
example, you might have a use case where
you want the similarity to be based more
on the sentiment versus the semantic
similarity of the text. So for a
pre-trained embedding model, my order
was late is going to be similar to
shipping was fast, right? Because that's
both about the order itself versus the
individual items. But you can fine-tune
the embedding model to make it so that
my order was late is going to be most
like items are always sold out because
now it's based more on sentiment. You
can have a sentiment-based training set
to make your embedding model operate
like this instead. So there you go. That
is the rundown I have for you on all the
main rag strategies and their pros and
cons. And if you want to dive deeper
into any of them, again, check out this
repository. I've got all these examples
with pseudo code focusing on using
Postgress with PG vector because
especially with Neon that is my go-to
right now for my rag agents. And last
golden nugget that I'm going to leave
you with. If you want to focus on three
rag strategies to start because remember
I recommend combining three to five for
the most accurate use cases. I would
look at reranking agentic rag and
contextaware chunking. Like specifically
hybrid rag with dockling has been
killing it for me. That's my like super
tactical recommendation for you to end
things off. So with that, if you
appreciated this video, you're looking
forward to more things on AI agents and
rag, I'd really appreciate a like and a
subscribe. And with that, I will see you
in the next video.
Retrieval Augmented Generation is THE way to give your AI agents the ability to search and leverage your documents and knowledge. But there are a million strategies for RAG out there - how do you know what is ideal for your use case? What even are all of the RAG strategies I can pick from? Should I combine some together? I've got you covered with all of these questions in this video. Beware this is very informationally dense, but it'll get you started thinking about what strategies to apply to your use cases. Also let me know in the comments which RAG strategies you want me to cover more on my channel! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - If you want to master AI coding assistants and learn how to build systems for reliable and repeatable results, check out the new Agentic Coding Course in Dynamous: https://dynamous.ai/agentic-coding-course - Neon is the Postgres platform I used to showcase real RAG data: check it out here: https://get.neon.com/jR4SxEE - Here is the repo that outlines all the RAG strategies with examples: https://github.com/coleam00/ottomator-agents/tree/main/all-rag-strategies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 00:00 - Intro: The Importance of RAG 01:08 - RAG Explained in 1 Minute 02:22 - Resources for Each RAG Strategy 02:51 - RAG Strategy #1 - Reranking 03:46 - RAG Strategy #2 - Agentic RAG (includes Hybrid Search) 04:42 - RAG Strategy #3 - Knowledge Graphs 05:33 - RAG Strategy #4 - Contextual Retrieval 06:26 - RAG Strategy #5 - Query Expansion 06:56 - RAG Strategy #6 - Multi-Query RAG 07:22 - RAG Strategy #7 - Context-Aware Chunking 08:20 - RAG Strategy #8 - Late Chunking 09:08 - RAG Strategy #9 - Hierarchical RAG (Using Metadata) 10:15 - RAG Strategy #10 - Self-Reflective RAG 10:51 - RAG Strategy #11 - Fine-tuned Embeddings 12:00 - Final Thoughts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Join me as I push the limits of what is possible with AI. I'll be uploading videos weekly - at least every Wednesday at 7:00 PM CDT!