Every RAG Strategy Explained in 13 Minutes (No Fluff) | DailyDevLists

Loading video player...

Full Transcript

2,692 words • EN

Retrieval augmented generation is the

way to give your AI agents the ability

to search and leverage your knowledge

and documents, but it feels like there

are a million strategies for RAG out

there. How do you know which is best for

your use case? What even are all the RAG

strategies that we can pick from? And

should we be combining some together? It

gets overwhelming pretty fast when you

try to optimize a rag system for your

use case. But don't worry because I've

got you covered in this video answering

all of those questions for you. Now,

beware. This is a veryformationally

dense video. I want to try a new format

where I have shorter content that's

super valuepacked for you. And so,

please let me know in the comments as

I'm going through any of these rag

strategies if there's one that you want

me to make a dedicated video for. And if

I already have that, I'll link to the

video when I'm covering that strategy

specifically. So the goal that I have

for you right now is just to get you

started thinking about the strategies

that will apply to your use cases and

how you can combine them together

because usually the optimal solution is

going to combine around three to five

rag strategies. So hopefully going into

this you understand rag at least at a

high level. We have our data preparation

phase and the actual retrieval augmented

generation phase. And so for data

preparation, we take our documents, we

chunk them up into bite-sized pieces of

information for our LLM. We embed them

to then put in our vector database or

potentially a knowledge graph, which

I'll talk about later as well. And then

for our query process, this is where we

take a question from a user like what

are the action items from a meeting for

example, very common rag use case. We

embed that query and then we search our

vector database to find similar chunks

that we then pass into the large

language model. So it's able to leverage

that as extra context to augment its

answer. That's why it's called retrieval

augmented generation. And so for example

the action items and the meaning are

XYZ. So that's rag at a high level but

there are so many different ways for us

to one do the data preparation different

chunking strategies and then also the

way that we search the vector database.

There are a million ways to do that.

even storing information in alternate

formats like a knowledge graph. And so

that is what I'm going to cover with you

here. Starting with our number one

re-ranking. This is the first strategy

that I use for almost every rag

implementation. And also as a resource

to go along with all of these

strategies, I have a GitHub repo that

I'll link to below that has a readme

that dives deeper into all of the 11

strategies that we have today. So, some

research in the docs folder, some pseudo

code examples for you to reference, and

then also I have a full implementation

that's not production ready because it's

not ideal to try to combine as many as

possible, but I'm doing this just as a

reference for you. Feel free to give it

to an AI coding assistant to use as a

starting point as well. And so, with

that, reranking. So, with this strategy,

we have a twostep retrieval. First,

we're going to pull a large number of

chunks from our vector database. But

then we're going to use this specialized

reranker model often a cross encoder to

find the ones that are actually relevant

to our query and then just return some

of those. And so in the end, the large

language model only gets a few of the

chunks, but it's the ones that are the

most relevant. And this is so important

because if we were to go to our LLM and

just give it 20, 50 or more chunks right

away, we're going to completely

overwhelm it. And so by having this

specialized model pull in more context

but then reduce it for the LLM, we're

able to consider more knowledge without

overwhelming it. And it is going to be

slightly more expensive because we have

the second model, but it's not that much

more. I love using re-ranking in most of

my rag implementations. And I've got a

code example you can pause and take a

look at right here if you are

interested. I'll have this for each of

the strategies that I cover. Next up, we

have a gentic rag. I've covered this a

ton on my channel before. Link to a

video right here. It's all about giving

our agent the ability to choose how it

searches our knowledge base. Like maybe

it can do a classic semantic search, but

also if it wants to, it can read the

entire text of a single document. And

I'll show you this right now in a live

project. So I'm here in my Neon

dashboard, which is quickly becoming my

go-to for Postgress. And I love using

Postgress with PG Vector for most of my

rag AI agents. And so I have one table

here for our chunks and then another

table that stores the higher level

information each individual document and

my agent can pick and choose where it

searches based on the question. And so

this makes rag very flexible but it is

going to be less predictable as well. So

you want to incorporate a gentic rag

when you have very clear instructions

for when you want to search the

knowledge in the different ways that you

give it. And then also here is a code

example if you want to pause and take a

look at that. Next up, we have knowledge

graphs. Another thing that I've covered

a lot in my content. Link to a video

right here. We're combining traditional

vector search, which is what I showed in

this diagram with a new type of

database, a graph database that stores

entity relationships. So, our agent can

not just do similarity search, but it

can also search through relationships

that we have in all of the entities in

our knowledge. And so you generally end

up with a graph that looks like this

that you're usually using a large

language model to build extracting the

entities and relationships from the raw

text that you feed in. And so knowledge

graphs are fantastic for interconnected

data. But just keep in mind since we're

usually using an LLM to extract from

documents. It's going to be a lot slower

and more expensive to create our

knowledge graphs. And so take a look at

this. This is the pseudo code if you

want to see an example using graffiti

which is my favorite library for working

with knowledge graphs. Next, we have

contextual retrieval, which is something

that Anthropic has done a lot of

research on. They have some very

enticing statistics for how much it

helps with the general retrieval

process. And so, what we're doing is

we're using a large language model to

enrich each chunk with information that

we put at the start that describes how

the chunk fits with the rest of the

document. So, back in my Neon dashboard,

I'll show you what this looks like in a

real database. And so, for all of the

chunks that I have stored here, if I

click into any one of them, take a look

at this. we have this text that is

prepended that describes how this

specific chunk fits with the document.

And then we have the triple dash and

then the content of the actual chunk.

And so this is embedded along with the

rest of the information for every chunk

that we have. So there's just more

context with everything that we store.

But we're using a large language model

to create every chunk now. And so it's

going to be a lot slower and more

expensive like knowledge graphs. Next is

query expansion. This is one of the

simplest. All we're doing here is taking

the user query and before we send it

into the search, we are using a large

language model to expand the query to

make it more specific in ways that we

know are going to lead to pulling more

relevant chunks from the knowledge base.

And so we define the instructions for

how to improve the precision by adding

more relevant details. Obviously, the

trade-off here is that it's going to be

slower because we have an extra-l large

language model call for every single

search that we perform. And another

simple and kind of similar rag strategy

is multi-query rag. So instead of using

a large language model to expand upon

one query, we're using an LLM to

generate multiple different variants and

then sending them into our search in

parallel. And so it gives us more

comprehensive coverage obviously at the

cost of having an LLM call before each

search again and then more database

queries overall. And so here is a quick

code example. Now on to contextaware

chunking. This one's a little bit

different because up until this point,

we've only been talking about strategies

for the query process, but it's also

important to have solid strategies for

data preparation. And so, this is

speaking to how we split up our

documents to put in our knowledge base

because we definitely need to. If we

don't split our documents into

bite-sized pieces of information, then

our embeddings are inaccurate and our

agents are pulling way too much

information. But when we split, we want

to make sure that we maintain the

document structure. And so what we're

doing here is we're using an embedding

model to find the natural boundaries in

our document so that we can split and

it's going to be free and fast and we

will maintain our document structure.

It's obviously more complex than if

we're just doing like a split every

1,000 characters or something like that,

but I find this to be very very worth

it. And dockling is a library that I use

in Python that makes it very easy to

implement hybrid chunking, which is a

form of contextaware chunking. I got a

video on that right here. Now, for

another chunking strategy, we have late

chunking. Full disclosure, this is the

only one that I haven't used myself.

It's also definitely the most

complicated, but I wanted to include it

here because I think that it is

fascinating. The idea here is that we

apply the embedding model onto the

document before we chunk it, unlike most

chunking strategies. And then we're

going to chunk up the token embeddings.

And so, what we get out of this is that

each of the chunks still maintain the

context of the rest of the document. So,

this obviously leads to maintaining full

document context better and it's

leveraging longer context embedding

models. Of course, the trade-off here is

it is a lot more complex. In fact, you

might even be thinking to yourself,

Cole, who wa this is insane. Like, what

are you even talking about here? Well,

just let me know if you want me to make

a video on late chunking specifically,

like I said, for any of these

strategies. Next, we have hierarchical

rag. And the idea here is that we have

different layers of our knowledge stored

in our database. We can have these

parent child chunk relationships and

generally we store these relationships

as metadata for all of our chunks. And

so we can search small to be very

precise, right? Like searching

individual paragraphs, but then we can

pull the entire document for a specific

chunk that we find. So we're balancing

precision, you know, searching small

with context, returning big. And you

could argue that hierarchical rag is

sort of a subset of a gentic rag. This

sounds very similar to what I was

showing you in Neon earlier. And going

back to neon really quick, I'll actually

show you this. Let's say that our search

finds this chunk right here. We can look

at the metadata and we can see that this

chunk came from this specific file. So

then we could go to the document

metadata table and pull the content of

that entire file. So for a system where

you wanted to do precise search, but

then look at larger sets of context,

like assuming your documents aren't too

big to read the whole thing, then this

is an awesome approach. which obviously

adds more complexibility and a little

bit of unpredictability like a gentic

rag, but this is a very powerful

strategy as well. Next is

self-reflective rag. After the last

couple, I just want to show you another

simple one again cuz all we have here is

a self-correcting search loop. So we

perform our initial search and then we

call upon a large language model given

the chunks and the question to produce

some kind of grade like maybe on a 1

through five scale and if it's less than

three for example then we're going to

call the rag tool again with a refined

search to try to get more relevant

chunks and so it's self-correcting just

at the cost of more LLM calls because

obviously after every search now we need

to call into a secondary LLM before

we're returning chunks to our agent and

then potentially retrying. Last but not

least, we have fine-tune embeddings. And

this applies to both embedding during

the query process and in the indexing

process. Because what you can do with

embedding models, just like large

language models, is fine-tune them on a

domain specific data set like for legal

or for medical. And from my research, 5

to 10% accuracy gains, you can make it

so smaller embedding models, even open

source ones, can outperform a larger,

more generic ones on your specific use

case. Now this requires a lot of data to

train and infrastructure ongoing

maintenance since it is your embedding

model. Now but this is a very powerful

use case when you have a data set that

you can use to train a model. For

example, you might have a use case where

you want the similarity to be based more

on the sentiment versus the semantic

similarity of the text. So for a

pre-trained embedding model, my order

was late is going to be similar to

shipping was fast, right? Because that's

both about the order itself versus the

individual items. But you can fine-tune

the embedding model to make it so that

my order was late is going to be most

like items are always sold out because

now it's based more on sentiment. You

can have a sentiment-based training set

to make your embedding model operate

like this instead. So there you go. That

is the rundown I have for you on all the

main rag strategies and their pros and

cons. And if you want to dive deeper

into any of them, again, check out this

repository. I've got all these examples

with pseudo code focusing on using

Postgress with PG vector because

especially with Neon that is my go-to

right now for my rag agents. And last

golden nugget that I'm going to leave

you with. If you want to focus on three

rag strategies to start because remember

I recommend combining three to five for

the most accurate use cases. I would

look at reranking agentic rag and

contextaware chunking. Like specifically

hybrid rag with dockling has been

killing it for me. That's my like super

tactical recommendation for you to end

things off. So with that, if you

appreciated this video, you're looking

forward to more things on AI agents and

rag, I'd really appreciate a like and a

subscribe. And with that, I will see you

in the next video.

Every RAG Strategy Explained in 13 Minutes (No Fluff)

Cole Medin

14 days ago

12:51

RAG & Vector Search

Rank #2

Description

Retrieval Augmented Generation is THE way to give your AI agents the ability to search and leverage your documents and knowledge. But there are a million strategies for RAG out there - how do you know what is ideal for your use case? What even are all of the RAG strategies I can pick from? Should I combine some together? I've got you covered with all of these questions in this video. Beware this is very informationally dense, but it'll get you started thinking about what strategies to apply to your use cases. Also let me know in the comments which RAG strategies you want me to cover more on my channel! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - If you want to master AI coding assistants and learn how to build systems for reliable and repeatable results, check out the new Agentic Coding Course in Dynamous: https://dynamous.ai/agentic-coding-course - Neon is the Postgres platform I used to showcase real RAG data: check it out here: https://get.neon.com/jR4SxEE - Here is the repo that outlines all the RAG strategies with examples: https://github.com/coleam00/ottomator-agents/tree/main/all-rag-strategies ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 00:00 - Intro: The Importance of RAG 01:08 - RAG Explained in 1 Minute 02:22 - Resources for Each RAG Strategy 02:51 - RAG Strategy #1 - Reranking 03:46 - RAG Strategy #2 - Agentic RAG (includes Hybrid Search) 04:42 - RAG Strategy #3 - Knowledge Graphs 05:33 - RAG Strategy #4 - Contextual Retrieval 06:26 - RAG Strategy #5 - Query Expansion 06:56 - RAG Strategy #6 - Multi-Query RAG 07:22 - RAG Strategy #7 - Context-Aware Chunking 08:20 - RAG Strategy #8 - Late Chunking 09:08 - RAG Strategy #9 - Hierarchical RAG (Using Metadata) 10:15 - RAG Strategy #10 - Self-Reflective RAG 10:51 - RAG Strategy #11 - Fine-tuned Embeddings 12:00 - Final Thoughts ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Join me as I push the limits of what is possible with AI. I'll be uploading videos weekly - at least every Wednesday at 7:00 PM CDT!

Video Details

Category

RAG & Vector Search

Featured Date

November 14, 2025

Quality Rank

#2

AI Recommended