800+ Hours of Learning RAG + Agentic Design in 42 mins (n8n Masterclass) | DailyDevLists

Loading video player...

Full Transcript

7,783 words • EN

There's no one-sizefits-all approach

when it comes to building rag systems.

It's highly dependent on your use case,

and if you choose the wrong pattern to

start, you're setting yourself up to

fail. The key element here is you need

to strike the right balance between

model intelligence versus speed and

cost. Think about it. A customer-f

facing chatbot needs lightning fast

responses, otherwise you've lost your

customer. So, you can't have a frontier

model like Claude Sonnet thinking for 5

minutes before it comes up with an

answer. Whereas a rag agent deployed in

a legal department absolutely

prioritizes accuracy over speed. And if

you take a fully local rag system for

example, here you're massively

constrained by resources. So you need to

design your rag system around the models

you can actually afford to run inhouse.

This is the main reason why there are

hundreds of rag strategies and tactics

out there. And I've distilled them down

to nine key RAG design patterns that

cover all use cases. And that's what

we'll be going through today. We'll be

starting with the most basic, which is

naive rag, and work our way up to the

state-of-the-art multi-agent rag

systems. And this isn't just theory.

Today, I'll be showing you how this can

be implemented in N8N to give you the

knowledge you need for your RAG project

to succeed. Let's get into it. The

models you pick for your RAG system will

heavily influence how you design

everything else. And the reason for this

is generally speaking, the larger the

LLM is, the more parameters it has, the

higher the quality responses you're

going to get out of it. This is because

larger models are generally better at

instruction following, tool calling,

complex reasoning, and context handling.

So by and large, you do get better

responses out of larger models compared

to smaller models. And what I found is

small language models from let's say 1

to 10 billion parameters can't usually

reliably call tools and in a lot of

cases can only handle basic instruction

following. But then there's a steep rise

after let's say the 15 billion parameter

mark where the model's abilities around

tool calling and reasoning and context

handling improve all the way up then to

the main frontier models like claude

sonnet and GPT5. That's why I think for

RAG systems there's somewhat of a sweet

spot in between these small and frontier

models. A model like GPTOSS

120 billion for example really is highly

capable and nowhere near as expensive as

the likes of Claude Sonnet or GPT5. But

the crucial detail here is even small

language models of 15 to 20 billion

parameters can outperform large language

models if they have high quality

retrieval. So, it's all down to your rag

system design. If you can inject the

right content into context into a small

language model, it will always produce a

better answer than a large language

model that doesn't have the information

to base an answer off. Before we get

into the design patterns, let's look at

four rag use cases and focus on the

different priorities that will dictate

the system design. First up, we have a

customerf facing rag chatbot. So you can

imagine that this is embedded on let's

say an e-commerce website where

customers can chat and ask questions

about the product catalog or delivery

information. A key priority of a

customerf facing chatbot is speed. It

has to be fast and that priority will

dictate which model you use. So you're

going to need to use something like

Gemini 2.5 Flash, which is a smaller

model that'll give you much faster

responses. And as these models are

cheaper to run, your bill is going to be

lower, which is a good thing. But

because this is a public-f facing

chatbot, this will be deployed at scale.

You could have thousands of customers

asking questions of it. So you're also

constrained by that, too. You can't have

frontier models that cost $15 per

million output tokens. So these

constraints, these priorities dictate

the system design. And as a result, then

you'll likely use a mixture of agentic

and non-agentic rag, which I'll be going

through in a few minutes. You'll need

guard rails because you don't want this

chatbot chatting about anything. It

needs to be highly specific and highly

specialized on, let's say, an e-commerce

catalog. You'll most likely need query

routing so that you can route the

different types of questions to the

right tools. I'll explain all of these

in a few minutes when we go through the

design patterns. And a key pattern you

would likely use here is the verify

answer pattern because you want to make

sure that what you're telling the

customer is actually true and is

grounded in the retrieved data from your

systems. In other words, you're better

off not saying anything as opposed to

saying something false. Next up, we have

the idea of an AI assistant or co-pilot

like the AI agent deployed in a legal

department. So here, accuracy is

absolutely crucial, particularly if

you're dealing with complex documents.

So with accuracy being the guiding light

in this project, speed is going to be

reduced because you will want complex

reasoning over the documents. You may

need iterative and recursive retrieval.

And if you're using more expensive

models and larger context windows,

that's all going to come at a greater

cost. So it will be more expensive per

user. But because this isn't public

facing, you'll likely have a smaller

user base. So the overall cost won't be

too high. So this type of system could

be deployed in lots of different ways.

So you could have an MCP trigger node in

N8N and then if claw desktop is deployed

within a business they can simply add it

as a tool and then chat to their

documentation that way. Or it could be

deployed in something like open web UI

or this could be a Slack agent or a

Microsoft Teams agent. So that way you

don't need to deploy software to

employees in a business. The types of

techniques you would likely need here

include agentic rag, multi- aent rag,

hybrid rag, and that verify answer

pattern as well. We'll be deep diving on

all of these shortly. Another use case

is an AI automation with rag embedded

within it. This is an agentic blogging

system I created a few months ago. And

within this sequential flow, we have an

agentic rag which retrieves information

from a knowledge base. And all of that

feeds into article outlines and article

generations which then goes through a

deterministic pattern to build out an

article to be published on WordPress. So

it's just another rag use case where

there are again different priorities. So

this type of automation will be running

in the background. So speed is not a

problem. The key thing is it's

resilient. So you can apply larger

models with reasoning. So we have aentic

rag and non-agentic rag here in a

deterministic flow and a lot of prompt

chaining going on as well. And finally

onto our fully local rag system here.

Your model selection is highly limited

based on the internal hardware that's

available to run these models. So with a

$3,000 graphics card, you may only be

able to run a 20 billion parameter

model. And then that massively

influences the design of your system. So

while there is an initial capital cost

up front, the actual cost per user is

quite low over the long term and how

fast the model inference is really

depends on the graphics cards that you

have in play. These types of solutions

could be deployed again using the likes

of Open Web UI because you want to keep

everything local. I put days of work

into pulling this video together. So if

you're finding value in it, I really

appreciate if you gave the video a like

and subscribe to our channel for more

deep AI and N8 content. Now that you

have an appreciation for the different

types of rag use cases, let's dive into

our deterministic rag system designs.

First up, naive rag. And naive rag is

the most basic way of running a rag

system. Essentially, you have a message

that comes in, it goes straight to a

vector store to retrieve relevant

results which goes straight to an LLM to

generate a response and that's outputed

to the user. So, it's ultra fast but

ultra basic. And this is how it looks in

NATM. So chat trigger straight into

subbase aggregate all of the chunks that

we get back from the vector store and

straight into an LLM call. Now I have

this hooked up to a large vector store.

So my documents table here has 215,000

chunks. And here I'm going to

demonstrate the exact problem with naive

rag. So I'm going to ask this question.

How to change the power levels on my GE

Advantium oven. So here I ingested

10,000 product manuals of kitchen

appliances. And if you want to see how

we did this, check out our rag at scale

series on our channel which includes

these two videos. I'll leave links for

these in the description below. So if we

ask this question, you'll see that we

quickly get chunks from the vector

store. Quickly get an answer. The answer

being sorry, I don't know. And the

reason for that is that the chunks that

we got back from the vector store are

not relevant. And you can see that the

chunks are related to a different GE

model. There's a cafe model. It seems to

be here again a cafe model. So it has

returned chunks that are related to

power levels of an oven. But it's the

wrong model. And within our system

prompt, we are being ultra specific. The

LLM needs to analyze the retrieve chunks

to determine if they're relevant. And if

they are, base the answer off those

chunks. If they're not, just say sorry,

I don't know. And that in a nutshell is

the limitation of naive rag. It's a

single pass through to generate an

answer. Even though that query could be

quite conversational with lots of stop

words and irrelevant words that are

going to negatively impact the quality

of the embedding that's sent into the

vector store to find the nearest

matches. But obviously the benefit of

this approach is simplicity and speed.

End to end this took 2.5 seconds and

included a vector search and an LLM

inference. So highly efficient if a

little bit blunt. So as a result of

these shortcomings, a lot of techniques

have been invented around the idea of

transforming the original user's query.

So in our next design pattern, we

implement a lot of these query

transformation techniques just to give

you an idea of how they all work

together. If you'd like to get access to

the nine different rag design patterns

that I discussed in today's video, then

check out the link in the description to

our community, the AI Automators, where

you can join hundreds of fellow builders

all looking to create production rag

applications. If you're serious about

building AI agents, this is the place to

be. So, our second pattern is a

deterministic rag flow with query

transformation and this verify answer

pattern that I talked about previously.

So, here there's no tool calling. So we

need to decompose the entire retrieval

flow. This type of flow would work with

most LLMs from 5 billion parameters up

to trillions of parameters. And here

I'll be demonstrating how when a user's

question comes in, we first decompose

that query because there might be

multiple asks within that question. And

then for each of those sub queries, we

expand upon it. So we find different

angles to actually search the knowledge

base for that question. We then carry

out multiple vector searches and then

use a technique called rag fusion or

multi-query rag where we actually fuse

those result sets together and carry out

reranking. This then gives us the top

results from the much larger candidate

set that was retrieved. That's then sent

to the LLM to generate an answer. And

then our verify answer call checks the

answer against the retrieved context

from this stage to make sure that it's

actually accurate. If it is, it outputs

it to the user. If it's not, it goes

back to the LLM to generate an updated

answer with the guidance on where it

went wrong. So, let's see it in action.

And for this, unlike our previous

question, let's ask a multifaceted

question. How to change the power levels

on my GE Abacus mixer? Now, an abacus

mixer doesn't exist. And how do I clean

it? So, let's ask it here. And off it

goes. Now I've tied in chat history as

well so that we can retrieve the chat

history and there is nothing yet because

I've refreshed the session. We've got

back a quick answer which is great. But

let's just look at what happened. So

firstly we go to this query intent and

decomposition step. And what we're doing

here is we're breaking down this

question into multiple sub queries or

sub questions. And on the right hand

side here you can see number one it's

outputting that it needs retrieval which

is crucial and number two it's

outputting the sub queries. The first

one is how to change the power levels on

a GE abacus mixer and the second is how

to clean it. It then goes through this

retrieval gate because if someone is

just asking a question like hi how are

you? You don't need to go to a vector

store. You can simply just output an

answer up here which is ideal for small

talk as you can see. But because this

question does require retrieval, it goes

through this gate and we then go to our

query rewriting an expansion call. And

here we're transforming the original

query or the subquery in this case by

adding synonyms, related terms,

different phrasing, trying to come at it

from different angles to get different

chunks back from the vector store for a

more rounded candidate set. So within

this node on the left you can see we

have how to change the power levels how

to clean it and on the right you can see

the different rewritten queries to come

at the search from different angles and

the other thing we're doing is we're

asking the LLM to classify the incoming

message based off a product category

because this will help us narrow the

search within the vector store. we have

200,000 chunks. So, ideally, we should

be able to narrow this with the metadata

filter. And that way, we'll have a much

more focused search for these terms to

be able to get back the right chunks.

And as you can see here, it's saying

that it's unsure what product category

this product fits into. So, a lot of

this is in the system prompting. So,

here I'm saying the only valid options

are ovens, washing machines, toasters,

and unsure. But you could list all of

your metadata values here. Okay. So from

here then we're splitting out again. And

here I have a simple code node that

checks to see are we unsure of any of

the categories because if we are unsure

we should just get straight back to the

user and ask for clarification. And this

brings us to this if node and because we

are unsure of what this abacus mixer is.

We come down to a query clarification

call. And here we've identified an

ambiguous or an underspecified query and

we're going to ask the user to actually

clarify. and it's output of the response

to help me provide the most accurate

information. Please clarify what type of

mixer you're referring to. And that's

effectively then what's output to the

user. And then we've updated the chat

history. So you can see then within the

chat messages, what the LLM generated is

outputed to the user. Are you asking

about a food mixer or an audio mixer? So

this is a great way to ask clarifying

questions of the user to help you refine

your search within the vector store

based off metadata filters. So let's

just follow up and say sorry I meant the

GE Advantium oven. Okay. And that's

working its way back through the

process. And now it has identified the

actual product category. So let's now

take it up here. So we've now passed

this gate and it's true. And now we're

into a technique called query routing.

So here we're going to direct the

queries to the most appropriate metadata

filter in this case. So the switch has

ovens, washing machines, and toasters.

So here we're coming up through the

ovens path which sets the metadata

filter. So we're setting the product

category as oven. And when we go to the

vector search then we're actually

passing that product category as oven

along with the prompt. And because we've

expanded out the prompts previously,

this is just one of five searches we're

going to have on this vector store. So

this one is GE advantium oven change

power levels. But if we look back here

at the query rewriting and expansion,

you can see that we have multiple

queries that we're going to search the

vector store off. How to change power

levels, how to adjust cooking power, how

to clean the oven, cleaning

instructions, maintenance guides. So all

of these keywords are going to be sent

into the vector store to get back a

diverse collection of chunks. So you can

see this is already way better than

naive rag. naive rag sent in a really

long rambling query by the user. Whereas

here, not only is the query written,

it's also expanded into multiple queries

to cover a wider range. As a result, we

end up with 50 items coming back across

all of the different queries because

we're getting 10 chunks per search. We

then go through the process of rag

fusion. So with these multiple queries,

we're going to end up with a lot of

duplicate results. So we need a way of

dduplicating and fusing the results

together into an ordered list. And

that's what this reciprocal rank fusion

step does. So essentially if one chunk

appears in multiple lists, it'll rank

higher than a chunk that only appears in

one of the lists. The lists being the

five queries that we ran against the

vector store. So of the 50 chunks that

we retrieved from the vector store.

There are only 20 unique chunks and they

are now ordered based off that fusion

process. But that fusion process is

pretty crude as I just described. So we

also carry out a re-ranking stage. And

here we're using coher 3.5. You could

use a local re-ranking model if you

wanted. But what re-ranking does is it

sends in the user's question as well as

the 20 items in this case into a cross

encoder model which is similar to a

basic LLM. And what we're doing is we're

asking it to take the 20 items and only

return the 10 most relevant ones in

order. And that's what it has done

there. So here we're just setting those

up then to send into the LLM, which is

what we do to generate the answer. And

you can see we're sending in the user's

question, the chat history, as well as

the top 10 chunks ordered by relevance

from the re-ranker. And that's helping

to produce this quite accurate answer.

And finally, we go through our verify

answer pattern with the feedback loop.

So what we do is we send in that answer

that was just generated along with the

user's question along with the chunks

that we got back from the vector store

and we're asking the LLM to make a

judgment on is this answer fully

grounded in the context that was

retrieved. Are there any contradictions?

Are there any unsupported claims? And

from there then there's a decision. If

it's grounded we can just output the

answer to the user. If it's not

grounded, then we can go back to the LLM

and pass the feedback to say you need to

make these changes and it works its way

back through the process. It only does

this a couple of times because you don't

want to get caught in an infinite loop.

And finally, it outputs a response to

the user and updates the memory. So

that's what the end toend process looks

like for a deterministic rag flow with

query transformation and the verify

answer pattern. And what's brilliant

about this is a small language model can

actually run this. There's no tool

calling. Any of the really complex

system prompts have been broken into

multiple calls to LLMs. And time-wise,

if we look at the logs here, you can see

that end to end, this took 30 seconds to

retrieve quite an accurate answer. And

I've baked a lot of query transformation

strategies in here. So you could

probably strip some of these out and

actually get that even faster. And it is

worth calling out that this is what

production rag systems look like. It's

highly focused LLM calls. It's super

fast vector searches with chunks

injected into context to get ultra fast

answers. And if you look through the

logs here, the real bottlenecks are

actually in the answer verify pattern

which takes 9 seconds because there's a

bit of reasoning required over the data.

It's absolutely possible to use agents

within this type of flow. So you can end

up with the hybrid agentic yet

deterministic workflow. Our third

strategy is deterministic rag using

iterative retrieval. And it's a similar

flow to before with the exception that

after the re-ranking stage before

generating the answer, we have an LLM

call that analyzes the results and it

determines whether we actually need to

retrieve more from the vector store

because the quality of results isn't

good enough to formulate an answer. And

then you can keep a count on the number

of iterations that you progress through

to again avoid an infinite loop similar

to the answer verify pattern. And once

you're happy that you've retrieved

enough or you've hit the limit, you can

then go to an LLM to generate a response

to the user. I haven't mapped this one

out on NAND, but essentially you'll be

injecting the LLM call here. This would

be your analyze results call. And then

you would have an if node. That if node

is should you retrieve more and if we

should retrieve more then we need a

counter to make sure that we're not

stuck in a loop. So should we stop we'll

drop that in there and if we should stop

then generate the answer. If we

shouldn't stop if we can go again then

we go back to the vector store and we

pass in the new queries that were

generated from this analyze results

node. And if the retrieve more says

we're done we can also just connect that

up to the generate answer. So that's the

type of implementation you would need.

Of course, this would also increase the

time it takes to generate a response. So

again, back to the use cases. If you

really need highquality responses, this

is a good pattern to use. Another

variation on this deterministic flow is

adaptive retrieval as opposed to

iterative retrieval. I really like this

design pattern, and I already have some

of this implemented in what I've shown

you so far. So with adaptive retrieval,

what we're doing is we have a query

classifier. So you saw it earlier where

I said should I retrieve yes or no?

That's essentially this piece here. So

if the query classifier runs and it

determines that no retrieval is

required, you can simply just use what's

in the LLM's training data. Then you

just generate a simple response. Whereas

if retrieval is required, it then

defines it as a single step retrieval or

multi-step. If it's a single stage

retrieval, you just send in the single

query or an expanded query and get your

results and output the response.

Whereas, if it's a multi-stage

retrieval, you end up again in a kind of

a recursive pattern. So, you send in

your first query into the vector store,

you get your results and rerank them.

And then based off the retrieval

strategy, if it's multi-step, you go to

an analyze results node and it then

analyzes to see is retrieval complete.

Has it hit enough steps? Because it's a

complex multi-step query. Have you done

enough research? Have you done enough

digging? And if you haven't, it

generates more queries which goes back

to the vector store and it cycles

through and cycles through until it's

complete and an answer is generated. So

this is kind of similar to how GPT5

operates its router. If you interact

with chat GPT and if you ask a very

simple question, then behind the scenes,

OpenAI is sending that really simple

question to a small model to generate an

answer. Whereas if it's a moderately

complex question, it'll go to a decent

model, but probably not a frontier

model. Whereas if it's a really complex

question, it's going to hit its largest

model. It's going to use reasoning and

it'll take a couple of minutes to

generate a response. So that's

essentially this pattern adaptive

retrieval and within N8N we have some of

this already covered. So the query

intent is essentially your query

classifier and the retrieval decision

here. First determine if this query

requires retrieval from the knowledge

base. Select no retrieval or needs

retrieval. Instead you would have no

retrieval, singlestep retrieval or

multi-step retrieval. And then if no

retrieval you just come back to the

fallback message and output to the user.

And then whereas whereas down here then

again you would just split at this mark

and you would have an if node which is

your retrieval strategy. And let's say

if the retrieval strategy is single step

which is the top leg here then you just

generate your answer and off you go.

Whereas if it's multi-step then you

would come back down here. You have an

analyze results. You decide whether you

need to retrieve more. Have you gone

deep enough? If you have, should you

stop? Because you don't want to get

caught in a loop. So, exactly the same

idea. So, you would tie that one there.

You don't need to retrieve more. Or if

you do and you're happy to continue,

that will go back to the vector store

with the updated queries. So, again, the

same type of recursive iteration and

looping, but now it's based off a

retrieval strategy based off a query

classifier. So you can see a lot of

these rag strategies and tactics are

kind of variations of each other.

Adaptive retrieval is similar to

iterative retrieval. So that's our first

four rag system designs covered. And

just to reiterate that while all of

these can run on small language models,

they also can run on large language

models. You can have AI agents buried

within flows, deterministic flows like

this. And the beauty of a deterministic

workflow is that it's highly reliable.

it will do the same thing every single

time it runs which is totally different

to the idea of AI agent because with AI

agents through a system prompt you're

trying to persuade it to do what you

want whereas with the deterministic

workflow it has to do what you want

because you've designed it that way. So

this hybrid approach of a deterministic

yet agentic workflow is highly powerful

in a production rag setting. Now that

you've seen deterministic rag system

designs, let's look at some agentic rag

patterns. Let's start off with standard

agentic rag and hybrid rag. And on the

face of it, your standard agentic rag is

quite simple. It's an AI agent that

receives a message from a user. It has

various tools that it can call to help

retrieve context to generate a response.

So here it could have a vector search

tool, a database search tool, a web

search tool, and you can load in quite a

complex system prompt in to try to

convince it to do what you want to do.

You could give it a standard operating

procedure. You could use prompt

engineering techniques to really urge

the agent to follow the right series of

steps if that's the approach you want to

take. Based off all of that, it can then

output a response to the user. And this

is what it looks like in N8. This is

probably pretty familiar. Standard chat

message, standard AI agent. It has a

vector search tool with an embedded

model hooked up. And then agents also

have memory. So here we've just assigned

a simple memory, but you could have

different types of memory like Postgres

or Zep. And then we have an AI model

hooked up. And as I mentioned earlier,

small language models under 10 billion

parameters can struggle with reliable

tool calling. So your selection of model

here is important. I've hooked up a

Frontier model Claude Sonnet 4.5. So

let's ask the same question. How to

change the power levels on a GE Abacus

mixer. And the beauty of these agents is

that it can hit a vector store numerous

times with different variations of a

query. So this is the same as query

rewriting for example or query

decomposition. It can break up the query

and hit the vector store numerous times.

Sometimes it just does that off its own

bat. Other times you need to prompt it

in the system prompt, but because it has

this function calling loop, it can do it

numerous times. And as you can see, it's

saying it can't find any information

about the abacus mixer because it

doesn't exist. So I'll say, "Sorry, I

meant the GE Advantium oven." And

because there's memory attached to this

agent, similar to the way we were

updating the chat history in the

previous flow, it's able to fill in the

blanks and figure out that it needs to

retrieve cleaning information as well as

how to change the power levels of the

oven. So, it's hit the vector store six

times already and has now outputed an

answer, which looks quite good. And if

we have a look at the vector search, you

can see the different queries that have

passed. So, Advantium oven power levels,

cleaning instructions,

settings adjustments cooking

interior exterior care maintenance.

So, it's just passing lots of different

keywords to try to get a broad range of

candidate chunks that it can actually

feed into its context to output an

accurate answer. And this is why

Frontier models are great. There was no

need for any of the query transformation

steps that I showed you in the previous

design patterns because it's such a

smart model. It can figure that out.

Claude 4.5 solid is probably one of the

best models out there and one of the

most expensive. So by way of a

comparison, let's try this out on the

GPT OSS 20 billion model. This is an

order of magnitude smaller than 4.5

sonnet. So I've set the GPT OSS

safeguard 20 billion model there. And

let's ask the same question. So it has

hit the vector search. So it is able to

actually call tools which is great. It

is getting a decent answer. I'm sorry I

don't have the specific instructions.

Let's ask it again just to see do we get

the same response. We probably will

actually. It's in memory. No, we didn't

actually. We are getting some

instructions on how to change the power

levels on a GE abacus mixer which

doesn't exist. So we got it right the

first time, but when I asked the same

question again, it didn't get it right.

Let's reset the chat session and we'll

just try that question again. So, it's

gone to the vector storm numerous times.

Again, it's a good answer. It doesn't

actually exist. But just when I repeated

the question, it must have thought that

I really wanted an answer even though it

wasn't in context. Let me try that with

SA 4.5. I'll ask it a couple of times.

Okay. So, we get our apologies. Doesn't

exist. Please clarify. So, let's ask

this exact same question again. Yeah.

And it still doesn't have the

information. So, I think that's kind of

a good example of smaller models. They

are less reliable than the larger more

intelligent models. For some reason, GPT

OSS 20 billion thought it was okay to

fabricate the information because they

asked the question twice. Whereas a

larger model is smart enough to realize

it still doesn't exist. It's still not

in context. And that is why benchmarks

exist. It is a quantitative mechanism to

compare these models. Even though by all

accounts they are being gamed by the

various model providers. But either way,

you can see how a gentic rag works and

the idea of multiple tool calls with

different queries. So a lot of the query

transformation is wrapped up in the

reasoning of the model itself. Hybrid

rag is essentially the same thing. The

only difference is you don't just have

semantic search attached. You also have

different representations of data within

different systems. So here we have a

database search tool. So this is a

Postgres database and you would require

the hybrid rag agent to write a SQL

query to actually fetch results from

that knowledge base. As for this one,

this is a graph search tool using Neo4j.

So the hybrid rag agent would need to

write a cipher query to be able to

traverse the knowledge graph and return

the results. So that's what hybrid rag

is retrieval from different types of

data stores such as vector, graph or

database. If you want more information

on database search tool calling, Allan

has a video on our channel called

Agentic Databases, which is well worth a

watch. And on the graph search, I just

released a video on graph agents a

couple of weeks ago where I show how to

set up Neo4j and actually configure your

NAN agent to traverse through a custom

knowledge graph that you can create.

I'll leave links for these in the

description below. So on to pattern

seven which is our multi- aent rag

system using sub aents. You've probably

seen a lot of this on YouTube with nadn

videos but here instead of tool calls

which are outbound external to different

services we're calling another agent and

that agent has its own system prompt and

has its own context window. And this is

a really good pattern because if we go

back up to our agentic rag setup, you

could have quite a complex system prompt

for this AI agent. And if you have a

really smart model like 4.5 sonnet, it

will by and large follow that prompt.

Particularly if you enable reasoning and

it can think about it a bit. But at the

same time, there is a limit to the

complexity of system instructions that a

model can actually follow. At a certain

point, you need to break that apart into

different specialized agents. And that's

what this multi- aent rag setup is. So

here we can simplify the system prompt

of our agent and farm out some of those

responsibilities to sub agents. So if

we're looking at our database example,

you could have a database sub agent who

has numerous SQL tool calls and

sometimes these SQL queries can be

incredibly complicated and with few shot

prompting all of that buried in a system

prompt at the AI agent level would

totally overwhelm it and it wouldn't

know where to focus to actually follow

the correct set of instructions. So

instead, we can dedicate a sub agent

just for the database queries and we can

give it all of the examples that it

needs to effectively trigger these SQL

tools. So that's a good example of

simplifying the system prompts. Another

benefit of this approach is protecting

the context window. I released this

video 3 weeks ago on a technique called

context expansion. And what this means

is you have the ability to load up an

entire document into context if the

query actually requires it. So let's say

a question came in asking can you

summarize this entire document and it

provides a link to a document. For an

LLM to summarize it, you can't really

use rag because rag will only give you

segments of the document. If you need to

summarize it, you need visibility of the

entire document which means you need to

load the entire document into context.

The problem with that is there is a

limited amount of space in context. So

if your main agent has let's say five

steps in a procedure to follow to output

a result and one of those steps is to

actually load an entire document then

once it does that it might get confused

because it doesn't know where to focus

within the context itself and it might

end up messing up steps three four and

five as a result. So with this pattern,

you can offload the context overhead to

a sub agent. This document researcher,

its only job is to load a document,

summarize it, and send back the summary.

And at that point, it can just forget

it. So the agent will never see the full

document. It'll only get the summary to

continue to steps three, four, and five.

So there are lots of benefits to having

a multi- aent setup with sub aents. And

this is what it would look like in N8N

where you have your agentic rag and as

tools. Then you have your sub agents and

this is pretty easy. You just click on

the plus and just type in agent and you

have this AI agent tool and then you can

just attach your model attach memory if

you need to and attach the various

tools. So my database sub agent you

could have lots of different kind of SQL

Postgres tools with different prompting.

The document researcher could just have

a fetch full document tool and that

could be a Google drive that could be a

Google cloud bucket or it could even be

a Postgres tool that fetches all of the

chunks based off a document ID. So then

when this agent is called it can trigger

the sub agent to carry out the work,

provide the responses and it can

continue on its merry way to generate an

answer to the user. So it's a pretty

good pattern. I would have one word of

warning though which is that you do need

to be highly specific around the roles

and responsibilities within this type of

structure. A few months ago I created a

vast multi- aent system called HAL 90001

as you can see here and within this

setup I had 25 sub aents for a main HAL

orchestrator agent and it was incredibly

unwieldy because each of the agents

needs to be tightly configured. The

tools need to be tightly defined.

Otherwise, it becomes an unreliable mess

trying to actually get responses that

are anyway accurate. It can get lost in

conversation between the various agents.

So, you do need to be quite conservative

when you're actually building out multi-

aent rag systems. I would honestly start

with just a single sub aent, get that

working really well, then add another

sub agent and regression test the whole

thing and continue on. Conservative is

best, I think, with this one. Another

pattern for multi-agent rag is with

sequential chaining. So this is more

aligned to the deterministic workflows

that I showed earlier. Here you would

have a chat input. It would go to one

agent who would produce a response after

interrogating various tools and that

would then feed another agent to carry

out another action to produce a response

to the user. So it is more sequential.

Again you can have simpler prompts. Each

agent can be more specialized. So for

example, you could have a research agent

which carries out research based off

various tools and then that goes to a

writer agent who creates a blog post and

that goes to a publication agent who

publishes it to social media or to

WordPress. And this doesn't need to be a

chat trigger. This is more aligned to

the AI rag automation that I talked

about earlier in the use cases section.

This is our agentic rag blogging system.

And here the actual trigger is no code

DB or it runs on a schedule. Noode DB is

an open-source air table equivalent. And

depending on what action was triggered,

if the blog article is ready for

outlining, it goes up to this agentic

researcher. And this is a rag researcher

that has access to various tools like a

semantic search database, a structured

SQL list based off a table. We have deep

research with perplexity and a deep

search with Gina. And this agentic

researcher is part of a sequential chain

because once the outline is created, it

can then go and actually generate the

article. Now there is human in the loop

built in as well. But from here then

it's a full chain of AI nodes and

different features to actually build out

the article. generating images or

searching for stock images, writing and

uploading the article to WordPress. Here

we're using another AI agent, uploading

the featured image, writing the social

posts, updating the publication system

in no code DB. So you can see the

benefit of a fully sequentially chained

multi- aent system. And it can be a

hybrid as well. These don't need to be

agents. These could be kind of one-off

LLM calls that generate text responses

or JSON outputs that feeds the next step

in an automation. And finally, the last

pattern for today is a little bit of a

mix of everything. It's a multi- aent

rag system with routing and sequential

chaining. So, for example, if it's a

very simple question, we can just drop

it into the simple response and send it

back to the user. Whereas, if it's a

question about writing, it can go to

this agent. About research, it can go to

that agent. And this is a great way of

speeding up inference of an agent

because if you go to a multi- aent setup

with sub agents, there's a lot of

supervision and orchestration of the

responses. Whereas here, you can just

route the query directly to the agent

who actually can answer the question.

And you still have separation of

concerns. You still have specialized

agents with their simpler system prompts

and specialized tools to carry out the

tasks. So this is how it looks in N8.

Again, standard chat trigger. We have

our query classifier and router. Off the

back of that, then we get a structured

output that we can then funnel the user

down a path. So, if it's a simple

question, we can then just generate a

friendly response and output the answer.

If it's a question about publishing a

piece of content, for example, it could

go to the publication agent. And you can

mix and match the sequential side of it,

too. If it's a task about researching

and drafting a piece of text, then it

could go to a research and writing agent

that are then linked in in a standard

sequential chain and then you get the

output response. So, it's quite

simplistic here, but you get the design

pattern that we're trying to achieve,

which is this idea of query routing in a

multi- aent setup. There are other

patterns that I didn't dive into on this

video. Selfrag for example uses

reflection tokens to determine the

quality of retrieval and whether it

needs to go and retrieve again. So it's

kind of a variation of iterative

retrieval. Corrective rag is an approach

that uses web search as a fallback if

the content can't be found in the

knowledge base. Guard rails is a way of

protecting your system from prompt

injection attacks or PII leakage. Human

in the loop is a way of escalating and

bringing in a human to actually interact

with the flow. And this can be quite

useful for the likes of these rag

automations as opposed to the chat

interfaces. Human handoff is another

variation of this where the actual chat

message is directed to a real life human

when the agent can't actually answer the

question of the user. That obviously

works great if you have a support team

on standby waiting for messages to come

in. Multi-step flows is another pattern

which is more aligned to traditional

conversational chat flows where

questions are asked in sequence and then

the answers feed the next stage of the

sequence. And the key challenge there in

a conversational interface is keeping

things on track because if people break

off midflow then how do you know what

they're now talking about because they

were in the tunnel completing a

particular task within the actual

chatbot. Context expansion is the

technique I discuss in this video next

level rag where you grab neighboring

chunks, document sections, subsections

or even the full document itself. And

then deep rag is an emerging trend

around this idea of deep agents and

having a retrieval planning system and

executing that plan over a longer time

horizon than you typically would find in

a conversational chatbot. Example,

there's one technique I didn't discuss

in this video and not enough people talk

about it, which is lexical keyword

search, dynamic hybrid search. If you

want to learn more about that, then

click on this thumbnail

800+ Hours of Learning RAG + Agentic Design in 42 mins (n8n Masterclass)

The AI Automators

112 days ago

42:03

RAG & Vector Search

Rank #1

Description

👉 Get our advanced RAG workflows and learn how to implement these patterns in n8n, in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=design_patterns There's no one-size-fits-all approach to RAG – and choosing the wrong pattern from the start sets you up to fail. In this comprehensive deep-dive, you'll learn the 9 essential RAG design patterns that cover every use case, from basic naive RAG to state-of-the-art multi-agent systems. I'll show you exactly when to use each pattern, why your model selection dictates your entire system design, and how to balance the critical trade-offs between intelligence, speed, and cost. 🎯 What You'll Master: ✅ Why model selection (small vs frontier LLMs) fundamentally shapes your RAG architecture ✅ How to match RAG patterns to your specific use case priorities ✅ When to prioritize speed vs accuracy vs cost in your system design ✅ 9 complete RAG design patterns with real n8n implementations ✅ Advanced multi-agent orchestration strategies ✅ Query routing, transformation, RAG fusion and verification techniques ✅ How to avoid the pitfalls that make RAG systems unreliable 💡 Real-World Use Cases Covered: 1. Customer-Facing Chatbot - Lightning-fast responses with query routing and answer verification 2. AI Assistant/Co-Pilot - Deep accuracy for legal departments with iterative retrieval 3. AI Automation - Background RAG systems embedded in content workflows 4. Fully Local RAG - Resource-constrained systems with optimal model selection 📋 The 9 RAG Design Patterns: 1. Naive RAG - The foundation (and why it's rarely enough) 2. Query Transformation, RAG Fusion & Verify Answer - Ensuring grounded, accurate responses 3. Iterative Retrieval - When one pass isn't enough 4. Adaptive Retrieval - Dynamic decision-making for retrieval 5. Agentic RAG - Non-deterministic systems with control over retrieval. 6. Hybrid RAG - Systems that can retrieve from different types of knowledgebases (graph, structured, semantic etc) 7. Multi-Agent RAG with Sub-Agents - Distributing cognitive load 8. Multi-Agent RAG with Sequential Chaining - Specialized agents in deterministic flows 9. Multi-Agent RAG with Routing - Intelligent query distribution Plus bonus coverage of: Self-RAG, CorrectiveRAG, Guardrails, Human-in-the-Loop, Context Expansion, and Deep RAG patterns! 🔗 Useful Links: Context Expansion Video: https://www.youtube.com/watch?v=y72TrpffdSk RAG at Scale Video: https://www.youtube.com/watch?v=sn0SjjkRhxI&t=285s Dynamic Hybrid Search: https://www.youtube.com/watch?v=FgUJ2kzhmKQ&t=613s ⏱️ Timestamps: 00:00 - Intro 02:48 - Use Cases & Priorities 07:21 - 1. Naive RAG 10:13 - 2. Query Transformation & Verify Answer 19:48 - 3. Iterative Retrieval 21:14 - 4. Adaptive Retrieval 25:25 - 5-6. Agentic + Hybrid RAG 31:24 - 7. Multi-Agent RAG with Sub-Agent 36:10 - 8. Multi-Agent RAG with Sequential Chaining 38:28 - 9. Multi-Agent RAG with Routing 40:02 - Bonus: Other Patterns 💬 Questions or Comments? Drop them below! I read every comment and love hearing about what RAG systems you're building and which patterns work best for your use cases. 📢 If you found this valuable: 👍 Like this video 🔔 Subscribe for more advanced AI automation content 📤 Share with someone building RAG systems

Video Details

Category

RAG & Vector Search

Featured Date

December 3, 2025

Quality Rank

#1

AI Recommended