Loading video player...
There's no one-sizefits-all approach
when it comes to building rag systems.
It's highly dependent on your use case,
and if you choose the wrong pattern to
start, you're setting yourself up to
fail. The key element here is you need
to strike the right balance between
model intelligence versus speed and
cost. Think about it. A customer-f
facing chatbot needs lightning fast
responses, otherwise you've lost your
customer. So, you can't have a frontier
model like Claude Sonnet thinking for 5
minutes before it comes up with an
answer. Whereas a rag agent deployed in
a legal department absolutely
prioritizes accuracy over speed. And if
you take a fully local rag system for
example, here you're massively
constrained by resources. So you need to
design your rag system around the models
you can actually afford to run inhouse.
This is the main reason why there are
hundreds of rag strategies and tactics
out there. And I've distilled them down
to nine key RAG design patterns that
cover all use cases. And that's what
we'll be going through today. We'll be
starting with the most basic, which is
naive rag, and work our way up to the
state-of-the-art multi-agent rag
systems. And this isn't just theory.
Today, I'll be showing you how this can
be implemented in N8N to give you the
knowledge you need for your RAG project
to succeed. Let's get into it. The
models you pick for your RAG system will
heavily influence how you design
everything else. And the reason for this
is generally speaking, the larger the
LLM is, the more parameters it has, the
higher the quality responses you're
going to get out of it. This is because
larger models are generally better at
instruction following, tool calling,
complex reasoning, and context handling.
So by and large, you do get better
responses out of larger models compared
to smaller models. And what I found is
small language models from let's say 1
to 10 billion parameters can't usually
reliably call tools and in a lot of
cases can only handle basic instruction
following. But then there's a steep rise
after let's say the 15 billion parameter
mark where the model's abilities around
tool calling and reasoning and context
handling improve all the way up then to
the main frontier models like claude
sonnet and GPT5. That's why I think for
RAG systems there's somewhat of a sweet
spot in between these small and frontier
models. A model like GPTOSS
120 billion for example really is highly
capable and nowhere near as expensive as
the likes of Claude Sonnet or GPT5. But
the crucial detail here is even small
language models of 15 to 20 billion
parameters can outperform large language
models if they have high quality
retrieval. So, it's all down to your rag
system design. If you can inject the
right content into context into a small
language model, it will always produce a
better answer than a large language
model that doesn't have the information
to base an answer off. Before we get
into the design patterns, let's look at
four rag use cases and focus on the
different priorities that will dictate
the system design. First up, we have a
customerf facing rag chatbot. So you can
imagine that this is embedded on let's
say an e-commerce website where
customers can chat and ask questions
about the product catalog or delivery
information. A key priority of a
customerf facing chatbot is speed. It
has to be fast and that priority will
dictate which model you use. So you're
going to need to use something like
Gemini 2.5 Flash, which is a smaller
model that'll give you much faster
responses. And as these models are
cheaper to run, your bill is going to be
lower, which is a good thing. But
because this is a public-f facing
chatbot, this will be deployed at scale.
You could have thousands of customers
asking questions of it. So you're also
constrained by that, too. You can't have
frontier models that cost $15 per
million output tokens. So these
constraints, these priorities dictate
the system design. And as a result, then
you'll likely use a mixture of agentic
and non-agentic rag, which I'll be going
through in a few minutes. You'll need
guard rails because you don't want this
chatbot chatting about anything. It
needs to be highly specific and highly
specialized on, let's say, an e-commerce
catalog. You'll most likely need query
routing so that you can route the
different types of questions to the
right tools. I'll explain all of these
in a few minutes when we go through the
design patterns. And a key pattern you
would likely use here is the verify
answer pattern because you want to make
sure that what you're telling the
customer is actually true and is
grounded in the retrieved data from your
systems. In other words, you're better
off not saying anything as opposed to
saying something false. Next up, we have
the idea of an AI assistant or co-pilot
like the AI agent deployed in a legal
department. So here, accuracy is
absolutely crucial, particularly if
you're dealing with complex documents.
So with accuracy being the guiding light
in this project, speed is going to be
reduced because you will want complex
reasoning over the documents. You may
need iterative and recursive retrieval.
And if you're using more expensive
models and larger context windows,
that's all going to come at a greater
cost. So it will be more expensive per
user. But because this isn't public
facing, you'll likely have a smaller
user base. So the overall cost won't be
too high. So this type of system could
be deployed in lots of different ways.
So you could have an MCP trigger node in
N8N and then if claw desktop is deployed
within a business they can simply add it
as a tool and then chat to their
documentation that way. Or it could be
deployed in something like open web UI
or this could be a Slack agent or a
Microsoft Teams agent. So that way you
don't need to deploy software to
employees in a business. The types of
techniques you would likely need here
include agentic rag, multi- aent rag,
hybrid rag, and that verify answer
pattern as well. We'll be deep diving on
all of these shortly. Another use case
is an AI automation with rag embedded
within it. This is an agentic blogging
system I created a few months ago. And
within this sequential flow, we have an
agentic rag which retrieves information
from a knowledge base. And all of that
feeds into article outlines and article
generations which then goes through a
deterministic pattern to build out an
article to be published on WordPress. So
it's just another rag use case where
there are again different priorities. So
this type of automation will be running
in the background. So speed is not a
problem. The key thing is it's
resilient. So you can apply larger
models with reasoning. So we have aentic
rag and non-agentic rag here in a
deterministic flow and a lot of prompt
chaining going on as well. And finally
onto our fully local rag system here.
Your model selection is highly limited
based on the internal hardware that's
available to run these models. So with a
$3,000 graphics card, you may only be
able to run a 20 billion parameter
model. And then that massively
influences the design of your system. So
while there is an initial capital cost
up front, the actual cost per user is
quite low over the long term and how
fast the model inference is really
depends on the graphics cards that you
have in play. These types of solutions
could be deployed again using the likes
of Open Web UI because you want to keep
everything local. I put days of work
into pulling this video together. So if
you're finding value in it, I really
appreciate if you gave the video a like
and subscribe to our channel for more
deep AI and N8 content. Now that you
have an appreciation for the different
types of rag use cases, let's dive into
our deterministic rag system designs.
First up, naive rag. And naive rag is
the most basic way of running a rag
system. Essentially, you have a message
that comes in, it goes straight to a
vector store to retrieve relevant
results which goes straight to an LLM to
generate a response and that's outputed
to the user. So, it's ultra fast but
ultra basic. And this is how it looks in
NATM. So chat trigger straight into
subbase aggregate all of the chunks that
we get back from the vector store and
straight into an LLM call. Now I have
this hooked up to a large vector store.
So my documents table here has 215,000
chunks. And here I'm going to
demonstrate the exact problem with naive
rag. So I'm going to ask this question.
How to change the power levels on my GE
Advantium oven. So here I ingested
10,000 product manuals of kitchen
appliances. And if you want to see how
we did this, check out our rag at scale
series on our channel which includes
these two videos. I'll leave links for
these in the description below. So if we
ask this question, you'll see that we
quickly get chunks from the vector
store. Quickly get an answer. The answer
being sorry, I don't know. And the
reason for that is that the chunks that
we got back from the vector store are
not relevant. And you can see that the
chunks are related to a different GE
model. There's a cafe model. It seems to
be here again a cafe model. So it has
returned chunks that are related to
power levels of an oven. But it's the
wrong model. And within our system
prompt, we are being ultra specific. The
LLM needs to analyze the retrieve chunks
to determine if they're relevant. And if
they are, base the answer off those
chunks. If they're not, just say sorry,
I don't know. And that in a nutshell is
the limitation of naive rag. It's a
single pass through to generate an
answer. Even though that query could be
quite conversational with lots of stop
words and irrelevant words that are
going to negatively impact the quality
of the embedding that's sent into the
vector store to find the nearest
matches. But obviously the benefit of
this approach is simplicity and speed.
End to end this took 2.5 seconds and
included a vector search and an LLM
inference. So highly efficient if a
little bit blunt. So as a result of
these shortcomings, a lot of techniques
have been invented around the idea of
transforming the original user's query.
So in our next design pattern, we
implement a lot of these query
transformation techniques just to give
you an idea of how they all work
together. If you'd like to get access to
the nine different rag design patterns
that I discussed in today's video, then
check out the link in the description to
our community, the AI Automators, where
you can join hundreds of fellow builders
all looking to create production rag
applications. If you're serious about
building AI agents, this is the place to
be. So, our second pattern is a
deterministic rag flow with query
transformation and this verify answer
pattern that I talked about previously.
So, here there's no tool calling. So we
need to decompose the entire retrieval
flow. This type of flow would work with
most LLMs from 5 billion parameters up
to trillions of parameters. And here
I'll be demonstrating how when a user's
question comes in, we first decompose
that query because there might be
multiple asks within that question. And
then for each of those sub queries, we
expand upon it. So we find different
angles to actually search the knowledge
base for that question. We then carry
out multiple vector searches and then
use a technique called rag fusion or
multi-query rag where we actually fuse
those result sets together and carry out
reranking. This then gives us the top
results from the much larger candidate
set that was retrieved. That's then sent
to the LLM to generate an answer. And
then our verify answer call checks the
answer against the retrieved context
from this stage to make sure that it's
actually accurate. If it is, it outputs
it to the user. If it's not, it goes
back to the LLM to generate an updated
answer with the guidance on where it
went wrong. So, let's see it in action.
And for this, unlike our previous
question, let's ask a multifaceted
question. How to change the power levels
on my GE Abacus mixer? Now, an abacus
mixer doesn't exist. And how do I clean
it? So, let's ask it here. And off it
goes. Now I've tied in chat history as
well so that we can retrieve the chat
history and there is nothing yet because
I've refreshed the session. We've got
back a quick answer which is great. But
let's just look at what happened. So
firstly we go to this query intent and
decomposition step. And what we're doing
here is we're breaking down this
question into multiple sub queries or
sub questions. And on the right hand
side here you can see number one it's
outputting that it needs retrieval which
is crucial and number two it's
outputting the sub queries. The first
one is how to change the power levels on
a GE abacus mixer and the second is how
to clean it. It then goes through this
retrieval gate because if someone is
just asking a question like hi how are
you? You don't need to go to a vector
store. You can simply just output an
answer up here which is ideal for small
talk as you can see. But because this
question does require retrieval, it goes
through this gate and we then go to our
query rewriting an expansion call. And
here we're transforming the original
query or the subquery in this case by
adding synonyms, related terms,
different phrasing, trying to come at it
from different angles to get different
chunks back from the vector store for a
more rounded candidate set. So within
this node on the left you can see we
have how to change the power levels how
to clean it and on the right you can see
the different rewritten queries to come
at the search from different angles and
the other thing we're doing is we're
asking the LLM to classify the incoming
message based off a product category
because this will help us narrow the
search within the vector store. we have
200,000 chunks. So, ideally, we should
be able to narrow this with the metadata
filter. And that way, we'll have a much
more focused search for these terms to
be able to get back the right chunks.
And as you can see here, it's saying
that it's unsure what product category
this product fits into. So, a lot of
this is in the system prompting. So,
here I'm saying the only valid options
are ovens, washing machines, toasters,
and unsure. But you could list all of
your metadata values here. Okay. So from
here then we're splitting out again. And
here I have a simple code node that
checks to see are we unsure of any of
the categories because if we are unsure
we should just get straight back to the
user and ask for clarification. And this
brings us to this if node and because we
are unsure of what this abacus mixer is.
We come down to a query clarification
call. And here we've identified an
ambiguous or an underspecified query and
we're going to ask the user to actually
clarify. and it's output of the response
to help me provide the most accurate
information. Please clarify what type of
mixer you're referring to. And that's
effectively then what's output to the
user. And then we've updated the chat
history. So you can see then within the
chat messages, what the LLM generated is
outputed to the user. Are you asking
about a food mixer or an audio mixer? So
this is a great way to ask clarifying
questions of the user to help you refine
your search within the vector store
based off metadata filters. So let's
just follow up and say sorry I meant the
GE Advantium oven. Okay. And that's
working its way back through the
process. And now it has identified the
actual product category. So let's now
take it up here. So we've now passed
this gate and it's true. And now we're
into a technique called query routing.
So here we're going to direct the
queries to the most appropriate metadata
filter in this case. So the switch has
ovens, washing machines, and toasters.
So here we're coming up through the
ovens path which sets the metadata
filter. So we're setting the product
category as oven. And when we go to the
vector search then we're actually
passing that product category as oven
along with the prompt. And because we've
expanded out the prompts previously,
this is just one of five searches we're
going to have on this vector store. So
this one is GE advantium oven change
power levels. But if we look back here
at the query rewriting and expansion,
you can see that we have multiple
queries that we're going to search the
vector store off. How to change power
levels, how to adjust cooking power, how
to clean the oven, cleaning
instructions, maintenance guides. So all
of these keywords are going to be sent
into the vector store to get back a
diverse collection of chunks. So you can
see this is already way better than
naive rag. naive rag sent in a really
long rambling query by the user. Whereas
here, not only is the query written,
it's also expanded into multiple queries
to cover a wider range. As a result, we
end up with 50 items coming back across
all of the different queries because
we're getting 10 chunks per search. We
then go through the process of rag
fusion. So with these multiple queries,
we're going to end up with a lot of
duplicate results. So we need a way of
dduplicating and fusing the results
together into an ordered list. And
that's what this reciprocal rank fusion
step does. So essentially if one chunk
appears in multiple lists, it'll rank
higher than a chunk that only appears in
one of the lists. The lists being the
five queries that we ran against the
vector store. So of the 50 chunks that
we retrieved from the vector store.
There are only 20 unique chunks and they
are now ordered based off that fusion
process. But that fusion process is
pretty crude as I just described. So we
also carry out a re-ranking stage. And
here we're using coher 3.5. You could
use a local re-ranking model if you
wanted. But what re-ranking does is it
sends in the user's question as well as
the 20 items in this case into a cross
encoder model which is similar to a
basic LLM. And what we're doing is we're
asking it to take the 20 items and only
return the 10 most relevant ones in
order. And that's what it has done
there. So here we're just setting those
up then to send into the LLM, which is
what we do to generate the answer. And
you can see we're sending in the user's
question, the chat history, as well as
the top 10 chunks ordered by relevance
from the re-ranker. And that's helping
to produce this quite accurate answer.
And finally, we go through our verify
answer pattern with the feedback loop.
So what we do is we send in that answer
that was just generated along with the
user's question along with the chunks
that we got back from the vector store
and we're asking the LLM to make a
judgment on is this answer fully
grounded in the context that was
retrieved. Are there any contradictions?
Are there any unsupported claims? And
from there then there's a decision. If
it's grounded we can just output the
answer to the user. If it's not
grounded, then we can go back to the LLM
and pass the feedback to say you need to
make these changes and it works its way
back through the process. It only does
this a couple of times because you don't
want to get caught in an infinite loop.
And finally, it outputs a response to
the user and updates the memory. So
that's what the end toend process looks
like for a deterministic rag flow with
query transformation and the verify
answer pattern. And what's brilliant
about this is a small language model can
actually run this. There's no tool
calling. Any of the really complex
system prompts have been broken into
multiple calls to LLMs. And time-wise,
if we look at the logs here, you can see
that end to end, this took 30 seconds to
retrieve quite an accurate answer. And
I've baked a lot of query transformation
strategies in here. So you could
probably strip some of these out and
actually get that even faster. And it is
worth calling out that this is what
production rag systems look like. It's
highly focused LLM calls. It's super
fast vector searches with chunks
injected into context to get ultra fast
answers. And if you look through the
logs here, the real bottlenecks are
actually in the answer verify pattern
which takes 9 seconds because there's a
bit of reasoning required over the data.
It's absolutely possible to use agents
within this type of flow. So you can end
up with the hybrid agentic yet
deterministic workflow. Our third
strategy is deterministic rag using
iterative retrieval. And it's a similar
flow to before with the exception that
after the re-ranking stage before
generating the answer, we have an LLM
call that analyzes the results and it
determines whether we actually need to
retrieve more from the vector store
because the quality of results isn't
good enough to formulate an answer. And
then you can keep a count on the number
of iterations that you progress through
to again avoid an infinite loop similar
to the answer verify pattern. And once
you're happy that you've retrieved
enough or you've hit the limit, you can
then go to an LLM to generate a response
to the user. I haven't mapped this one
out on NAND, but essentially you'll be
injecting the LLM call here. This would
be your analyze results call. And then
you would have an if node. That if node
is should you retrieve more and if we
should retrieve more then we need a
counter to make sure that we're not
stuck in a loop. So should we stop we'll
drop that in there and if we should stop
then generate the answer. If we
shouldn't stop if we can go again then
we go back to the vector store and we
pass in the new queries that were
generated from this analyze results
node. And if the retrieve more says
we're done we can also just connect that
up to the generate answer. So that's the
type of implementation you would need.
Of course, this would also increase the
time it takes to generate a response. So
again, back to the use cases. If you
really need highquality responses, this
is a good pattern to use. Another
variation on this deterministic flow is
adaptive retrieval as opposed to
iterative retrieval. I really like this
design pattern, and I already have some
of this implemented in what I've shown
you so far. So with adaptive retrieval,
what we're doing is we have a query
classifier. So you saw it earlier where
I said should I retrieve yes or no?
That's essentially this piece here. So
if the query classifier runs and it
determines that no retrieval is
required, you can simply just use what's
in the LLM's training data. Then you
just generate a simple response. Whereas
if retrieval is required, it then
defines it as a single step retrieval or
multi-step. If it's a single stage
retrieval, you just send in the single
query or an expanded query and get your
results and output the response.
Whereas, if it's a multi-stage
retrieval, you end up again in a kind of
a recursive pattern. So, you send in
your first query into the vector store,
you get your results and rerank them.
And then based off the retrieval
strategy, if it's multi-step, you go to
an analyze results node and it then
analyzes to see is retrieval complete.
Has it hit enough steps? Because it's a
complex multi-step query. Have you done
enough research? Have you done enough
digging? And if you haven't, it
generates more queries which goes back
to the vector store and it cycles
through and cycles through until it's
complete and an answer is generated. So
this is kind of similar to how GPT5
operates its router. If you interact
with chat GPT and if you ask a very
simple question, then behind the scenes,
OpenAI is sending that really simple
question to a small model to generate an
answer. Whereas if it's a moderately
complex question, it'll go to a decent
model, but probably not a frontier
model. Whereas if it's a really complex
question, it's going to hit its largest
model. It's going to use reasoning and
it'll take a couple of minutes to
generate a response. So that's
essentially this pattern adaptive
retrieval and within N8N we have some of
this already covered. So the query
intent is essentially your query
classifier and the retrieval decision
here. First determine if this query
requires retrieval from the knowledge
base. Select no retrieval or needs
retrieval. Instead you would have no
retrieval, singlestep retrieval or
multi-step retrieval. And then if no
retrieval you just come back to the
fallback message and output to the user.
And then whereas whereas down here then
again you would just split at this mark
and you would have an if node which is
your retrieval strategy. And let's say
if the retrieval strategy is single step
which is the top leg here then you just
generate your answer and off you go.
Whereas if it's multi-step then you
would come back down here. You have an
analyze results. You decide whether you
need to retrieve more. Have you gone
deep enough? If you have, should you
stop? Because you don't want to get
caught in a loop. So, exactly the same
idea. So, you would tie that one there.
You don't need to retrieve more. Or if
you do and you're happy to continue,
that will go back to the vector store
with the updated queries. So, again, the
same type of recursive iteration and
looping, but now it's based off a
retrieval strategy based off a query
classifier. So you can see a lot of
these rag strategies and tactics are
kind of variations of each other.
Adaptive retrieval is similar to
iterative retrieval. So that's our first
four rag system designs covered. And
just to reiterate that while all of
these can run on small language models,
they also can run on large language
models. You can have AI agents buried
within flows, deterministic flows like
this. And the beauty of a deterministic
workflow is that it's highly reliable.
it will do the same thing every single
time it runs which is totally different
to the idea of AI agent because with AI
agents through a system prompt you're
trying to persuade it to do what you
want whereas with the deterministic
workflow it has to do what you want
because you've designed it that way. So
this hybrid approach of a deterministic
yet agentic workflow is highly powerful
in a production rag setting. Now that
you've seen deterministic rag system
designs, let's look at some agentic rag
patterns. Let's start off with standard
agentic rag and hybrid rag. And on the
face of it, your standard agentic rag is
quite simple. It's an AI agent that
receives a message from a user. It has
various tools that it can call to help
retrieve context to generate a response.
So here it could have a vector search
tool, a database search tool, a web
search tool, and you can load in quite a
complex system prompt in to try to
convince it to do what you want to do.
You could give it a standard operating
procedure. You could use prompt
engineering techniques to really urge
the agent to follow the right series of
steps if that's the approach you want to
take. Based off all of that, it can then
output a response to the user. And this
is what it looks like in N8. This is
probably pretty familiar. Standard chat
message, standard AI agent. It has a
vector search tool with an embedded
model hooked up. And then agents also
have memory. So here we've just assigned
a simple memory, but you could have
different types of memory like Postgres
or Zep. And then we have an AI model
hooked up. And as I mentioned earlier,
small language models under 10 billion
parameters can struggle with reliable
tool calling. So your selection of model
here is important. I've hooked up a
Frontier model Claude Sonnet 4.5. So
let's ask the same question. How to
change the power levels on a GE Abacus
mixer. And the beauty of these agents is
that it can hit a vector store numerous
times with different variations of a
query. So this is the same as query
rewriting for example or query
decomposition. It can break up the query
and hit the vector store numerous times.
Sometimes it just does that off its own
bat. Other times you need to prompt it
in the system prompt, but because it has
this function calling loop, it can do it
numerous times. And as you can see, it's
saying it can't find any information
about the abacus mixer because it
doesn't exist. So I'll say, "Sorry, I
meant the GE Advantium oven." And
because there's memory attached to this
agent, similar to the way we were
updating the chat history in the
previous flow, it's able to fill in the
blanks and figure out that it needs to
retrieve cleaning information as well as
how to change the power levels of the
oven. So, it's hit the vector store six
times already and has now outputed an
answer, which looks quite good. And if
we have a look at the vector search, you
can see the different queries that have
passed. So, Advantium oven power levels,
cleaning instructions,
settings adjustments cooking
interior exterior care maintenance.
So, it's just passing lots of different
keywords to try to get a broad range of
candidate chunks that it can actually
feed into its context to output an
accurate answer. And this is why
Frontier models are great. There was no
need for any of the query transformation
steps that I showed you in the previous
design patterns because it's such a
smart model. It can figure that out.
Claude 4.5 solid is probably one of the
best models out there and one of the
most expensive. So by way of a
comparison, let's try this out on the
GPT OSS 20 billion model. This is an
order of magnitude smaller than 4.5
sonnet. So I've set the GPT OSS
safeguard 20 billion model there. And
let's ask the same question. So it has
hit the vector search. So it is able to
actually call tools which is great. It
is getting a decent answer. I'm sorry I
don't have the specific instructions.
Let's ask it again just to see do we get
the same response. We probably will
actually. It's in memory. No, we didn't
actually. We are getting some
instructions on how to change the power
levels on a GE abacus mixer which
doesn't exist. So we got it right the
first time, but when I asked the same
question again, it didn't get it right.
Let's reset the chat session and we'll
just try that question again. So, it's
gone to the vector storm numerous times.
Again, it's a good answer. It doesn't
actually exist. But just when I repeated
the question, it must have thought that
I really wanted an answer even though it
wasn't in context. Let me try that with
SA 4.5. I'll ask it a couple of times.
Okay. So, we get our apologies. Doesn't
exist. Please clarify. So, let's ask
this exact same question again. Yeah.
And it still doesn't have the
information. So, I think that's kind of
a good example of smaller models. They
are less reliable than the larger more
intelligent models. For some reason, GPT
OSS 20 billion thought it was okay to
fabricate the information because they
asked the question twice. Whereas a
larger model is smart enough to realize
it still doesn't exist. It's still not
in context. And that is why benchmarks
exist. It is a quantitative mechanism to
compare these models. Even though by all
accounts they are being gamed by the
various model providers. But either way,
you can see how a gentic rag works and
the idea of multiple tool calls with
different queries. So a lot of the query
transformation is wrapped up in the
reasoning of the model itself. Hybrid
rag is essentially the same thing. The
only difference is you don't just have
semantic search attached. You also have
different representations of data within
different systems. So here we have a
database search tool. So this is a
Postgres database and you would require
the hybrid rag agent to write a SQL
query to actually fetch results from
that knowledge base. As for this one,
this is a graph search tool using Neo4j.
So the hybrid rag agent would need to
write a cipher query to be able to
traverse the knowledge graph and return
the results. So that's what hybrid rag
is retrieval from different types of
data stores such as vector, graph or
database. If you want more information
on database search tool calling, Allan
has a video on our channel called
Agentic Databases, which is well worth a
watch. And on the graph search, I just
released a video on graph agents a
couple of weeks ago where I show how to
set up Neo4j and actually configure your
NAN agent to traverse through a custom
knowledge graph that you can create.
I'll leave links for these in the
description below. So on to pattern
seven which is our multi- aent rag
system using sub aents. You've probably
seen a lot of this on YouTube with nadn
videos but here instead of tool calls
which are outbound external to different
services we're calling another agent and
that agent has its own system prompt and
has its own context window. And this is
a really good pattern because if we go
back up to our agentic rag setup, you
could have quite a complex system prompt
for this AI agent. And if you have a
really smart model like 4.5 sonnet, it
will by and large follow that prompt.
Particularly if you enable reasoning and
it can think about it a bit. But at the
same time, there is a limit to the
complexity of system instructions that a
model can actually follow. At a certain
point, you need to break that apart into
different specialized agents. And that's
what this multi- aent rag setup is. So
here we can simplify the system prompt
of our agent and farm out some of those
responsibilities to sub agents. So if
we're looking at our database example,
you could have a database sub agent who
has numerous SQL tool calls and
sometimes these SQL queries can be
incredibly complicated and with few shot
prompting all of that buried in a system
prompt at the AI agent level would
totally overwhelm it and it wouldn't
know where to focus to actually follow
the correct set of instructions. So
instead, we can dedicate a sub agent
just for the database queries and we can
give it all of the examples that it
needs to effectively trigger these SQL
tools. So that's a good example of
simplifying the system prompts. Another
benefit of this approach is protecting
the context window. I released this
video 3 weeks ago on a technique called
context expansion. And what this means
is you have the ability to load up an
entire document into context if the
query actually requires it. So let's say
a question came in asking can you
summarize this entire document and it
provides a link to a document. For an
LLM to summarize it, you can't really
use rag because rag will only give you
segments of the document. If you need to
summarize it, you need visibility of the
entire document which means you need to
load the entire document into context.
The problem with that is there is a
limited amount of space in context. So
if your main agent has let's say five
steps in a procedure to follow to output
a result and one of those steps is to
actually load an entire document then
once it does that it might get confused
because it doesn't know where to focus
within the context itself and it might
end up messing up steps three four and
five as a result. So with this pattern,
you can offload the context overhead to
a sub agent. This document researcher,
its only job is to load a document,
summarize it, and send back the summary.
And at that point, it can just forget
it. So the agent will never see the full
document. It'll only get the summary to
continue to steps three, four, and five.
So there are lots of benefits to having
a multi- aent setup with sub aents. And
this is what it would look like in N8N
where you have your agentic rag and as
tools. Then you have your sub agents and
this is pretty easy. You just click on
the plus and just type in agent and you
have this AI agent tool and then you can
just attach your model attach memory if
you need to and attach the various
tools. So my database sub agent you
could have lots of different kind of SQL
Postgres tools with different prompting.
The document researcher could just have
a fetch full document tool and that
could be a Google drive that could be a
Google cloud bucket or it could even be
a Postgres tool that fetches all of the
chunks based off a document ID. So then
when this agent is called it can trigger
the sub agent to carry out the work,
provide the responses and it can
continue on its merry way to generate an
answer to the user. So it's a pretty
good pattern. I would have one word of
warning though which is that you do need
to be highly specific around the roles
and responsibilities within this type of
structure. A few months ago I created a
vast multi- aent system called HAL 90001
as you can see here and within this
setup I had 25 sub aents for a main HAL
orchestrator agent and it was incredibly
unwieldy because each of the agents
needs to be tightly configured. The
tools need to be tightly defined.
Otherwise, it becomes an unreliable mess
trying to actually get responses that
are anyway accurate. It can get lost in
conversation between the various agents.
So, you do need to be quite conservative
when you're actually building out multi-
aent rag systems. I would honestly start
with just a single sub aent, get that
working really well, then add another
sub agent and regression test the whole
thing and continue on. Conservative is
best, I think, with this one. Another
pattern for multi-agent rag is with
sequential chaining. So this is more
aligned to the deterministic workflows
that I showed earlier. Here you would
have a chat input. It would go to one
agent who would produce a response after
interrogating various tools and that
would then feed another agent to carry
out another action to produce a response
to the user. So it is more sequential.
Again you can have simpler prompts. Each
agent can be more specialized. So for
example, you could have a research agent
which carries out research based off
various tools and then that goes to a
writer agent who creates a blog post and
that goes to a publication agent who
publishes it to social media or to
WordPress. And this doesn't need to be a
chat trigger. This is more aligned to
the AI rag automation that I talked
about earlier in the use cases section.
This is our agentic rag blogging system.
And here the actual trigger is no code
DB or it runs on a schedule. Noode DB is
an open-source air table equivalent. And
depending on what action was triggered,
if the blog article is ready for
outlining, it goes up to this agentic
researcher. And this is a rag researcher
that has access to various tools like a
semantic search database, a structured
SQL list based off a table. We have deep
research with perplexity and a deep
search with Gina. And this agentic
researcher is part of a sequential chain
because once the outline is created, it
can then go and actually generate the
article. Now there is human in the loop
built in as well. But from here then
it's a full chain of AI nodes and
different features to actually build out
the article. generating images or
searching for stock images, writing and
uploading the article to WordPress. Here
we're using another AI agent, uploading
the featured image, writing the social
posts, updating the publication system
in no code DB. So you can see the
benefit of a fully sequentially chained
multi- aent system. And it can be a
hybrid as well. These don't need to be
agents. These could be kind of one-off
LLM calls that generate text responses
or JSON outputs that feeds the next step
in an automation. And finally, the last
pattern for today is a little bit of a
mix of everything. It's a multi- aent
rag system with routing and sequential
chaining. So, for example, if it's a
very simple question, we can just drop
it into the simple response and send it
back to the user. Whereas, if it's a
question about writing, it can go to
this agent. About research, it can go to
that agent. And this is a great way of
speeding up inference of an agent
because if you go to a multi- aent setup
with sub agents, there's a lot of
supervision and orchestration of the
responses. Whereas here, you can just
route the query directly to the agent
who actually can answer the question.
And you still have separation of
concerns. You still have specialized
agents with their simpler system prompts
and specialized tools to carry out the
tasks. So this is how it looks in N8.
Again, standard chat trigger. We have
our query classifier and router. Off the
back of that, then we get a structured
output that we can then funnel the user
down a path. So, if it's a simple
question, we can then just generate a
friendly response and output the answer.
If it's a question about publishing a
piece of content, for example, it could
go to the publication agent. And you can
mix and match the sequential side of it,
too. If it's a task about researching
and drafting a piece of text, then it
could go to a research and writing agent
that are then linked in in a standard
sequential chain and then you get the
output response. So, it's quite
simplistic here, but you get the design
pattern that we're trying to achieve,
which is this idea of query routing in a
multi- aent setup. There are other
patterns that I didn't dive into on this
video. Selfrag for example uses
reflection tokens to determine the
quality of retrieval and whether it
needs to go and retrieve again. So it's
kind of a variation of iterative
retrieval. Corrective rag is an approach
that uses web search as a fallback if
the content can't be found in the
knowledge base. Guard rails is a way of
protecting your system from prompt
injection attacks or PII leakage. Human
in the loop is a way of escalating and
bringing in a human to actually interact
with the flow. And this can be quite
useful for the likes of these rag
automations as opposed to the chat
interfaces. Human handoff is another
variation of this where the actual chat
message is directed to a real life human
when the agent can't actually answer the
question of the user. That obviously
works great if you have a support team
on standby waiting for messages to come
in. Multi-step flows is another pattern
which is more aligned to traditional
conversational chat flows where
questions are asked in sequence and then
the answers feed the next stage of the
sequence. And the key challenge there in
a conversational interface is keeping
things on track because if people break
off midflow then how do you know what
they're now talking about because they
were in the tunnel completing a
particular task within the actual
chatbot. Context expansion is the
technique I discuss in this video next
level rag where you grab neighboring
chunks, document sections, subsections
or even the full document itself. And
then deep rag is an emerging trend
around this idea of deep agents and
having a retrieval planning system and
executing that plan over a longer time
horizon than you typically would find in
a conversational chatbot. Example,
there's one technique I didn't discuss
in this video and not enough people talk
about it, which is lexical keyword
search, dynamic hybrid search. If you
want to learn more about that, then
click on this thumbnail
š Get our advanced RAG workflows and learn how to implement these patterns in n8n, in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=design_patterns There's no one-size-fits-all approach to RAG ā and choosing the wrong pattern from the start sets you up to fail. In this comprehensive deep-dive, you'll learn the 9 essential RAG design patterns that cover every use case, from basic naive RAG to state-of-the-art multi-agent systems. I'll show you exactly when to use each pattern, why your model selection dictates your entire system design, and how to balance the critical trade-offs between intelligence, speed, and cost. šÆ What You'll Master: ā Why model selection (small vs frontier LLMs) fundamentally shapes your RAG architecture ā How to match RAG patterns to your specific use case priorities ā When to prioritize speed vs accuracy vs cost in your system design ā 9 complete RAG design patterns with real n8n implementations ā Advanced multi-agent orchestration strategies ā Query routing, transformation, RAG fusion and verification techniques ā How to avoid the pitfalls that make RAG systems unreliable š” Real-World Use Cases Covered: 1. Customer-Facing Chatbot - Lightning-fast responses with query routing and answer verification 2. AI Assistant/Co-Pilot - Deep accuracy for legal departments with iterative retrieval 3. AI Automation - Background RAG systems embedded in content workflows 4. Fully Local RAG - Resource-constrained systems with optimal model selection š The 9 RAG Design Patterns: 1. Naive RAG - The foundation (and why it's rarely enough) 2. Query Transformation, RAG Fusion & Verify Answer - Ensuring grounded, accurate responses 3. Iterative Retrieval - When one pass isn't enough 4. Adaptive Retrieval - Dynamic decision-making for retrieval 5. Agentic RAG - Non-deterministic systems with control over retrieval. 6. Hybrid RAG - Systems that can retrieve from different types of knowledgebases (graph, structured, semantic etc) 7. Multi-Agent RAG with Sub-Agents - Distributing cognitive load 8. Multi-Agent RAG with Sequential Chaining - Specialized agents in deterministic flows 9. Multi-Agent RAG with Routing - Intelligent query distribution Plus bonus coverage of: Self-RAG, CorrectiveRAG, Guardrails, Human-in-the-Loop, Context Expansion, and Deep RAG patterns! š Useful Links: Context Expansion Video: https://www.youtube.com/watch?v=y72TrpffdSk RAG at Scale Video: https://www.youtube.com/watch?v=sn0SjjkRhxI&t=285s Dynamic Hybrid Search: https://www.youtube.com/watch?v=FgUJ2kzhmKQ&t=613s ā±ļø Timestamps: 00:00 - Intro 02:48 - Use Cases & Priorities 07:21 - 1. Naive RAG 10:13 - 2. Query Transformation & Verify Answer 19:48 - 3. Iterative Retrieval 21:14 - 4. Adaptive Retrieval 25:25 - 5-6. Agentic + Hybrid RAG 31:24 - 7. Multi-Agent RAG with Sub-Agent 36:10 - 8. Multi-Agent RAG with Sequential Chaining 38:28 - 9. Multi-Agent RAG with Routing 40:02 - Bonus: Other Patterns š¬ Questions or Comments? Drop them below! I read every comment and love hearing about what RAG systems you're building and which patterns work best for your use cases. š¢ If you found this valuable: š Like this video š Subscribe for more advanced AI automation content š¤ Share with someone building RAG systems