Loading video player...
There's a myth in the AI world that
vector search is the silver bullet to
ground your AI agents in your private
company knowledge. But the reality is
quite different. If you're building AI
agents that rely solely on semantic
search, you're leaving massive gaps in
your retrieval. Gaps that can lead to
hallucinations, incomplete answers, and
unreliable results. That's not to say
vector search is bad. It's brilliant for
conceptual semantic style queries. But
there's an entire subset of other
queries that require different retrieval
strategies to output the right answer.
And in essence, this is retrieval
engineering, designing your retrieval
strategies based on the specialized
scope and capabilities of your system.
We've worked with hundreds of community
members who are building production
grade rag agents and we are seeing the
same patterns over and over again. So
in this video, I'll show you nine
realworld examples of types of questions
where vector search fails, and I'll
demonstrate the retrieval strategies
that actually work in these examples. As
I said, vector search isn't bad. It's
just one tool in the toolbox. There's a
lot more to rag, so let's get into it.
Building AI agents in NAD is incredibly
easy. It's a simple canvas. You add an
AI agent node. You connect up a model
give it a system prompt, a couple of
tools, and away you go. And that's great
for a quick proof of concept, but in
reality, building accurate and reliable
AI agents in any platform is hard. And
the reason for this is their natural
language interface. You can literally
ask an agent any possible question in
any number of different ways. And it
needs to figure out what you need. And
some questions can be answered directly
from the model's training data, while
other questions require diving into a
knowledge base. What was our Q3 revenue
for example? Clearly requires diving
deep into a financial system. And this
is where things can get tricky when
you're trying to create an accurate and
reliable system because questions asked
of the agent can get quite complex. Some
questions require synthesizing
information across multiple documents
comparing information from different
data sources, interpreting and analyzing
data extracting summarizing
inferring, evaluating, and the list is
endless. It's infinite. And the critical
thing is each type of question might
require a completely different approach
to retrieve the right information to
generate an accurate answer. And on the
one hand, this is what makes AI agents
so powerful. But on the other, this is
why creating accurate and reliable
agents can be quite difficult. And in
essence, this retrieval strategy will
ultimately determine whether your
project succeeds or fails. And if you
strip an AI agent back to its core
there's essentially a simple decision
loop that's at play. So when a user's
question comes in, it hits an AI agent
powered by an LLM in the context of a
conversation. So there's memory and it
needs to reason and decide does it have
all of the information it needs in
context to answer the question or do I
need to do something to retrieve the
information or carry out an action so
that I can then answer the question and
it's a loop. So this can happen multiple
times. Now of course this is a little
bit simplified but you get the idea. And
when we think about retrieval, retrieval
within the context of an AI agent is
simply a tool call. No different to an
API call to create a calendar entry or
draft an email for example. So querying
a vector store is just another tool
call. But it is important to state that
there's more to rag than just vector
search. And don't get me wrong, I would
love if vector search was the silver
bullet for chatting to your data. But
here's the major problem. Vector search
operates using a similarity algorithm.
When you query a vector store, you are
getting the most similar results back.
But similarity is not the same as
relevance. Relevance is highly
subjective depending on the question
that was asked. If you queried a vector
store, for example, looking for
information on error code 221, the
vector store would happily send you back
information on error 220, 221, and 222
because they're all relatively similar.
Clearly, the most relevant result you're
looking for is this one, error 221. This
lack of exactness around vector search
is both its major strength because you
can find the right information even
though you don't have the right words in
the exact query, but then it's also a
major weakness if you need exactness
like in this case. And this is why there
are an array of retrieval methods that
you can leverage for different types of
questions. Vector search we've spoken
about, but if you need exactness
keyword search, pattern matching is a
great approach. SQL queries for
structured data searches, graph
databases for relationships and
concepts, API calls if you need to
retrieve information from other software
systems, file system scans if you need
to grab information from disk. And this
is why I would define all of these
including API calls as rag because if
through these methods the retrieved
information is fed into context and it's
used by the model to synthesize the
answer then that is retrieval augmented
generation. This entire video was
inspired by an article I read from Amit
Verma who's the head of engineering at
neuron 7 where he described retrieval
engineering as a distinct discipline
that's going to emerge over the coming
years. And just as how machine learning
ops has matured, so too will practices
around hybrid ranking, graph
construction, and more. And all of this
leads to the nine common question types
where vector search may fail, and you
need other retrieval strategies to
answer those questions. And it's worth
calling out two sources I used when
researching this video. One is IBM's
know your rag research as well as the
comprehensive rag benchmark. I'll leave
links for these in the description
below. First up are summary questions.
And this is the most common question
type that trips up our members when
building out systems. In this example
we have a database of meeting
transcripts. And the question is, what
decisions were made in the leadership
meeting? The key thing about this
question is to answer it accurately, you
need to analyze the full transcript of
the specific meeting. That way you can
extract out the various decisions that
were made. So it has multiple units of
information. And if we look at an
example knowledge base here, you can
have different documents that represent
different transcripts. You'll have
different chunks because meetings can go
on quite long. They may be chunked and
decisions are sprinkled across the
transcript. And the other thing is
they're not specifically called out as a
decision in a lot of cases. So with the
standard vector search without metadata
filtering for example, you would just
search for the word decision and anytime
it was mentioned in any of the
transcripts it would be pulled back. So
at the very least you would need to use
metadata filtering to narrow in on the
specific meeting that the user is
talking about. But even then searching
for decisions across that meeting
transcript isn't really going to yield a
complete answer. Now there is a case
where it will. Whereas maybe at the end
of the meeting if someone was taking
notes they could summarize all of the
decisions and in that case vector search
will return the right result and this is
the greyness of these questions.
Sometimes vector search will actually
output the right answer depending on the
content that's in the vector store or
depending on the iterative retrieval
that an agent will undertake to actually
research the topic. So we have a number
of possible retrieval strategies here.
agentic rag as I mentioned which is
iterative retrieval so the agent can
search across the vector store multiple
times but again it doesn't really know
what to search for because it doesn't
know what decision it's looking for if
it's not called out as a decision but it
can apply metadata filtering to at least
narrow the scope query expansion for a
more traditional rag system that's not
agent-based but again you have the same
problem here because you don't know what
you're looking for context expansion is
a technique I went into in a previous
video And that allows you to look at the
structure of a document and load up the
parent section, for example. But this
isn't really a document that has a
formal structure. But in reality, to get
the most comprehensive answer to this
question, you need to load the full
transcript. And if the transcript is too
large, you need a way of batch
summarizing the transcript to extract
out decisions. And this idea of loading
a full document is covered in that
context expansion video that you see
here. And something else to call out
here is if someone is looking for
decisions that were made at a leadership
meeting, they should have the right
level of access to actually retrieve
that information. I have a full video on
zero trust rag where I go deep into how
you secure AI agents to make sure people
have the right access privileges when
requesting the information. And that's
our first example of a summary question
where in reality you need to process the
entire document or the entire transcript
to output a comprehensive answer. If
you're enjoying the video, make sure to
give it a like below and subscribe to
our channel for more deep AI and NAND
content. It really helps us out. Another
example is let's say we have a knowledge
base of documentation for a cloud
storage service and the question might
be what are the main features of the
service? The answer could be embedded in
features scattered across the
documentation. So again, multiple units
of information spread out and not
specifically called out as features. You
could have version control, encryption
device syncing, organization. And if you
carry out a vector search, you might
pull some of these, but it's unlikely
that you're going to pull everything.
And that can yield a partially correct
answer, not necessarily a hallucination
but just not comprehensive. And vector
search can work here if for example
there's a page in the documentation that
lists the features and that's similar to
the example of decisions made at the
leadership meeting. If someone goes to
the effort of summarizing the decisions
then it's just raw retrieval to fetch
the list and it's the same here. So the
heart of the issue here is with vector
search you're talking about raw
retrieval of information that's there in
a knowledge base whereas these types of
questions require document processing.
So if someone has already listed the
features of cloud storage that
processing has already taken place and
it's there in the vector store to be
retrieved. But if it hasn't you're going
to need to do the same thing again.
you're going to need to load the full
documentation or iteratively process the
documentation to retrieve out features.
And this leads us to our third example
of a summary question, which is, let's
say, a database of internal reports. And
someone is asking for a comprehensive
summary of a specific report. And to
fetch this, you need to synthesize
information from every single section of
that document. And if you miss any
section, it's potentially misleading or
incomplete. So it's the same problem
again. Vector search will retrieve a
limited number of chunks of the
document. A gentic retrieval will fetch
more chunks, but you are only
essentially retrieving segments or
pieces of the document. And the document
might have an executive summary at the
front. So that might be the first chunk
that's pulled back. And that's how
vector search can produce an answer, but
just not a comprehensive answer.
Approaches to create a comprehensive
summary of a report include loading the
full document into context or instead
actually have a document processing sub
agent that you delegate the task to load
the full document into context and that
way you don't pollute the full context
window of your main agent to answer the
question. But if you are dealing with
documents that are just too big for an
LLM's context window, you need to look
at techniques like map reduce
summarization or hierarchical
summarization. These are top down and
bottom up approaches to summarizing
large documents. If you'd like me to do
a dedicated video on this, then just
drop me a note in the comments below. We
have other videos on our channel where
we go through some of these retrieval
strategies. In our RAG design patterns
master class, I go deep into that idea
of sub agents and delegating tasks to a
sub agent. While my CAG versus rag video
goes into this idea of cache augmented
generation where you load full documents
into context and you're using prompt
caching as a result. We have nodn
workflows for the vast majority of
retrieval strategies that you see in
today's video. If you'd like to get
access to those, then check out the link
in the description to our community, the
AI Automators, where you can join
hundreds of fellow builders, all
creating production grade rag agents.
So, summary questions are notoriously
difficult for vector search to actually
answer comprehensively. But even simple
questions can trip up vector search. So
with the example of a company knowledge
base, for a question like when was our
company founded, vector search should
perform pretty well here. This answer
has one unit of information. It cannot
be partially correct. It has to be fully
correct. And here within example
documents, vector search finds within an
about company page that the company was
founded in 1972. So here the answer
appears verbatim in the documents. The
query embedding when was our company
founded largely matches the embedding of
the chunk which includes founded in
1972. So that should perform pretty
well. But if you have queries that
include rare terms that aren't actually
in the training data of the embedding
model, you can run into problems. So for
this example query, who created the blue
sheet system? Blue sheet is a company
term. It's a system that the company
created and they just dreamt up a name.
And the blue sheet project might be
contained within the knowledge base. But
the problem here is vector embeddings
are going to struggle on this domain
specific term because it was totally
absent or is under represented in the
embedding models training data. And this
is where approaches like lexical search
and hybrid search make a lot of sense
because you can get an exact match on a
term like blue sheet which has very
little meaning in a semantic sense. For
those types of terms, you could also
have a company glossery and have a more
structured data lookup for example.
Another simple question example is this
type of query explain and then you just
have a random code that's specific to
that company. So explain 15 CFR 744.21.
So this could be a company regulation or
a policy or something like that and
it'll be contained in those documents.
It has one unit of information. It
cannot be partially correct. But the
embedding model has very little chance
of representing this identifier as it
was totally absent from its training
data. It means nothing really. So here
you would need to use the likes of
pattern matching to actually find this
code because even hybrid search can
actually fail with this. And the reason
being hybrid search tokenizes the actual
information. So this code 15 space CFR
space 744 these spaces will result in
this code being split. So there is no
exact match possible using lexical
search because this is now four tokens
instead of one. So hybrid search or
lexical search can fail here and that's
why you might need pattern matching
which is wild cards or reax or again
that idea of a structured glossery or
structured data lookup. I've gone deep
in these topics on this channel. I have
a hybrid search video where I show you
how to set up hybrid search on superbase
and pine cone. And I also have a lexical
and pattern matching search video called
high precision rag. So as you can see
even simple questions can trip up vector
search. But not all simple questions are
simple. They can have conditions. Here
we have what looks like a simple
question. Who is the CEO of our company?
So we should be getting an answer with
one unit of information. It can't be
partially correct. But there's a catch
which is it's recency dependent. This
company might have had 10 CEOs over the
last 20 or 30 years. Standard vector
search is going to search for the word
CEO or chief executive officer and it'll
pull out lots of different names from
lots of different documents over the
last 20 years. So, how does it know who
is the current CEO? Because that's the
implication here. Who is the CEO of our
company? The person's most likely
looking for the current CEO. Now, that
could be clarified by the agent, but I
think that's the implication here. So a
possible issue would be that for
whatever reason the current CEO Sarah
Patel might not actually be returned at
all by vector search. Maybe the vector
search will return previous CEOs from
older documents because generally
speaking vector search doesn't favor
newer documents over older ones. It's
all about similarity. But a good
approach would be within the metadata of
the chunks to at least have let's say
the document title. So, if 2025 merger
was in the metadata that was fed to the
agent and Sarah Patel was returned, then
the LLM will more likely describe
Sarah's the current CEO, James was the
previous Michael in 2015. And that's
where metadata can be used to influence
the AI when generating the response, but
it could also be used for filtering.
That's where you could tag and filter
documents, maybe based off publication
date. So you could check all documents
in 2025 and if nothing is returned then
try 2024. Or you could filter documents
by type. So maybe only look at employee
records or org charts. Hybrid search
might work here because CEO is an
abbreviation not full natural language.
Although I think CEO is pretty well
represented in the training data of
these embedding models. A structured
data lookup could work of an org chart
that could be in a knowledge graph or in
a database table. And re-erranking is an
approach here as well. There are new
re-ranking models that are actually
promptable. So the AI agent could prompt
a re-ranker to say, "Rank these chunks
based off recency." Now, you just need
to make sure that the actual date of the
document is dropped in there as well. If
you'd like me to do a video on that type
of reranker, again, let me know in the
comments below. So, you can see that
even simple questions with conditions
can actually cause problems for vector
search. And here's another one. What was
our revenue in Q2 2024? It it is a
simple question. However, it's tabular
by nature and that's where vector search
falls down. So there could be reports
PDF reports in the knowledge base that
talk about the different financial
returns of the different quarters for
the company and they might have
different revenue figures returned. But
again, it's a little bit of a gamble on
what actually will be returned by the
vector store. And this ties into the
need to build accurate and reliable
agents because an AI agent might
actually get this right. It might
actually pull the correct report based
off the data queried, but then it might
not. It might pull a different report.
So, it's that lack of reliability that
then ties into the trust that people
have with the system and whether they
can actually take the information at
face value. So here answer appears
embedded in tables and documents. So
definitely actually markdown OCR makes a
lot of sense here. If we are only
relying on PDF financial reports, at
least get them into a format that is LLM
friendly. So the likes of Mistral or
Docklane will work well here. But yeah
I think structured data makes a lot of
sense. A direct lookup of a database
table that actually contains these
results would be a lot more reliable.
Again, the reranking approach could work
here. Metadata filtering could work or
even an API call to a financial system
that actually contains this information.
Again, we have lots of videos on these
strategies on our channel. Allan has
gone deep on database agents and
spreadsheet agents. So, there's a few
videos there. We have a metadata
filtering video, how to extract markdown
tables using the likes of Docklane
Llama Parse, and Mistral. And I also
have a re-ranking video which explains
what actually it does. Another type of
question that can trip up vector search
are aggregation questions. So here for
example, if someone asks how many
customer support tickets were closed
last month, the answer requires
computing or counting across lots of
different documents. So there is a
quantitative output and it's not
necessarily embedded in the text of the
documents. So for example, each support
ticket might be embedded in a vector
store so that it can be searched across.
But if you do need to know the number of
tickets closed in a specific month, this
naturally aligns to SQL queries. So
structured data is the best approach
here by far. An API call or an MCP call
could also work if it's a support ticket
software that has an API endpoint that
actually answers this question. There's
a category of questions that I would
describe as global questions that
essentially span the entire knowledge
base. So for example, what are the
recurring operational challenges
mentioned across all team
retrospectives? So here we need to
identify patterns and themes across
massive document collections. So there's
no single document that contains the
answer. Within our example documents, we
just have all of our various team retros
where various team issues are talked
about. Vector search would essentially
return a random subset of
retrospectives. So it would be able to
provide a partial answer that reflects
some of the retrospectives, but it can't
really speak to the recurring
operational challenges because there's
just too much documentation to actually
work through. So possible retrieval
strategies here, graph rag is probably
the best approach because graph rag
extracts out these global concepts as
entities and it can interlink everything
and create community summaries. So the
likes of deployment issues or
communication, gaps,, for example,, might
be interlin multiple times. If you
didn't want to go down a knowledge graph
approach, you could use the map reduce
summarization method where you would
process documents in batches, extract
themes, and then aggregate everything.
But that would be a long running job to
actually undertake. Whereas with graph
rag, that's all premputed up front. I
have a graph rag video on our channel
where I talk through setting up light
rag which extracts out entities and
relationships. There's an entity
resolution process, LLM summarization
and merges and you can query light rag
to inject the most relevant entities and
relationships into context to generate
an answer. With global questions, we're
talking about aggregating knowledge
across lots of different documents in a
corpus. With multihop questions on the
other hand, we're talking about chaining
information across documents. So in this
example question, what projects will be
affected if Sarah goes on maternity
leave? We need to chain information to
generate the answer. So we need to
figure out Sarah's current role, what
projects she's on, what's the status of
the various dependencies in that
project. So in our example knowledge
base, we could have an employee
directory. Maybe we have projects and
team structures. To come up with this
answer, we need to be able to traverse
through this chain to figure out that
Sarah is on the API team, which is a key
part of project Phoenix, which has a
certain project status and is to go live
at a certain date. So, if she went on
maternity leave, there would be a major
impact. And simple vector search can
return these isolated chunks, but it
isn't able to return the connections
between these entities. And in fairness
agent gra can actually get there. It can
perform multi-step reasoning. It could
look for Sarah's projects, try to figure
out the timelines, try to figure out the
dependencies to come up with the answer
but it lacks the reliability that the
likes of a knowledge graph would have
where the data is actually modeled
correctly. That way, you're able to
traverse the graph to come up with the
right answer. So this is one of those
cases where smart reasoning models using
vector search can actually get there but
there may be a trust issue because you
don't know can you actually trust the
answer that you're getting whereas with
the knowledge graph it's a lot easier to
actually stand over the data. A lot of
our members need images to be returned
in line in the responses of the AI
agent. So with this example question how
do I replace the toner cartridge in the
third floor printer? Show me the
diagram. The answer requires retrieving
and returning visual information. And in
the example documents here, there might
be equipment photos as well as an
inventory of the office equipment and
printer manuals, for example. So, this
is multi-stage retrieval and it is
actually suited to vector search if the
images are embedded in the chunks or
they're in the metadata so that they can
be injected into the chat. So that's
where multimodal rag kicks in where you
extract out images from source documents
and make them accessible within the chat
widget. Metadata filtering would also be
important here because you might need to
filter by the equipment model. That way
you're providing the right image. And
aent rag would also be important for
that iterative multi-stage retrieval and
maybe even generating signed image URLs.
Alan has a multimodal rag video on our
channel which is definitely worth a
watch. And I also have a video on how to
create a Slack agent because internal
staff in a company might be using an
instant messaging platform like Slack
when engaging with a system like this.
Some of the questions that we come
across require heavy post-processing to
actually come up with the answer. For
example, is our customer churn rate
trending up or down over the past 6
months? If this information has not been
calculated before, the answer would
require significant reasoning and
analysis to actually figure out. And for
this then you need to load lots of raw
data to calculate the trends and compare
values. So ideally you might have
monthly reports and you'd be able to
pull everything together. But even
vector search might struggle to find
those. Whereas in reality you'd likely
need some sort of SQL tool where you can
pull structured data and then have a
calculator tool so that the LLM could
generate accurate calculations. So a
gentic rag with reasoning is important
here. But for a question like this, if
you really want to trust the answer
you're getting back, you're better off
to have premputed the actual answer. So
that way, the agent is simply just
retrieving the answer. It doesn't need
to calculate anything. Or again, it
might be an API call that you can make
to a CRM or to a financial system to get
the premputed answer back. And finally
there's a category of questions where
the actual starting premise is false.
And it's important that the LLM doesn't
go off track and stays honest to the
data. So for example, which VP led the
Berlin office before it closed? In this
case, the question contains a false
assumption. The company never had a
Berlin office. So the AI agent needs to
recognize and correct the premise rather
than hallucinating the answer to fit
into the user's question. So example
documents here, we might have office
locations and Berlin is not mentioned.
We might have a leadership history
section where the VPs are mentioned and
there may have been expansion plans to
Berlin but there was no timeline set. So
it all comes back to the chunks that are
returned from the vector store. It's
important that the agent stays true to
the information that is in these chunks.
Because if the retrieved documents
mentioned Berlin and VP along with
expansion plans, it could quite easily
hallucinate the connection Sarah Chen
led operations before the Berlin office
closed. So strategies here carrying out
an exhaustive search is important.
Context expansion could help here as
well. That video I mentioned earlier
having agentic rag with verification
could work here because I went through
the verify answer pattern in our rag
design patterns video. And this would
allow a verification step to check the
answer versus the retrieved context to
make sure it's staying true to what it
actually learned. But it all comes back
to eval and ground truth testing. You
need to keep your specialized agents on
track. They need to stay true to the
data and not dive into their training
data or infer connections from
disconnected information. I have a full
video on evaluating AI systems using the
open- source library DP val. So that one
is definitely worth a watch. And if
you'd like to learn more about rag
design patterns, including this verify
answer pattern that I just described
then click on this video here to learn
all about it. Thanks for watching and
I'll, see you, in, the, next
👉 Get access to our Hybrid Retrieval n8n workflows in our community https://www.theaiautomators.com/?utm_source=youtube&utm_medium=video&utm_campaign=tutorial&utm_content=retrieval-engineering Vector Search isn't the silver bullet everyone thinks it is. After working with hundreds of community members who are building production-grade RAG agents, we keep seeing the same failures. Questions that should be simple get hallucinated answers. Queries that need exact matches return "similar" results. Systems that work in demos break in production. Vector search is brilliant for semantic queries—but there's an entire category of questions that need completely different retrieval strategies. In this deep-dive, I'll show you 9 real-world examples where Vector Search fails, and demonstrate the retrieval engineering strategies that actually work in production. 🎯 What You'll Learn: ✅ Why similarity ≠ relevance (and what to do about it) ✅ 9 types of queries where vector search breaks down ✅ Exact match strategies for error codes and IDs ✅ Structured data approaches for tabular queries ✅ Aggregation techniques for counting and computing ✅ GraphRAG for global knowledge patterns ✅ Multi-hop reasoning with knowledge graphs ✅ Multimodal RAG for image retrieval ✅ Handling false premise questions without hallucinations ✅ Complete evaluation strategies using DeepEval 🔗 Useful Links: Amit Verma Article: https://venturebeat.com/ai/from-shiny-object-to-sober-reality-the-vector-database-story-two-years-later IBM Know Your RAG Research - https://arxiv.org/html/2411.19710v1 CRAG Benchmark - https://arxiv.org/html/2406.04744v2 ⏱️ Timestamps: 00:00 - The Vector Search Myth 05:20 - #1 Summary Questions 12:27 - #2 Simple Questions 15:25 - #3 Simple Questions with Conditions 19:55 - #4 Aggregation Questions 20:41 - #5 Global Questions 22:21 - #6 Multi-Hop Questions 23:59 - #7 Multi-Modal Questions 25:12 - #8 Post-Processing Questions 26:18 - #9 False Premise Questions 💬 Questions or Comments? What retrieval challenges are you facing in your RAG systems? Which of these 9 problems have you encountered? Drop your thoughts below!