Loading video player...
I've heard about five different variations of my first name Joseph, Joe, Joseph with an
F, Jose from Mis Amigos en Guadalajara, and then, there's that one guy in grad school who
called me Joseppi. Meanwhile, I don't think I've ever had anything similar with my last name, just
Washington. The retrieval in retrieval augmented generation or RAG is kind of like
that. We all agree on the augmented generation part of the name, but retrieval comes in
multiple flavors, and the retrieval strategy you choose can make or break your AI agentic
system. And this is key in any generic RAG system. You know where a user comes
in and they have a query and they come to your application,
which itself is connected to your LLM. And you want to
provide that LLM access to different knowledge sources.
RAG works by fetching relevant chunks from your knowledge base and feeding them into the LLM. The
quality of that retrieval method, though, determines how factual and relevant the answers
will be. More methods, or some methods, I should say, are lightning fast while others are more flexible
when it comes to synonyms, context and data that spans different modalities. So
let's count down the top three retrieval strategies and end with the one that most teams
are betting on today. Okay. Starting with number three, sparse
retrieval. This is a foundational
classic method of retrieval. It's fairly old. It's about 50
years old, relying on keywords or
keyword search. Sparse retrieval uses methods like TF-IDF,
the well-known, as well as BM25.
It counts how often query terms appears in your document and then scores the documents
accordingly. Its pros are it's simple, fast and scalable,
but it doesn't handle synonyms or context very well. Still, in some cases, BM25
can outperform more expensive deep learning models on domain-specific terms. Question: when
should you use it? Any situation where exact wording matters, so short,
well-defined queries, code, search logs or legal clauses are all
examples. And it doesn't require embeddings. So it's cost-effective, and it scales really well.
You're probably already using open-source examples like Elasticsearch and Apache Lucene,
both built on BM25, and even Milvus now supports BM25, in
addition to vector embeddings. Now on to number two.
Dense retrieval aka or semantic
workhorse. This technology is about 5 to 10 years old,
so fairly recent. And in dense retrieval, both queries and documents are
mapped into high-dimensional vector space.
And results are found based on the semantic similarity, i.e., the meanings of the words instead
of exact matches. So this depends on embedding models. And embedding model, like the open-source
sentence transformers models, takes text and converts that into a vector
of numbers. Texts with similar meaning land close together in that vector space, where
similarity is calculated using algorithms like approximate nearest neighbor or
k nearest neighbor. Open-source examples include files from Meta or
JVector, which is an open-source, high-performance Java library that speeds up dense search
or dense retrieval in enterprise RAG systems. Dense retrieval makes natural language queries
shine. It's perfect for chatbots, customer service and research over unstructured knowledge bases
where people might phrase things in many different ways. It's powerful and context-aware,
but it can miss rare or jargon-heavy terms. It's also not good with short, few
word queries. On to number one, hybrid
retrieval, aka, the current state of the art.
This one is the new kid on the block. It's only about 2 to 3 years old
in practical deployments, and it combines the best of both worlds, vector plus
keyword search.
The semantic matching handles synonyms and concepts, while the keyword matching
ensures that rare but critical terms don't get lost. Benchmarks
show hybrid retrieval consistently outperforming dense only retrieval,
boosting both precision and recall. So how does it work? The query runs both ways in
parallel: once as a vector embedding against your embedded knowledge set and again as a keyword
search. It then uses a fusion algorithm to merge results based on scores
from both. The most common fusion algorithm is a weighted sum, so it
picks a balance between, for example, 70% dense and
30% sparse. Another very popular method is
reciprocal ranked fusion, or RRF, which doesn't use raw scores
but instead merges based on the ranked positions from each retriever.
It works across use cases, but especially in domains with specialized jargon, such as
legal or technical domains or medical, uh, medical field.
Hybrid is number one because it balances speed, precision and recall. That's why
it has become the default choice for serious RAG deployments, and also why offerings like
Elasticsearch, Milvus, Weaviate and DataStax Astra DB have all made it easy to
experiment with hybrid retrieval. For some of you, this may feel like a Taylor Swift Eras
tour, but with retrieval strategies and with the eras spanning the last 50 years.
If you're a data scientist or a developer, I encourage you to embrace the hybrid retrieval era.
Because sparse retrieval is fast and exact, and dense retrieval is context-aware
and flexible. But hybrid retrieval gives you the best of both worlds, and that
is why it's top of the list.
Ready to become a certified watsonx Generative AI Engineer? Register now and use code IBMTechYT20 for 20% off of your exam → https://ibm.biz/BdbX3c Learn more about RAG (Retrieval Augmented Generation) here → https://ibm.biz/BdbX32 🚀 Unlock the power of RAG with the top 3 retrieval strategies! Joseph Washington breaks down Sparse, Dense, and Hybrid Retrieval, showing how they enhance precision, recall, and context awareness. Discover how techniques like vector embeddings and BM25 fuel more accurate and context-aware RAG systems. AI news moves fast. Sign up for a monthly newsletter for AI updates from IBM → https://ibm.biz/BdbZgC #ai #retrievalaugmentedgeneration #bm25