Loading video player...
Want to learn real AI Engineering? Go here: https://go.datalumina.com/QpP01LX Want to start freelancing? Let me help: https://go.datalumina.com/jOYILqO 🔗 GitHub Repository https://github.com/daveebbelaar/ai-cookbook/tree/main/knowledge/hybrid-retrieval 🛠️ My VS Code / Cursor Setup https://youtu.be/mpk4Q5feWaw ⏱️ Timestamps 00:00 Hybrid Retrieval Overview 01:00 Meet the Finance QA Data 03:29 Exploring Queries and Corpus 06:09 Mapping Questions to Documents 09:37 Retrieval Pipeline Roadmap 10:05 BM25 Keyword Retrieval 14:44 Tokenizing the Corpus 17:11 Building the BM25 Index 19:25 Querying with BM25 23:56 Why Dense Embeddings Help 25:13 Creating Dense Embeddings 32:11 Dense Search in Python 36:45 Dense Retrieval Compared 37:12 Reciprocal Rank Fusion 40:51 Fusing Search Results 43:56 Adding the Re-Ranker 46:44 Re-Ranking Hybrid Candidates 49:36 Evaluating Retrieval Quality 54:27 Tuning for Your Own Data 📌 Description In this lecture, I build a production-style hybrid retrieval system from scratch, combining BM25, dense embeddings (OpenAI text-embedding-3-small), reciprocal rank fusion, and Cohere's re-ranker into a single pipeline. Using the FinanceQA dataset from the BEIR benchmark, I walk through each stage, loading and inspecting the corpus, building a BM25 index, generating dense embeddings, fusing rankings with RRF, and re-ranking the top candidates. The final section evaluates all four approaches with NDCG@10, showing how the full hybrid plus re-ranker stack outperforms each method on its own. 👋🏻 About Me Hi! I'm Dave, AI Engineer and founder of Datalumina®. On this channel, I share practical tutorials that teach developers how to build production-ready AI systems that actually work in the real world. Beyond these tutorials, I also help people start successful freelancing careers. Check out the links above to learn more!