Loading video player...
The RAG Hype is Dead... Unless You Master These Optimizations. Senior engineers and architects: Stop "vibe checking" your RAG! This deep dive with Stephen Batifol cuts through the marketing noise to deliver the brutal truth about what makes Retrieval-Augmented Generation (RAG) and self-hosted multimodal systems (using Pixtral and vLLM) fail in production. He reveals the critical, often-overlooked decisions you must make on embedding models, vector indexing, and inference optimization to achieve scale, speed, and accuracy. ⏱️ Video Timestamps (For Navigation) 0:00 - Introduction & Multimodal RAG with Pixtral & vLLM 1:45 - What Vector Search Actually Is and Why It Matters 4:10 - Indexing Deep Dive: FLAT, IVF, and HNSW Explained 7:50 - The Index Tradeoff Matrix: Speed vs. Accuracy vs. Cost 9:45 - Embedding Models: Stop Using the Wrong Ones! 13:30 - The RAG Pipeline: From Unstructured Data to Retrieval 14:20 - Why Vibe Coding Your RAG is a Disaster (Proper Evals) 16:45 - RAG vs. The Long Context LLM Myth (Llama 4, Gemini 2.5 Benchmarks) 19:10 - Hybrid Search (BM25 + Similarity) and Metadata Filtering 22:30 - Building the Self-Hosted Multimodal RAG Stack (Pixtral, Milvus, vLLM) 25:05 - The Inference Challenge: Latency, Throughput, and Batching 27:15 - Model Parallelism: Why You Need to Split the Model (Tensor Parallelism) 29:30 - Optimization Secrets: Quantization & Paged Attention (KV Cache) 31:50 - Live Demo (Architecture & Setup) 33:05 - Q&A: Chunking Strategy, CAG, and More 🔗 Transcript available on InfoQ: https://bit.ly/4hPcuHr #RAG #LLMOps #VectorDatabase 💬 Discussion Question: What is the biggest indexing or embedding challenge you face right now at scale? Let us know your model/index choice in the comments below! 👇