Loading video player...
You built a RAG system. The answers look correct. So you ship it. That’s not evaluation. That’s **vibe checking**. In this video, I break down what a recent comprehensive research paper teaches us about **how RAG systems should actually be evaluated** — and why most systems fail silently. We cover: * Why prompt tweaks create regression loops * RAG as an **open-book exam**, not a black box * Evaluation layers: pre-processing, retrieval, generation, safety, efficiency * Why **not all metrics should be used in every system** * The only two real evaluation methods today: * datasets + mathematical metrics * LLMs as judges No hype. No tool pushing. Just the mental model you need to stop guessing and start verifying. If you’re building RAG systems and fixing bugs by “adjusting the prompt until it feels right” — this video is for you. **Hashtags:** #rag #retrievalaugmentedgeneration #LLMEvaluation #aiengineering #generativeai #machinelearning #aiarchitecture #mlops #aisystems #hallucinations #promptengineering #airesearch #llms