Loading video player...
How do you know if your RAG system is actually performing well? In traditional machine learning, we rely on simple accuracy scores. But in the world of Generative AI, where outputs are free-form text, "accuracy" isn't enough. In this video, we explore the critical discipline of RAG Evaluation and how to measure the quality of your AI responses using production-grade metrics. In this session, we cover: 1. The Shift in Evaluation: Why we move away from fixed labels to measuring Relevance, Grounding, and Factual Consistency. 2. Decoupling Evaluation: A key LLM Ops principle—why your evaluation system should be separate from your inference pipeline to ensure unbiased signals. Core RAG Metrics: - Answer Relevancy: Does the model actually address the user's question? - Faithfulness: Is the answer grounded in the retrieved context, or is the model hallucinating extra info? - Structured Data Evaluation: How to wrap queries and contexts into datasets for automated evaluation frameworks (like Ragas). - From Demo to System: A summary of how we’ve moved from a simple script to an observable, controllable, and reliable production architecture.