Loading video player...
Your RAG pipeline "seems fine" - but have you actually measured it? In this video, we build a complete evaluation framework for RAG systems using RAGAS, DeepEval, and the latest LLM-as-judge techniques. You'll walk away with real numbers, a decision framework, and a working pipeline you can run today. ▶ WHAT'S COVERED: — Why RAG evaluation is fundamentally harder than standard LLM eval (the two-component problem) — Why BLEU and ROUGE scores fail completely for RAG systems — The 4 core RAGAS metrics: Faithfulness, Answer Relevancy, Context Precision, Context Recall — Extended metrics: Factual Correctness, Noise Sensitivity, Agent eval (2025 updates) — Full framework comparison: RAGAS vs DeepEval vs TruLens vs LangSmith vs MLflow — LLM-as-judge: how it works, what biases to watch for, when to trust it — Building a production-grade evaluation pipeline with CI/CD integration — Hot takes: what the industry is getting wrong about RAG evaluation 📸 SCREENSHOT MOMENTS IN THIS VIDEO: — The RAG Evaluation Matrix (failure modes + metrics that catch them) — The Four Core RAGAS Metrics Quick Reference — Framework Comparison Table (RAGAS vs DeepEval vs TruLens vs LangSmith vs MLflow) — The Evaluation Pipeline Flowchart ⏱️ TIMESTAMPS: 0:00 - Intro — "Your RAG pipeline seems fine. That's the problem." 1:45 - Why RAG Evaluation Is Uniquely Hard 3:30 - The Two-Component Problem (Retriever + Generator fail independently) 5:00 - Why BLEU and ROUGE Fail for RAG 6:00 - The 4 Core RAGAS Metrics Overview 6:45 - Faithfulness (your hallucination detector) 9:00 - Answer Relevancy (does the answer address the question?) 11:00 - Context Precision (retriever ranking quality) 12:30 - Context Recall (did you miss important docs?) 14:00 - Extended Metrics: Beyond the Big Four 14:30 - Factual Correctness, Noise Sensitivity, Agent Metrics 16:00 - Metric Selection Decision Tree 16:30 - Framework Comparison: RAGAS vs DeepEval vs TruLens vs LangSmith 17:00 - RAGAS deep dive 18:00 - DeepEval (pytest for LLMs) 18:15 - TruLens (explainability approach) 19:00 - LangSmith & MLflow 19:30 - Framework Comparison Table 🖼️ SCREENSHOT THIS 20:00 - LLM-as-Judge: The Meta Problem 21:30 - Why it actually works (correlation with human raters) 22:00 - Known biases: position, verbosity, self-preference 22:30 - Practical recommendations for LLM judges 23:00 - Building a Real Evaluation Pipeline 23:30 - Step 1: Building your test dataset (manual vs synthetic vs production) 24:00 - Step 2: Choosing your metrics mix 24:30 - Step 3: CI thresholds and automated gates 25:00 - Step 4: Production monitoring 26:00 - Hot Takes 🔥 27:00 - Wrap-Up, Homework & CTA 🔗 TOOLS REFERENCED: — RAGAS: https://docs.ragas.io — DeepEval by Confident AI: https://deepeval.com — TruLens: https://www.trulens.org — LangSmith: https://smith.langchain.com — Arize Phoenix: https://phoenix.arize.com 📚 RAG/AI SERIES PLAYLIST: https://youtube.com/playlist?list=PLFSggxLuQqWLH-o1Phiq2gBiiDPbqmW_N&si=f2lFge6O7WtOzgJY 💬 Drop a comment: What score did YOUR RAG pipeline get when you ran RAGAS? I want to see the community baseline. 🔔 Subscribe for the full RAG/AI series — new video every [cadence]. #RAG #RAGAS #LLMEvaluation #VectorDatabase #AIEngineering #DeepEval #LangChain #LLM #RetrievalAugmentedGeneration #AITutorial