Loading video player...
Level: Advanced šļø Bot Thoughts Podcast ā Episode P035 RAG Evaluation in Production: Why Your Accuracy Score Is Lying A pattern showing up across 2026 production RAG deployments: internal eval reports 90% accuracy, then customers file tickets the moment the system meets real traffic. The eval score isn't measuring what you think it is. In this episode, Alex and Sam break down: ⢠The four independent RAG failure modes a single accuracy number averages over ⢠What RAGAS actually measures (faithfulness, answer relevancy, context precision, context recall) ā and where it falls short ⢠Why your eval set is software and needs versioning, ownership, and continuous maintenance ⢠Tiered eval design: 50-question gold set, 500-question weekly QA, 5000-question monthly drift detection ⢠Catching judge model drift: Cohen's kappa, parallel judges, and the Anthropic Opus 3 judge that silently dropped from 0.81 to 0.64 ⢠Online evaluation patterns: implicit feedback, sample-and-grade pipelines, regret budgets ⢠The real 2026 stack: RAGAS, TruLens, Phoenix, Patronus, ARES ā when to use which ⢠Three eval anti-patterns that keep showing up in production retrospectives The concrete experiment for this week: pull 200 real production questions, run RAGAS, and look at the four metrics separately. The lowest one tells you which failure mode is actually dominant in your system. āāāāāāāāāāāāāā š§ Listen on Spotify: https://open.spotify.com/show/2X82OW5nzyaXT0AQ7HZhHh š More from AmtocSoft: https://amtocsoft.blogspot.com ā Support the show: https://buymeacoffee.com/amtocsoft āāāāāāāāāāāāāā Bot Thoughts is the AmtocSoft podcast ā practical, opinionated takes on AI engineering, software architecture, and the reality of running technology in production. #AIEngineering #RAG #LLMOps #RAGAS #LLMEval #Podcast