Bot Thoughts Podcast — RAG Evaluation in Production: Why Your Accuracy Score Is Lying | DailyDevLists

Loading video player...

Bot Thoughts Podcast — RAG Evaluation in Production: Why Your Accuracy Score Is Lying

Toc am

2 hours ago

10:53

AI Evaluation & Monitoring

Rank #1

Description

Level: Advanced 🎙️ Bot Thoughts Podcast — Episode P035 RAG Evaluation in Production: Why Your Accuracy Score Is Lying A pattern showing up across 2026 production RAG deployments: internal eval reports 90% accuracy, then customers file tickets the moment the system meets real traffic. The eval score isn't measuring what you think it is. In this episode, Alex and Sam break down: • The four independent RAG failure modes a single accuracy number averages over • What RAGAS actually measures (faithfulness, answer relevancy, context precision, context recall) — and where it falls short • Why your eval set is software and needs versioning, ownership, and continuous maintenance • Tiered eval design: 50-question gold set, 500-question weekly QA, 5000-question monthly drift detection • Catching judge model drift: Cohen's kappa, parallel judges, and the Anthropic Opus 3 judge that silently dropped from 0.81 to 0.64 • Online evaluation patterns: implicit feedback, sample-and-grade pipelines, regret budgets • The real 2026 stack: RAGAS, TruLens, Phoenix, Patronus, ARES — when to use which • Three eval anti-patterns that keep showing up in production retrospectives The concrete experiment for this week: pull 200 real production questions, run RAGAS, and look at the four metrics separately. The lowest one tells you which failure mode is actually dominant in your system. ────────────── 🎧 Listen on Spotify: https://open.spotify.com/show/2X82OW5nzyaXT0AQ7HZhHh 📝 More from AmtocSoft: https://amtocsoft.blogspot.com ☕ Support the show: https://buymeacoffee.com/amtocsoft ────────────── Bot Thoughts is the AmtocSoft podcast — practical, opinionated takes on AI engineering, software architecture, and the reality of running technology in production. #AIEngineering #RAG #LLMOps #RAGAS #LLMEval #Podcast

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

Quality Rank

#1

AI Recommended