Loading video player...
In previous video we covered why AI evaluation matters and what breaks down. In this video we are covering how you actually build a pipeline that measures what matters and catches failures before users do. š§ WHAT'S COVERED: - Golden datasets: how to build a reliable test set from production logs, how many examples you actually need, and why you should rotate it every 6 months - RAGAS in practice: the 4 metrics that independently diagnose retriever and generator failures in RAG pipelines - Trace-level agent evaluation: why evaluating only the final output misses everything ā and the 3-level framework that gives you a complete picture - Eval-driven development: write the test before the prompt, and why every production failure should immediately become a test case - CI/CD quality gates: cheap deterministic checks on every commit, LLM-as-judge at deploy time, and how to set thresholds that don't block velocity - Judge calibration: why correlation alone isn't sufficient, what Cohen's Kappa tells you that correlation doesn't, and how multi-judge consensus achieves near-human reliability š KEY DATA: - Golden dataset minimum: ~100 examples for statistical reliability; ~500 for segment-level analysis (LXT.ai) - RAGAS scores ā„ 0.8 on all four metrics = strong RAG pipeline performance - LLM judge accuracy drops from ~93% on short traces to ~75% on 50+ step agent traces (GUIDE paper, Apr 2026) - Cohen's Kappa ā„ 0.80 = "very strong" alignment threshold; fewer than half of 54 tested LLMs cleared this bar (ICLR 2026) - 3-judge consensus achieves Cohen's Kappa ~0.95, Macro F1 97ā98% š SOURCES: - GUIDE paper ā trajectory-aware evaluation (arXiv, Apr 2026) - Judge's Verdict Benchmark, Han et al. (ICLR 2026) - RAGAS official documentation (Dec 2025) - LXT.ai benchmark analysis (Mar 2026) - LangChain agent eval readiness checklist (Mar 2026) - Braintrust LLM evaluation guide (Feb 2026) ā±ļø TIMESTAMPS: 0:00 - Introduction 0:35 - Golden Datasets 2:05 - RAGAS in Practice 3:35 - Trace-Level Agent Evaluation 5:10 - Eval-Driven Development 6:30 - CI/CD Quality Gates 7:45 - Judge Calibration 9:05 - Recap #AIEvals #LLMEvaluation #RAGAS #AIAgents #CICDPipeline #GoldenDataset #AgentEvaluation #AIQuality #AIInfrastructure #machinelearning Subscribe to Scrollypedia for more technical deep dives into AI infrastructure. DISCLAIMER: This content is for educational purposes. All statistics are sourced from publicly available reports and company announcements as of April 2026. Market projections are based on industry research reports and should not be considered investment advice. Ā© 2026 Scrollypedia