Loading video player...
Learn how to replace "looks right to me" with a repeatable, automatable evaluation signal for your RAG pipelines and AI agents. We cover the full eval stack: building a golden dataset, measuring retrieval with Precision@k, Recall@k, and MRR, evaluating generated answers with semantic similarity and LLM-as-judge, and diagnosing agent failures by measuring tool routing and end-to-end quality separately. GitHub Repo: https://github.com/CumulusCycles/AI_Engineering_Hands-On