Loading video player...
Today’s Build What We’re Building: Automated RAG evaluation pipeline measuring groundedness, relevance, and completeness Real-time metrics dashboard showing RAGAS scores across evaluation datasets Comparative benchmarking system tracking performance across model configurations Integration with L26’s conversation system for multi-turn evaluation Synthetic test data generator creating realistic question-answer-context triplets Building on L26: We extend the ConversationBufferMemory and RAG chain from L26 by adding quantitative evaluation. Instead of subjectively assessing multi-turn performance, we now measure it with metrics like faithfulness scores and context recall. Enabling L28: The evaluation framework we build today becomes critical for L28’s tool-equipped agent. When agents combine retrieval with tool calls, evaluation complexity explodes—you need to verify both retrieval quality AND tool execution correctness. Today’s metrics foundation makes that feasible.