Loading video player...
Your RAG system scored 87%. Looks good. But what is it hiding? š One number across all queries is the most dangerous number in evaluation. š§ What is Stratified Evaluation? Instead of one aggregate score, you break your eval set into groups and score each separately Slice by category ā different query types fail differently Slice by difficulty ā easy queries show if it works, hard queries show if it actually works The cell with the lowest score in your matrix is your highest priority fix š Example: University student portal RAG system ā 4 query categories Exam schedules ā 91% correctness ā Fee and payments ā 78% correctness ā ļø Course registration ā 64% correctness ā Hostel and facilities ā 71% correctness ā ļø Aggregate: 76%. Looks acceptable. But students making real decisions about course registration are getting wrong answers 36% of the time. The aggregate would have let you ship that. šÆ What a low stratified score is telling you: Low hit rate in a category ā retrieval problem. Fix chunking or embeddings High hit rate, low faithfulness ā generation problem. Fix your system prompt High hit rate, high faithfulness, low correctness ā interpretation problem. Fix your prompt engineering or check your golden dataset ā ļø Why this matters: Every query category has a different failure mode Every failure mode has a different fix Without stratification you are guessing. With it, you are debugging. š” Key idea: Slice your eval until the actionable insight is visible. The aggregate score is where debugging ends before it begins. This is part of a series where I'm building a corpus of what I know ā one concept at a time. š Connect with me: http://www.linkedin.com/in/chaithra-n-a91192225