Stratified Evaluation in RAG — Stop Trusting Your Aggregate Score | DailyDevLists

Loading video player...

Stratified Evaluation in RAG — Stop Trusting Your Aggregate Score

AI with Chaithra

7 days ago

4:58

AI Evaluation & Monitoring

Rank #3

Description

Your RAG system scored 87%. Looks good. But what is it hiding? 👉 One number across all queries is the most dangerous number in evaluation. 🧠 What is Stratified Evaluation? Instead of one aggregate score, you break your eval set into groups and score each separately Slice by category — different query types fail differently Slice by difficulty — easy queries show if it works, hard queries show if it actually works The cell with the lowest score in your matrix is your highest priority fix 🔍 Example: University student portal RAG system — 4 query categories Exam schedules → 91% correctness ✅ Fee and payments → 78% correctness ⚠️ Course registration → 64% correctness ❌ Hostel and facilities → 71% correctness ⚠️ Aggregate: 76%. Looks acceptable. But students making real decisions about course registration are getting wrong answers 36% of the time. The aggregate would have let you ship that. 🎯 What a low stratified score is telling you: Low hit rate in a category → retrieval problem. Fix chunking or embeddings High hit rate, low faithfulness → generation problem. Fix your system prompt High hit rate, high faithfulness, low correctness → interpretation problem. Fix your prompt engineering or check your golden dataset ⚠️ Why this matters: Every query category has a different failure mode Every failure mode has a different fix Without stratification you are guessing. With it, you are debugging. 💡 Key idea: Slice your eval until the actionable insight is visible. The aggregate score is where debugging ends before it begins. This is part of a series where I'm building a corpus of what I know — one concept at a time. 🔗 Connect with me: http://www.linkedin.com/in/chaithra-n-a91192225

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

Quality Rank

#3

AI Recommended