Loading video player...
LLM-as-Judge is everywhere, but most teams use it wrong. This video shows you how to calibrate cheap LLM judges against ground truth so your evals actually mean something. We call this Causal Judge Evaluation (CJE). What you'll learn: - Why raw judge scores mislead you (preference inversion) - The 3 failure modes that break LLM evaluation - How to calibrate SāY and monitor for drift - When to collect more human labels Timestamps: 0:00 - Cold open 0:14 - The evaluation ladder 0:40 - Preference inversion 0:55 - Three failure classes 1:25 - How calibration works 1:42 - Why it works 2:02 - Monitoring for drift 2:50 - Residual analysis 3:10 - The recipe Links: š¦ pip install cje-eval š https://arxiv.org/abs/2512.11150 š cimolabs.com #LLM #AIevaluation #MachineLearning