How to Test GenAI Agents in Production: MLflow Tracing & Evaluation Deep Dive | DailyDevLists

Loading video player...

How to Test GenAI Agents in Production: MLflow Tracing & Evaluation Deep Dive

VectorLab

129 days ago

1:08:26

AI Evaluation & Monitoring

Rank #1

Description

Dive into the critical, yet challenging, topic of GenAI Agent Quality with Samraj Moorjani (Engineer at Databricks on the MLflow and Agent Quality teams). Learn why comprehensive testing is non-negotiable for GenAI applications and discover how MLflow provides the foundational tools to build high-quality, observable, and trustworthy agents. 💡 What You Will Learn: - The Problem: Why testing GenAI is difficult (non-deterministic outputs, latency/cost trade-offs, subjective quality). - Observability with Tracing: The foundation of quality. See a live demo of using MLflow Tracing (built on OpenTelemetry) to debug complex multi-agent architectures step-by-step. - Scaling Feedback with Evals: How to move beyond "vibe checks." Learn to use high-quality datasets (even small ones!) to pressure test your agent and identify issues like misrouting. MLflow Quality Features: - AI Insights: Using an agent to automatically read traces, root-cause issues, and prioritize fixes (Agent of an Agent!). - Labeling Sessions: Easy collaboration with non-technical Domain Experts (SMEs) to collect high-quality ground truth. - Synthetic Generation: Bootstrap your evaluation datasets when human feedback is scarce. - LLM as a Judge & Custom Scores: Systematically improve your agent by defining your specific quality criteria (e.g., routing accuracy, resolution). - Agent as a Judge & Judge Alignment: Advanced techniques to simplify complex evaluation metrics and ensure your LLM judges align perfectly with human expectations. - Production Monitoring: Set up automated scoring and use Unity Catalog tables to deliver KPIs and SQL alerts for continuous quality assurance and regression detection. Key Timestamps: 00:56 Why GenAI quality is hard (non-determinism). 03:50 MLflow Tracing: The foundation of observability. 17:08 AI Insights: Agent-assisted root cause analysis with MLFlow MCP server. 23:00 Collaborating with Domain Experts & Ground Truth. 30:50 Defining and using Evaluation Scores (LLM as a Judge). 42:30 Continuous Production Monitoring. 50:00 Agent as a Judge & Judge Alignment deep dive. Discussed Videos - Agent Evaluation: https://youtu.be/2hZW0aFKSnU

Watch on YouTube

Video Details

Category

AI Evaluation & Monitoring

Featured Date

January 22, 2026

Quality Rank

#1

AI Recommended