Loading video player...
Learn how to build and evaluate a production-style Retrieval-Augmented Generation (RAG) agent with MLflow. This is Part 1 of a two-part series on a complete workflow: register prompts and the agent, capture execution traces with ground-truth expectations, and run evaluations across multiple frameworks from a single MLflow interface. What this video covers: Use case: A “school assistant” agent that answers children’s questions about school policies (cell phones, attendance, and more) in a child-friendly tone. 👉 Stack: LangChain, FAISS, Amazon Bedrock, MLflow Workflow highlights: • Prompt registration in the MLflow Prompt Registry (versioning + aliases like "production" so prompts can change without redeploying code) • Agent definition using MLflow’s standardized agent base class (logging, versioning, deployment patterns) • Trace capture on evaluation questions, including retrieved context and final outputs • Ground truth expectations from subject matter experts, logged with traces for evaluation reference • Multi-framework evaluation in one place: Custom MLflow LLM judge, Ragas, Arize Phoenix, and a deterministic retriever scorer Results: Aggregated and per-trace metrics with judge rationales, plus tracking over time (including moving averages) to monitor iteration. Coming in Part 2: Aligning a custom judge with human SME feedback using natural language when generic LLM judges are less reliable in domain-specific settings. 🎤 Speaker: Joana Mesquita, MLflow Ambassador 🔗 Repo with the code: https://github.com/joanacmesquitaf/rag-agent-mlflow-evaluation 📖 Read the accompanying blog post for a deep-dive tutorial and code breakdowns: https://medium.com/@joana.c.mesquita.f/evaluating-generative-ai-with-mlflow-from-development-to-deployment-validation-85bc2bd5e7a9 Timestamps: 0:00 – Introduction & The Problem of Fragmented Evaluation 2:15 – Introduction to the MLflow GenAI Module 5:30 – Step 1: Setting up the MLflow Environment 8:45 – Step 2: Defining the Agent & Prediction Function 12:10 – Step 3: Structuring the Evaluation Dataset & Ground Truth 15:40 – Step 4: Configuring Scorers (Built-in & Custom Metrics) 18:55 – Step 5: Running mlflow.genai.evaluate() & UI Walkthrough 21:30 – Wrap-up & Preview of Part II