Loading video player...
Meeting Purpose Review ML model monitoring, observability, and retraining strategies for the GCP ML Engineer exam. Key Takeaways - ML monitoring extends beyond infra metrics (CPU/latency) to track model accuracy. The "ML monitoring gap" is that models can be healthy but inaccurate if data or relationships shift. - Three types of drift degrade models: Data (input features), Concept (feature-label relationship), and Prediction (output distribution). Each requires a specific detection and response strategy. - Vertex AI automates drift detection, while Cloud Monitoring provides unified observability for infra, app, and custom metrics. - Effective alerting balances sensitivity and specificity. Use composite alerts (multiple conditions) and anomaly detection (vs. baselines) to reduce alert fatigue. Topics The ML Monitoring Gap - Traditional monitoring (CPU, latency, error rate) is insufficient for ML. - Problem: A model can be healthy but inaccurate if underlying data or relationships change (e.g., fraudsters adapt tactics). - Solution: Monitor prediction quality and data integrity to ensure the model remains effective. Types of Model Drift - Data Drift: Input feature distributions change between training and production. - Detection: Compare distributions using statistical tests (KL divergence, PSI, chi-square). - Response: Retrain with recent data if persistent. - Concept Drift: The relationship between features and labels changes. - Detection: Monitor performance on labeled data; track prediction distributions; use proximity metrics (conversion rate) for delayed labels. - Response: Retrain immediately. - Prediction Drift: Model output distributions change despite stable inputs. - Detection: Alert on significant deviations from historical output patterns. - Response: Investigate root cause (pipeline failure, feature corruption) or retrain. Vertex AI Model Monitoring - Automates drift detection for deployed endpoints. - Features: - Training-Serving Skew: Compares production vs. training feature distributions. - Prediction Drift: Monitors production output distributions vs. historical patterns. - Configuration: - Monitoring frequency (hourly, daily, custom). - Drift thresholds (statistical significance). - Training data reference (baseline distributions). - Notification channels (email, Slack, PagerDuty). Cloud Monitoring & Alerting - Vertex AI exports metrics to Cloud Monitoring for unified observability. - Metric Types: - Infrastructure: CPU, GPU, memory (OOM risk greater than 90% CPU, swap usage). - Application: Request rate, latency (P50, P95, P99), error rate (4xx client, 5xx server). - Alerting Strategies: - Fixed Threshold: Simple but brittle (e.g., latency greater than 500ms). - Anomaly Detection: Robust; alerts on deviations from historical baselines (e.g., error rate 3x higher than last week). - Rate of Change: Catches issues early; alerts on rapid shifts (e.g., latency +50% in 10 mins). - Composite Alerts: High signal-to-noise; requires multiple conditions (e.g., error rate greater than 1% AND latency greater than 200ms for 5 mins). Retraining Triggers - Systematic retraining maintains model accuracy. - Strategies: - Scheduled: Retrain on a fixed cadence (e.g., monthly). - Drift-Based: Retrain when drift exceeds thresholds. - Implementation: Cloud Monitoring alert → Cloud Function → Vertex AI Pipelines. - Performance-Based: Retrain when performance degrades on ground truth labels (e.g., precision drops from 92% to 88%). - Hybrid: Combines strategies for optimal balance of freshness and cost. Model Performance Metrics - Track standard metrics when ground truth labels are available. - Classification: - Precision: Fraction of positive predictions that are correct (minimizes false positives). - Recall: Fraction of actual positives identified (minimizes false negatives). - F1 Score: Harmonic mean of precision and recall. - AUC-ROC: Measures classifier's ability to distinguish classes across all thresholds. - Regression: - MAE (Mean Absolute Error): Interpretable in target units. - RMSE (Root Mean Squared Error): Penalizes larger errors more; sensitive to outliers. - R-squared: Proportion of variance explained (1.0 = perfect fit). Logging & Debugging - Cloud Logging captures prediction requests, responses, and errors. - Challenge: Logging every request is costly. - Solution: Use sampling to control costs while capturing key events. - Uniform: Logs a random percentage (e.g., 1%). - Stratified: Logs 100% of low-confidence predictions, 1% of high-confidence. - Error-Focused: Logs all errors, 1% of successes (best for debugging).