Skip to content

ML Model Assets

Machine learning models and AI systems

ML Model assets represent machine learning models, training datasets, features, and experiments. OpenMetadata models ML assets with a two-level hierarchy for MLOps platforms.


Hierarchy Overview

graph TD
    A[MLModelService<br/>MLflow, SageMaker, Vertex AI] --> B1[MLModel:<br/>customer_churn_predictor]
    A --> B2[MLModel:<br/>fraud_detection]
    A --> B3[MLModel:<br/>product_recommendation]

    B1 --> C1[Features]
    B1 --> C2[Hyperparameters]
    B1 --> C3[Metrics]

    C1 --> D1[Feature: recency<br/>Source: customers.last_purchase_date]
    C1 --> D2[Feature: frequency<br/>Source: orders.count_30d]
    C1 --> D3[Feature: monetary<br/>Source: orders.total_amount_30d]

    C2 --> E1[learning_rate: 0.01]
    C2 --> E2[max_depth: 6]
    C2 --> E3[n_estimators: 100]

    C3 --> F1[Accuracy: 92%]
    C3 --> F2[Precision: 0.85]
    C3 --> F3[Recall: 0.88]
    C3 --> F4[AUC: 0.93]

    B1 -.->|trained on| G1[Snowflake<br/>customer_features]
    B1 -.->|predictions to| G2[Snowflake<br/>churn_scores]
    B2 -.->|monitors| G3[Dashboard]

    style A fill:#667eea,color:#fff
    style B1 fill:#4facfe,color:#fff
    style B2 fill:#4facfe,color:#fff
    style B3 fill:#4facfe,color:#fff
    style C1 fill:#00f2fe,color:#fff
    style C2 fill:#00f2fe,color:#fff
    style C3 fill:#00f2fe,color:#fff
    style D1 fill:#43e97b,color:#fff
    style D2 fill:#43e97b,color:#fff
    style D3 fill:#43e97b,color:#fff
    style E1 fill:#fa709a,color:#fff
    style E2 fill:#fa709a,color:#fff
    style E3 fill:#fa709a,color:#fff
    style F1 fill:#f093fb,color:#fff
    style F2 fill:#f093fb,color:#fff
    style F3 fill:#f093fb,color:#fff
    style F4 fill:#f093fb,color:#fff
    style G1 fill:#764ba2,color:#fff
    style G2 fill:#764ba2,color:#fff
    style G3 fill:#764ba2,color:#fff

Why This Hierarchy?

ML Model Service

Purpose: Represents the ML platform or model registry

An ML Model Service is the platform that tracks and serves ML models. It contains configuration for connecting to the MLOps tool and discovering models.

Examples:

  • mlflow-prod - Production MLflow instance
  • sagemaker-models - AWS SageMaker model registry
  • vertex-ai - Google Cloud Vertex AI
  • databricks-ml - Databricks ML workspace

Why needed: Organizations use different ML platforms for different teams and use cases (MLflow for experimentation, SageMaker for production, Vertex AI for Google Cloud). The service level organizes models by platform.

Supported Platforms: MLflow, AWS SageMaker, Azure ML, Google Vertex AI, Databricks ML, Kubeflow, Weights & Biases, Neptune, H2O

View ML Model Service Specification →


ML Model

Purpose: Represents a trained machine learning model

An ML Model is a trained algorithm that makes predictions. It has features, training data, performance metrics, versions, and deployment information.

Examples:

  • customer_churn_predictor - Predicts customer churn risk
  • product_recommendation - Recommends products to users
  • fraud_detection - Identifies fraudulent transactions
  • demand_forecasting - Forecasts product demand

Key Metadata:

  • Algorithm: Model type (RandomForest, XGBoost, Neural Network, etc.)
  • Features: Input variables used for predictions
  • Training Data: Tables/datasets used for training
  • Performance Metrics: Accuracy, precision, recall, AUC, etc.
  • Versions: Model iterations with performance tracking
  • Deployment: Where and how the model is served
  • Lineage: Training data → Model → Predictions table
  • Owner: Data science team or individual

Why needed: ML models are critical data assets. Tracking them enables: - Understanding which data trains which models - Impact analysis (what breaks if training data changes?) - Model governance (bias detection, compliance) - Performance monitoring and model drift detection - Reproducibility and experiment tracking

View ML Model Specification →


ML Model Lifecycle

graph LR
    A[Training Data] -->|Train| B[Experiment]
    B -->|Best Model| C[Registered Model]
    C -->|Deploy| D[Production Model]
    D -->|Predictions| E[Predictions Table]

    A -.->|Version 1.0| A1[customers_features v1]
    C -.->|Version 2.0| C1[Model v2 - 92% accuracy]
    C -.->|Version 1.0| C2[Model v1 - 88% accuracy]

    style A fill:#0061f2,color:#fff
    style B fill:#6900c7,color:#fff
    style C fill:#4facfe,color:#fff
    style D fill:#00ac69,color:#fff
    style E fill:#f5576c,color:#fff

Stages: 1. Experimentation: Train models on different datasets and hyperparameters 2. Registration: Register best-performing models in model registry 3. Deployment: Deploy models to production (API, batch scoring, edge) 4. Monitoring: Track predictions and model performance


Common Patterns

Pattern 1: MLflow Churn Prediction

MLflow Service → customer_churn_predictor Model → Algorithm: XGBoost
                                                 → Features: [recency, frequency, monetary]
                                                 → Training Data: customer_features table
                                                 → Metrics: AUC 0.85, Precision 0.78

Classification model predicting customer churn.

Pattern 2: SageMaker Recommendation Engine

SageMaker Service → product_recommendation Model → Algorithm: Collaborative Filtering
                                                  → Features: [user_id, product_views, purchases]
                                                  → Training Data: user_product_interactions
                                                  → Deployment: Real-time endpoint

Recommendation model served via API.

Pattern 3: Vertex AI Demand Forecasting

Vertex AI Service → demand_forecasting Model → Algorithm: LSTM Neural Network
                                              → Features: [historical_sales, seasonality, promotions]
                                              → Training Data: sales_history table
                                              → Predictions: future_demand table

Time series forecasting model for inventory planning.


Real-World Example

Here's how a data science team builds a fraud detection model:

graph LR
    A[Snowflake<br/>transactions] --> P1[Feature Pipeline]
    B[Snowflake<br/>customers] --> P1

    P1 --> C[fraud_features<br/>Table]
    C -->|Train| D[MLflow<br/>fraud_detection Model]

    D -->|Deploy| E[SageMaker<br/>Production Endpoint]
    E -->|Predictions| F[fraud_scores<br/>Table]

    F -->|Alert| G[Fraud Alert System]

    D -.->|Algorithm| H[Random Forest]
    D -.->|Metrics| I[AUC: 0.93, Precision: 0.85]
    D -.->|Owner| J[Data Science Team]

    style A fill:#0061f2,color:#fff
    style B fill:#0061f2,color:#fff
    style C fill:#00ac69,color:#fff
    style D fill:#4facfe,color:#fff
    style E fill:#f5576c,color:#fff
    style F fill:#6900c7,color:#fff

Flow: 1. Feature Engineering: Pipeline creates features from transactions and customer data 2. Training: Model trained on fraud_features table 3. Registration: Model registered in MLflow with metrics 4. Deployment: Model deployed to SageMaker endpoint 5. Scoring: Real-time predictions written to fraud_scores table 6. Action: Alerts trigger for high fraud scores

Benefits:

  • Lineage: Trace predictions back to training data
  • Impact Analysis: Know which models break if transactions table changes
  • Governance: Track model performance and bias metrics
  • Reproducibility: Know exact data and code used for training

ML Model Lineage

ML models create complex lineage across the data platform:

graph TD
    A[Raw Events] --> B[Feature Pipeline]
    B --> C[Features Table]
    C --> D[ML Model Training]

    D --> E[Registered Model v1.0]
    D --> F[Registered Model v2.0]

    F --> G[Production Model]
    G --> H[Predictions API]
    H --> I[Predictions Table]

    I --> J[Dashboard]
    I --> K[Alerting System]

    style D fill:#6900c7,color:#fff
    style E fill:#4facfe,color:#fff
    style F fill:#4facfe,color:#fff
    style G fill:#00ac69,color:#fff

Data → Features → Model → Predictions → Decisions


Model Features

Features are the input variables for ML models. OpenMetadata tracks features and their sources:

{
  "modelName": "customer_churn_predictor",
  "features": [
    {
      "name": "recency",
      "dataType": "integer",
      "source": "customers.last_purchase_date",
      "description": "Days since last purchase"
    },
    {
      "name": "frequency",
      "dataType": "integer",
      "source": "orders.count_30d",
      "description": "Number of orders in last 30 days"
    },
    {
      "name": "monetary",
      "dataType": "float",
      "source": "orders.total_amount_30d",
      "description": "Total spend in last 30 days"
    }
  ]
}

Feature Lineage: Track which table columns become which model features.


Model Versions

ML models evolve over time. OpenMetadata tracks versions:

Version Algorithm Training Data Accuracy Deployed Date
v1.0 Logistic Regression customers_2024_01 82% No 2024-01-15
v2.0 Random Forest customers_2024_03 88% No 2024-03-10
v3.0 XGBoost customers_2024_06 92% Yes 2024-06-20

Version Metadata: Each version has its own metrics, training data, and deployment status.


Model Governance

Track important governance metadata:

  • Fairness Metrics: Bias detection across demographic groups
  • Explainability: SHAP values, feature importance
  • Compliance: GDPR, model cards, audit logs
  • Data Lineage: Ensure training data quality and provenance
  • Performance Monitoring: Drift detection, accuracy over time

Entity Specifications

Entity Description Specification
ML Model Service MLOps platform View Spec
ML Model Trained model View Spec

Each specification includes: - Complete field reference - JSON Schema definition - RDF/OWL ontology representation - JSON-LD context and examples - Integration with ML platforms


Supported ML Platforms

OpenMetadata supports metadata extraction from:

  • MLflow - Open-source ML lifecycle platform
  • AWS SageMaker - Fully managed ML service
  • Azure Machine Learning - Enterprise ML platform
  • Google Vertex AI - Unified ML platform
  • Databricks ML - ML on lakehouse platform
  • Kubeflow - ML toolkit for Kubernetes
  • Weights & Biases - ML experiment tracking
  • Neptune.ai - ML metadata store
  • H2O.ai - AutoML platform
  • Hugging Face - Model hub for transformers

Next Steps

  1. Explore specifications - Click through ML Model entities above
  2. See ML lineage - Check out feature lineage to predictions
  3. ML governance - Learn about model fairness and compliance