Getting Started with MLflow

In the previous posts, we covered the ML lifecycle and data versioning. We keep mentioning "track your experiments" and "register your models" - MLflow is the tool that makes both of those practical.

MLflow is an open-source platform for managing the ML lifecycle. It does three things well: tracking experiments, packaging models, and managing a model registry. You can start using it in 10 minutes, and it scales from a single notebook to a production team.

Why You Need Experiment Tracking

Without tracking, ML experimentation looks like this:

model_v1.pkl
model_v2.pkl
model_v2_final.pkl
model_v2_final_ACTUAL.pkl
model_v3_better.pkl
model_v3_better_lr001.pkl

You trained 20 models. You don't remember which hyperparameters produced the best result. The model file names are meaningless. Your notebook has 15 cells that you ran in random order. And the model that worked best was three days ago, but you overwrote it.

This is not a workflow. This is chaos with extra steps.

Experiment tracking solves this by automatically logging every training run with its parameters, metrics, and artifacts.

What MLflow Tracks

For each training run, MLflow records:

Parameters - Hyperparameters you chose (learning rate, batch size, epochs, model type)
Metrics - Results the model produced (accuracy, loss, F1, RMSE)
Artifacts - Files the run generated (model weights, plots, preprocessors)
Tags - Custom metadata (dataset name, experiment owner, environment)
Source - The code that ran (git commit, script path)

All of this is queryable, comparable, and reproducible.

MLflow Components

MLflow has four main components. You don't need all of them to start.

1. MLflow Tracking

The core component. Logs experiments and lets you compare runs.

import mlflow

mlflow.set_experiment("house-price-prediction")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model_type", "random_forest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("learning_rate", 0.01)

    # Train your model
    model = train_model(X_train, y_train)

    # Log metrics
    accuracy = evaluate(model, X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("rmse", rmse)
    mlflow.log_metric("mae", mae)

    # Log the model as an artifact
    mlflow.sklearn.log_model(model, "model")

That's it. Every time this script runs, MLflow creates a new run with all the details logged. No spreadsheets. No naming files manually.

2. MLflow UI

MLflow comes with a built-in web UI. Start it with:

mlflow ui

This opens a dashboard at http://localhost:5000 where you can:

See all experiments and runs
Compare runs side by side
Sort by any metric (find the run with highest accuracy)
View parameter combinations
Download artifacts

The UI is where experiment tracking becomes useful. Instead of scrolling through terminal output, you see a table of every run with sortable columns.

3. MLflow Models

MLflow packages models in a standard format that works across frameworks.

# Log a scikit-learn model
mlflow.sklearn.log_model(model, "model")

# Log a PyTorch model
mlflow.pytorch.log_model(model, "model")

# Log a TensorFlow model
mlflow.tensorflow.log_model(model, "model")

The logged model includes:

The model weights
The conda/pip environment needed to run it
A standard interface for loading and predicting

Loading a model later is one line:

model = mlflow.sklearn.load_model("runs:/abc123/model")
predictions = model.predict(new_data)

This standardization means you don't need to write custom save/load code for each framework.

4. Model Registry

The model registry is where models go when they're ready for production. It adds:

Versioning - Each registered model can have multiple versions
Stage management - Mark models as "Staging", "Production", or "Archived"
Lineage - Every registry version links back to the exact run that produced it

# Register a model from a run
mlflow.register_model(
    "runs:/abc123/model",
    "house-price-model"
)

The workflow looks like this:

Experiment tracking (many runs)
     ↓
Best run identified
     ↓
Register model (version 1)
     ↓
Move to "Staging"
     ↓
Test in staging environment
     ↓
Move to "Production"
     ↓
Old version moves to "Archived"

This gives you a clear promotion path instead of "someone copied a model file to the production server."

A Complete Example

Here's a realistic workflow from start to finish:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# Set the experiment
mlflow.set_experiment("house-price-prediction")

# Load and split data
X_train, X_test, y_train, y_test = load_and_split_data()

# Define hyperparameter combinations to try
configs = [
    {"n_estimators": 50, "max_depth": 5},
    {"n_estimators": 100, "max_depth": 10},
    {"n_estimators": 200, "max_depth": 15},
    {"n_estimators": 100, "max_depth": 20},
]

for config in configs:
    with mlflow.start_run():
        # Log parameters
        mlflow.log_params(config)

        # Train
        model = RandomForestRegressor(**config)
        model.fit(X_train, y_train)

        # Evaluate
        predictions = model.predict(X_test)
        mae = mean_absolute_error(y_test, predictions)
        rmse = np.sqrt(mean_squared_error(y_test, predictions))

        # Log metrics
        mlflow.log_metric("mae", mae)
        mlflow.log_metric("rmse", rmse)

        # Log model
        mlflow.sklearn.log_model(model, "model")

        print(f"Config: {config} | MAE: {mae:.2f} | RMSE: {rmse:.2f}")

After running this, open the MLflow UI. You'll see four runs. Sort by MAE. The best one is your candidate for production.

Autologging

For supported frameworks, MLflow can log everything automatically:

import mlflow

mlflow.autolog()

# Just train your model normally
model = RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)

With autolog(), MLflow automatically captures:

All hyperparameters passed to the model
Training metrics
The model artifact
Feature importance plots (for tree-based models)
The training dataset signature

No manual log_param or log_metric calls needed. This is the fastest way to get started.

Supported frameworks include scikit-learn, TensorFlow, Keras, PyTorch, XGBoost, LightGBM, and more.

MLflow in a Team

Tracking server

For a team, you run MLflow as a central server instead of locally:

mlflow server \
    --host 0.0.0.0 \
    --port 5000 \
    --backend-store-uri postgresql://user:pass@db:5432/mlflow \
    --default-artifact-root s3://mlflow-artifacts/

This stores experiment metadata in PostgreSQL and artifacts in S3. Everyone on the team points their code to this server:

mlflow.set_tracking_uri("http://mlflow-server:5000")

Now all experiments are in one place. No more "which model did you train yesterday?" messages.

Organizing experiments

Use experiments to group related work:

house-price-prediction - All runs for the house price model
churn-prediction - All runs for customer churn
fraud-detection-v2 - Second generation of the fraud model

Use tags for finer organization:

mlflow.set_tag("engineer", "ahmad")
mlflow.set_tag("dataset_version", "v2.3")
mlflow.set_tag("purpose", "hyperparameter_tuning")

MLflow + DVC Together

In the data versioning post, we covered DVC. MLflow and DVC complement each other:

DVC versions the data
MLflow tracks the experiments and models

with mlflow.start_run():
    # Log the DVC data version
    mlflow.set_tag("data_version", "abc123")  # DVC hash

    # Log parameters
    mlflow.log_params(config)

    # Train and log model
    model = train(X_train, y_train)
    mlflow.sklearn.log_model(model, "model")
    mlflow.log_metric("accuracy", accuracy)

Now every MLflow run links to an exact data version. You can trace any model back to the exact code, data, and parameters that produced it.

Code version:  git commit 7a3f2b1
Data version:  dvc hash abc123
Parameters:    lr=0.001, epochs=50, batch=32
Metrics:       accuracy=94.2%, F1=0.91
Model:         mlflow run xyz789

Full reproducibility. No guessing.

Common Mistakes

Not setting an experiment name - Everything goes into the "Default" experiment. After 200 runs, you can't find anything. Always use set_experiment().
Logging too few parameters - You logged learning rate but not batch size. When you need to reproduce the best run, you can't. Log everything that could affect the result.
Logging metrics only at the end - For long training runs, log metrics at each epoch. This lets you see training curves and catch overfitting early.
Skipping the model registry - Experiment tracking alone doesn't tell you which model is in production. The registry creates a clear promotion path.
Running MLflow locally in a team - Each person has their own local experiments. Nobody can see each other's results. Set up a shared tracking server early.
Not tagging runs - After 500 runs, parameters and metrics aren't enough to find what you need. Tags like purpose, dataset_version, and engineer make runs searchable.

Key Takeaways

MLflow solves the "which model was that" problem - Every run is logged with parameters, metrics, and artifacts
Start with autologging - One line of code gives you full experiment tracking
Use the UI to compare runs - Sort, filter, and compare side by side instead of scrolling through terminal output
The model registry adds production discipline - Staging → Production → Archived is a clear promotion path
Pair with DVC for full reproducibility - MLflow tracks experiments, DVC tracks data, Git tracks code
Set up a shared server for teams - Local tracking doesn't scale past one person

MLflow turns "I think that model from last Tuesday was better" into "run #42, accuracy 94.2%, here are the exact parameters and data."