news

Algorithm-agnostic model building with Mlflow

Vaseline August 10, 2024

A beginner-friendly step-by-step guide to creating generic ML pipelines using mlflow.pyfunc

A common challenge in MLOps is the difficulty of migrating between different algorithms or frameworks. This beginner-friendly article will help you tackle the challenge by leveraging algorithm-agnostic model building using mlflow.pyfunc.

Why agorithm-agonistic model building?

Consider this scenario: we have a sklearn model currently in production for a particular use case. Later, we discover that a deep learning model performs even better. If the sklearn model was deployed in its native format, the transition to the deep learning model might be a hassle 🤪 because the two model artifacts are very different.

Algorithm-agnostic model building with Mlflow Image generated by asking Gemini

To address this challenge, the mlflow.pyfunc model variant provides a versatile and generic approach to building and deploying machine learning models in Python. 😎

1. Generic model building: The pyfunc-model variant provides a uniform way to build models, regardless of the framework or library used for building them.

2. Encapsulation of the ML pipeline: With pyfunc we can encapsulate the model with the associated pre- and post-processing steps or other custom logic desired while using the model.

3. Uniform model representation: We can implement a model, a machine learning pipeline, or an arbitrary Python function using pyfunc without worrying about the underlying format of the model. Such a unified representation simplifies the deployment, re-deployment, and downstream scoring of the model.

Sounds interesting? If so, this article is here to help you get started with mlflow.pyfunc. 🥂

First, let’s look at a simple example of creating the mlflow.pyfunc class.
Next, we define a mlflow.pyfunc class that encapsulates a machine learning pipeline (an estimator plus some preprocessing logic as an example). We also train, log, and load this ML pipeline for inference.
Finally, let’s dig deeper into the encapsulated mlflow.pyfunc object, explore the extensive metadata and artifacts that mlflow automatically maintains for us, and get a better sense of the full power of mlflow.pyfunc.

🔗 All code and configuration are available on GitHub. 🧰

{pyfunc} Simple toy model

Let’s first create a simple mlflow.pyfunc play model and then use it with the mlflow workflow.

Step 1: Make the model
Step 2: Log the model
Step 3: Load the registered model to perform the inference

# Step 1: Create a mlflow.pyfunc model
class ToyModel(mlflow.pyfunc.PythonModel):
    """
    ToyModel is a simple example implementation of an MLflow Python model.
    """
    
    def predict(self, context, model_input):
        """
        A basic predict function that takes a model_input list and returns a new list 
        where each element is increased by one.

        Parameters:
        - context (Any): An optional context parameter provided by MLflow.
        - model_input (list of int or float): A list of numerical values that the model will use for prediction.

        Returns:
        - list of int or float: A list with each element in model_input is increased by one.
        """
        return (x + 1 for x in model_input)

As you can see in the example above, you can create a mlflow.pyfunc model to implement any custom Python function you see fit for your ML solution. It doesn’t have to be an off-the-shelf machine learning algorithm.

You can then log this model and load it later to perform the inference.

# Step 2: log this model as an mlflow run
with mlflow.start_run():
    mlflow.pyfunc.log_model(
        artifact_path = "model", 
        python_model=ToyModel()
    )
    run_id = mlflow.active_run().info.run_id

# Step 3: load the logged model to perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = (1,2,3)
# model inference for the new data
print(model.predict(x_new))

(2, 3, 4)

{pyfunc} Encapsulated XGBoost Pipeline

Now let’s create an ML pipeline that includes an estimator with additional, custom logic.

In the example below, the XGB_PIPELINE class is a wrapper that integrates the estimator with preprocessing steps, which may be desirable for some MLOps implementations. This wrapper uses mlflow.pyfunc, is estimator agnostic, and provides a unified model representation. More specifically,

fit(): Instead of using XGBoost’s native API (xgboost.train()), this class uses .fit(), which conforms to sklearn conventions, allowing easy integration into sklearn pipelines and ensuring consistency between different estimators.
DMatrix(): DMatrix is a core data structure in XGBoost that optimizes data for training and prediction. In this class, the step to transform a pandas DataFrame to a DMatrix is wrapped in the class, allowing seamless integration with pandas DataFrames, just like all other sklearn estimators.
predict(): This is the universal inference API of the mlflow.pyfunc model. It is consistent across this ML pipeline, across the toy model above, across any machine learning algorithms or custom logic we wrap in a mlflow.pyfunc model.

import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pd


class XGB_PIPELINE(mlflow.pyfunc.PythonModel):
    """
    XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
    """
    
    def __init__(self, params: Dict(str, Union(str, int, float))):
        """
        Initialize the model with given parameters.

        Parameters:
        - params (Dict(str, Union(str, int, float))): Parameters for the XGBoost model.
        """
        self.params = params
        self.xgb_model = None
        self.config = None      

    def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
        """
        Preprocess the input data.

        Parameters:
        - model_input (pd.DataFrame): The input data to preprocess.

        Returns:
        - pd.DataFrame: The preprocessed input data.
        """
        processed_input = model_input.copy()
        # put any desired preprocessing logic here
        processed_input.drop(processed_input.columns(0), axis=1, inplace=True)

        return processed_input

    def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
        """
        Train the XGBoost model.

        Parameters:
        - X_train (pd.DataFrame): The training input data.
        - y_train (pd.Series): The target values.
        """
        processed_model_input = self.preprocess_input(X_train.copy())
        dtrain = xgb.DMatrix(processed_model_input, label=y_train)
        self.xgb_model = xgb.train(self.params, dtrain)

    def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
        """
        Predict using the trained XGBoost model.

        Parameters:
        - context (Any): An optional context parameter provided by MLflow.
        - model_input (pd.DataFrame): The input data for making predictions.

        Returns:
        - Any: The prediction results.
        """
        processed_model_input = self.preprocess_input(model_input.copy())
        dmatrix = xgb.DMatrix(processed_model_input)
        return self.xgb_model.predict(dmatrix)

Now let’s train and log this model.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate synthetic datasets for demo
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train and log the model
with mlflow.start_run(run_name="xgb_demo") as run:

    # Create an instance of XGB_PIPELINE
    params = {
        'objective': 'reg:squarederror',  
        'max_depth': 3,  
        'learning_rate': 0.1,
    }
    model = XGB_PIPELINE(params)

    # Fit the model
    model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)

    # Log the model
    model_info = mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model = model,
    )

    run_id = mlflow.active_run().info.run_id

The model has been successfully registered. ✌ ️Now we will load it for inference.

loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))

array(( 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
       ...

Dive deeper into the Mlflow.pyfunc object

The above process is pretty smooth, right? This represents the basic functionality of the mlflow.pyfunc object. Now let’s dive deeper into the full power that mlflow.pyfunc has to offer.

1. model_info

In the above example, the model_info object returned by mlflow.pyfunc.log_model() is an instance of the mlflow.models.model.ModelInfo class. It contains metadata and information about the logged model. For example

screenshot showing some features of the model_info object Some features of the model_info object

Feel free to run dir(model_info) to explore further or view the source code for all defined attributes. The attribute I use most is model_uri, which indicates where the logged model can be found within the mlflow tracking system.

2. loaded_model

It’s worth clarifying that loaded_model is not an instance of the XGB_PIPELINE class, but rather a wrapper object provided by mlflow.pyfunc for algorithm-agnostic inference. As shown below, attempting to retrieve attributes of the XGB_PIPELINE class from loaded_model will return an error.

print(loaded_model.params)

AttributeError: 'PyFuncModel' object has no attribute 'params'

3. unpacked_model

Okay, you might be wondering, where is the trained instance of XGB_PIPELINE then? Is it also captured and retrieved via mlflow?

Don’t worry, it is stored safely for easy unpacking as shown below.

unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)

{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}

That’s how it’s done. 😎 With the unwrapped_model you have access to all the properties or methods of your custom ML pipeline, just like this one! Sometimes I add useful methods like explain_model or post_processing in the custom pipeline, or add useful features to trace the model training process and provide diagnostics 🤩… Well, I’ll stop here and save those for the next articles. Suffice it to say, feel free to customize your ML pipeline to your use case and know that

You will have access to all these custom methods and attributes for downstream use and
This custom model is wrapped in the unified mlflow.pyfunc inference API and can therefore be smoothly migrated to other estimators if needed.

4. Context

You may have noticed that there is a context parameter for the predict methods in both mlflow.pyfunc classes defined above. But interestingly, this parameter is not required when we are making predictions with the loaded model. Why❓

loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter is not needed when calling `predict`
loaded_model.predict(model_input)

This is because the loaded_model above is a wrapper object provided by mlflow. If we use the unwrapped model instead, we need to explicitly define the context as shown below; otherwise, the code will return an error.

unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually
unwrapped_model.predict(context=None, model_input)

So, what is this context? And what role does it play in the predict method?

The context is a PythonModelContext object that contains artifacts that the pyfunc model can use when performing inference. It is created implicitly and automatically by the log_method() method.

Navigate to the mlruns subfolder in your project repo, which is automatically created by mlflow when you log an mlflow model. Find the folder named after the model’s run_id. Inside you will find the model artifacts that have been automatically logged for you, as shown below.

# get run_id of a loaded model
print(loaded_model.metadata.run_id)

38a617d0f30645e8ae95eea4642a03c2

screenshot of the artifacts folder in a registered `mlflow.pyfunc` model artifacts map in a registered `mlflow.pyfunc` model

Pretty cool, right?😁 Feel free to explore these artifacts at your leisure; below are the screenshots of the requirements and the MLmodel file from the FYR folder.

The requirements below specify the versions of dependencies needed to recreate the environment for running the model.

screenshot of the contents of the file `requirements.txt` in the folder The file `requirements.txt` in the artifacts folder

The MLmodel document below defines the metadata and configuration required to load and present the model in YAML format.

screenshot of the contents of the MLModel file in the artifacts folder The file `MLmodel` in the artifacts folder

Conclusion

There you have it, the mlflow.pyfunc approach to model building. It’s a lot of information, so let’s summarize

mlflow.pyfunc provides a unified model representation that is not affected by the underlying framework or libraries used to build the model.
We can even encapsulate extensive custom logic in an mlflow.pyfunc model to tailor each use case while keeping the inference API consistent and unified.
The underlying model can be extracted from the loaded mlflow.pyfunc model, allowing us to leverage more custom methods/features tailored to each use case.
An mlflow.pyfunc model object is captured with extensive metadata and artifacts that are automatically maintained by mlflow.
This unified mlflow.pyfunc model representation can streamline the process of experimenting and migrating between different algorithms to achieve optimal performance (more on this in the following articles, see below)

Next steps

Now that we have the basics down, let’s move on to more advanced uses of mlflow.pyfunc in the next few articles. 😎 Below are a few topics that come to mind. Feel free to leave a comment and let me know what you’d like to see. 🥰

Use the unified API to experiment with different algorithms and identify the optimal solution for a use case.
Hyperparameter tuning with custom mlflow.pyfunc models.
Encapsulate custom logic in a mlflow.pyfunc ML pipeline to customize model consumption and diagnostics.

If you enjoyed reading this article, please follow me on Medium. 😁

💼LinkedIn | 😺GitHub | 🕊️Twitter/X

Unless otherwise stated, all images are by the author.

Algorithm-agnostic model building with Mlflow was originally published in Towards Data Science on Medium, where people continued the discussion by bookmarking and commenting on this story.

first Drop

first Drop

Algorithm-agnostic model building with Mlflow

A beginner-friendly step-by-step guide to creating generic ML pipelines using mlflow.pyfunc

Why agorithm-agonistic model building?

{pyfunc} Simple toy model

{pyfunc} Encapsulated XGBoost Pipeline

Dive deeper into the Mlflow.pyfunc object

Conclusion

Next steps

Vaseline

Stallone compares Trump to George Washington

Jeanty rushes for 217 yards and TD, catches the TD pass and Boise State beats Hawaii 28-7

The rain chances today are evolving into a storm threat for Wednesday

Hayden Panettiere’s body ‘blown up’ after brother’s death

Algorithm-agnostic model building with Mlflow

A beginner-friendly step-by-step guide to creating generic ML pipelines using mlflow.pyfunc

Why agorithm-agonistic model building?

{pyfunc} Simple toy model

{pyfunc} Encapsulated XGBoost Pipeline

Dive deeper into the Mlflow.pyfunc object

Conclusion

Next steps

Vaseline

You Might Also Like

Stallone compares Trump to George Washington

Jeanty rushes for 217 yards and TD, catches the TD pass and Boise State beats Hawaii 28-7

The rain chances today are evolving into a storm threat for Wednesday

Hayden Panettiere’s body ‘blown up’ after brother’s death

Latest Trending News