close
close

first Drop

Com TW NOw News 2024

Algorithm-agnostic model building with Mlflow
news

Algorithm-agnostic model building with Mlflow

A beginner-friendly step-by-step guide to creating generic ML pipelines using mlflow.pyfunc

A common challenge in MLOps is the difficulty of migrating between different algorithms or frameworks. This beginner-friendly article will help you tackle the challenge by leveraging algorithm-agnostic model building using mlflow.pyfunc.

Why agorithm-agonistic model building?

Consider this scenario: we have a sklearn model currently in production for a particular use case. Later, we discover that a deep learning model performs even better. If the sklearn model was deployed in its native format, the transition to the deep learning model might be a hassle 🤪 because the two model artifacts are very different.

Algorithm-agnostic model building with MlflowImage generated by asking Gemini

To address this challenge, the mlflow.pyfunc model variant provides a versatile and generic approach to building and deploying machine learning models in Python. 😎

1. Generic model building: The pyfunc-model variant provides a uniform way to build models, regardless of the framework or library used for building them.

2. Encapsulation of the ML pipeline: With pyfunc we can encapsulate the model with the associated pre- and post-processing steps or other custom logic desired while using the model.

3. Uniform model representation: We can implement a model, a machine learning pipeline, or an arbitrary Python function using pyfunc without worrying about the underlying format of the model. Such a unified representation simplifies the deployment, re-deployment, and downstream scoring of the model.

Sounds interesting? If so, this article is here to help you get started with mlflow.pyfunc. 🥂

  • First, let’s look at a simple example of creating the mlflow.pyfunc class.
  • Next, we define a mlflow.pyfunc class that encapsulates a machine learning pipeline (an estimator plus some preprocessing logic as an example). We also train, log, and load this ML pipeline for inference.
  • Finally, let’s dig deeper into the encapsulated mlflow.pyfunc object, explore the extensive metadata and artifacts that mlflow automatically maintains for us, and get a better sense of the full power of mlflow.pyfunc.

🔗 All code and configuration are available on GitHub. 🧰

{pyfunc} Simple toy model

Let’s first create a simple mlflow.pyfunc play model and then use it with the mlflow workflow.

  • Step 1: Make the model
  • Step 2: Log the model
  • Step 3: Load the registered model to perform the inference
# Step 1: Create a mlflow.pyfunc model
class ToyModel(mlflow.pyfunc.PythonModel):
"""
ToyModel is a simple example implementation of an MLflow Python model.
"""

def predict(self, context, model_input):
"""
A basic predict function that takes a model_input list and returns a new list
where each element is increased by one.

Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (list of int or float): A list of numerical values that the model will use for prediction.

Returns:
- list of int or float: A list with each element in model_input is increased by one.
"""
return (x + 1 for x in model_input)

As you can see in the example above, you can create a mlflow.pyfunc model to implement any custom Python function you see fit for your ML solution. It doesn’t have to be an off-the-shelf machine learning algorithm.

You can then log this model and load it later to perform the inference.

# Step 2: log this model as an mlflow run
with mlflow.start_run():
mlflow.pyfunc.log_model(
artifact_path = "model",
python_model=ToyModel()
)
run_id = mlflow.active_run().info.run_id
# Step 3: load the logged model to perform inference
model = mlflow.pyfunc.load_model(f"runs:/{run_id}/model")
# dummy new data
x_new = (1,2,3)
# model inference for the new data
print(model.predict(x_new))
(2, 3, 4)

{pyfunc} Encapsulated XGBoost Pipeline

Now let’s create an ML pipeline that includes an estimator with additional, custom logic.

In the example below, the XGB_PIPELINE class is a wrapper that integrates the estimator with preprocessing steps, which may be desirable for some MLOps implementations. This wrapper uses mlflow.pyfunc, is estimator agnostic, and provides a unified model representation. More specifically,

  • fit(): Instead of using XGBoost’s native API (xgboost.train()), this class uses .fit(), which conforms to sklearn conventions, allowing easy integration into sklearn pipelines and ensuring consistency between different estimators.
  • DMatrix(): DMatrix is ​​a core data structure in XGBoost that optimizes data for training and prediction. In this class, the step to transform a pandas DataFrame to a DMatrix is ​​wrapped in the class, allowing seamless integration with pandas DataFrames, just like all other sklearn estimators.
  • predict(): This is the universal inference API of the mlflow.pyfunc model. It is consistent across this ML pipeline, across the toy model above, across any machine learning algorithms or custom logic we wrap in a mlflow.pyfunc model.
import json
import xgboost as xgb
import mlflow.pyfunc
from typing import Any, Dict, Union
import pandas as pd


class XGB_PIPELINE(mlflow.pyfunc.PythonModel):
"""
XGBWithPreprocess is an example implementation of an MLflow Python model with XGBoost.
"""

def __init__(self, params: Dict(str, Union(str, int, float))):
"""
Initialize the model with given parameters.

Parameters:
- params (Dict(str, Union(str, int, float))): Parameters for the XGBoost model.
"""
self.params = params
self.xgb_model = None
self.config = None

def preprocess_input(self, model_input: pd.DataFrame) -> pd.DataFrame:
"""
Preprocess the input data.

Parameters:
- model_input (pd.DataFrame): The input data to preprocess.

Returns:
- pd.DataFrame: The preprocessed input data.
"""
processed_input = model_input.copy()
# put any desired preprocessing logic here
processed_input.drop(processed_input.columns(0), axis=1, inplace=True)

return processed_input

def fit(self, X_train: pd.DataFrame, y_train: pd.Series):
"""
Train the XGBoost model.

Parameters:
- X_train (pd.DataFrame): The training input data.
- y_train (pd.Series): The target values.
"""
processed_model_input = self.preprocess_input(X_train.copy())
dtrain = xgb.DMatrix(processed_model_input, label=y_train)
self.xgb_model = xgb.train(self.params, dtrain)

def predict(self, context: Any, model_input: pd.DataFrame) -> Any:
"""
Predict using the trained XGBoost model.

Parameters:
- context (Any): An optional context parameter provided by MLflow.
- model_input (pd.DataFrame): The input data for making predictions.

Returns:
- Any: The prediction results.
"""
processed_model_input = self.preprocess_input(model_input.copy())
dmatrix = xgb.DMatrix(processed_model_input)
return self.xgb_model.predict(dmatrix)

Now let’s train and log this model.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
import pandas as pd

# Generate synthetic datasets for demo
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# train and log the model
with mlflow.start_run(run_name="xgb_demo") as run:

# Create an instance of XGB_PIPELINE
params = {
'objective': 'reg:squarederror',
'max_depth': 3,
'learning_rate': 0.1,
}
model = XGB_PIPELINE(params)

# Fit the model
model.fit(X_train=pd.DataFrame(X_train), y_train=y_train)

# Log the model
model_info = mlflow.pyfunc.log_model(
artifact_path="model",
python_model = model,
)

run_id = mlflow.active_run().info.run_id

The model has been successfully registered. ✌ ️Now we will load it for inference.

loaded_model = mlflow.pyfunc.load_model(model_uri=model_info.model_uri)
loaded_model.predict(pd.DataFrame(X_test))
array(( 4.11692047e+00,  7.30551958e+00, -2.36042137e+01, -1.31888123e+02,
...

Dive deeper into the Mlflow.pyfunc object

The above process is pretty smooth, right? This represents the basic functionality of the mlflow.pyfunc object. Now let’s dive deeper into the full power that mlflow.pyfunc has to offer.

1. model_info

In the above example, the model_info object returned by mlflow.pyfunc.log_model() is an instance of the mlflow.models.model.ModelInfo class. It contains metadata and information about the logged model. For example

screenshot showing some features of the model_info objectSome features of the model_info object

Feel free to run dir(model_info) to explore further or view the source code for all defined attributes. The attribute I use most is model_uri, which indicates where the logged model can be found within the mlflow tracking system.

2. loaded_model

It’s worth clarifying that loaded_model is not an instance of the XGB_PIPELINE class, but rather a wrapper object provided by mlflow.pyfunc for algorithm-agnostic inference. As shown below, attempting to retrieve attributes of the XGB_PIPELINE class from loaded_model will return an error.

print(loaded_model.params)
AttributeError: 'PyFuncModel' object has no attribute 'params'

3. unpacked_model

Okay, you might be wondering, where is the trained instance of XGB_PIPELINE then? Is it also captured and retrieved via mlflow?

Don’t worry, it is stored safely for easy unpacking as shown below.

unwrapped_model = loaded_model.unwrap_python_model()
print(unwrapped_model.params)
{'objective': 'reg:squarederror', 'max_depth': 3, 'learning_rate': 0.1}

That’s how it’s done. 😎 With the unwrapped_model you have access to all the properties or methods of your custom ML pipeline, just like this one! Sometimes I add useful methods like explain_model or post_processing in the custom pipeline, or add useful features to trace the model training process and provide diagnostics 🤩… Well, I’ll stop here and save those for the next articles. Suffice it to say, feel free to customize your ML pipeline to your use case and know that

  1. You will have access to all these custom methods and attributes for downstream use and
  2. This custom model is wrapped in the unified mlflow.pyfunc inference API and can therefore be smoothly migrated to other estimators if needed.

4. Context

You may have noticed that there is a context parameter for the predict methods in both mlflow.pyfunc classes defined above. But interestingly, this parameter is not required when we are making predictions with the loaded model. Why❓

loaded_model = mlflow.pyfunc.load_model(model_uri)
# the context parameter is not needed when calling `predict`
loaded_model.predict(model_input)

This is because the loaded_model above is a wrapper object provided by mlflow. If we use the unwrapped model instead, we need to explicitly define the context as shown below; otherwise, the code will return an error.

unwrapped_model = loaded_model.unwrap_python_model()
# need to provide context mannually
unwrapped_model.predict(context=None, model_input)

So, what is this context? And what role does it play in the predict method?

The context is a PythonModelContext object that contains artifacts that the pyfunc model can use when performing inference. It is created implicitly and automatically by the log_method() method.

Navigate to the mlruns subfolder in your project repo, which is automatically created by mlflow when you log an mlflow model. Find the folder named after the model’s run_id. Inside you will find the model artifacts that have been automatically logged for you, as shown below.

# get run_id of a loaded model
print(loaded_model.metadata.run_id)
38a617d0f30645e8ae95eea4642a03c2

screenshot of the artifacts folder in a registered `mlflow.pyfunc` modelartifacts map in a registered `mlflow.pyfunc` model

Pretty cool, right?😁 Feel free to explore these artifacts at your leisure; below are the screenshots of the requirements and the MLmodel file from the FYR folder.

The requirements below specify the versions of dependencies needed to recreate the environment for running the model.

screenshot of the contents of the file `requirements.txt` in the folderThe file `requirements.txt` in the artifacts folder

The MLmodel document below defines the metadata and configuration required to load and present the model in YAML format.

screenshot of the contents of the MLModel file in the artifacts folderThe file `MLmodel` in the artifacts folder

Conclusion

There you have it, the mlflow.pyfunc approach to model building. It’s a lot of information, so let’s summarize

  1. mlflow.pyfunc provides a unified model representation that is not affected by the underlying framework or libraries used to build the model.
  2. We can even encapsulate extensive custom logic in an mlflow.pyfunc model to tailor each use case while keeping the inference API consistent and unified.
  3. The underlying model can be extracted from the loaded mlflow.pyfunc model, allowing us to leverage more custom methods/features tailored to each use case.
  4. An mlflow.pyfunc model object is captured with extensive metadata and artifacts that are automatically maintained by mlflow.
  5. This unified mlflow.pyfunc model representation can streamline the process of experimenting and migrating between different algorithms to achieve optimal performance (more on this in the following articles, see below)

Next steps

Now that we have the basics down, let’s move on to more advanced uses of mlflow.pyfunc in the next few articles. 😎 Below are a few topics that come to mind. Feel free to leave a comment and let me know what you’d like to see. 🥰

  1. Use the unified API to experiment with different algorithms and identify the optimal solution for a use case.
  2. Hyperparameter tuning with custom mlflow.pyfunc models.
  3. Encapsulate custom logic in a mlflow.pyfunc ML pipeline to customize model consumption and diagnostics.

If you enjoyed reading this article, please follow me on Medium. 😁

💼LinkedIn | 😺GitHub | 🕊️Twitter/X

Unless otherwise stated, all images are by the author.


Algorithm-agnostic model building with Mlflow was originally published in Towards Data Science on Medium, where people continued the discussion by bookmarking and commenting on this story.