close
close

first Drop

Com TW NOw News 2024

KernelSHAP can be misleading with correlated predictors
news

KernelSHAP can be misleading with correlated predictors

A concrete case study

“Like many other permutation-based interpretation methods, the Shapley value method suffers from the inclusion of unrealistic data instances when features are correlated. To simulate that a feature value is missing from a coalition, we marginalize the feature. ..When features are dependent, we may sample feature values ​​that do not make sense for this instance.” — Interpretable-ML-Book.

SHAP (SHapley Additive exPlanations) values ​​are designed to fairly assign the contribution of each feature to the prediction of a machine learning model, based on the concept of Shapley values ​​from cooperative game theory. The Shapley value framework has several desirable theoretical properties and can in principle handle any predictive model. However, SHAP values ​​can potentially be misleading, especially when using the KernelSHAP approximation method. When predictors are correlated, these approximations can be inaccurate and even have the opposite sign.

In this blog post, I will show how the original SHAP values ​​can differ significantly from approximations created using the SHAP framework, in particular KernalSHAP. I will also discuss the reasons behind these differences.

Case Study: Churn Rate

Imagine a scenario where we want to predict the churn rate of leases in an office building based on two key factors: occupancy rate and the number of reported issues.

Occupancy has a significant impact on churn rate. For example, if occupancy is too low, tenants may leave because the office is underutilized. Conversely, if occupancy is too high, tenants may leave due to overcrowding, looking for better options elsewhere.

Furthermore, let us assume that the percentage of reported problems is strongly correlated with the occupancy rate, specifically that the percentage of reported problems is the square of the occupancy rate.

We define the churn rate function as follows:

KernelSHAP can be misleading with correlated predictorsImage by Author: churn rate function

This function with respect to the two variables can be represented by the following illustrations:

Image by Author: Churn on Two Variables

Differences between original SHAP and Kernel SHAP

SHAP values ​​calculated using Kernel SHAP

We now use the following code to calculate the SHAP values ​​of the predictors:

# Define the dataframe 
churn_df=pd.DataFrame(
{
"occupancy_rate":occupancy_rates,
"reported_problem_rate": reported_problem_rates,
"churn_rate":churn_rates,
}
)
X=churn_df.drop(("churn_rate"),axis=1)
y=churn_df("churn_rate")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
# append one speical point
X_test=pd.concat(objs=(X_test, pd.DataFrame({"occupancy_rate":(0.8), "reported_problem_rate":(0.64)})))

# Define the prediction
def predict_fn(data):
occupancy_rates = data(:, 0)
reported_problem_rates = data(:, 1)

churn_rate= C_base +C_churn*(C_occ* occupancy_rates-reported_problem_rates-0.6)**2 +C_problem*reported_problem_rates
return churn_rate

# Create the SHAP KernelExplainer using the correct prediction function
background_data = shap.sample(X_train,100)
explainer = shap.KernelExplainer(predict_fn, background_data)
shap_values = explainer(X_test)

The above code performs the following tasks:

  1. Data Preparation: A DataFrame named churn_df is created with the columns occupancy_rate, reported_problem_rate and churn_rate. Variables and target (churn_rate) are then created from and Data is split into training and test sets, with 80% for training and 20% for testing. Note that a special data point with specific occupancy_rate and reported_problem_rate values ​​is added to the test set X_test.
  2. Definition of Predictive Function: A function predict_fn is defined to calculate the churn rate using a specific formula with predefined constants.
  3. SHAP Analysis: A SHAP KernelExplainer is initialized using the prediction function and background_data samples from X_train. SHAP values ​​for X_test are calculated using the explainer.

Below is a summary SHAP bar chart, showing the average SHAP values ​​for X_test:

Image by author: average shap values

In particular, we see that at the data point (0.8, 0.64) the SHAP values ​​of the two features are 0.10 and -0.03, illustrated by the following power plot:

Image by author Force Plot of one data point

SHAP values ​​according to original definition

Let’s take a step back and calculate the exact SHAP values ​​step by step according to their original definition. The general formula for SHAP values ​​is given by:

where: S is a subset of all feature indices except i, |S| is the size of the subset S, M is the total number of features, f(XS​∪{xi​}) is the function evaluated with the features in S with xi present, while f(XS) is the function evaluated with the features in S with xi absent.

Now let us calculate the SHAP values ​​for two functions: occupancy rate (denoted as x1x_1x1​) and reported problem rate (denoted as x2x_2x2​) at the data point (0.8, 0.64). Recall that x1x_1x1​ and x2x_2x2​ are related by x_1 = x_2².

We have the SHAP value for the occupancy rate at the data point:

and, similarly, for the reported problem ratio of the function:

First, let’s calculate the SHAP value for the occupancy rate at the data point:

  1. The first term is the expectation of the model output when X1​ is fixed at 0.8 and X2​ is averaged over the distribution. Given the relation xx_1 = x_2², this expectation leads to the model output at the specific point (0.8, 0.64).
  2. The second term is the unconditional expectation of the output of the model, where both X1 and X2 are averaged over their distributions. This can be calculated by averaging the outputs over all data points in the background dataset.
  3. The third term is the output of the model at the specific point (0.8, 0.64).
  4. The last term is the expectation of the model output when X1​ is averaged over its distribution, given that X2​ is fixed at the specific point 0.64. Again, because of the relationship x_1 = x_2²​, this expectation corresponds to the model output at (0.8, 0.64), similar to the first step.

The SHAP values ​​calculated from the original definition for the occupancy rate and the reported problem rate of the two functions at the data point (0.8, 0.64) are -0.0375 and -0.0375, respectively, which are quite different from the values ​​given by Kernel SHAP.

Where do the discrepancies come from?

Cause of discrepancies in SHAP values

As you may have noticed, the discrepancy between the two methods mainly arises from the second and fourth steps, where we need to calculate the conditional expectation. This means that we calculate the expectation of the model output when X1X_1X1​ is conditioned on 0.8.

  • Exact SHAP: When calculating exact SHAP values, dependencies between features (such as x1=x_2² in our example​) are explicitly taken into account. This allows for accurate calculations by considering how feature interactions affect the model output.
  • Kernel-SHAP: By default, Kernel SHAP assumes feature independence, which can lead to inaccurate SHAP values ​​when features are actually dependent. According to the article A unified approach to interpreting model predictionsThis assumption is a simplification. In practice, features are often correlated, making it difficult to achieve accurate approximations using Kernel SHAP.

Screenshot from the article

Possible solutions

Unfortunately, calculating SHAP values ​​directly from their original definition can be computationally expensive. Here are some alternative approaches to consider:

BoomSHAP

  • TreeSHAP is specifically designed for tree-based models such as random forests and gradient boosting machines. It efficiently calculates SHAP values ​​and effectively manages feature dependencies.
  • This method is optimized for tree ensembles, making it faster and more scalable than traditional SHAP computations.
  • When using TreeSHAP within the SHAP framework, set the parameter feature_perturbation = “interventional” to accurately account for feature dependencies.

Extending Kernel SHAP for Dependent Functions

  • To address feature dependencies, this paper involves extending Kernel SHAP. One approach is to assume that the feature vector follows a multivariate Gaussian distribution. In this approach:
  • Conditional distributions are modeled as multivariate Gaussian distributions.
  • Samples are generated from these conditional Gaussian distributions using estimates from the training data.
  • The integral in the approximation is calculated based on these samples.
  • This method assumes a multivariate Gaussian distribution for features. This is not always applicable in realistic scenarios where features may exhibit different dependency structures.

Improving kernel SHAP accuracy

  • Description: Improve the accuracy of Kernel SHAP by ensuring that the background dataset used for the approach is representative of the true data distribution with independent features.

Using these methods, you can address the computational challenges associated with calculating SHAP values ​​and improve their accuracy in practical applications. However, it is important to note that no single solution is universally optimal for all scenarios.

Conclusion

In this blog post, we explored how SHAP values, despite their strong theoretical basis and versatility in various predictive models, can suffer from accuracy issues when predictors are correlated, especially when using approximations such as KernelSHAP. Understanding these limitations is crucial for effectively interpreting SHAP values. By recognizing the potential discrepancies and selecting the most appropriate approximation methods, we can achieve more accurate and reliable feature attribution in our models.


KernelSHAP Can Be Misleading with Correlated Predictors was originally published in Towards Data Science on Medium, where people continued the discussion by bookmarking and commenting on this story.