This article is based on information in 「機械学習を解釈する技術 ~Techniques for Interpreting Machine Learning~」by Mitsunosuke Morishita. In this book, the author does not go through all the methods by R, so I decided to make a brief note with an R script.
Permutation Feature Importance (PFI) is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. scikit-learn Here are simple 5 steps of PFI
Now, let’s see how to run PFI with actual dataset.
# Set up
library(mlbench)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.2 ✔ recipes 1.0.3
## ✔ dials 1.1.0 ✔ rsample 1.1.1
## ✔ dplyr 1.0.10 ✔ tibble 3.1.8
## ✔ ggplot2 3.4.0 ✔ tidyr 1.3.0
## ✔ infer 1.0.4 ✔ tune 1.0.1
## ✔ modeldata 1.0.1 ✔ workflows 1.1.2
## ✔ parsnip 1.0.3 ✔ workflowsets 1.0.0
## ✔ purrr 1.0.1 ✔ yardstick 1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(DALEX)
## Welcome to DALEX (version: 2.4.2).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
##
## Attaching package: 'DALEX'
## The following object is masked from 'package:dplyr':
##
## explain
library(ranger)
library(Rcpp)
##
## Attaching package: 'Rcpp'
## The following object is masked from 'package:rsample':
##
## populate
library(corrplot)
## corrplot 0.92 loaded
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)
Here are overview of the dataset
Detailes of Variables
head(df)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
medv is our response variable, We predict this.
hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)')
We won’t cover building a model in this article. I used XGBoost model.
split = initial_split(df, 0.8)
train = training(split)
test = testing(split)
model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>%
set_engine(engine = "ranger", seed(25)) %>%
set_mode("regression")
fit = model %>%
fit(medv ~., data=train)
fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13, x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 100
## Sample size: 404
## Number of independent variables: 13
## Mtry: 13
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 10.70344
## R squared (OOB): 0.8676068
result = test %>%
select(medv) %>%
bind_cols(predict(fit, test))
metrics = metric_set(rmse, rsq)
result %>%
metrics(medv, .pred)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.24
## 2 rsq standard 0.823
Use the function explain to create an explainer object that helps us to interpret the model.
explainer = fit %>%
explain(
data = test %>% select(-medv),
y = test$medv
)
## Preparation of a new explainer is initiated
## -> model label : model_fit ( default )
## -> data : 102 rows 13 cols
## -> target variable : 102 values
## -> predict function : yhat.model_fit will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package parsnip , ver. 1.0.3 , task regression ( default )
## -> predicted values : numerical, min = 9.321 , mean = 22.52986 , max = 46.632
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -23.066 , mean = 0.1956275 , max = 17.366
## A new explainer has been created!
Use model_parts function to get PFI. Here you can see rm and lstat are top 2 important variable to predict medv. The bark blue boxchart show distribution of error loss since we calculate it multiple times.
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference"
)
plot(pfi)
Method | Function |
---|---|
Permutation Feature Importance(PFI) | model_parts() |
Partial Dependence(PD) | model_profile() |
Individual Conditional Expectation(ICE) | predict_profile() |
SHAP | predict_parts() |
If some explanatory variables are correlated each other, PFI won’t work well. Let’s say \(X0\) and \(X1\) are correlated. While calculating importance of \(X0\), the model still use \(X1\) on prediction. The performance of the model would not decrease much because \(X0\) and \(X1\) are correlated. Thus, PFI will underestimate the importance of \(X1\). In plot below, rad; index of accessibility to radial highway, and tax;full-value property-tax rate per $10,000. In the situation like this, we should shuffle both variables together. In addition to that, we should use this GPFI when the variables are encoded by one-hot encoding. Or you can use it when you are dealing with data like latitudes and longitudes.
cor <- df[,unlist(lapply(df, is.numeric)) ]
corrplot(cor(cor), method = 'number', order = 'alphabet')
So let’s run GPFI on our dataset. model_parts function have variable_groups method. It takes list objects. So make a list that contains name of explanatory variables in this case rad and tax1. The source code of feature_importance is here.
# Make list
paired_var = list(c("tax","rad"))
# Male vector of explanatory variables Do not forget to take out your response variable
all_vars = c(colnames(df)[colnames(df) %notin% c("tax","rad","medv")])
# Gather
var_list = c(all_vars, paired_var)
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference",
variable_groups = var_list
)
# Gather explaniner object
plot(pfi)
If you keep tax and rad in the plot, you can see that the importance of tax and rad are dispersed.
# Make list
paired_var = list(c("tax","rad"))
# Male vector of explanatory variables Do not forget to take out your response variable
all_vars = c(colnames(df)[colnames(df) %notin% c("medv")])
# Gather
var_list = c(all_vars, paired_var)
pfi = explainer %>%
model_parts(
loss_function = loss_root_mean_square,
B = 10,
type = "difference",
variable_groups = var_list
)
# Gather explaniner object
plot(pfi)
PFI and GPFI are very sufficient model to calculate the importance of explanatory variables in the model. On the other hand, PFI does not explain how each variables affect prediction of the model. This could be done by Partial Dependence (PD).
*Partial Dependence (PD) is a method to show the dependence between the response variable and explanatory variable. By changing the value of the explanatory variable while keeping track of the prediction, we can understand the relationship between the explanatory variable and the response variable. For example, if we increase the unit of explanatory variable \(X0\), and the response variable also increases, these variables have a positive relationship. In addition to that, by potting these outputs, we can determine whether these variables have a linear or non-linear relationship.
Here comes your favorite part. Based on the image above, let’s say your trained model \(\hat{f}(X0,X1,X2)\) and you would like to find out PD of \(X0\) and prediction \(\hat{y}\). In this situation, we only change the value of \(X0\) and average out the prediction.
\[ \large \hat{PD_0}(x_0) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(x_0,x_{i,1},x_{i,2}) \]
We use \(x\) to express direct
specification.
For example, If we specify \(X0\) to be
1, we substitute \(x_0\) to 1. So the
function is going to be look like
\(\hat{PD_0}(1) = \displaystyle \frac{1}{N}
\sum_{i = 1}^{N} \hat{f}(1,X_{i,1},X_{i,2})\)
Or if we specify \(X04\) to be 4,
then
\(\hat{PD_0}(4) = \displaystyle \frac{1}{N}
\sum_{i = 1}^{N} \hat{f}(4,X_{i,1},X_{i,2})\)
\(x_{i,1}\) and \(x_{i,2}\) takes \(i\)th observation of \(X1\) and \(X2\).
Or if you generalize more, define your set of explanatory variable as
\(X = (X_0,...,X_J)\), and define your
trained model as \(\hat{f}(X)\). Your
target explanatory variable is \(X_j\),
and \(X\) without \(X_j\) is \(X_{-j}
= (X_0,...,X_{j-1},X_{j+1},...,X_J)\). Actual observation of
\(X_j\) at \(i\)th observation is defined as \(x_{j,i}\). So actual observation without
\(x_{j,i}\) is x\(_{i,-j} =
(x_{i,0},...,x_{i,j-1},x_{i,j+1},...,x_{i,J})\).
When explanatory variable \(X_j =
x_j\), \(\hat{PD_j}(x_j)\) aka.
prediction mean is
\[
\large \hat{PD_j}(x_j) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N}
\hat{f}(x_j,x_{i,-j})
\] ### FYI
A method like this; calculating the effect of a focused variable by
taking an average of other variables to ignore their effects, is called
Marginalization. if you learn more about marginalization visit
here.
Now, let’s see how to run PFI with actual dataset.
# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)
Here are overview of the dataset
Detailes of Variables
head(df)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
medv is our response variable, We predict this.
hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)')
We won’t cover building a model in this article. I used XGBoost model.
split = initial_split(df, 0.8)
train = training(split)
test = testing(split)
model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>%
set_engine(engine = "ranger", seed(25)) %>%
set_mode("regression")
fit = model %>%
fit(medv ~., data=train)
fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13, x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 100
## Sample size: 404
## Number of independent variables: 13
## Mtry: 13
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 11.63041
## R squared (OOB): 0.8625487
result = test %>%
select(medv) %>%
bind_cols(predict(fit, test))
metrics = metric_set(rmse, rsq)
result %>%
metrics(medv, .pred)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.10
## 2 rsq standard 0.889
Use the function explain to create an explainer object that helps us to interpret the model.
explainer = fit %>%
explain(
data = test %>% select(-medv),
y = test$medv
)
## Preparation of a new explainer is initiated
## -> model label : model_fit ( default )
## -> data : 102 rows 13 cols
## -> target variable : 102 values
## -> predict function : yhat.model_fit will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package parsnip , ver. 1.0.3 , task regression ( default )
## -> predicted values : numerical, min = 8.595 , mean = 22.59445 , max = 49.12
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -8.21 , mean = 0.3594706 , max = 12.805
## A new explainer has been created!
Use model_profile function to get PD plot. Here you can see lstat, rm, and dis (top 3 importance predictors by PFI) have relationships with prediction. The source code of partial_dependence is here.
pd = explainer %>%
model_profile()
plot(pd)
You can designate which plot you like to plot by giving variables method a vector of variable names.
pd = explainer %>%
model_profile(
variables = c("lstat", "rm", "dis", "crim")
)
plot(pd)
Method | Function |
---|---|
Permutation Feature Importance(PFI) | model_parts() |
Partial Dependence(PD) | model_profile() |
Individual Conditional Expectation(ICE) | predict_profile() |
SHAP | predict_parts() |
Some of you might ask, the process of PD would be the same thing as just looking at the scatter plot like this. However, there is a huge difference between the scatter plot and the PD plot.
cor <- df[,unlist(lapply(df, is.numeric))]
pairs(cor, pch = 19, upper.panel = NULL, panel=panel.smooth)
For example, take a look at the scatter plot of dis; weighted distances to five Boston employment centers, crim; per capita crime rate by town, and medv; median value of owner-occupied homes in $ 1000’s. Looking at the plot on the right, we can observe as the distance of employment centers and medv; median value of owner-occupied homes in $1000’s have a positive relationship. Since employment centers are located in the center of the cities, we can assume that as you move far from the city, the home price would increase. This is not very intuitive. By looking at crim and dis (the middle plot), the distance increases, crime rate decreases. From this observation, In Boston, the neighbor gets safer as you move away from the city center. Therefore, in the third plot, the price of the house decreases as the crime rate decreases. This explains the positive relationship between the home price and distance from the center.
p1 <- qplot(dis, medv, data = df) +
geom_smooth()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- qplot(dis, log(crim), data = df) +
geom_smooth()
p3 <- qplot(log(crim), medv, data = df) +
geom_smooth()
grid.arrange(p1,p2,p3,
nrow = 1,
top = "medv:price of house, dis:distance, crim:crime rate")
In PD plot, medv(y) and dis(x) have an opposite relationship to the scatterplot plot. As you can see, PD plot explains the hidden relationship between variables.
pd = explainer %>%
model_profile(
variables = c("dis")
)
plot(pd)
PD is a sufficient method to observe variable relationships on the model. However, PD is averaging out all observations to visualize the relationship. If individual observations have different effects on the response variable, PD would not be able to catch that effects. in a situation like that, Individual Conditional Expectation is capable of handling these effects.
Individual Conditional Expectation (ICE) is a method of visualizing and understanding the relationship between the response variable and individual observation. The technique of calculating ICE is pretty simple if you are familiar with calculating Partial Dependence (PD). PD is just a mean of ICE. PD is calculated by changing the value of an explanatory variable while keeping track of the prediction. For example, if we increase the unit of explanatory variable \(X0\), and the response variable also increases, these variables have a positive relationship. In PD, we average out all observation orders to see the overview of the relationship, but in ICE, we plot all observations individually.
ICE formula looks like a formula for PD. For your review, PD is
\[ \large \hat{PD_j}(x_j) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(x_j,x_{i,-j}) \]
Your set of explanatory variable as \(X =
(X_0,...,X_J)\), and define your trained model as \(\hat{f}(X)\). Your target explanatory
variable is \(X_j\), and \(X\) without \(X_j\) is \(X_{-j}
= (X_0,...,X_{j-1},X_{j+1},...,X_J)\). Actual observation of
\(X_j\) at \(i\)th observation is defined as \(x_{j,i}\). So actual observation without
\(x_{j,i}\) is x\(_{i,-j} =
(x_{i,0},...,x_{i,j-1},x_{i,j+1},...,x_{i,J})\).
Explanatory variables are \(X_j =
x_j\), \(\hat{PD_j}(x_j)\).
For ICE, we take out the averaging out part \(\displaystyle \frac{1}{N} \sum_{i = 1}^{N}\) and add little i for individual observation.
\[ \large \hat{ICE_{i,j}}(x_j) = \hat{f}(x_j,x_{i,-j}) \]
Let’s say you are predicting customer satisfaction at a restaurant. The restaurant has a variety of customers of all ages. One of your explanatory variables is the amount of food per dish (g). You assume for a younger customer, would have higher satisfaction with more food on the dish. On the other hand, older people might not be interested in the amount of food. In the situation like that and if you only look at PD, you would miss the relationship between individual observation and the response variable.
Now, let’s see how to run PFI with actual dataset.
# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)
Here is overview of the dataset
Detailes of Variables
head(df)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
medv is our response variable, We predict this.
hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)')
We won’t cover building a model in this article. I used XGBoost.
split = initial_split(df, 0.8)
train = training(split)
test = testing(split)
model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>%
set_engine(engine = "ranger", seed(25)) %>%
set_mode("regression")
fit = model %>%
fit(medv ~., data=train)
fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13, x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 100
## Sample size: 404
## Number of independent variables: 13
## Mtry: 13
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 13.19693
## R squared (OOB): 0.8420363
result = test %>%
select(medv) %>%
bind_cols(predict(fit, test))
metrics = metric_set(rmse, rsq)
result %>%
metrics(medv, .pred)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.52
## 2 rsq standard 0.860
Use the function explain to create an explainer object that helps us to interpret the model.
explainer = fit %>%
explain(
data = test %>% select(-medv),
y = test$medv
)
## Preparation of a new explainer is initiated
## -> model label : model_fit ( default )
## -> data : 102 rows 13 cols
## -> target variable : 102 values
## -> predict function : yhat.model_fit will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package parsnip , ver. 1.0.3 , task regression ( default )
## -> predicted values : numerical, min = 7.011 , mean = 23.07132 , max = 47.021
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -12.856 , mean = -0.05857843 , max = 12.788
## A new explainer has been created!
Use predict_profile function to get PD plot. The source code of predict_profile is here. You can designate which plot you like to plot by giving the variables method a vector of variable names in the plot function. (The blue dots are actual prediction and variable)
ice = explainer %>%
predict_profile(
new_observation = test
)
plot(ice,variables = c("lstat", "rm", "dis", "crim"), title = "ICE for lstat, rm, dis, crim")
Method | Function |
---|---|
Permutation Feature Importance(PFI) | model_parts() |
Partial Dependence(PD) | model_profile() |
Individual Conditional Expectation(ICE) | predict_profile() |
SHAP | predict_parts() |
plot(ice,variables = c("rm"), title = "ICE for rm")
If you look at the outlier on rm plot, as other observations follow the path, there is one unique observation with an opposite relationship with prediction.
cor <- df[,unlist(lapply(df, is.numeric))]
pairs(cor, pch = 19, upper.panel = NULL, panel=panel.smooth, col = dplyr::case_when(row(cor) == 373 ~"red",T ~"grey"))
# Merge prediction and test data.
test_pred = cbind(test, explainer$y_hat)
outlier = test_pred[(test_pred$`explainer$y_hat`>35)&(test_pred$rm<6),]
outlier
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 369 4.89822 0 18.1 0 0.631 4.970 100.0 1.3325 24 666 20.2 375.52 3.26
## 373 8.26725 0 18.1 1 0.668 5.875 89.6 1.1296 24 666 20.2 347.88 8.88
## medv explainer$y_hat
## 369 50 37.212
## 373 50 39.172
index | crim | zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | b | lstat | medv | y_hat |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
373 | 8.26725 | 0 | 18.1 | 1 | 0.668 | 5.875 | 89.6 | 1.1296 | 24 | 666 | 20.2 | 347.88 | 8.88 | 50 | 40.461 |
With its extremely high rad, tax, and low dis, we can assume the houses are the center of the city. Maybe there ain’t many houses with more than 6 rooms, so our model could not do a good job on that (extrapolation). or As the room number increases, the size of each room get smaller and smaller, so the house price would decrease due to the decrease of the house value.
While PD could not visualize individual relationships, ICE is a great way to visualize individual relationships to the response variable. It is a substance of PD, so if you understand PD, I bet ICE was straightforward to you.
SHapley Additive exPlanations (SHAP) is a method to understand how our AI model came to an certain decision. For example, if your task is to made AI for loan screening process, you need to make sure why your AI came up to the conclusion because your applicant would desperate to know why they failed. The idea of SHAP is to allocate/calculate credits of each explanatory variable on the prediction. For example, in cooperative game theory, a player is able to choose “join” or “not join” to the game. And if all players are “join” to the game except player A, the difference between the result of the game within all players and the game without player A is going to be credit for players A. SHAP does simile process to the model to help us to understand the behavior of the model.
SHAP computes credits by calculating the difference of the overall prediction and an individual prediction. For example, a model \(\hat{f}(X)\) with explanatory variable \(X = (X_1,...,X_J)\). So explanatory variables for observation \(i\) are \(x_i=(x_{i,1}...,x_{i,J})\). In SHAP, the credits are break down into the addition form.
\[ \hat{f}(x_i) - \mathbb{E}\big[\hat{f}(X)\big] = \sum_{j = 1}^{J} \phi_{i,j} \]
Breaking down into summation from called Additive Feature Attribution Method and SHAP is one of the Additive Feature Attribution Methods. Often, by setting \(\phi_0 = \mathbb{E}\big[\hat{f}(X)\big]\) SHAP is notated as
\[ \hat{f}(x_i) = \phi_{0} - \sum_{j = 1}^{J} \phi_{i,j} \]
For example, let’s say our goal is to predict annual salary from education, job title, job position, and English skill. The prediction formula would be
\[ \hat{f}(x_i) = \phi_0 + \phi_i, education + \phi_i title + \phi_i position + \phi_i english \]
In this particular observation, we get some who got Master’s degree, working as Data Scientist, an Manager, and does not speak English. And predicted salary is 10 million yen.
According to our AI model, if he practices
his English, he has more chance to be predicted the higher salary.
Let’s say there are three players A, B, and C in some cooperative game, and we want to somehow calculate contributions of each player. Here is a table of final scores with list of players in the game.
Player | Score |
---|---|
A | 6 |
B | 4 |
C | 2 |
A,B | 20 |
A,C | 15 |
B,C | 10 |
A,B,C | 24 |
To calculate the contribution, we use marginal contribution. Marginal contribution calculates one’s effort by getting the average of difference between the game scores with and without the certain player. For example, if you want to calculate marginal contribution for player A,
Game 1 Player A plays the game aloe, A gets score of 6. None ->
A
\(Contribution_A = 6 - 0 = 6\)
Game 2
Player B plays the game aloe(4), but A gets into the game. A and B gets
score of 20. B -> A,B
\(Contribution_A = 20 - 4 = 16\)
Game 3
Player C plays the game aloe(2), but A gets into the game. A and C gets
score of 15. C -> A,C
\(Contribution_A = 15 - 2 = 13\)
Game 4
Player B and C plays the game (10), A gets into the game. A,B, and C
gets score of 24. B,C -> A,B,C
\(Contribution_A = 24 - 10 = 14\)
Pay attention to the fact that the contribution of A changes whether who’s in the game already. Therefore, we need to average out each contributions.
Here is a table of marginal contribution with all possible participation order.
Participation Order | Contribution A | Contribution B | Contribution C |
---|---|---|---|
A-B-C | 6 | 14 | 4 |
A-C-B | 6 | 9 | 9 |
B-A-C | 16 | 4 | 4 |
B-C-A | 14 | 4 | 6 |
C-A-B | 13 | 9 | 2 |
C-B-A | 14 | 8 | 2 |
By taking average we can calculate the Shapley Value
\[ \phi_{A} = \dfrac{6+6+16+14+13+14}{6}=11.5 \] ##### Formula for Shapley Value
\[ \phi_{j} = \dfrac{1}{|\mathbf{J}|!} \sum_{\mathbf{S}\subseteq\mathbf{J}-j} (|\mathbf{S}|!(|\mathbf{J}|-|\mathbf{S}|-1)!) (v(\mathbf{S}\cup\{j\})-v(\mathbf{S})) \]
\(\mathbf{J} = \{1,...,J\}\) is a
set of players.
- In the case above, \(\mathbf{J} =
\{A,B,C\}\)
\(|\mathbf{J}|\) is number of
components in the set \(\mathbf{J}\).
- In the case above, \(|\mathbf{J}|
=3\)
\(\mathbf{S}\) is a set of sets that
deducts player \(j\) form \(\mathbf{J}\).
- Thinking about shapely value of A, \(\mathbf{S}\) is \(\emptyset,\{B\},\{C\},\{B,C\}\)
\(|\mathbf{S}|\) is number of
components in the set \(\mathbf{S}\).
- In the case \(\emptyset,\{B\},\{C\},\{B,C\}\), \(|\mathbf{S}|\) would be \(0,1,1,2\)
\(v\) is a game score.
- If A and B are in the game, the score would be notated as \(v(\{A,B\})\)
\(v(\mathbf{S}\cup\{j\}) - v(\mathbf{S})\) is difference of the scores when player \(j\) is in and out of the game.
Now, let’s see how to run PFI with actual dataset.
# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(SHAPforxgboost)
data("BostonHousing")
df = BostonHousing
d_mx <- as.matrix(df)
Here are overview of the dataset
Detailes of Variables
head(df)
## crim zn indus chas nox rm age dis rad tax ptratio b lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
medv is our response variable, We predict this.
hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)')
We won’t cover building a model in this article. I used XGBoost model.
split = initial_split(df, 0.8)
train = training(split)
test = testing(split)
model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>%
set_engine(engine = "ranger", seed(25)) %>%
set_mode("regression")
fit = model %>%
fit(medv ~., data=train)
fit
## parsnip model object
##
## Ranger result
##
## Call:
## ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13, x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
##
## Type: Regression
## Number of trees: 100
## Sample size: 404
## Number of independent variables: 13
## Mtry: 13
## Target node size: 1
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 10.82015
## R squared (OOB): 0.8646467
result = test %>%
select(medv) %>%
bind_cols(predict(fit, test))
metrics = metric_set(rmse, rsq)
result %>%
metrics(medv, .pred)
## # A tibble: 2 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 3.06
## 2 rsq standard 0.912
Use the function explain to create an explainer object that helps us to interpret the model.
explainer = fit %>%
explain(
data = test %>% select(-medv),
y = test$medv
)
## Preparation of a new explainer is initiated
## -> model label : model_fit ( default )
## -> data : 102 rows 13 cols
## -> target variable : 102 values
## -> predict function : yhat.model_fit will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package parsnip , ver. 1.0.3 , task regression ( default )
## -> predicted values : numerical, min = 8.019 , mean = 22.78642 , max = 49.006
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -10.137 , mean = -0.2021078 , max = 9.822
## A new explainer has been created!
Use predict_parts function to get SHAP plot.
shap = explainer %>%
predict_parts(
new_observation = test %>% dplyr::slice(11),
type = "shap",
B = 25)
plot(shap)
By looking at the plot, we can tell that lstat has highest
contribution on the prediction. To interpret, lstat contributes
medv to be high, but rm contributes medv to
be low, and so on…
Method | Function |
---|---|
Permutation Feature Importance(PFI) | model_parts() |
Partial Dependence(PD) | model_profile() |
Individual Conditional Expectation(ICE) | predict_profile() |
SHAP | predict_parts() |
SHAP approach can be used to get macro/overview of the model by taking absolute average of SHAP value for all observation. Thank you to Yang Liu for making this all-in-one package that manage to build a model and plot SHAP plot. For more information, visit here
# Note: The functions shap.score.rank, shap_long_hd and plot.shap.summary were
# originally published at https://liuyanguu.github.io/post/2018/10/14/shap-visualization-for-xgboost/
# All the credits to the author.
## functions for plot
# return matrix of shap score and mean ranked score list
shap.score.rank <- function(xgb_model = xgb_mod, shap_approx = TRUE,
X_train = mydata$train_mm){
require(xgboost)
require(data.table)
shap_contrib <- predict(xgb_model, X_train,
predcontrib = TRUE, approxcontrib = shap_approx)
shap_contrib <- as.data.table(shap_contrib)
shap_contrib[,BIAS:=NULL]
cat('make SHAP score by decreasing order\n\n')
mean_shap_score <- colMeans(abs(shap_contrib))[order(colMeans(abs(shap_contrib)), decreasing = T)]
return(list(shap_score = shap_contrib,
mean_shap_score = (mean_shap_score)))
}
# Note: The functions var_importance were
# originally published at https://github.com/pablo14/shap-values/blob/master/shap.R
# All the credits to the author.
shap.score.rank <- function(xgb_model = xgb_mod, shap_approx = TRUE,
X_train = mydata$train_mm){
require(xgboost)
require(data.table)
shap_contrib <- predict(xgb_model, X_train,
predcontrib = TRUE, approxcontrib = shap_approx)
shap_contrib <- as.data.table(shap_contrib)
shap_contrib[,BIAS:=NULL]
cat('make SHAP score by decreasing order\n\n')
mean_shap_score <- colMeans(abs(shap_contrib))[order(colMeans(abs(shap_contrib)), decreasing = T)]
return(list(shap_score = shap_contrib,
mean_shap_score = (mean_shap_score)))
}
var_importance <- function(shap_result, top_n=10)
{
var_importance=tibble(var=names(shap_result$mean_shap_score), importance=shap_result$mean_shap_score)
var_importance=var_importance[1:top_n,]
ggplot(var_importance, aes(x=reorder(var,importance), y=importance)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_light() +
theme(axis.title.y=element_blank())
}
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ stringr 1.5.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor() masks scales::col_factor()
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ purrr::discard() masks scales::discard()
## ✖ DALEX::explain() masks dplyr::explain()
## ✖ dplyr::filter() masks stats::filter()
## ✖ stringr::fixed() masks recipes::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ readr::spec() masks yardstick::spec()
library(xgboost)
##
## Attaching package: 'xgboost'
##
## The following object is masked from 'package:dplyr':
##
## slice
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following objects are masked from 'package:yardstick':
##
## precision, recall, sensitivity, specificity
##
## The following object is masked from 'package:purrr':
##
## lift
dummy <- dummyVars(" ~ .", data=df)
newdata <- data.frame(predict(dummy, newdata = df)) %>%
select(-medv)
model_wine <- xgboost(data = as.matrix(newdata),
label = df$medv,
objective = "reg:squarederror",
nrounds = 100)
## [1] train-rmse:17.059007
## [2] train-rmse:12.263687
## [3] train-rmse:8.931146
## [4] train-rmse:6.572166
## [5] train-rmse:4.907308
## [6] train-rmse:3.733522
## [7] train-rmse:2.912337
## [8] train-rmse:2.362943
## [9] train-rmse:1.953651
## [10] train-rmse:1.684866
## [11] train-rmse:1.528102
## [12] train-rmse:1.410980
## [13] train-rmse:1.307722
## [14] train-rmse:1.230047
## [15] train-rmse:1.126096
## [16] train-rmse:1.064402
## [17] train-rmse:1.036869
## [18] train-rmse:0.972606
## [19] train-rmse:0.902468
## [20] train-rmse:0.873432
## [21] train-rmse:0.848563
## [22] train-rmse:0.795342
## [23] train-rmse:0.771540
## [24] train-rmse:0.718620
## [25] train-rmse:0.693590
## [26] train-rmse:0.639754
## [27] train-rmse:0.617747
## [28] train-rmse:0.575608
## [29] train-rmse:0.554515
## [30] train-rmse:0.540946
## [31] train-rmse:0.514530
## [32] train-rmse:0.499325
## [33] train-rmse:0.483672
## [34] train-rmse:0.471477
## [35] train-rmse:0.454060
## [36] train-rmse:0.445097
## [37] train-rmse:0.416921
## [38] train-rmse:0.393740
## [39] train-rmse:0.369955
## [40] train-rmse:0.357643
## [41] train-rmse:0.345047
## [42] train-rmse:0.337411
## [43] train-rmse:0.327211
## [44] train-rmse:0.313934
## [45] train-rmse:0.295142
## [46] train-rmse:0.286331
## [47] train-rmse:0.276405
## [48] train-rmse:0.262147
## [49] train-rmse:0.251315
## [50] train-rmse:0.245113
## [51] train-rmse:0.239266
## [52] train-rmse:0.224254
## [53] train-rmse:0.221726
## [54] train-rmse:0.216645
## [55] train-rmse:0.213014
## [56] train-rmse:0.203666
## [57] train-rmse:0.198869
## [58] train-rmse:0.187178
## [59] train-rmse:0.184667
## [60] train-rmse:0.176635
## [61] train-rmse:0.168293
## [62] train-rmse:0.162576
## [63] train-rmse:0.154729
## [64] train-rmse:0.153375
## [65] train-rmse:0.145142
## [66] train-rmse:0.143298
## [67] train-rmse:0.137444
## [68] train-rmse:0.135575
## [69] train-rmse:0.130472
## [70] train-rmse:0.128452
## [71] train-rmse:0.123708
## [72] train-rmse:0.115479
## [73] train-rmse:0.112092
## [74] train-rmse:0.107142
## [75] train-rmse:0.106020
## [76] train-rmse:0.099164
## [77] train-rmse:0.094449
## [78] train-rmse:0.090660
## [79] train-rmse:0.088310
## [80] train-rmse:0.085793
## [81] train-rmse:0.083700
## [82] train-rmse:0.077640
## [83] train-rmse:0.072620
## [84] train-rmse:0.069856
## [85] train-rmse:0.067911
## [86] train-rmse:0.066260
## [87] train-rmse:0.062791
## [88] train-rmse:0.060610
## [89] train-rmse:0.057933
## [90] train-rmse:0.055341
## [91] train-rmse:0.053233
## [92] train-rmse:0.051275
## [93] train-rmse:0.049004
## [94] train-rmse:0.048386
## [95] train-rmse:0.046329
## [96] train-rmse:0.045065
## [97] train-rmse:0.042863
## [98] train-rmse:0.041238
## [99] train-rmse:0.039000
## [100] train-rmse:0.037987
shap_result_wine <- shap.score.rank(xgb_model = model_wine,
X_train = as.matrix(newdata),
shap_approx = F
)
## Loading required package: data.table
##
## Attaching package: 'data.table'
##
## The following object is masked from 'package:purrr':
##
## transpose
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## make SHAP score by decreasing order
var_importance(shap_result_wine, top_n = ncol(d_mx) - 1 )
Just like that one observation, lstat and rm are most important feature in the whole dataset.
SHAP is a sufficient method to observe variable importance on each observation. At the same time, SHAP also can be helpful to get overview of the whole model. However, SHAP cannot explain the relation ship between explanatory variable and responce variable, which can be explain with ICE.
It may not be right to pair up tax and rad variables without decent causal inference.↩︎