Introduction

This article is based on information in 「機械学習を解釈する技術 ~Techniques for Interpreting Machine Learning~」by Mitsunosuke Morishita. In this book, the author does not go through all the methods by R, so I decided to make a brief note with an R script.

Permutation Feature Inportance

Permutation Feature Importance (PFI) is defined to be the decrease in a model score when a single feature value is randomly shuffled 1. This procedure breaks the relationship between the feature and the target, thus the drop in the model score is indicative of how much the model depends on the feature. scikit-learn Here are simple 5 steps of PFI

  1. Predict the target with ALL explanatory variables and calculate prediction error, which is going to be the baseline
  2. Pick one explanatory variable and permeate/shuffle it on the debatable. Predict the target and calculate prediction error
  3. Calculate difference of prediction errors from step 1 and 2. Make the difference the importance of variables piked at step 2
  4. Repeat steps for all explanatory variables
  5. See the importance for all variables and analyze

Execution with Real Data

Now, let’s see how to run PFI with actual dataset.

Get Dataset

# Set up
library(mlbench)
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom        1.0.2      ✔ recipes      1.0.3 
## ✔ dials        1.1.0      ✔ rsample      1.1.1 
## ✔ dplyr        1.0.10     ✔ tibble       3.1.8 
## ✔ ggplot2      3.4.0      ✔ tidyr        1.3.0 
## ✔ infer        1.0.4      ✔ tune         1.0.1 
## ✔ modeldata    1.0.1      ✔ workflows    1.1.2 
## ✔ parsnip      1.0.3      ✔ workflowsets 1.0.0 
## ✔ purrr        1.0.1      ✔ yardstick    1.1.0
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ purrr::discard() masks scales::discard()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ✖ recipes::step()  masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(DALEX)
## Welcome to DALEX (version: 2.4.2).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
## 
## Attaching package: 'DALEX'
## The following object is masked from 'package:dplyr':
## 
##     explain
library(ranger)
library(Rcpp)
## 
## Attaching package: 'Rcpp'
## The following object is masked from 'package:rsample':
## 
##     populate
library(corrplot)
## corrplot 0.92 loaded
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)

Obserview of the Dataset

Here are overview of the dataset

Detailes of Variables

head(df)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

medv is our response variable, We predict this.

hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)') 

Build a Model

We won’t cover building a model in this article. I used XGBoost model.

split = initial_split(df, 0.8)
train = training(split)
test = testing(split)

model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>% 
  set_engine(engine = "ranger", seed(25)) %>% 
  set_mode("regression")

fit = model %>% 
  fit(medv ~., data=train)
fit
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13,      x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      404 
## Number of independent variables:  13 
## Mtry:                             13 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       10.70344 
## R squared (OOB):                  0.8676068

Predict medv

result = test %>% 
  select(medv) %>% 
  bind_cols(predict(fit, test))

metrics = metric_set(rmse, rsq)

result %>% 
  metrics(medv, .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       4.24 
## 2 rsq     standard       0.823

Interpre Feature Importance

Use the function explain to create an explainer object that helps us to interpret the model.

explainer = fit %>% 
  explain(
    data = test %>% select(-medv),
    y = test$medv
  )
## Preparation of a new explainer is initiated
##   -> model label       :  model_fit  (  default  )
##   -> data              :  102  rows  13  cols 
##   -> target variable   :  102  values 
##   -> predict function  :  yhat.model_fit  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package parsnip , ver. 1.0.3 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  9.321 , mean =  22.52986 , max =  46.632  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -23.066 , mean =  0.1956275 , max =  17.366  
##   A new explainer has been created!

Use model_parts function to get PFI. Here you can see rm and lstat are top 2 important variable to predict medv. The bark blue boxchart show distribution of error loss since we calculate it multiple times.

  • loss_function: Evaluation metrics
  • B: # of shuffles
  • type: method of calculating importance “difference” or “ratio” are applicable
pfi = explainer %>% 
  model_parts(
    loss_function = loss_root_mean_square,
    B = 10,
    type = "difference"
  )

plot(pfi)

FYI

Method Function
Permutation Feature Importance(PFI) model_parts()
Partial Dependence(PD) model_profile()
Individual Conditional Expectation(ICE) predict_profile()
SHAP predict_parts()

Grouped Permutation Feature Inportance (GPFI)

If some explanatory variables are correlated each other, PFI won’t work well. Let’s say \(X0\) and \(X1\) are correlated. While calculating importance of \(X0\), the model still use \(X1\) on prediction. The performance of the model would not decrease much because \(X0\) and \(X1\) are correlated. Thus, PFI will underestimate the importance of \(X1\). In plot below, rad; index of accessibility to radial highway, and tax;full-value property-tax rate per $10,000. In the situation like this, we should shuffle both variables together. In addition to that, we should use this GPFI when the variables are encoded by one-hot encoding. Or you can use it when you are dealing with data like latitudes and longitudes.

cor  <- df[,unlist(lapply(df, is.numeric))  ]
corrplot(cor(cor), method = 'number', order = 'alphabet')

Rad and Tax

So let’s run GPFI on our dataset. model_parts function have variable_groups method. It takes list objects. So make a list that contains name of explanatory variables in this case rad and tax1. The source code of feature_importance is here.

# Make list
paired_var = list(c("tax","rad"))

# Male vector of explanatory variables Do not forget to take out your response variable  

all_vars = c(colnames(df)[colnames(df) %notin% c("tax","rad","medv")])

# Gather 
var_list = c(all_vars, paired_var)
                  
pfi = explainer %>% 
  model_parts(
    loss_function = loss_root_mean_square,
    B = 10,
    type = "difference",
    variable_groups = var_list
  )

# Gather explaniner object 
plot(pfi)

If you keep tax and rad in the plot, you can see that the importance of tax and rad are dispersed.

# Make list
paired_var = list(c("tax","rad"))

# Male vector of explanatory variables Do not forget to take out your response variable  

all_vars = c(colnames(df)[colnames(df) %notin% c("medv")])

# Gather 
var_list = c(all_vars, paired_var)
                  
pfi = explainer %>% 
  model_parts(
    loss_function = loss_root_mean_square,
    B = 10,
    type = "difference",
    variable_groups = var_list
  )

# Gather explaniner object 
plot(pfi)

Conclution

PFI and GPFI are very sufficient model to calculate the importance of explanatory variables in the model. On the other hand, PFI does not explain how each variables affect prediction of the model. This could be done by Partial Dependence (PD).

Partial Dependence

*Partial Dependence (PD) is a method to show the dependence between the response variable and explanatory variable. By changing the value of the explanatory variable while keeping track of the prediction, we can understand the relationship between the explanatory variable and the response variable. For example, if we increase the unit of explanatory variable \(X0\), and the response variable also increases, these variables have a positive relationship. In addition to that, by potting these outputs, we can determine whether these variables have a linear or non-linear relationship.

Formula for PD

Here comes your favorite part. Based on the image above, let’s say your trained model \(\hat{f}(X0,X1,X2)\) and you would like to find out PD of \(X0\) and prediction \(\hat{y}\). In this situation, we only change the value of \(X0\) and average out the prediction.

\[ \large \hat{PD_0}(x_0) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(x_0,x_{i,1},x_{i,2}) \]

We use \(x\) to express direct specification.
For example, If we specify \(X0\) to be 1, we substitute \(x_0\) to 1. So the function is going to be look like
\(\hat{PD_0}(1) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(1,X_{i,1},X_{i,2})\)
Or if we specify \(X04\) to be 4, then
\(\hat{PD_0}(4) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(4,X_{i,1},X_{i,2})\)
\(x_{i,1}\) and \(x_{i,2}\) takes \(i\)th observation of \(X1\) and \(X2\).

Or if you generalize more, define your set of explanatory variable as \(X = (X_0,...,X_J)\), and define your trained model as \(\hat{f}(X)\). Your target explanatory variable is \(X_j\), and \(X\) without \(X_j\) is \(X_{-j} = (X_0,...,X_{j-1},X_{j+1},...,X_J)\). Actual observation of \(X_j\) at \(i\)th observation is defined as \(x_{j,i}\). So actual observation without \(x_{j,i}\) is x\(_{i,-j} = (x_{i,0},...,x_{i,j-1},x_{i,j+1},...,x_{i,J})\).
When explanatory variable \(X_j = x_j\), \(\hat{PD_j}(x_j)\) aka. prediction mean is

\[ \large \hat{PD_j}(x_j) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(x_j,x_{i,-j}) \] ### FYI
A method like this; calculating the effect of a focused variable by taking an average of other variables to ignore their effects, is called Marginalization. if you learn more about marginalization visit here.

Execution with Real Data

Now, let’s see how to run PFI with actual dataset.

Get Dataset

# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)

Obserview of the Dataset

Here are overview of the dataset

Detailes of Variables

head(df)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

medv is our response variable, We predict this.

hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)') 

Build a Model

We won’t cover building a model in this article. I used XGBoost model.

split = initial_split(df, 0.8)
train = training(split)
test = testing(split)

model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>% 
  set_engine(engine = "ranger", seed(25)) %>% 
  set_mode("regression")

fit = model %>% 
  fit(medv ~., data=train)
fit
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13,      x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      404 
## Number of independent variables:  13 
## Mtry:                             13 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       11.63041 
## R squared (OOB):                  0.8625487

Predict medv

result = test %>% 
  select(medv) %>% 
  bind_cols(predict(fit, test))

metrics = metric_set(rmse, rsq)

result %>% 
  metrics(medv, .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       3.10 
## 2 rsq     standard       0.889

Interpre PD

Use the function explain to create an explainer object that helps us to interpret the model.

explainer = fit %>% 
  explain(
    data = test %>% select(-medv),
    y = test$medv
  )
## Preparation of a new explainer is initiated
##   -> model label       :  model_fit  (  default  )
##   -> data              :  102  rows  13  cols 
##   -> target variable   :  102  values 
##   -> predict function  :  yhat.model_fit  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package parsnip , ver. 1.0.3 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  8.595 , mean =  22.59445 , max =  49.12  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -8.21 , mean =  0.3594706 , max =  12.805  
##   A new explainer has been created!

Use model_profile function to get PD plot. Here you can see lstat, rm, and dis (top 3 importance predictors by PFI) have relationships with prediction. The source code of partial_dependence is here.

pd = explainer %>% 
  model_profile()
plot(pd)

You can designate which plot you like to plot by giving variables method a vector of variable names.

pd = explainer %>% 
  model_profile(
    variables = c("lstat", "rm", "dis", "crim")
  )
plot(pd)

FYI

Method Function
Permutation Feature Importance(PFI) model_parts()
Partial Dependence(PD) model_profile()
Individual Conditional Expectation(ICE) predict_profile()
SHAP predict_parts()

dis, crim, and medv

Some of you might ask, the process of PD would be the same thing as just looking at the scatter plot like this. However, there is a huge difference between the scatter plot and the PD plot.

cor  <- df[,unlist(lapply(df, is.numeric))]

pairs(cor, pch = 19, upper.panel = NULL, panel=panel.smooth)

For example, take a look at the scatter plot of dis; weighted distances to five Boston employment centers, crim; per capita crime rate by town, and medv; median value of owner-occupied homes in $ 1000’s. Looking at the plot on the right, we can observe as the distance of employment centers and medv; median value of owner-occupied homes in $1000’s have a positive relationship. Since employment centers are located in the center of the cities, we can assume that as you move far from the city, the home price would increase. This is not very intuitive. By looking at crim and dis (the middle plot), the distance increases, crime rate decreases. From this observation, In Boston, the neighbor gets safer as you move away from the city center. Therefore, in the third plot, the price of the house decreases as the crime rate decreases. This explains the positive relationship between the home price and distance from the center.

p1 <- qplot(dis, medv, data = df) +
  geom_smooth()
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p2 <- qplot(dis, log(crim), data = df) +
  geom_smooth()

p3 <- qplot(log(crim), medv, data = df) +
  geom_smooth()

grid.arrange(p1,p2,p3,
  nrow = 1,
  top = "medv:price of house, dis:distance, crim:crime rate")

In PD plot, medv(y) and dis(x) have an opposite relationship to the scatterplot plot. As you can see, PD plot explains the hidden relationship between variables.

pd = explainer %>% 
  model_profile(
    variables = c("dis")
  )
plot(pd)

Conclution

PD is a sufficient method to observe variable relationships on the model. However, PD is averaging out all observations to visualize the relationship. If individual observations have different effects on the response variable, PD would not be able to catch that effects. in a situation like that, Individual Conditional Expectation is capable of handling these effects.

Individual Conditional Expectation (ICE)

Individual Conditional Expectation (ICE) is a method of visualizing and understanding the relationship between the response variable and individual observation. The technique of calculating ICE is pretty simple if you are familiar with calculating Partial Dependence (PD). PD is just a mean of ICE. PD is calculated by changing the value of an explanatory variable while keeping track of the prediction. For example, if we increase the unit of explanatory variable \(X0\), and the response variable also increases, these variables have a positive relationship. In PD, we average out all observation orders to see the overview of the relationship, but in ICE, we plot all observations individually.

Formula for ICE

ICE formula looks like a formula for PD. For your review, PD is

\[ \large \hat{PD_j}(x_j) = \displaystyle \frac{1}{N} \sum_{i = 1}^{N} \hat{f}(x_j,x_{i,-j}) \]

Your set of explanatory variable as \(X = (X_0,...,X_J)\), and define your trained model as \(\hat{f}(X)\). Your target explanatory variable is \(X_j\), and \(X\) without \(X_j\) is \(X_{-j} = (X_0,...,X_{j-1},X_{j+1},...,X_J)\). Actual observation of \(X_j\) at \(i\)th observation is defined as \(x_{j,i}\). So actual observation without \(x_{j,i}\) is x\(_{i,-j} = (x_{i,0},...,x_{i,j-1},x_{i,j+1},...,x_{i,J})\).
Explanatory variables are \(X_j = x_j\), \(\hat{PD_j}(x_j)\).

For ICE, we take out the averaging out part \(\displaystyle \frac{1}{N} \sum_{i = 1}^{N}\) and add little i for individual observation.

\[ \large \hat{ICE_{i,j}}(x_j) = \hat{f}(x_j,x_{i,-j}) \]

Why? -Limitation of PD

Let’s say you are predicting customer satisfaction at a restaurant. The restaurant has a variety of customers of all ages. One of your explanatory variables is the amount of food per dish (g). You assume for a younger customer, would have higher satisfaction with more food on the dish. On the other hand, older people might not be interested in the amount of food. In the situation like that and if you only look at PD, you would miss the relationship between individual observation and the response variable.

Execution with Real Data

Now, let’s see how to run PFI with actual dataset.

Get Dataset

# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)

data("BostonHousing")
df = BostonHousing
`%notin%` <- Negate(`%in%`)

Overview of the Dataset

Here is overview of the dataset

Detailes of Variables

head(df)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

medv is our response variable, We predict this.

hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)') 

Build a Model

We won’t cover building a model in this article. I used XGBoost.

split = initial_split(df, 0.8)
train = training(split)
test = testing(split)

model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>% 
  set_engine(engine = "ranger", seed(25)) %>% 
  set_mode("regression")

fit = model %>% 
  fit(medv ~., data=train)
fit
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13,      x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      404 
## Number of independent variables:  13 
## Mtry:                             13 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       13.19693 
## R squared (OOB):                  0.8420363

Predict medv

result = test %>% 
  select(medv) %>% 
  bind_cols(predict(fit, test))

metrics = metric_set(rmse, rsq)

result %>% 
  metrics(medv, .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       3.52 
## 2 rsq     standard       0.860

Interpre ICE

Use the function explain to create an explainer object that helps us to interpret the model.

explainer = fit %>% 
  explain(
    data = test %>% select(-medv),
    y = test$medv
  )
## Preparation of a new explainer is initiated
##   -> model label       :  model_fit  (  default  )
##   -> data              :  102  rows  13  cols 
##   -> target variable   :  102  values 
##   -> predict function  :  yhat.model_fit  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package parsnip , ver. 1.0.3 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  7.011 , mean =  23.07132 , max =  47.021  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -12.856 , mean =  -0.05857843 , max =  12.788  
##   A new explainer has been created!

Use predict_profile function to get PD plot. The source code of predict_profile is here. You can designate which plot you like to plot by giving the variables method a vector of variable names in the plot function. (The blue dots are actual prediction and variable)

ice = explainer %>% 
  predict_profile(
    new_observation = test
  )
plot(ice,variables = c("lstat", "rm", "dis", "crim"), title = "ICE for lstat, rm, dis, crim")

FYI

Method Function
Permutation Feature Importance(PFI) model_parts()
Partial Dependence(PD) model_profile()
Individual Conditional Expectation(ICE) predict_profile()
SHAP predict_parts()

rm

plot(ice,variables = c("rm"), title = "ICE for rm")

If you look at the outlier on rm plot, as other observations follow the path, there is one unique observation with an opposite relationship with prediction.

cor  <- df[,unlist(lapply(df, is.numeric))]
pairs(cor, pch = 19, upper.panel = NULL, panel=panel.smooth, col = dplyr::case_when(row(cor) == 373 ~"red",T ~"grey"))

# Merge prediction and test data.  
test_pred = cbind(test, explainer$y_hat)
outlier = test_pred[(test_pred$`explainer$y_hat`>35)&(test_pred$rm<6),]
outlier
##        crim zn indus chas   nox    rm   age    dis rad tax ptratio      b lstat
## 369 4.89822  0  18.1    0 0.631 4.970 100.0 1.3325  24 666    20.2 375.52  3.26
## 373 8.26725  0  18.1    1 0.668 5.875  89.6 1.1296  24 666    20.2 347.88  8.88
##     medv explainer$y_hat
## 369   50          37.212
## 373   50          39.172
index crim zn indus chas nox rm age dis rad tax ptratio b lstat medv y_hat
373 8.26725 0 18.1 1 0.668 5.875 89.6 1.1296 24 666 20.2 347.88 8.88 50 40.461

With its extremely high rad, tax, and low dis, we can assume the houses are the center of the city. Maybe there ain’t many houses with more than 6 rooms, so our model could not do a good job on that (extrapolation). or As the room number increases, the size of each room get smaller and smaller, so the house price would decrease due to the decrease of the house value.

Conclution

While PD could not visualize individual relationships, ICE is a great way to visualize individual relationships to the response variable. It is a substance of PD, so if you understand PD, I bet ICE was straightforward to you.

SHapley Additive exPlanations (SHAP)

SHapley Additive exPlanations (SHAP) is a method to understand how our AI model came to an certain decision. For example, if your task is to made AI for loan screening process, you need to make sure why your AI came up to the conclusion because your applicant would desperate to know why they failed. The idea of SHAP is to allocate/calculate credits of each explanatory variable on the prediction. For example, in cooperative game theory, a player is able to choose “join” or “not join” to the game. And if all players are “join” to the game except player A, the difference between the result of the game within all players and the game without player A is going to be credit for players A. SHAP does simile process to the model to help us to understand the behavior of the model.

Additive Feature Attribution Method

SHAP computes credits by calculating the difference of the overall prediction and an individual prediction. For example, a model \(\hat{f}(X)\) with explanatory variable \(X = (X_1,...,X_J)\). So explanatory variables for observation \(i\) are \(x_i=(x_{i,1}...,x_{i,J})\). In SHAP, the credits are break down into the addition form.

\[ \hat{f}(x_i) - \mathbb{E}\big[\hat{f}(X)\big] = \sum_{j = 1}^{J} \phi_{i,j} \]

Breaking down into summation from called Additive Feature Attribution Method and SHAP is one of the Additive Feature Attribution Methods. Often, by setting \(\phi_0 = \mathbb{E}\big[\hat{f}(X)\big]\) SHAP is notated as

\[ \hat{f}(x_i) = \phi_{0} - \sum_{j = 1}^{J} \phi_{i,j} \]

For example, let’s say our goal is to predict annual salary from education, job title, job position, and English skill. The prediction formula would be

\[ \hat{f}(x_i) = \phi_0 + \phi_i, education + \phi_i title + \phi_i position + \phi_i english \]

In this particular observation, we get some who got Master’s degree, working as Data Scientist, an Manager, and does not speak English. And predicted salary is 10 million yen.

According to our AI model, if he practices his English, he has more chance to be predicted the higher salary.

Shapley Value

  • The Shapley value is a solution concept used in game theory that involves fairly distributing both gains and costs to several actors working in coalition. Game theory is when two or more players or factors are involved in a strategy to achieve a desired outcome or payoff.* -Will Kenton investopedia.

Let’s say there are three players A, B, and C in some cooperative game, and we want to somehow calculate contributions of each player. Here is a table of final scores with list of players in the game.

Player Score
A 6
B 4
C 2
A,B 20
A,C 15
B,C 10
A,B,C 24

To calculate the contribution, we use marginal contribution. Marginal contribution calculates one’s effort by getting the average of difference between the game scores with and without the certain player. For example, if you want to calculate marginal contribution for player A,

Game 1 Player A plays the game aloe, A gets score of 6. None -> A
\(Contribution_A = 6 - 0 = 6\)

Game 2
Player B plays the game aloe(4), but A gets into the game. A and B gets score of 20. B -> A,B
\(Contribution_A = 20 - 4 = 16\)

Game 3
Player C plays the game aloe(2), but A gets into the game. A and C gets score of 15. C -> A,C
\(Contribution_A = 15 - 2 = 13\)

Game 4
Player B and C plays the game (10), A gets into the game. A,B, and C gets score of 24. B,C -> A,B,C
\(Contribution_A = 24 - 10 = 14\)

Pay attention to the fact that the contribution of A changes whether who’s in the game already. Therefore, we need to average out each contributions.

Here is a table of marginal contribution with all possible participation order.

Participation Order Contribution A Contribution B Contribution C
A-B-C 6 14 4
A-C-B 6 9 9
B-A-C 16 4 4
B-C-A 14 4 6
C-A-B 13 9 2
C-B-A 14 8 2

By taking average we can calculate the Shapley Value

\[ \phi_{A} = \dfrac{6+6+16+14+13+14}{6}=11.5 \] ##### Formula for Shapley Value

\[ \phi_{j} = \dfrac{1}{|\mathbf{J}|!} \sum_{\mathbf{S}\subseteq\mathbf{J}-j} (|\mathbf{S}|!(|\mathbf{J}|-|\mathbf{S}|-1)!) (v(\mathbf{S}\cup\{j\})-v(\mathbf{S})) \]

\(\mathbf{J} = \{1,...,J\}\) is a set of players.
- In the case above, \(\mathbf{J} = \{A,B,C\}\)

\(|\mathbf{J}|\) is number of components in the set \(\mathbf{J}\).
- In the case above, \(|\mathbf{J}| =3\)

\(\mathbf{S}\) is a set of sets that deducts player \(j\) form \(\mathbf{J}\).
- Thinking about shapely value of A, \(\mathbf{S}\) is \(\emptyset,\{B\},\{C\},\{B,C\}\)

\(|\mathbf{S}|\) is number of components in the set \(\mathbf{S}\).
- In the case \(\emptyset,\{B\},\{C\},\{B,C\}\), \(|\mathbf{S}|\) would be \(0,1,1,2\)

\(v\) is a game score.
- If A and B are in the game, the score would be notated as \(v(\{A,B\})\)

\(v(\mathbf{S}\cup\{j\}) - v(\mathbf{S})\) is difference of the scores when player \(j\) is in and out of the game.

Execution with Real Data

Now, let’s see how to run PFI with actual dataset.

Get Dataset

# Set up
library(mlbench)
library(tidymodels)
library(DALEX)
library(ranger)
library(Rcpp)
library(corrplot)
library(ggplot2)
library(gridExtra)
library(SHAPforxgboost)

data("BostonHousing")
df = BostonHousing
d_mx <- as.matrix(df)

Obserview of the Dataset

Here are overview of the dataset

Detailes of Variables

head(df)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

medv is our response variable, We predict this.

hist(df$medv,breaks = 20, main = 'Histgram of medv', xlab = 'medv ($ in 1,000)') 

Build a Model

We won’t cover building a model in this article. I used XGBoost model.

split = initial_split(df, 0.8)
train = training(split)
test = testing(split)

model = rand_forest(trees = 100, min_n = 1, mtry = 13) %>% 
  set_engine(engine = "ranger", seed(25)) %>% 
  set_mode("regression")

fit = model %>% 
  fit(medv ~., data=train)
fit
## parsnip model object
## 
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~13,      x), num.trees = ~100, min.node.size = min_rows(~1, x), ~seed(25),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  100 
## Sample size:                      404 
## Number of independent variables:  13 
## Mtry:                             13 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       10.82015 
## R squared (OOB):                  0.8646467

Predict medv

result = test %>% 
  select(medv) %>% 
  bind_cols(predict(fit, test))

metrics = metric_set(rmse, rsq)

result %>% 
  metrics(medv, .pred)
## # A tibble: 2 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard       3.06 
## 2 rsq     standard       0.912

Interpre SHAP

Use the function explain to create an explainer object that helps us to interpret the model.

explainer = fit %>% 
  explain(
    data = test %>% select(-medv),
    y = test$medv
  )
## Preparation of a new explainer is initiated
##   -> model label       :  model_fit  (  default  )
##   -> data              :  102  rows  13  cols 
##   -> target variable   :  102  values 
##   -> predict function  :  yhat.model_fit  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package parsnip , ver. 1.0.3 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  8.019 , mean =  22.78642 , max =  49.006  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -10.137 , mean =  -0.2021078 , max =  9.822  
##   A new explainer has been created!

Use predict_parts function to get SHAP plot.

  • new_observation: Set a observation which you want to calculate SHAP
  • type: Set the method
  • B: Number of trials of picking ways of “Participation Order”
shap = explainer %>% 
  predict_parts(
    new_observation = test %>% dplyr::slice(11),
    type = "shap",
    B = 25)
plot(shap)

By looking at the plot, we can tell that lstat has highest contribution on the prediction. To interpret, lstat contributes medv to be high, but rm contributes medv to be low, and so on…

FYI

Method Function
Permutation Feature Importance(PFI) model_parts()
Partial Dependence(PD) model_profile()
Individual Conditional Expectation(ICE) predict_profile()
SHAP predict_parts()

Gloval Feature Importance

SHAP approach can be used to get macro/overview of the model by taking absolute average of SHAP value for all observation. Thank you to Yang Liu for making this all-in-one package that manage to build a model and plot SHAP plot. For more information, visit here

# Note: The functions shap.score.rank, shap_long_hd and plot.shap.summary were 
# originally published at https://liuyanguu.github.io/post/2018/10/14/shap-visualization-for-xgboost/
# All the credits to the author.


## functions for plot
# return matrix of shap score and mean ranked score list
shap.score.rank <- function(xgb_model = xgb_mod, shap_approx = TRUE, 
                            X_train = mydata$train_mm){
  require(xgboost)
  require(data.table)
  shap_contrib <- predict(xgb_model, X_train,
                          predcontrib = TRUE, approxcontrib = shap_approx)
  shap_contrib <- as.data.table(shap_contrib)
  shap_contrib[,BIAS:=NULL]
  cat('make SHAP score by decreasing order\n\n')
  mean_shap_score <- colMeans(abs(shap_contrib))[order(colMeans(abs(shap_contrib)), decreasing = T)]
  return(list(shap_score = shap_contrib,
              mean_shap_score = (mean_shap_score)))
}

# Note: The functions var_importance were 
# originally published at https://github.com/pablo14/shap-values/blob/master/shap.R
# All the credits to the author.

shap.score.rank <- function(xgb_model = xgb_mod, shap_approx = TRUE, 
                            X_train = mydata$train_mm){
  require(xgboost)
  require(data.table)
  shap_contrib <- predict(xgb_model, X_train,
                          predcontrib = TRUE, approxcontrib = shap_approx)
  shap_contrib <- as.data.table(shap_contrib)
  shap_contrib[,BIAS:=NULL]
  cat('make SHAP score by decreasing order\n\n')
  mean_shap_score <- colMeans(abs(shap_contrib))[order(colMeans(abs(shap_contrib)), decreasing = T)]
  return(list(shap_score = shap_contrib,
              mean_shap_score = (mean_shap_score)))
}

var_importance <- function(shap_result, top_n=10)
{
  var_importance=tibble(var=names(shap_result$mean_shap_score), importance=shap_result$mean_shap_score)
  
  var_importance=var_importance[1:top_n,]
  
  ggplot(var_importance, aes(x=reorder(var,importance), y=importance)) + 
    geom_bar(stat = "identity") + 
    coord_flip() + 
    theme_light() + 
    theme(axis.title.y=element_blank()) 
}
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ stringr 1.5.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ readr::col_factor()  masks scales::col_factor()
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ purrr::discard()     masks scales::discard()
## ✖ DALEX::explain()     masks dplyr::explain()
## ✖ dplyr::filter()      masks stats::filter()
## ✖ stringr::fixed()     masks recipes::fixed()
## ✖ dplyr::lag()         masks stats::lag()
## ✖ readr::spec()        masks yardstick::spec()
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
dummy <- dummyVars(" ~ .", data=df)
newdata <- data.frame(predict(dummy, newdata = df)) %>% 
  select(-medv)

model_wine <- xgboost(data = as.matrix(newdata), 
                      label = df$medv, 
                      objective = "reg:squarederror", 
                      nrounds = 100)
## [1]  train-rmse:17.059007 
## [2]  train-rmse:12.263687 
## [3]  train-rmse:8.931146 
## [4]  train-rmse:6.572166 
## [5]  train-rmse:4.907308 
## [6]  train-rmse:3.733522 
## [7]  train-rmse:2.912337 
## [8]  train-rmse:2.362943 
## [9]  train-rmse:1.953651 
## [10] train-rmse:1.684866 
## [11] train-rmse:1.528102 
## [12] train-rmse:1.410980 
## [13] train-rmse:1.307722 
## [14] train-rmse:1.230047 
## [15] train-rmse:1.126096 
## [16] train-rmse:1.064402 
## [17] train-rmse:1.036869 
## [18] train-rmse:0.972606 
## [19] train-rmse:0.902468 
## [20] train-rmse:0.873432 
## [21] train-rmse:0.848563 
## [22] train-rmse:0.795342 
## [23] train-rmse:0.771540 
## [24] train-rmse:0.718620 
## [25] train-rmse:0.693590 
## [26] train-rmse:0.639754 
## [27] train-rmse:0.617747 
## [28] train-rmse:0.575608 
## [29] train-rmse:0.554515 
## [30] train-rmse:0.540946 
## [31] train-rmse:0.514530 
## [32] train-rmse:0.499325 
## [33] train-rmse:0.483672 
## [34] train-rmse:0.471477 
## [35] train-rmse:0.454060 
## [36] train-rmse:0.445097 
## [37] train-rmse:0.416921 
## [38] train-rmse:0.393740 
## [39] train-rmse:0.369955 
## [40] train-rmse:0.357643 
## [41] train-rmse:0.345047 
## [42] train-rmse:0.337411 
## [43] train-rmse:0.327211 
## [44] train-rmse:0.313934 
## [45] train-rmse:0.295142 
## [46] train-rmse:0.286331 
## [47] train-rmse:0.276405 
## [48] train-rmse:0.262147 
## [49] train-rmse:0.251315 
## [50] train-rmse:0.245113 
## [51] train-rmse:0.239266 
## [52] train-rmse:0.224254 
## [53] train-rmse:0.221726 
## [54] train-rmse:0.216645 
## [55] train-rmse:0.213014 
## [56] train-rmse:0.203666 
## [57] train-rmse:0.198869 
## [58] train-rmse:0.187178 
## [59] train-rmse:0.184667 
## [60] train-rmse:0.176635 
## [61] train-rmse:0.168293 
## [62] train-rmse:0.162576 
## [63] train-rmse:0.154729 
## [64] train-rmse:0.153375 
## [65] train-rmse:0.145142 
## [66] train-rmse:0.143298 
## [67] train-rmse:0.137444 
## [68] train-rmse:0.135575 
## [69] train-rmse:0.130472 
## [70] train-rmse:0.128452 
## [71] train-rmse:0.123708 
## [72] train-rmse:0.115479 
## [73] train-rmse:0.112092 
## [74] train-rmse:0.107142 
## [75] train-rmse:0.106020 
## [76] train-rmse:0.099164 
## [77] train-rmse:0.094449 
## [78] train-rmse:0.090660 
## [79] train-rmse:0.088310 
## [80] train-rmse:0.085793 
## [81] train-rmse:0.083700 
## [82] train-rmse:0.077640 
## [83] train-rmse:0.072620 
## [84] train-rmse:0.069856 
## [85] train-rmse:0.067911 
## [86] train-rmse:0.066260 
## [87] train-rmse:0.062791 
## [88] train-rmse:0.060610 
## [89] train-rmse:0.057933 
## [90] train-rmse:0.055341 
## [91] train-rmse:0.053233 
## [92] train-rmse:0.051275 
## [93] train-rmse:0.049004 
## [94] train-rmse:0.048386 
## [95] train-rmse:0.046329 
## [96] train-rmse:0.045065 
## [97] train-rmse:0.042863 
## [98] train-rmse:0.041238 
## [99] train-rmse:0.039000 
## [100]    train-rmse:0.037987
shap_result_wine <- shap.score.rank(xgb_model = model_wine, 
                                    X_train = as.matrix(newdata),
                                    shap_approx = F
                                    )
## Loading required package: data.table
## 
## Attaching package: 'data.table'
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## make SHAP score by decreasing order
var_importance(shap_result_wine, top_n = ncol(d_mx) - 1 )

Just like that one observation, lstat and rm are most important feature in the whole dataset.

Conclution

SHAP is a sufficient method to observe variable importance on each observation. At the same time, SHAP also can be helpful to get overview of the whole model. However, SHAP cannot explain the relation ship between explanatory variable and responce variable, which can be explain with ICE.


  1. It may not be right to pair up tax and rad variables without decent causal inference.↩︎