DrWhy.AI是一套关于predictive Models的工具! 对应书籍: https://ema.drwhy.ai

The DrWhy.AI family

  1. Model adapters : scikitlearn, h2o, mlr 等工具的通用接口!
  2. Model agnostic explainers :模型探索的特定方法。
  3. Model specific explainers. 特定的模型探索方法。
  4. Automated exploration:这些软件包结合了一系列模型探索技术,并生成了用于模型探索的网站自动报告。

1. DALEX

有助于理解复杂模型的工作原理。这个包主要是帮助理解模型的。

关键函数总结:

  1. explain : 解释模型
  2. model_parts : 解释变量
  3. model_profile : 解释模型与变量关系
  4. predict_parts : 解释预测

来看一个例子。

library("DALEX")
## Welcome to DALEX (version: 2.4.2).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
head(titanic_imputed)
##   gender age class    embarked  fare sibsp parch survived
## 1   male  42   3rd Southampton  7.11     0     0        0
## 2   male  13   3rd Southampton 20.05     0     2        0
## 3   male  16   3rd Southampton 20.05     1     1        0
## 4 female  39   3rd Southampton 20.05     1     1        1
## 5 female  16   3rd Southampton  7.13     0     0        1
## 6   male  25   3rd Southampton  7.13     0     0        1

构建模型分类模型

library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked + fare + sibsp + parch,  data = titanic_imputed,
                           classification = TRUE)
model_titanic_rf
## Ranger result
## 
## Call:
##  ranger(survived ~ gender + age + class + embarked + fare + sibsp +      parch, data = titanic_imputed, classification = TRUE) 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      2207 
## Number of independent variables:  7 
## Mtry:                             2 
## Target node size:                 1 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error:             19.48 %

理解这个随机森林模型

library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf, 
                      data = titanic_imputed,
                      y = titanic_imputed$survived, 
                      label = "Random Forest v7",
                      colorize = FALSE)
## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest v7 
##   -> data              :  2207  rows  8  cols 
##   -> target variable   :  2207  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0 , mean =  0.2129588 , max =  1  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -1 , mean =  0.109198 , max =  1  
##   A new explainer has been created!

分析变量的重要性!

vi_rf <- model_parts(explain_titanic_rf)
head(vi_rf)
##       variable mean_dropout_loss            label
## 1 _full_model_         0.2139575 Random Forest v7
## 2     survived         0.2139575 Random Forest v7
## 3        parch         0.2299543 Random Forest v7
## 4        sibsp         0.2327692 Random Forest v7
## 5     embarked         0.2385094 Random Forest v7
## 6         fare         0.2579893 Random Forest v7
plot(vi_rf)

最爱重要的变量是gender,进一步,我们分析因变量和自变量之间的关系。

vr_age  <- model_profile(explain_titanic_rf, variables =  "age")
head(vr_age)
## $cp_profiles
## Top profiles    : 
##       gender        age class    embarked fare sibsp parch survived _yhat_
## 682     male  0.1666667   3rd Southampton 7.11     0     0        0      0
## 682.1   male  2.0000000   3rd Southampton 7.11     0     0        0      1
## 682.2   male  4.0000000   3rd Southampton 7.11     0     0        0      1
## 682.3   male  7.0000000   3rd Southampton 7.11     0     0        0      1
## 682.4   male  9.0000000   3rd Southampton 7.11     0     0        0      1
## 682.5   male 13.0000000   3rd Southampton 7.11     0     0        0      0
##       _vname_ _ids_          _label_
## 682       age   682 Random Forest v7
## 682.1     age   682 Random Forest v7
## 682.2     age   682 Random Forest v7
## 682.3     age   682 Random Forest v7
## 682.4     age   682 Random Forest v7
## 682.5     age   682 Random Forest v7
## 
## 
## Top observations:
##      gender age            class    embarked     fare sibsp parch survived
## 682    male  26              3rd Southampton   7.1100     0     0        0
## 1894   male  30 engineering crew Southampton   0.0000     0     0        1
## 2124   male  32 engineering crew Southampton   0.0000     0     0        0
## 348  female  27              1st   Cherbourg 247.1005     1     1        1
## 269    male  27              3rd Southampton   7.1711     0     0        0
## 1204   male  20              3rd  Queenstown   7.1500     0     0        0
##      _yhat_          _label_ _ids_
## 682       0 Random Forest v7     1
## 1894      0 Random Forest v7     2
## 2124      0 Random Forest v7     3
## 348       1 Random Forest v7     4
## 269       0 Random Forest v7     5
## 1204      0 Random Forest v7     6
## 
## $agr_profiles
## Top profiles    : 
##   _vname_          _label_        _x_ _yhat_ _ids_
## 1     age Random Forest v7  0.1666667   0.56     0
## 2     age Random Forest v7  2.0000000   0.60     0
## 3     age Random Forest v7  4.0000000   0.63     0
## 4     age Random Forest v7  7.0000000   0.53     0
## 5     age Random Forest v7  9.0000000   0.46     0
## 6     age Random Forest v7 13.0000000   0.30     0
## 
## $color
## [1] "#4378bf"
plot(vr_age)

解释模型预测!

new_passanger <- data.frame(
  class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
  gender = factor("male", levels = c("female", "male")),
  age = 8,
  sibsp = 0,
  parch = 0,
  fare = 72,
  embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)

sp_rf <- predict_parts(explain_titanic_rf, new_passanger)
plot(sp_rf)

看起来这位乘客最重要的特征是年龄和性别。毕竟,他的生存几率高于普通乘客。主要是因为年龄小,尽管是男性。

2. DALEXtra

DALEXtra 是上一个包得到拓展。可以与scikitlearn, keras, H2O, mljar and mlr ,tidymodel结合。

我们以h20模型为例子。

library(h2o)
## 
## ----------------------------------------------------------------------
## 
## Your next step is to start H2O:
##     > h2o.init()
## 
## For H2O package documentation, ask for help:
##     > ??h2o
## 
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
## 
## ----------------------------------------------------------------------
## 
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
## 
##     cor, sd, var
## The following objects are masked from 'package:base':
## 
##     &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
##     colnames<-, ifelse, is.character, is.factor, is.numeric, log,
##     log10, log1p, log2, round, signif, trunc
library(DALEXtra)

# data <- DALEX::titanic_imputed

# init h2o
 cluster <- try(h2o::h2o.init())
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         4 hours 5 minutes 
##     H2O cluster timezone:       Asia/Shanghai 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.38.0.1 
##     H2O cluster version age:    1 month and 21 days  
##     H2O cluster name:           H2O_started_from_R_milin_wbn438 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.92 GB 
##     H2O cluster total cores:    8 
##     H2O cluster allowed cores:  8 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 4.2.1 (2022-06-23)
# stop h2o progress printing
 h2o.no_progress()

# split the data
# h2o_split <- h2o.splitFrame(as.h2o(data))
# train <- h2o_split[[1]]
# test <- as.data.frame(h2o_split[[2]])
# h2o automl takes target as factor
# train$survived <- as.factor(train$survived)

# fit a model
# automl <- h2o.automl(y = "survived",
#                   training_frame = train,
#                    max_runtime_secs = 30)


# create an explainer for the model
# explainer <- explain_h2o(automl,
#                        data = test,
#                         y = test$survived,
#                          label = "h2o")


titanic_test <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))
titanic_train <- read.csv(system.file("extdata", "titanic_train.csv", package = "DALEXtra"))
titanic_h2o <- h2o::as.h2o(titanic_train)
titanic_h2o["survived"] <- h2o::as.factor(titanic_h2o["survived"])
titanic_test_h2o <- h2o::as.h2o(titanic_test)
model <- h2o::h2o.gbm(
training_frame = titanic_h2o,
y = "survived",
distribution = "bernoulli",
ntrees = 500,
max_depth = 4,
min_rows =  12,
learn_rate = 0.001
)

ex_h2o <- explain_h2o(model, titanic_test[,1:17], titanic_test[,18])
## Preparation of a new explainer is initiated
##   -> model label       :  H2OBinomialModel  (  default  )
##   -> data              :  524  rows  17  cols 
##   -> target variable   :  524  values 
##   -> predict function  :  yhat.H2OBinomialModel  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package h2o , ver. 3.38.0.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.1994385 , mean =  0.3299534 , max =  0.5876636  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.5847857 , mean =  -0.01888466 , max =  0.8005615  
##   A new explainer has been created!
ex_h2o
## Model label:  H2OBinomialModel 
## Model class:  H2OBinomialModel 
## Data head  :
##   gender.female gender.male age class.1st class.2nd class.3rd class.deck.crew
## 1             1           0  16         0         0         1               0
## 2             0           1  25         0         0         1               0
##   class.engineering.crew class.restaurant.staff class.victualling.crew
## 1                      0                      0                      0
## 2                      0                      0                      0
##   embarked.Belfast embarked.Cherbourg embarked.Queenstown embarked.Southampton
## 1                0                  0                   0                    1
## 2                0                  0                   0                    1
##   fare sibsp parch
## 1 7.13     0     0
## 2 7.13     0     0

分析变量重要性

imp <- model_parts(ex_h2o)
imp
##                  variable mean_dropout_loss            label
## 1            _full_model_         0.2027514 H2OBinomialModel
## 2                   sibsp         0.1965960 H2OBinomialModel
## 3      embarked.Cherbourg         0.2026987 H2OBinomialModel
## 4             gender.male         0.2027514 H2OBinomialModel
## 5               class.2nd         0.2027514 H2OBinomialModel
## 6  class.engineering.crew         0.2027514 H2OBinomialModel
## 7  class.victualling.crew         0.2027514 H2OBinomialModel
## 8        embarked.Belfast         0.2027514 H2OBinomialModel
## 9     embarked.Queenstown         0.2027514 H2OBinomialModel
## 10   embarked.Southampton         0.2027514 H2OBinomialModel
## 11                  parch         0.2027514 H2OBinomialModel
## 12 class.restaurant.staff         0.2109886 H2OBinomialModel
## 13                   fare         0.2112248 H2OBinomialModel
## 14                    age         0.2129132 H2OBinomialModel
## 15              class.1st         0.2267874 H2OBinomialModel
## 16        class.deck.crew         0.2277595 H2OBinomialModel
## 17              class.3rd         0.2407041 H2OBinomialModel
## 18          gender.female         0.4084505 H2OBinomialModel
## 19             _baseline_         0.4980346 H2OBinomialModel
plot(imp)

解释变量与因变量的关系。

pre_h2o <- model_profile(ex_h2o,variables = "age")

plot(pre_h2o)

解释预测,需要注意的是,这里的数据需要是一个dataframe ,而不是h2o数据对象。

sp_rf <- predict_parts(ex_h2o,titanic_test[1,])
plot(sp_rf)

3. ingredients

是一套分析变量重要性和变量影响的工具。关键函数包括:

  1. feature_importance() 变量重要性
  2. ceteris_paribus() 计算Ceteris Paribus / What-If Profiles
  3. partial_dependency() 计算Partial Dependency 部分依赖关系
  4. conditional_dependency() 计算条件依赖关系
  5. accumulated_dependency() 计算局部效应 Local Effects
  6. cluster_profiles() 聚合Ceteris Paribus Profile
  7. describe() 解释结果

我们来看一个例子:

首先构建模型。

library("DALEX")
head(titanic_imputed)
##   gender age class    embarked  fare sibsp parch survived
## 1   male  42   3rd Southampton  7.11     0     0        0
## 2   male  13   3rd Southampton 20.05     0     2        0
## 3   male  16   3rd Southampton 20.05     1     1        0
## 4 female  39   3rd Southampton 20.05     1     1        1
## 5 female  16   3rd Southampton  7.13     0     0        1
## 6   male  25   3rd Southampton  7.13     0     0        1
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked +
                           fare + sibsp + parch,
                           data = titanic_imputed, probability = TRUE)
model_titanic_rf
## Ranger result
## 
## Call:
##  ranger(survived ~ gender + age + class + embarked + fare + sibsp +      parch, data = titanic_imputed, probability = TRUE) 
## 
## Type:                             Probability estimation 
## Number of trees:                  500 
## Sample size:                      2207 
## Number of independent variables:  7 
## Mtry:                             2 
## Target node size:                 10 
## Variable importance mode:         none 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1420633

解释模型:

library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              label = "Random Forest")
## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest 
##   -> data              :  2207  rows  7  cols 
##   -> target variable   :  2207  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.0141245 , mean =  0.3218769 , max =  0.9918776  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.7839062 , mean =  0.0002798463 , max =  0.8830055  
##   A new explainer has been created!

计算变量重要性

library("ingredients")
## 
## Attaching package: 'ingredients'
## The following object is masked from 'package:DALEX':
## 
##     feature_importance
fi_rf <- feature_importance(explain_titanic_rf)
head(fi_rf)
##       variable mean_dropout_loss         label
## 1 _full_model_         0.3406770 Random Forest
## 2        parch         0.3507961 Random Forest
## 3        sibsp         0.3516941 Random Forest
## 4     embarked         0.3524095 Random Forest
## 5          age         0.3765780 Random Forest
## 6         fare         0.3832976 Random Forest
plot(fi_rf)

计算自变量与因变量之间的关系。

pp_age  <- partial_dependence(explain_titanic_rf, variables =  c("age", "fare"))
head(pp_age)
plot(pp_age)
describe(pp_age)

5 岁左右的孩子存活概率更高。

Conditional Dependence 条件依赖

cp_age  <- conditional_dependence(explain_titanic_rf, variables =  c("age", "fare"))
 head(cp_age)
## Top profiles    : 
##                                _vname_       _label_        _x_    _yhat_ _ids_
## age.Random Forest.0.1666666667     age Random Forest  0.1666667 0.5218634     0
## age.Random Forest.2                age Random Forest  2.0000000 0.5506402     0
## age.Random Forest.4                age Random Forest  4.0000000 0.5563661     0
## age.Random Forest.7                age Random Forest  7.0000000 0.5056455     0
## age.Random Forest.9                age Random Forest  9.0000000 0.4910581     0
## age.Random Forest.13               age Random Forest 13.0000000 0.4280250     0
plot(cp_age)

Accumulated Local Effect Profiles

ap_age  <- accumulated_dependence(explain_titanic_rf, variables =  c("age", "fare"))
plot(ap_age)

对预测进行分析

从 C 港出发的一等舱 8 岁男性的模型预测的分解解释

new_passanger <- data.frame(
  class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
  gender = factor("male", levels = c("female", "male")),
  age = 8,
  sibsp = 0,
  parch = 0,
  fare = 72,
  embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)

sp_rf <- ceteris_paribus(explain_titanic_rf, new_passanger)
plot(sp_rf) +
  show_observations(sp_rf)

describe(sp_rf,variables = "fare")
## Random Forest predicts that for the selected instance, prediction is equal to 0.462
## 
## The highest prediction occurs for (fare = 26.11), while the lowest for (fare = 51.117764).
## Breakpoint is identified at (fare = 34.011).
## 
## Average model responses are *higher* for variable values *lower* than breakpoint (= 34.011).

对于选定的分类变量。请注意, sibsp 是数值型的,但这里显示为分类变量。

plot(sp_rf,
     variables = c("class", "embarked", "gender", "sibsp"),
     variable_type = "categorical")

看起来这个乘客最重要的特征是ageand sex。毕竟他的生存几率高于普通乘客。主要是因为年纪小,尽管是男性。

轮廓聚类

passangers <- select_sample(titanic, n = 100)

sp_rf <- ceteris_paribus(explain_titanic_rf, passangers)
clust_rf <- cluster_profiles(sp_rf, k = 3)
head(clust_rf)
## Top profiles    : 
##   _vname_         _label_       _x_ _cluster_    _yhat_ _ids_
## 1    fare Random Forest_1 0.0000000         1 0.2353243     0
## 2   parch Random Forest_1 0.0000000         1 0.1672138     0
## 3   sibsp Random Forest_1 0.0000000         1 0.1703429     0
## 4     age Random Forest_1 0.1666667         1 0.4574496     0
## 5   parch Random Forest_1 1.0000000         1 0.2350421     0
## 6   sibsp Random Forest_1 1.0000000         1 0.1471561     0
plot(sp_rf, alpha = 0.1) +
  show_aggregated_profiles(clust_rf, color = "_label_", size = 2)

describe 可以帮助我们理解结果。

library("DALEX")
library("ingredients")
library("ranger")


model_titanic_rf <- ranger(survived ~., data = titanic_imputed, probability = TRUE)

explain_titanic_rf <- explain(model_titanic_rf,
                              data = titanic_imputed[,-8],
                              y = titanic_imputed[,8],
                              label = "ranger forest",
                              verbose = FALSE)

selected_passangers <- select_sample(titanic_imputed, n = 10)
cp_rf <- ceteris_paribus(explain_titanic_rf, selected_passangers)
pdp <- aggregate_profiles(cp_rf, type = "partial", variable_type = "categorical") # 对结果进行聚合
pdp
## Top profiles    : 
##   _vname_       _label_       _x_    _yhat_ _ids_
## 8  gender ranger forest    female 0.7163783     0
## 9  gender ranger forest      male 0.2523073     0
## 1   class ranger forest       1st 0.4573005     0
## 2   class ranger forest       2nd 0.3722280     0
## 3   class ranger forest       3rd 0.2941767     0
## 6   class ranger forest deck crew 0.3651602     0
describe(pdp, variables = "gender")
## Ranger forest's mean prediction is equal to 0.313.
## 
## Model's prediction would decrease substantially if the value of gender variable would change to "female".
## The largest change would be marked if gender variable would change to "male".
## 
## Other variables are with less importance and they do not change prediction by more than 0.06%.

4. iBreakDown

该iBreakDown软件包是一个模型无关工具,用于解释黑盒 ML 模型的预测。分解表显示了每个变量对最终预测的贡献。分解图以简洁的图形方式呈现变量贡献。SHAP(Shapley Additive Attributions)值计算为随机分解配置文件的平均值。这个包适用于二元分类器和回归模型。

library("DALEX")
head(titanic)
##   gender age class    embarked       country  fare sibsp parch survived
## 1   male  42   3rd Southampton United States  7.11     0     0       no
## 2   male  13   3rd Southampton United States 20.05     0     2       no
## 3   male  16   3rd Southampton United States 20.05     1     1       no
## 4 female  39   3rd Southampton       England 20.05     1     1      yes
## 5 female  16   3rd Southampton        Norway  7.13     0     0      yes
## 6   male  25   3rd Southampton United States  7.13     0     0      yes
# prepare model
library("randomForest")
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
## 
##     importance
titanic <- na.omit(titanic)
model_titanic_rf <- randomForest(survived == "yes" ~ gender + age + class + embarked +
                                   fare + sibsp + parch,  data = titanic)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
model_titanic_rf
## 
## Call:
##  randomForest(formula = survived == "yes" ~ gender + age + class +      embarked + fare + sibsp + parch, data = titanic) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1430273
##                     % Var explained: 34.74
library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf, 
                      data = titanic[,-9],
                      y = titanic$survived == "yes", 
                      label = "Random Forest v7")
## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest v7 
##   -> data              :  2099  rows  8  cols 
##   -> target variable   :  2099  values 
##   -> predict function  :  yhat.randomForest  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
##   -> model_info        :  Model info detected regression task but 'y' is a logical .  (  WARNING  )
##   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
##   -> model_info        :  Consider changing to numerical vector.
##   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
##   -> predicted values  :  numerical, min =  0.008045946 , mean =  0.3239705 , max =  0.9926026  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.7924928 , mean =  0.0004697528 , max =  0.9044595  
##   A new explainer has been created!

让我们看看从 C 港出发的一等舱 8 岁男性的模型预测分解。

new_passanger <- data.frame(
  class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
  gender = factor("male", levels = c("female", "male")),
  age = 8,
  sibsp = 0,
  parch = 0,
  fare = 72,
  embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)

计算变量属性

library("iBreakDown")
## 
## Attaching package: 'iBreakDown'
## The following objects are masked from 'package:ingredients':
## 
##     describe, plotD3
rf_la <- local_attributions(explain_titanic_rf, new_passanger)
rf_la
##                                          contribution
## Random Forest v7: intercept                     0.324
## Random Forest v7: age = 8                       0.191
## Random Forest v7: class = 1st                   0.071
## Random Forest v7: gender = male                -0.055
## Random Forest v7: fare = 72                    -0.037
## Random Forest v7: embarked = Southampton       -0.019
## Random Forest v7: sibsp = 0                    -0.001
## Random Forest v7: parch = 0                    -0.030
## Random Forest v7: prediction                    0.445
describe(rf_la)
## Random Forest v7 predicts, that the prediction for the selected instance is 0.445 which is higher than the average model prediction.
## 
## The most important variable that increase the prediction is age.
## 
## Other variables are with less importance. The contribution of all other variables is -0.087.

绘制归因图

plot(rf_la)

plotD3(rf_la)

计算可变属性的不确定性

rf_la_un <- break_down_uncertainty(explain_titanic_rf, new_passanger,path = "average")

describe(rf_la_un)
## Random Forest v7 predicts, that the prediction for the selected instance is 0.445 which is higher than the average model prediction.For the selected instance model's prediction is higher, than for 74% of all observations.
## 
## The most important variable that increase the prediction is age.
## 
## Other variables are with less importance. The contribution of all other variables is -0.061.
plot(rf_la_un)

plotD3(rf_la, max_features = 3)

关于describe 函数的解释:

解决了领域专家解释的可解释性不足的问题。我们通过引入describe()函数来解决这个问题,该函数会自动生成使用 package.json 生成的解释的自然语言描述 。

该iBreakDown包允许生成特征属性解释。特征归因解释通过显示模型的哪些变量影响预测以及影响程度来证明模型的预测是正确的。这是通过为每个变量附加一个重要性系数来完成的,该系数的总和应该接近模型的预测。

iBreakDown 使用了两种方法:shap()和break_down().

函数shap()生成一个 SHAP 解释,即函数将 Shapley 值赋给每个变量。函数break_down使用 break_down 算法生成 Shapley 值的有效近似值。

library("DALEX")
library("iBreakDown")
library("randomForest")
titanic <- na.omit(titanic)

model_titanic_rf <- randomForest(survived == "yes" ~ .,
                                 data = titanic 
                                 )
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
explain_titanic_rf <- explain(model_titanic_rf,
                            data = titanic[,-9],
                            y = titanic$survived == "yes",
                            label = "Random Forest")
## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest 
##   -> data              :  2099  rows  8  cols 
##   -> target variable   :  2099  values 
##   -> predict function  :  yhat.randomForest  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
##   -> model_info        :  Model info detected regression task but 'y' is a logical .  (  WARNING  )
##   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
##   -> model_info        :  Consider changing to numerical vector.
##   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
##   -> predicted values  :  numerical, min =  0.008234269 , mean =  0.324069 , max =  0.9922297  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8192322 , mean =  0.0003711799 , max =  0.8905284  
##   A new explainer has been created!
passanger <- titanic[sample(nrow(titanic), 1) ,-9]
passanger
##      gender age            class    embarked country fare sibsp parch
## 1455   male  40 engineering crew Southampton England    0     0     0

现在我们准备好生成shap()和iBreakDown()解释了。

bd_rf <- break_down(explain_titanic_rf,
                    passanger,
                    keep_distributions = TRUE) # distributions should be kept
shap_rf <- shap(explain_titanic_rf,
                passanger)

plot(bd_rf)

plot(shap_rf)

进行解释

describe(bd_rf)
## Random Forest predicts, that the prediction for the selected instance is 0.185 which is lower than the average model prediction. For the selected instance model's prediction is lower, than for 57% of all observations.
## 
## The most important variable that decrease the prediction is gender.
## 
## Other variables are with less importance. The contribution of all other variables is 0.017.
describe(shap_rf)
## Random Forest predicts, that the prediction for the selected instance is 0.185 which is lower than the average model prediction. For the selected instance model's prediction is lower, than for 57% of all observations.
## 
## The most important variable that decrease the prediction is gender.
## 
## Other variables are with less importance. The contribution of all other variables is 0.01.

describe 的参数:

  1. 调整非显着性阈值,非显着性阈值控制哪些预测接近平均预测。通过设置更高的值,更多的预测将被描述为接近平均模型预测,更多的变量将被描述为不显着。 nonsignificance_treshold
  2. 可以更改预测的标签以显示更具体的描述 - label = “the passanger survived with probability”
  3. 简短说明,生成简短的描述可能很有用,因为它们可以制作漂亮的情节字幕。short_description = TRUE
  4. 显示变量值可以很容易地使描述信息更丰富。display_values = TRUE
  5. 示数字会改变整个论证风格,使描述更长。describe(bd_rf, display_numbers = TRUE)
  6. 如果我们想全面了解其他实例的行为,描述分布细节很有用。display_distribution_details = TRUE
  7. 添加额外信息的内容。 display_shap = TRUE

5. auditor

auditor包是一个模型无关的验证工具。有助于评估和比较模型的拟合优度和性能。此外,它们还可用于分析残差的相似性以及识别异常值和有影响的观察值。

论文:https://arxiv.org/abs/1809.07763

案例

library(auditor)
## 
## Attaching package: 'auditor'
## The following object is masked from 'package:iBreakDown':
## 
##     plotD3
## The following object is masked from 'package:ingredients':
## 
##     plotD3
## The following object is masked from 'package:DALEX':
## 
##     model_performance
library(randomForest)
data(mtcars)

# fitting models
model_lm <- lm(mpg ~ ., data = mtcars)
set.seed(123)
model_rf <- randomForest(mpg ~ ., data = mtcars)

# creating objects with 'explain' function from the package DALEX
# that contains all necessary components required for further processing
exp_lm <- DALEX::explain(model_lm, data = mtcars, y = mtcars$mpg,  verbose = FALSE)
exp_rf <- DALEX::explain(model_rf, data = mtcars, y = mtcars$mpg, label = "rf", verbose = FALSE)

# create explanation  objects
mr_lm <- model_residual(exp_lm)
mr_rf <- model_residual(exp_rf)

# generating plots
plot_residual(mr_lm, mr_rf, variable = "wt", smooth = TRUE)

再来看一个2分类模型的例子。

model_glm <- glm(survived ~ ., family = binomial, data = titanic_imputed)

glm_audit <- audit(model_glm,
                   data = titanic_imputed,
                   y = titanic_imputed$survived)
## Preparation of a new explainer is initiated
##   -> model label       :  lm  (  default  )
##   -> data              :  2207  rows  8  cols 
##   -> target variable   :  2207  values 
##   -> predict function  :  yhat.glm  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package stats , ver. 4.2.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.008128381 , mean =  0.3221568 , max =  0.9731431  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.9628583 , mean =  -2.569729e-10 , max =  0.9663346  
##   A new explainer has been created!
# validate a model with auditor
eva_glm <- model_evaluation(glm_audit)

# plot results
plot_prc(eva_glm)
## Warning: Removed 1 row containing missing values (`geom_step()`).

plot(eva_glm)

cheatsheet :https://github.com/ModelOriented/auditor/blob/master/materials/auditor_cheatsheet.pdf

6. fairmodels

用于偏差检测、可视化和缓解的灵活工具。使用基于混淆矩阵计算公平分类指标.

官方网站:https://fairmodels.drwhy.ai/

关键函数是fairness_check()

假设您正在为法院系统构建算法,以预测某人将来是否会成为累犯。首先收集信息,然后建立模型并预测结果。你得到令人满意的准确性。这很好,但似乎该模型更有可能说非裔美国人将成为累犯。它接受了对某些种族歧视明显的数据的培训。种族主义算法对法院系统没有好处。所以现在我们有一些选择。第一个是收集更可靠的数据,第二个是尝试理解偏差并选择偏差最少的模型。我们将选择第二个选项。

library(fairmodels)
data("compas")

head(compas)
##   Two_yr_Recidivism Number_of_Priors Age_Above_FourtyFive Age_Below_TwentyFive
## 1                 0                0                    1                    0
## 2                 1                0                    0                    0
## 3                 1                4                    0                    1
## 4                 0                0                    0                    0
## 5                 1               14                    0                    0
## 6                 0                3                    0                    0
##   Misdemeanor        Ethnicity  Sex
## 1           0            Other Male
## 2           0 African_American Male
## 3           0 African_American Male
## 4           1            Other Male
## 5           0        Caucasian Male
## 6           0            Other Male

为了使fairmodels包正常工作,我们希望翻转目标变量中的因子水平,因此模型可以预测积极的结果(不是累犯)。

compas$Two_yr_Recidivism <- as.factor(ifelse(compas$Two_yr_Recidivism == '1', '0', '1'))

训练一个罗辑回归模型。

library(DALEX)
library(ranger)

# train
rf_compas <- ranger(Two_yr_Recidivism ~., data = compas, probability = TRUE)

# numeric target values
y_numeric <- as.numeric(compas$Two_yr_Recidivism)-1

# explainer
rf_explainer <- explain(rf_compas, data = compas[,-1], y = y_numeric, colorize = FALSE)
## Preparation of a new explainer is initiated
##   -> model label       :  ranger  (  default  )
##   -> data              :  6172  rows  6  cols 
##   -> target variable   :  6172  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.1638799 , mean =  0.5448618 , max =  0.8663325  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.8463783 , mean =  1.83503e-05 , max =  0.7790153  
##   A new explainer has been created!

比我们创建调用函数fairness_check()这个函数聚合了许多解释器

fobject <- fairness_check(rf_explainer,                         # explainer
                          protected = compas$Ethnicity,         # protected variable as factor
                          privileged = "Caucasian",             # level in protected variable, potentially more privileged
                          cutoff = 0.5,                         # cutoff - optional, default = 0.5
                          colorize = FALSE)                         
## Creating fairness classification object
## -> Privileged subgroup       : character ( Ok  )
## -> Protected variable        : factor ( Ok  ) 
## -> Cutoff values for explainers  : 0.5 ( for all subgroups )
## -> Fairness objects      : 0 objects 
## -> Checking explainers       : 1 in total (  compatible  )
## -> Metric calculation        : 13/13 metrics calculated for all models
##  Fairness object created succesfully

让我们看看我们的模型是否有偏差。

print(fobject, colorize = FALSE)
## 
## Fairness check for models: ranger 
## 
## ranger passes 2/5 metrics
## Total loss :  3.713474
plot(fobject)

如果条形到达左侧的红色区域,则表示存在对某些非特权子组的偏见。如果他们到达右侧,则意味着偏向特权.

为什么我们会有这种偏见?模型确实从有偏见的数据中学习。我们可以在下面的图中看到它

plot_density(fobject)

正如我们所看到的,该模型更有可能将非裔美国人归类为惯犯 。更多参见官方教程。

7. vivo

vivo软件包有助于计算实例级变量的重要性(局部敏感度的度量)

https://github.com/ModelOriented/vivo#intuition

该vivo软件包有助于计算

Ceteris Paribus 是一个拉丁短语,意思是“其他事物保持不变”或“其他一切都保持不变”。Ceteris Paribus 图显示模型响应如何依赖于单个输入变量的变化,同时保持所有其他变量不变。它们适用于任何机器学习模型,并允许进行模型比较以更好地了解黑色模型的工作原理。

具体意思就是,当其他变量不变的时候,这个变量是如何影响模型的。

library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
                                      no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
                        data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
## Preparation of a new explainer is initiated
##   -> model label       :  randomForest  (  default  )
##   -> data              :  9000  rows  4  cols 
##   -> target variable   :  9000  values 
##   -> predict function  :  yhat.randomForest  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package randomForest , ver. 4.7.1.1 , task regression (  default  ) 
##   -> predicted values  :  numerical, min =  2093.116 , mean =  3513.504 , max =  5379.316  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -1177.235 , mean =  -1.980654 , max =  2166.446  
##   A new explainer has been created!

计算全局变量重要性。

profiles <- model_profile(explainer_rf)
plot(profiles)

library("vivo")
measure <- global_variable_importance(profiles)
plot(measure)

可以看到最重要的变量是surface。

创建一个新的变量。

new_apartment  <-  data.frame ( construction.year  =  1998 , surface  =  88 , floor  =  2L , no.rooms  =  3 )
predict ( apartments_rf_model , new_apartment )
##       1 
## 3895.99

计算 Ceteris Paribus

library("ingredients")
profiles <- predict_profile(explainer_rf, new_apartment)
plot(profiles) + show_observations(profiles)

计算局部变量重要性的度量

library("vivo")
measure <- local_variable_importance(profiles, apartments[,2:5],
            absolute_deviation = TRUE, point = TRUE, density = TRUE)

plot(measure)

8. randomForestExplainer

randomForestExplainer软件包有助于了解随机森林模型内部发生的情况

备忘录:https://github.com/ModelOriented/randomForestExplainer/blob/master/materials/cheatsheet.pdf

教程:https://modeloriented.github.io/randomForestExplainer/articles/randomForestExplainer.html

library(randomForest)
library(randomForestExplainer)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ran <- randomForest(Species~.,data = iris,localImp=T) # 
# Localimp(局部重要性)设置每个变量在特定个体分类中的重要性

foresImp <- measure_importance(ran)

plot_importance_ggpairs(foresImp)

plot_importance_rankings(foresImp)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

plot_multi_way_importance(foresImp)

measure_importance会计算各种各种重要性。

9. xspliner

该xspliner软件包是一组用于训练可解释代理机器学习模型的工具

主要函数包括:

  1. xspline()或model_surrogate_xspliner()用于训练代理模型,
  2. plot_model_comparison()或plot通用的替代和原始 ML 模型的视觉预测比较,
  3. plot_variable_transition()或plot通用的变量配置文件和相关信息的图形表示,
  4. summary()用于替代和原始 ML 模型的统计比较,
  5. print()获取有关代理模型组件的详细信息。

https://modeloriented.github.io/xspliner/articles/xspliner.html

动机

在回归或分类问题中,主要问题是选择尽可能满足我们要求的模型。一方面,我们希望选择的解决方案具有最佳的统计特性,例如准确度、RMSE 或 R 平方 - 另一方面,我们关注模型的可解释性。

在第一种情况下,随机森林或 XGBoost 等黑盒模型的性能优于其他模型,而第二种情况则以线性模型为主。xspliner 包旨在结合这两种方法:使用从黑盒模型中收集的知识来构建可解释的线性模型。

# devtools::install_github("ModelOriented/xspliner")
library(xspliner)
## 
## Attaching package: 'xspliner'
## The following object is masked from 'package:graphics':
## 
##     xspline
library(pdp)
data(boston)
str(boston)
## 'data.frame':    506 obs. of  16 variables:
##  $ lon    : num  -71 -71 -70.9 -70.9 -70.9 ...
##  $ lat    : num  42.3 42.3 42.3 42.3 42.3 ...
##  $ cmedv  : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ b      : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
boston_rf <- randomForest(cmedv ~ rm + lstat + nox, data = boston)


rf_effect <- pdp::partial(boston_rf, "nox", grid.resolution = 40)
head(rf_effect)
##         nox     yhat
## 1 0.3850000 23.45756
## 2 0.3974615 23.46405
## 3 0.4099231 23.53536
## 4 0.4223846 23.69231
## 5 0.4348462 23.63515
## 6 0.4473077 23.75913
xp_model <- xspline(
  cmedv ~ rm + lstat +
    xs(nox, 
       effect = list(type = "pdp", grid.resolution = 40), 
       transition = list(k = 10, bs = "cr")),
  model = boston_rf
)

summary(xp_model)
## 
## Call:
## stats::glm(formula = cmedv ~ rm + lstat + xs(nox), family = family, 
##     data = data)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -13.4004   -3.3024   -0.9269    2.0446   27.1490  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -33.73798    5.03303  -6.703 5.49e-11 ***
## rm            5.24996    0.41654  12.604  < 2e-16 ***
## lstat        -0.51949    0.04388 -11.838  < 2e-16 ***
## xs(nox)       1.28752    0.16044   8.025 7.24e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 26.87613)
## 
##     Null deviance: 42578  on 505  degrees of freedom
## Residual deviance: 13492  on 502  degrees of freedom
## AIC: 3107.3
## 
## Number of Fisher Scoring iterations: 2

10. shapper

这shapper是 SHAP python 库的 R 包装器

11. drifter

用于识别模型结构或数据结构中的概念漂移。

机器学习模型通常在数据是静止的假设下对历史数据进行拟合和验证。最流行的验证技术(k 折交叉验证、重复交叉验证等)在与训练数据分布相同的数据上测试模型。

然而,在许多实际应用中,部署的模型在不断变化的环境中工作。一段时间后,由于环境的变化,模型性能可能会退化,因为模型可能不太可靠。

概念漂移是指数据分布或变量之间的关系随时间的变化。考虑一个学校的能源消耗模型,随着时间的推移,学校可能配备更多的设备或更节能的设备,这可能会影响模型的性能。

https://rpubs.com/liam/drifterConceptDriftInR

12. EIX

该EIX包实现了一组技术来探索和解释XGBoost以及LightGBM模型。

备忘录:https://github.com/ModelOriented/EIX#cheatsheets

13 modelStudio

modelStudio软件包自动执行机器学习预测模型的解释性分析。它以仅一行代码的无服务器 HTML站点的形式生成高级交互式模型解释。该工具与模型无关,因此与大多数黑盒预测模型和框架兼容(例如 mlr/mlr3、xgboost、caret、h2o、parsnip、tidymodels、scikit-learn、lightgbm、keras/tensorflow)。

https://modelstudio.drwhy.ai/#demo

library("DALEX")
library("ranger")
library("modelStudio")

# fit a model
model <- ranger(score ~., data = happiness_train)

# create an explainer for the model    
explainer <- explain(model,
                     data = happiness_test,
                     y = happiness_test$score,
                     label = "Random Forest")

# make a studio for the model
modelStudio(explainer)

14. arenar

Arena 是一种交互式工具,可让您探索和比较任何模型,无论其内部结构如何。

该arenar包可以在两种模式下运行 - 实时(R 在后台运行并计算所有必要的解释)和无服务器(所有必要的解释都在前面计算)。

https://arenar.drwhy.ai/articles/arena_intro_titanic.html

library("ranger")

titanic_rf <- ranger(survived ~ ., 
              data = titanic_imputed,
              probability = TRUE, 
              classification = TRUE)

ArenaR,与该DrWhy系列的所有软件包一样,适用于统一的模型包装器。我们将使用包中的explain函数创建它们

library("DALEX")

titanic_ex <- explain(
    titanic_rf,
    y = titanic_imputed$survived,
    data = titanic_imputed)
## Preparation of a new explainer is initiated
##   -> model label       :  ranger  (  default  )
##   -> data              :  2207  rows  8  cols 
##   -> target variable   :  2207  values 
##   -> predict function  :  yhat.ranger  will be used (  default  )
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.01403656 , mean =  0.3224658 , max =  0.9881752  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  -0.7797319 , mean =  -0.0003090208 , max =  0.8932716  
##   A new explainer has been created!

创建Arena

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
## 
##     combine
## The following object is masked from 'package:DALEX':
## 
##     explain
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("arenar")

titanic_ar <- create_arena(live = TRUE) %>%
    push_model(titanic_ex) 

运行live Arena服务

run_server(titanic_ar)

几个官方例子:

  1. https://arena.drwhy.ai/?demo=0
  2. https://arena.drwhy.ai/?demo=1
  3. https://arena.drwhy.ai/?demo=2

15. modelDown

modelDown包生成一个网站,其中包含预测模型的 HTML 摘要。

论文:https://joss.theoj.org/papers/10.21105/joss.01444

http://github.com/ModelOriented/modelDown

devtools::install_github("ModelOriented/modelDown")

16. rSAFE

rSAFE软件包是一个模型无关工具,用于使用称为代理模型的替代黑盒模型使可解释的白盒模型更加准确。基于复杂的模型,例如神经网络或随机森林,正在提取新特征,然后在拟合更简单的可解释模型的过程中使用,从而提高其整体性能。

文章:https://www.sciencedirect.com/science/article/pii/S016792362100066X

17. EloML

EloML软件包为机器学习模型提供 Elo 评级系统。Elo Predictive Power (EPP) 分数有助于评估基于模型性能的 Elo 排名系统。

文章:https://arxiv.org/abs/1908.09213

https://github.com/ModelOriented/EloML

18. archivist

archivist包自动序列化和反序列化 R 对象。对象与其他元数据一起存储,以促进数据科学项目的可重复性和治理。

19. 过程中其他一些常用的R包

19.1 Data preparation

19.1.1 dataMaid

作为数据筛选过程的一部分,用于识别数据帧中潜在错误的一套检查

https://cran.r-project.org/web/packages/dataMaid/index.html

19.1.2 inspectdf

用于按列汇总、比较和可视化数据框的实用程序集合。

https://cran.r-project.org/web/packages/inspectdf/index.html ### 19.1.3 validate

R 环境的专业数据验证

https://github.com/data-cleaning/validate

19.1.4 errorlocate

使用验证规则查找和替换数据中的错误字段 https://github.com/data-cleaning/errorlocate

19.1.5 ggplot2

基于图形语法的声明式创建图形系统。 https://ggplot2.tidyverse.org/

19.2 Data understanding

19.2.2 vip

变量重要性图

https://github.com/koalaverse/vip

19.2.3 SAFE

代理学习 = 训练弹性模型并提取特征转换。 https://github.com/MI2DataLab/SAFE

19.2.4 xspliner

使用代理黑盒来训练可解释的基于样条的加法模型 https://github.com/ModelOriented/xspliner

19.2.5 factorMerger

factorMerger是一组支持事后测试的工具,可以提取关于给定响应的因素的层次结构。

https://github.com/ModelOriented/factorMerger

19.2.6 ingredients

用于模型级特征效果和特征重要性的工具集。 https://github.com/ModelOriented/ingredients

19.4 Model audit

19.4.1 auditor

型验证、确认和错误分析 https://github.com/ModelOriented/auditor

19.4.2 DALEX

描述性机器学习说明 https://github.com/ModelOriented/DALEX

19.4.3 iml

可解释的机器学习 R 包 https://github.com/christophM/iml

19.4.4 randomForestExplainer

一组工具,用于了解随机森林中发生的事情 https://github.com/MI2DataLab/randomForestExplainer