DrWhy.AI是一套关于predictive Models的工具! 对应书籍: https://ema.drwhy.ai
有助于理解复杂模型的工作原理。这个包主要是帮助理解模型的。
关键函数总结:
来看一个例子。
library("DALEX")
## Welcome to DALEX (version: 2.4.2).
## Find examples and detailed introduction at: http://ema.drwhy.ai/
head(titanic_imputed)
## gender age class embarked fare sibsp parch survived
## 1 male 42 3rd Southampton 7.11 0 0 0
## 2 male 13 3rd Southampton 20.05 0 2 0
## 3 male 16 3rd Southampton 20.05 1 1 0
## 4 female 39 3rd Southampton 20.05 1 1 1
## 5 female 16 3rd Southampton 7.13 0 0 1
## 6 male 25 3rd Southampton 7.13 0 0 1
构建模型分类模型
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic_imputed,
classification = TRUE)
model_titanic_rf
## Ranger result
##
## Call:
## ranger(survived ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic_imputed, classification = TRUE)
##
## Type: Classification
## Number of trees: 500
## Sample size: 2207
## Number of independent variables: 7
## Mtry: 2
## Target node size: 1
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error: 19.48 %
理解这个随机森林模型
library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic_imputed,
y = titanic_imputed$survived,
label = "Random Forest v7",
colorize = FALSE)
## Preparation of a new explainer is initiated
## -> model label : Random Forest v7
## -> data : 2207 rows 8 cols
## -> target variable : 2207 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
## -> predicted values : numerical, min = 0 , mean = 0.2129588 , max = 1
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -1 , mean = 0.109198 , max = 1
## A new explainer has been created!
分析变量的重要性!
vi_rf <- model_parts(explain_titanic_rf)
head(vi_rf)
## variable mean_dropout_loss label
## 1 _full_model_ 0.2139575 Random Forest v7
## 2 survived 0.2139575 Random Forest v7
## 3 parch 0.2299543 Random Forest v7
## 4 sibsp 0.2327692 Random Forest v7
## 5 embarked 0.2385094 Random Forest v7
## 6 fare 0.2579893 Random Forest v7
plot(vi_rf)
最爱重要的变量是gender,进一步,我们分析因变量和自变量之间的关系。
vr_age <- model_profile(explain_titanic_rf, variables = "age")
head(vr_age)
## $cp_profiles
## Top profiles :
## gender age class embarked fare sibsp parch survived _yhat_
## 682 male 0.1666667 3rd Southampton 7.11 0 0 0 0
## 682.1 male 2.0000000 3rd Southampton 7.11 0 0 0 1
## 682.2 male 4.0000000 3rd Southampton 7.11 0 0 0 1
## 682.3 male 7.0000000 3rd Southampton 7.11 0 0 0 1
## 682.4 male 9.0000000 3rd Southampton 7.11 0 0 0 1
## 682.5 male 13.0000000 3rd Southampton 7.11 0 0 0 0
## _vname_ _ids_ _label_
## 682 age 682 Random Forest v7
## 682.1 age 682 Random Forest v7
## 682.2 age 682 Random Forest v7
## 682.3 age 682 Random Forest v7
## 682.4 age 682 Random Forest v7
## 682.5 age 682 Random Forest v7
##
##
## Top observations:
## gender age class embarked fare sibsp parch survived
## 682 male 26 3rd Southampton 7.1100 0 0 0
## 1894 male 30 engineering crew Southampton 0.0000 0 0 1
## 2124 male 32 engineering crew Southampton 0.0000 0 0 0
## 348 female 27 1st Cherbourg 247.1005 1 1 1
## 269 male 27 3rd Southampton 7.1711 0 0 0
## 1204 male 20 3rd Queenstown 7.1500 0 0 0
## _yhat_ _label_ _ids_
## 682 0 Random Forest v7 1
## 1894 0 Random Forest v7 2
## 2124 0 Random Forest v7 3
## 348 1 Random Forest v7 4
## 269 0 Random Forest v7 5
## 1204 0 Random Forest v7 6
##
## $agr_profiles
## Top profiles :
## _vname_ _label_ _x_ _yhat_ _ids_
## 1 age Random Forest v7 0.1666667 0.56 0
## 2 age Random Forest v7 2.0000000 0.60 0
## 3 age Random Forest v7 4.0000000 0.63 0
## 4 age Random Forest v7 7.0000000 0.53 0
## 5 age Random Forest v7 9.0000000 0.46 0
## 6 age Random Forest v7 13.0000000 0.30 0
##
## $color
## [1] "#4378bf"
plot(vr_age)
解释模型预测!
new_passanger <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 8,
sibsp = 0,
parch = 0,
fare = 72,
embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)
sp_rf <- predict_parts(explain_titanic_rf, new_passanger)
plot(sp_rf)
看起来这位乘客最重要的特征是年龄和性别。毕竟,他的生存几率高于普通乘客。主要是因为年龄小,尽管是男性。
DALEXtra 是上一个包得到拓展。可以与scikitlearn, keras, H2O, mljar and mlr ,tidymodel结合。
我们以h20模型为例子。
library(h2o)
##
## ----------------------------------------------------------------------
##
## Your next step is to start H2O:
## > h2o.init()
##
## For H2O package documentation, ask for help:
## > ??h2o
##
## After starting H2O, you can use the Web UI at http://localhost:54321
## For more information visit https://docs.h2o.ai
##
## ----------------------------------------------------------------------
##
## Attaching package: 'h2o'
## The following objects are masked from 'package:stats':
##
## cor, sd, var
## The following objects are masked from 'package:base':
##
## &&, %*%, %in%, ||, apply, as.factor, as.numeric, colnames,
## colnames<-, ifelse, is.character, is.factor, is.numeric, log,
## log10, log1p, log2, round, signif, trunc
library(DALEXtra)
# data <- DALEX::titanic_imputed
# init h2o
cluster <- try(h2o::h2o.init())
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 4 hours 5 minutes
## H2O cluster timezone: Asia/Shanghai
## H2O data parsing timezone: UTC
## H2O cluster version: 3.38.0.1
## H2O cluster version age: 1 month and 21 days
## H2O cluster name: H2O_started_from_R_milin_wbn438
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.92 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 4.2.1 (2022-06-23)
# stop h2o progress printing
h2o.no_progress()
# split the data
# h2o_split <- h2o.splitFrame(as.h2o(data))
# train <- h2o_split[[1]]
# test <- as.data.frame(h2o_split[[2]])
# h2o automl takes target as factor
# train$survived <- as.factor(train$survived)
# fit a model
# automl <- h2o.automl(y = "survived",
# training_frame = train,
# max_runtime_secs = 30)
# create an explainer for the model
# explainer <- explain_h2o(automl,
# data = test,
# y = test$survived,
# label = "h2o")
titanic_test <- read.csv(system.file("extdata", "titanic_test.csv", package = "DALEXtra"))
titanic_train <- read.csv(system.file("extdata", "titanic_train.csv", package = "DALEXtra"))
titanic_h2o <- h2o::as.h2o(titanic_train)
titanic_h2o["survived"] <- h2o::as.factor(titanic_h2o["survived"])
titanic_test_h2o <- h2o::as.h2o(titanic_test)
model <- h2o::h2o.gbm(
training_frame = titanic_h2o,
y = "survived",
distribution = "bernoulli",
ntrees = 500,
max_depth = 4,
min_rows = 12,
learn_rate = 0.001
)
ex_h2o <- explain_h2o(model, titanic_test[,1:17], titanic_test[,18])
## Preparation of a new explainer is initiated
## -> model label : H2OBinomialModel ( default )
## -> data : 524 rows 17 cols
## -> target variable : 524 values
## -> predict function : yhat.H2OBinomialModel will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package h2o , ver. 3.38.0.1 , task classification ( default )
## -> predicted values : numerical, min = 0.1994385 , mean = 0.3299534 , max = 0.5876636
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.5847857 , mean = -0.01888466 , max = 0.8005615
## A new explainer has been created!
ex_h2o
## Model label: H2OBinomialModel
## Model class: H2OBinomialModel
## Data head :
## gender.female gender.male age class.1st class.2nd class.3rd class.deck.crew
## 1 1 0 16 0 0 1 0
## 2 0 1 25 0 0 1 0
## class.engineering.crew class.restaurant.staff class.victualling.crew
## 1 0 0 0
## 2 0 0 0
## embarked.Belfast embarked.Cherbourg embarked.Queenstown embarked.Southampton
## 1 0 0 0 1
## 2 0 0 0 1
## fare sibsp parch
## 1 7.13 0 0
## 2 7.13 0 0
分析变量重要性
imp <- model_parts(ex_h2o)
imp
## variable mean_dropout_loss label
## 1 _full_model_ 0.2027514 H2OBinomialModel
## 2 sibsp 0.1965960 H2OBinomialModel
## 3 embarked.Cherbourg 0.2026987 H2OBinomialModel
## 4 gender.male 0.2027514 H2OBinomialModel
## 5 class.2nd 0.2027514 H2OBinomialModel
## 6 class.engineering.crew 0.2027514 H2OBinomialModel
## 7 class.victualling.crew 0.2027514 H2OBinomialModel
## 8 embarked.Belfast 0.2027514 H2OBinomialModel
## 9 embarked.Queenstown 0.2027514 H2OBinomialModel
## 10 embarked.Southampton 0.2027514 H2OBinomialModel
## 11 parch 0.2027514 H2OBinomialModel
## 12 class.restaurant.staff 0.2109886 H2OBinomialModel
## 13 fare 0.2112248 H2OBinomialModel
## 14 age 0.2129132 H2OBinomialModel
## 15 class.1st 0.2267874 H2OBinomialModel
## 16 class.deck.crew 0.2277595 H2OBinomialModel
## 17 class.3rd 0.2407041 H2OBinomialModel
## 18 gender.female 0.4084505 H2OBinomialModel
## 19 _baseline_ 0.4980346 H2OBinomialModel
plot(imp)
解释变量与因变量的关系。
pre_h2o <- model_profile(ex_h2o,variables = "age")
plot(pre_h2o)
解释预测,需要注意的是,这里的数据需要是一个dataframe ,而不是h2o数据对象。
sp_rf <- predict_parts(ex_h2o,titanic_test[1,])
plot(sp_rf)
是一套分析变量重要性和变量影响的工具。关键函数包括:
我们来看一个例子:
首先构建模型。
library("DALEX")
head(titanic_imputed)
## gender age class embarked fare sibsp parch survived
## 1 male 42 3rd Southampton 7.11 0 0 0
## 2 male 13 3rd Southampton 20.05 0 2 0
## 3 male 16 3rd Southampton 20.05 1 1 0
## 4 female 39 3rd Southampton 20.05 1 1 1
## 5 female 16 3rd Southampton 7.13 0 0 1
## 6 male 25 3rd Southampton 7.13 0 0 1
library("ranger")
model_titanic_rf <- ranger(survived ~ gender + age + class + embarked +
fare + sibsp + parch,
data = titanic_imputed, probability = TRUE)
model_titanic_rf
## Ranger result
##
## Call:
## ranger(survived ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic_imputed, probability = TRUE)
##
## Type: Probability estimation
## Number of trees: 500
## Sample size: 2207
## Number of independent variables: 7
## Mtry: 2
## Target node size: 10
## Variable importance mode: none
## Splitrule: gini
## OOB prediction error (Brier s.): 0.1420633
解释模型:
library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic_imputed[,-8],
y = titanic_imputed[,8],
label = "Random Forest")
## Preparation of a new explainer is initiated
## -> model label : Random Forest
## -> data : 2207 rows 7 cols
## -> target variable : 2207 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
## -> predicted values : numerical, min = 0.0141245 , mean = 0.3218769 , max = 0.9918776
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.7839062 , mean = 0.0002798463 , max = 0.8830055
## A new explainer has been created!
计算变量重要性
library("ingredients")
##
## Attaching package: 'ingredients'
## The following object is masked from 'package:DALEX':
##
## feature_importance
fi_rf <- feature_importance(explain_titanic_rf)
head(fi_rf)
## variable mean_dropout_loss label
## 1 _full_model_ 0.3406770 Random Forest
## 2 parch 0.3507961 Random Forest
## 3 sibsp 0.3516941 Random Forest
## 4 embarked 0.3524095 Random Forest
## 5 age 0.3765780 Random Forest
## 6 fare 0.3832976 Random Forest
plot(fi_rf)
计算自变量与因变量之间的关系。
pp_age <- partial_dependence(explain_titanic_rf, variables = c("age", "fare"))
head(pp_age)
plot(pp_age)
describe(pp_age)
5 岁左右的孩子存活概率更高。
Conditional Dependence 条件依赖
cp_age <- conditional_dependence(explain_titanic_rf, variables = c("age", "fare"))
head(cp_age)
## Top profiles :
## _vname_ _label_ _x_ _yhat_ _ids_
## age.Random Forest.0.1666666667 age Random Forest 0.1666667 0.5218634 0
## age.Random Forest.2 age Random Forest 2.0000000 0.5506402 0
## age.Random Forest.4 age Random Forest 4.0000000 0.5563661 0
## age.Random Forest.7 age Random Forest 7.0000000 0.5056455 0
## age.Random Forest.9 age Random Forest 9.0000000 0.4910581 0
## age.Random Forest.13 age Random Forest 13.0000000 0.4280250 0
plot(cp_age)
Accumulated Local Effect Profiles
ap_age <- accumulated_dependence(explain_titanic_rf, variables = c("age", "fare"))
plot(ap_age)
对预测进行分析
从 C 港出发的一等舱 8 岁男性的模型预测的分解解释
new_passanger <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 8,
sibsp = 0,
parch = 0,
fare = 72,
embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)
sp_rf <- ceteris_paribus(explain_titanic_rf, new_passanger)
plot(sp_rf) +
show_observations(sp_rf)
describe(sp_rf,variables = "fare")
## Random Forest predicts that for the selected instance, prediction is equal to 0.462
##
## The highest prediction occurs for (fare = 26.11), while the lowest for (fare = 51.117764).
## Breakpoint is identified at (fare = 34.011).
##
## Average model responses are *higher* for variable values *lower* than breakpoint (= 34.011).
对于选定的分类变量。请注意, sibsp 是数值型的,但这里显示为分类变量。
plot(sp_rf,
variables = c("class", "embarked", "gender", "sibsp"),
variable_type = "categorical")
看起来这个乘客最重要的特征是ageand
sex。毕竟他的生存几率高于普通乘客。主要是因为年纪小,尽管是男性。
轮廓聚类
passangers <- select_sample(titanic, n = 100)
sp_rf <- ceteris_paribus(explain_titanic_rf, passangers)
clust_rf <- cluster_profiles(sp_rf, k = 3)
head(clust_rf)
## Top profiles :
## _vname_ _label_ _x_ _cluster_ _yhat_ _ids_
## 1 fare Random Forest_1 0.0000000 1 0.2353243 0
## 2 parch Random Forest_1 0.0000000 1 0.1672138 0
## 3 sibsp Random Forest_1 0.0000000 1 0.1703429 0
## 4 age Random Forest_1 0.1666667 1 0.4574496 0
## 5 parch Random Forest_1 1.0000000 1 0.2350421 0
## 6 sibsp Random Forest_1 1.0000000 1 0.1471561 0
plot(sp_rf, alpha = 0.1) +
show_aggregated_profiles(clust_rf, color = "_label_", size = 2)
describe 可以帮助我们理解结果。
library("DALEX")
library("ingredients")
library("ranger")
model_titanic_rf <- ranger(survived ~., data = titanic_imputed, probability = TRUE)
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic_imputed[,-8],
y = titanic_imputed[,8],
label = "ranger forest",
verbose = FALSE)
selected_passangers <- select_sample(titanic_imputed, n = 10)
cp_rf <- ceteris_paribus(explain_titanic_rf, selected_passangers)
pdp <- aggregate_profiles(cp_rf, type = "partial", variable_type = "categorical") # 对结果进行聚合
pdp
## Top profiles :
## _vname_ _label_ _x_ _yhat_ _ids_
## 8 gender ranger forest female 0.7163783 0
## 9 gender ranger forest male 0.2523073 0
## 1 class ranger forest 1st 0.4573005 0
## 2 class ranger forest 2nd 0.3722280 0
## 3 class ranger forest 3rd 0.2941767 0
## 6 class ranger forest deck crew 0.3651602 0
describe(pdp, variables = "gender")
## Ranger forest's mean prediction is equal to 0.313.
##
## Model's prediction would decrease substantially if the value of gender variable would change to "female".
## The largest change would be marked if gender variable would change to "male".
##
## Other variables are with less importance and they do not change prediction by more than 0.06%.
该iBreakDown软件包是一个模型无关工具,用于解释黑盒 ML 模型的预测。分解表显示了每个变量对最终预测的贡献。分解图以简洁的图形方式呈现变量贡献。SHAP(Shapley Additive Attributions)值计算为随机分解配置文件的平均值。这个包适用于二元分类器和回归模型。
library("DALEX")
head(titanic)
## gender age class embarked country fare sibsp parch survived
## 1 male 42 3rd Southampton United States 7.11 0 0 no
## 2 male 13 3rd Southampton United States 20.05 0 2 no
## 3 male 16 3rd Southampton United States 20.05 1 1 no
## 4 female 39 3rd Southampton England 20.05 1 1 yes
## 5 female 16 3rd Southampton Norway 7.13 0 0 yes
## 6 male 25 3rd Southampton United States 7.13 0 0 yes
# prepare model
library("randomForest")
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ranger':
##
## importance
titanic <- na.omit(titanic)
model_titanic_rf <- randomForest(survived == "yes" ~ gender + age + class + embarked +
fare + sibsp + parch, data = titanic)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
model_titanic_rf
##
## Call:
## randomForest(formula = survived == "yes" ~ gender + age + class + embarked + fare + sibsp + parch, data = titanic)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1430273
## % Var explained: 34.74
library("DALEX")
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic[,-9],
y = titanic$survived == "yes",
label = "Random Forest v7")
## Preparation of a new explainer is initiated
## -> model label : Random Forest v7
## -> data : 2099 rows 8 cols
## -> target variable : 2099 values
## -> predict function : yhat.randomForest will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package randomForest , ver. 4.7.1.1 , task regression ( default )
## -> model_info : Model info detected regression task but 'y' is a logical . ( WARNING )
## -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
## -> model_info : Consider changing to numerical vector.
## -> model_info : Otherwise I will not be able to calculate residuals or loss function.
## -> predicted values : numerical, min = 0.008045946 , mean = 0.3239705 , max = 0.9926026
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.7924928 , mean = 0.0004697528 , max = 0.9044595
## A new explainer has been created!
让我们看看从 C 港出发的一等舱 8 岁男性的模型预测分解。
new_passanger <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 8,
sibsp = 0,
parch = 0,
fare = 72,
embarked = factor("Southampton", levels = c("Belfast", "Cherbourg", "Queenstown", "Southampton"))
)
计算变量属性
library("iBreakDown")
##
## Attaching package: 'iBreakDown'
## The following objects are masked from 'package:ingredients':
##
## describe, plotD3
rf_la <- local_attributions(explain_titanic_rf, new_passanger)
rf_la
## contribution
## Random Forest v7: intercept 0.324
## Random Forest v7: age = 8 0.191
## Random Forest v7: class = 1st 0.071
## Random Forest v7: gender = male -0.055
## Random Forest v7: fare = 72 -0.037
## Random Forest v7: embarked = Southampton -0.019
## Random Forest v7: sibsp = 0 -0.001
## Random Forest v7: parch = 0 -0.030
## Random Forest v7: prediction 0.445
describe(rf_la)
## Random Forest v7 predicts, that the prediction for the selected instance is 0.445 which is higher than the average model prediction.
##
## The most important variable that increase the prediction is age.
##
## Other variables are with less importance. The contribution of all other variables is -0.087.
绘制归因图
plot(rf_la)
plotD3(rf_la)
计算可变属性的不确定性
rf_la_un <- break_down_uncertainty(explain_titanic_rf, new_passanger,path = "average")
describe(rf_la_un)
## Random Forest v7 predicts, that the prediction for the selected instance is 0.445 which is higher than the average model prediction.For the selected instance model's prediction is higher, than for 74% of all observations.
##
## The most important variable that increase the prediction is age.
##
## Other variables are with less importance. The contribution of all other variables is -0.061.
plot(rf_la_un)
plotD3(rf_la, max_features = 3)
解决了领域专家解释的可解释性不足的问题。我们通过引入describe()函数来解决这个问题,该函数会自动生成使用 package.json 生成的解释的自然语言描述 。
该iBreakDown包允许生成特征属性解释。特征归因解释通过显示模型的哪些变量影响预测以及影响程度来证明模型的预测是正确的。这是通过为每个变量附加一个重要性系数来完成的,该系数的总和应该接近模型的预测。
iBreakDown 使用了两种方法:shap()和break_down().
函数shap()生成一个 SHAP 解释,即函数将 Shapley 值赋给每个变量。函数break_down使用 break_down 算法生成 Shapley 值的有效近似值。
library("DALEX")
library("iBreakDown")
library("randomForest")
titanic <- na.omit(titanic)
model_titanic_rf <- randomForest(survived == "yes" ~ .,
data = titanic
)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
explain_titanic_rf <- explain(model_titanic_rf,
data = titanic[,-9],
y = titanic$survived == "yes",
label = "Random Forest")
## Preparation of a new explainer is initiated
## -> model label : Random Forest
## -> data : 2099 rows 8 cols
## -> target variable : 2099 values
## -> predict function : yhat.randomForest will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package randomForest , ver. 4.7.1.1 , task regression ( default )
## -> model_info : Model info detected regression task but 'y' is a logical . ( WARNING )
## -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
## -> model_info : Consider changing to numerical vector.
## -> model_info : Otherwise I will not be able to calculate residuals or loss function.
## -> predicted values : numerical, min = 0.008234269 , mean = 0.324069 , max = 0.9922297
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.8192322 , mean = 0.0003711799 , max = 0.8905284
## A new explainer has been created!
passanger <- titanic[sample(nrow(titanic), 1) ,-9]
passanger
## gender age class embarked country fare sibsp parch
## 1455 male 40 engineering crew Southampton England 0 0 0
现在我们准备好生成shap()和iBreakDown()解释了。
bd_rf <- break_down(explain_titanic_rf,
passanger,
keep_distributions = TRUE) # distributions should be kept
shap_rf <- shap(explain_titanic_rf,
passanger)
plot(bd_rf)
plot(shap_rf)
进行解释
describe(bd_rf)
## Random Forest predicts, that the prediction for the selected instance is 0.185 which is lower than the average model prediction. For the selected instance model's prediction is lower, than for 57% of all observations.
##
## The most important variable that decrease the prediction is gender.
##
## Other variables are with less importance. The contribution of all other variables is 0.017.
describe(shap_rf)
## Random Forest predicts, that the prediction for the selected instance is 0.185 which is lower than the average model prediction. For the selected instance model's prediction is lower, than for 57% of all observations.
##
## The most important variable that decrease the prediction is gender.
##
## Other variables are with less importance. The contribution of all other variables is 0.01.
describe 的参数:
auditor包是一个模型无关的验证工具。有助于评估和比较模型的拟合优度和性能。此外,它们还可用于分析残差的相似性以及识别异常值和有影响的观察值。
论文:https://arxiv.org/abs/1809.07763
案例
library(auditor)
##
## Attaching package: 'auditor'
## The following object is masked from 'package:iBreakDown':
##
## plotD3
## The following object is masked from 'package:ingredients':
##
## plotD3
## The following object is masked from 'package:DALEX':
##
## model_performance
library(randomForest)
data(mtcars)
# fitting models
model_lm <- lm(mpg ~ ., data = mtcars)
set.seed(123)
model_rf <- randomForest(mpg ~ ., data = mtcars)
# creating objects with 'explain' function from the package DALEX
# that contains all necessary components required for further processing
exp_lm <- DALEX::explain(model_lm, data = mtcars, y = mtcars$mpg, verbose = FALSE)
exp_rf <- DALEX::explain(model_rf, data = mtcars, y = mtcars$mpg, label = "rf", verbose = FALSE)
# create explanation objects
mr_lm <- model_residual(exp_lm)
mr_rf <- model_residual(exp_rf)
# generating plots
plot_residual(mr_lm, mr_rf, variable = "wt", smooth = TRUE)
再来看一个2分类模型的例子。
model_glm <- glm(survived ~ ., family = binomial, data = titanic_imputed)
glm_audit <- audit(model_glm,
data = titanic_imputed,
y = titanic_imputed$survived)
## Preparation of a new explainer is initiated
## -> model label : lm ( default )
## -> data : 2207 rows 8 cols
## -> target variable : 2207 values
## -> predict function : yhat.glm will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package stats , ver. 4.2.1 , task classification ( default )
## -> predicted values : numerical, min = 0.008128381 , mean = 0.3221568 , max = 0.9731431
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.9628583 , mean = -2.569729e-10 , max = 0.9663346
## A new explainer has been created!
# validate a model with auditor
eva_glm <- model_evaluation(glm_audit)
# plot results
plot_prc(eva_glm)
## Warning: Removed 1 row containing missing values (`geom_step()`).
plot(eva_glm)
cheatsheet :https://github.com/ModelOriented/auditor/blob/master/materials/auditor_cheatsheet.pdf
用于偏差检测、可视化和缓解的灵活工具。使用基于混淆矩阵计算公平分类指标.
官方网站:https://fairmodels.drwhy.ai/
关键函数是fairness_check()
假设您正在为法院系统构建算法,以预测某人将来是否会成为累犯。首先收集信息,然后建立模型并预测结果。你得到令人满意的准确性。这很好,但似乎该模型更有可能说非裔美国人将成为累犯。它接受了对某些种族歧视明显的数据的培训。种族主义算法对法院系统没有好处。所以现在我们有一些选择。第一个是收集更可靠的数据,第二个是尝试理解偏差并选择偏差最少的模型。我们将选择第二个选项。
library(fairmodels)
data("compas")
head(compas)
## Two_yr_Recidivism Number_of_Priors Age_Above_FourtyFive Age_Below_TwentyFive
## 1 0 0 1 0
## 2 1 0 0 0
## 3 1 4 0 1
## 4 0 0 0 0
## 5 1 14 0 0
## 6 0 3 0 0
## Misdemeanor Ethnicity Sex
## 1 0 Other Male
## 2 0 African_American Male
## 3 0 African_American Male
## 4 1 Other Male
## 5 0 Caucasian Male
## 6 0 Other Male
为了使fairmodels包正常工作,我们希望翻转目标变量中的因子水平,因此模型可以预测积极的结果(不是累犯)。
compas$Two_yr_Recidivism <- as.factor(ifelse(compas$Two_yr_Recidivism == '1', '0', '1'))
训练一个罗辑回归模型。
library(DALEX)
library(ranger)
# train
rf_compas <- ranger(Two_yr_Recidivism ~., data = compas, probability = TRUE)
# numeric target values
y_numeric <- as.numeric(compas$Two_yr_Recidivism)-1
# explainer
rf_explainer <- explain(rf_compas, data = compas[,-1], y = y_numeric, colorize = FALSE)
## Preparation of a new explainer is initiated
## -> model label : ranger ( default )
## -> data : 6172 rows 6 cols
## -> target variable : 6172 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
## -> predicted values : numerical, min = 0.1638799 , mean = 0.5448618 , max = 0.8663325
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.8463783 , mean = 1.83503e-05 , max = 0.7790153
## A new explainer has been created!
比我们创建调用函数fairness_check()这个函数聚合了许多解释器
fobject <- fairness_check(rf_explainer, # explainer
protected = compas$Ethnicity, # protected variable as factor
privileged = "Caucasian", # level in protected variable, potentially more privileged
cutoff = 0.5, # cutoff - optional, default = 0.5
colorize = FALSE)
## Creating fairness classification object
## -> Privileged subgroup : character ([32m Ok [39m )
## -> Protected variable : factor ([32m Ok [39m )
## -> Cutoff values for explainers : 0.5 ( for all subgroups )
## -> Fairness objects : 0 objects
## -> Checking explainers : 1 in total ( [32m compatible [39m )
## -> Metric calculation : 13/13 metrics calculated for all models
## Fairness object created succesfully
让我们看看我们的模型是否有偏差。
print(fobject, colorize = FALSE)
##
## Fairness check for models: ranger
##
## ranger passes 2/5 metrics
## Total loss : 3.713474
plot(fobject)
如果条形到达左侧的红色区域,则表示存在对某些非特权子组的偏见。如果他们到达右侧,则意味着偏向特权.
为什么我们会有这种偏见?模型确实从有偏见的数据中学习。我们可以在下面的图中看到它
plot_density(fobject)
正如我们所看到的,该模型更有可能将非裔美国人归类为惯犯
。更多参见官方教程。
vivo软件包有助于计算实例级变量的重要性(局部敏感度的度量)
https://github.com/ModelOriented/vivo#intuition
该vivo软件包有助于计算
Ceteris Paribus 是一个拉丁短语,意思是“其他事物保持不变”或“其他一切都保持不变”。Ceteris Paribus 图显示模型响应如何依赖于单个输入变量的变化,同时保持所有其他变量不变。它们适用于任何机器学习模型,并允许进行模型比较以更好地了解黑色模型的工作原理。
具体意思就是,当其他变量不变的时候,这个变量是如何影响模型的。
library("randomForest")
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms, data = apartments)
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:5], y = apartmentsTest$m2.price)
## Preparation of a new explainer is initiated
## -> model label : randomForest ( default )
## -> data : 9000 rows 4 cols
## -> target variable : 9000 values
## -> predict function : yhat.randomForest will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package randomForest , ver. 4.7.1.1 , task regression ( default )
## -> predicted values : numerical, min = 2093.116 , mean = 3513.504 , max = 5379.316
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -1177.235 , mean = -1.980654 , max = 2166.446
## A new explainer has been created!
计算全局变量重要性。
profiles <- model_profile(explainer_rf)
plot(profiles)
library("vivo")
measure <- global_variable_importance(profiles)
plot(measure)
可以看到最重要的变量是surface。
创建一个新的变量。
new_apartment <- data.frame ( construction.year = 1998 , surface = 88 , floor = 2L , no.rooms = 3 )
predict ( apartments_rf_model , new_apartment )
## 1
## 3895.99
计算 Ceteris Paribus
library("ingredients")
profiles <- predict_profile(explainer_rf, new_apartment)
plot(profiles) + show_observations(profiles)
计算局部变量重要性的度量
library("vivo")
measure <- local_variable_importance(profiles, apartments[,2:5],
absolute_deviation = TRUE, point = TRUE, density = TRUE)
plot(measure)
randomForestExplainer软件包有助于了解随机森林模型内部发生的情况
备忘录:https://github.com/ModelOriented/randomForestExplainer/blob/master/materials/cheatsheet.pdf
教程:https://modeloriented.github.io/randomForestExplainer/articles/randomForestExplainer.html
library(randomForest)
library(randomForestExplainer)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ran <- randomForest(Species~.,data = iris,localImp=T) #
# Localimp(局部重要性)设置每个变量在特定个体分类中的重要性
foresImp <- measure_importance(ran)
plot_importance_ggpairs(foresImp)
plot_importance_rankings(foresImp)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
plot_multi_way_importance(foresImp)
measure_importance会计算各种各种重要性。
该xspliner软件包是一组用于训练可解释代理机器学习模型的工具
主要函数包括:
https://modeloriented.github.io/xspliner/articles/xspliner.html
在回归或分类问题中,主要问题是选择尽可能满足我们要求的模型。一方面,我们希望选择的解决方案具有最佳的统计特性,例如准确度、RMSE 或 R 平方 - 另一方面,我们关注模型的可解释性。
在第一种情况下,随机森林或 XGBoost 等黑盒模型的性能优于其他模型,而第二种情况则以线性模型为主。xspliner 包旨在结合这两种方法:使用从黑盒模型中收集的知识来构建可解释的线性模型。
# devtools::install_github("ModelOriented/xspliner")
library(xspliner)
##
## Attaching package: 'xspliner'
## The following object is masked from 'package:graphics':
##
## xspline
library(pdp)
data(boston)
str(boston)
## 'data.frame': 506 obs. of 16 variables:
## $ lon : num -71 -71 -70.9 -70.9 -70.9 ...
## $ lat : num 42.3 42.3 42.3 42.3 42.3 ...
## $ cmedv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : int 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ b : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
boston_rf <- randomForest(cmedv ~ rm + lstat + nox, data = boston)
rf_effect <- pdp::partial(boston_rf, "nox", grid.resolution = 40)
head(rf_effect)
## nox yhat
## 1 0.3850000 23.45756
## 2 0.3974615 23.46405
## 3 0.4099231 23.53536
## 4 0.4223846 23.69231
## 5 0.4348462 23.63515
## 6 0.4473077 23.75913
xp_model <- xspline(
cmedv ~ rm + lstat +
xs(nox,
effect = list(type = "pdp", grid.resolution = 40),
transition = list(k = 10, bs = "cr")),
model = boston_rf
)
summary(xp_model)
##
## Call:
## stats::glm(formula = cmedv ~ rm + lstat + xs(nox), family = family,
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -13.4004 -3.3024 -0.9269 2.0446 27.1490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.73798 5.03303 -6.703 5.49e-11 ***
## rm 5.24996 0.41654 12.604 < 2e-16 ***
## lstat -0.51949 0.04388 -11.838 < 2e-16 ***
## xs(nox) 1.28752 0.16044 8.025 7.24e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 26.87613)
##
## Null deviance: 42578 on 505 degrees of freedom
## Residual deviance: 13492 on 502 degrees of freedom
## AIC: 3107.3
##
## Number of Fisher Scoring iterations: 2
这shapper是 SHAP python 库的 R 包装器
用于识别模型结构或数据结构中的概念漂移。
机器学习模型通常在数据是静止的假设下对历史数据进行拟合和验证。最流行的验证技术(k 折交叉验证、重复交叉验证等)在与训练数据分布相同的数据上测试模型。
然而,在许多实际应用中,部署的模型在不断变化的环境中工作。一段时间后,由于环境的变化,模型性能可能会退化,因为模型可能不太可靠。
概念漂移是指数据分布或变量之间的关系随时间的变化。考虑一个学校的能源消耗模型,随着时间的推移,学校可能配备更多的设备或更节能的设备,这可能会影响模型的性能。
modelStudio软件包自动执行机器学习预测模型的解释性分析。它以仅一行代码的无服务器 HTML站点的形式生成高级交互式模型解释。该工具与模型无关,因此与大多数黑盒预测模型和框架兼容(例如 mlr/mlr3、xgboost、caret、h2o、parsnip、tidymodels、scikit-learn、lightgbm、keras/tensorflow)。
https://modelstudio.drwhy.ai/#demo
library("DALEX")
library("ranger")
library("modelStudio")
# fit a model
model <- ranger(score ~., data = happiness_train)
# create an explainer for the model
explainer <- explain(model,
data = happiness_test,
y = happiness_test$score,
label = "Random Forest")
# make a studio for the model
modelStudio(explainer)
Arena 是一种交互式工具,可让您探索和比较任何模型,无论其内部结构如何。
该arenar包可以在两种模式下运行 - 实时(R 在后台运行并计算所有必要的解释)和无服务器(所有必要的解释都在前面计算)。
https://arenar.drwhy.ai/articles/arena_intro_titanic.html
library("ranger")
titanic_rf <- ranger(survived ~ .,
data = titanic_imputed,
probability = TRUE,
classification = TRUE)
ArenaR,与该DrWhy系列的所有软件包一样,适用于统一的模型包装器。我们将使用包中的explain函数创建它们
library("DALEX")
titanic_ex <- explain(
titanic_rf,
y = titanic_imputed$survived,
data = titanic_imputed)
## Preparation of a new explainer is initiated
## -> model label : ranger ( default )
## -> data : 2207 rows 8 cols
## -> target variable : 2207 values
## -> predict function : yhat.ranger will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
## -> predicted values : numerical, min = 0.01403656 , mean = 0.3224658 , max = 0.9881752
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -0.7797319 , mean = -0.0003090208 , max = 0.8932716
## A new explainer has been created!
创建Arena
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:DALEX':
##
## explain
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("arenar")
titanic_ar <- create_arena(live = TRUE) %>%
push_model(titanic_ex)
运行live Arena服务
run_server(titanic_ar)
几个官方例子:
modelDown包生成一个网站,其中包含预测模型的 HTML 摘要。
论文:https://joss.theoj.org/papers/10.21105/joss.01444
http://github.com/ModelOriented/modelDown
devtools::install_github("ModelOriented/modelDown")
rSAFE软件包是一个模型无关工具,用于使用称为代理模型的替代黑盒模型使可解释的白盒模型更加准确。基于复杂的模型,例如神经网络或随机森林,正在提取新特征,然后在拟合更简单的可解释模型的过程中使用,从而提高其整体性能。
文章:https://www.sciencedirect.com/science/article/pii/S016792362100066X
EloML软件包为机器学习模型提供 Elo 评级系统。Elo Predictive Power (EPP) 分数有助于评估基于模型性能的 Elo 排名系统。
archivist包自动序列化和反序列化 R 对象。对象与其他元数据一起存储,以促进数据科学项目的可重复性和治理。
作为数据筛选过程的一部分,用于识别数据帧中潜在错误的一套检查
用于按列汇总、比较和可视化数据框的实用程序集合。
https://cran.r-project.org/web/packages/inspectdf/index.html ### 19.1.3 validate
R 环境的专业数据验证
使用验证规则查找和替换数据中的错误字段 https://github.com/data-cleaning/errorlocate
基于图形语法的声明式创建图形系统。 https://ggplot2.tidyverse.org/
代理学习 = 训练弹性模型并提取特征转换。 https://github.com/MI2DataLab/SAFE
使用代理黑盒来训练可解释的基于样条的加法模型 https://github.com/ModelOriented/xspliner
factorMerger是一组支持事后测试的工具,可以提取关于给定响应的因素的层次结构。
用于模型级特征效果和特征重要性的工具集。 https://github.com/ModelOriented/ingredients
https://github.com/topepo/caret 文章:https://www.jstatsoft.org/article/view/v028i05 书:https://topepo.github.io/caret/
型验证、确认和错误分析 https://github.com/ModelOriented/auditor
描述性机器学习说明 https://github.com/ModelOriented/DALEX
可解释的机器学习 R 包 https://github.com/christophM/iml
一组工具,用于了解随机森林中发生的事情 https://github.com/MI2DataLab/randomForestExplainer
https://github.com/ModelOriented/DrWhy/blob/master/README.md#6-model-delivery