concept drift 是机器学习中会遇到的一个问题。指的是,模型在运行的过程中,数据分布或变量之间的关系随时间变化。从而倒是模型效果变差甚至失效。因此,非常重要的一点是及时的识别concpet drift 。
drifter 是R中用于进行concept drift 的一个包。
这里,协变量漂移被定义为两个分布之间的非相交距离。具体公式为:
\[ d(P,Q) = 1 - sum_i min(P_i, Q_i) \]
library(drifter)
library("DALEX")
#> Welcome to DALEX (version: 2.4.2).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/
head(apartments,3)
#> m2.price construction.year surface floor no.rooms district
#> 1 5897 1953 25 3 1 Srodmiescie
#> 2 1818 1992 143 9 5 Bielany
#> 3 3643 1937 56 1 2 Praga
head(apartments_test,3)
#> m2.price construction.year surface floor no.rooms district
#> 1001 4644 1976 131 3 5 Srodmiescie
#> 1002 3082 1978 112 9 4 Mokotow
#> 1003 2498 1958 100 7 4 Bielany
# 数据没有漂移
<- calculate_covariate_drift(apartments, apartments_test)
d
d#> Variable Shift
#> -------------------------------------
#> m2.price 4.9
#> construction.year 6.0
#> surface 6.8
#> floor 4.9
#> no.rooms 2.8
#> district 2.8
# 数据有漂移
<- calculate_covariate_drift(dragons, dragons_test)
d
d#> Variable Shift
#> -------------------------------------
#> year_of_birth 8.9
#> height 15.3 .
#> weight 14.7 .
#> scars 4.6
#> colour 18.0 .
#> year_of_discovery 97.4 ***
#> number_of_lost_teeth 6.3
#> life_length 8.6
计算新旧模型的PDF 曲线的差别。
library("DALEX")
<- lm(m2.price ~ ., data = apartments)
model_old <- lm(m2.price ~ ., data = apartments_test[1:1000,])
model_new calculate_model_drift(model_old, model_new,
1:1000,],
apartments_test[1:1000,]$m2.price)
apartments_test[#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1867.309 , mean = 3488.057 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1867.309 , mean = 3488.057 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Variable Shift Scaled
#> -----------------------------------------------
#> floor 40.35 4.5
#> no.rooms 53.54 6.0
#> surface 63.60 7.1
#> m2.price 53.16 6.0
#> construction.year 53.85 6.0
使用ranger构建回归树
library("ranger")
<- function(m,x,...) predict(m, x, ...)$predictions
predict_function <- ranger(m2.price ~ ., data = apartments)
model_old <- ranger(m2.price ~ ., data = apartments_test)
model_new calculate_model_drift(model_old, model_new,
apartments_test,$m2.price,
apartments_testpredict_function = predict_function)
#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1901.578 , mean = 3515.837 , max = 6072.946
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1901.578 , mean = 3515.837 , max = 6072.946
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Variable Shift Scaled
#> -----------------------------------------------
#> floor 161.13 17.9 .
#> no.rooms 93.58 10.4 .
#> surface 82.10 9.1
#> m2.price 114.35 12.7 .
#> construction.year 90.32 10.0 .
<- function(m,x,...) predict(m, x, ..., probability=TRUE)$predictions[,1]
predict_function = HR[HR$gender == "male", -1]
data_old = HR[HR$gender == "female", -1]
data_new <- ranger(status ~ ., data = data_old, probability=TRUE)
model_old <- ranger(status ~ ., data = data_new, probability=TRUE)
model_new calculate_model_drift(model_old, model_new,
HR_test,$status == "fired",
HR_testpredict_function = predict_function)
#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 7897 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
#> -> model_info : Model info detected classification task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult classification tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector with 0 and 1 values.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 0 , mean = 0.3813948 , max = 0.9990851
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 7897 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task classification ( default )
#> -> model_info : Model info detected classification task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult classification tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector with 0 and 1 values.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 0 , mean = 0.3813948 , max = 0.9990851
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Variable Shift Scaled
#> -----------------------------------------------
#> salary 0.03 7.2
#> evaluation 0.06 13.1 .
#> age 0.02 4.8
#> hours 0.02 4.6
library("ingredients")
#>
#> Attaching package: 'ingredients'
#> The following object is masked from 'package:DALEX':
#>
#> feature_importance
<- partial_dependency(model_old,
prof_old data = data_new[1:500,],
label = "model_old",
predict_function = predict_function,
grid_points = 101,
variable_splits = NULL)
<- partial_dependency(model_new,
prof_new data = data_new[1:500,],
label = "model_new",
predict_function = predict_function,
grid_points = 101,
variable_splits = NULL)
plot(prof_old, prof_new, color = "_label_")
library("DALEX")
<- lm(m2.price ~ ., data = apartments)
model_old <- lm(m2.price ~ ., data = apartments_test[1:1000,])
model_new calculate_model_drift(model_old, model_new,
1:1000,],
apartments_test[1:1000,]$m2.price)
apartments_test[#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1867.309 , mean = 3488.057 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 1000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1867.309 , mean = 3488.057 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Variable Shift Scaled
#> -----------------------------------------------
#> floor 35.07 3.9
#> no.rooms 8.11 0.9
#> surface 36.63 4.1
#> m2.price 6.22 0.7
#> construction.year 6.83 0.8
library("ranger")
<- function(m,x,...) predict(m, x, ...)$predictions
predict_function <- ranger(m2.price ~ ., data = apartments)
model_old calculate_residuals_drift(model_old,
1:4000,], apartments_test[4001:8000,],
apartments_test[$m2.price[1:4000], apartments_test$m2.price[4001:8000],
apartments_testpredict_function = predict_function)
#> Variable Shift
#> -------------------------------------
#> Residuals 3.9
calculate_residuals_drift(model_old,
apartments, apartments_test,$m2.price, apartments_test$m2.price,
apartmentspredict_function = predict_function)
#> Variable Shift
#> -------------------------------------
#> Residuals 34.2 **
library("DALEX")
<- lm(m2.price ~ ., data = apartments)
model_old <- lm(m2.price ~ ., data = apartments_test[1:1000,])
model_new check_drift(model_old, model_new,
apartments, apartments_test,$m2.price, apartments_test$m2.price)
apartments#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1792.597 , mean = 3506.836 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package stats , ver. 4.2.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1792.597 , mean = 3506.836 , max = 6241.447
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> -------------------------------------
#> NULL
#> -------------------------------------
#> NULL
#> -----------------------------------------------
#> NULL
library("ranger")
<- function(m,x,...) predict(m, x, ...)$predictions
predict_function <- ranger(m2.price ~ ., data = apartments)
model_old <- ranger(m2.price ~ ., data = apartments_test)
model_new check_drift(model_old, model_new,
apartments, apartments_test,$m2.price, apartments_test$m2.price,
apartmentspredict_function = predict_function)
#> Preparation of a new explainer is initiated
#> -> model label : model_old
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1909.66 , mean = 3514.777 , max = 6105.554
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> Preparation of a new explainer is initiated
#> -> model label : model_new
#> -> data : 9000 rows 6 cols
#> -> target variable : not specified! ( WARNING )
#> -> predict function : predict_function
#> -> predicted values : No value for predict function target column. ( default )
#> -> model_info : package ranger , ver. 0.14.1 , task regression ( default )
#> -> model_info : Model info detected regression task but 'y' is a NULL . ( WARNING )
#> -> model_info : By deafult regressions tasks supports only numercical 'y' parameter.
#> -> model_info : Consider changing to numerical vector.
#> -> model_info : Otherwise I will not be able to calculate residuals or loss function.
#> -> predicted values : numerical, min = 1909.66 , mean = 3514.777 , max = 6105.554
#> -> residual function : difference between y and yhat ( default )
#> A new explainer has been created!
#> -------------------------------------
#> NULL
#> -------------------------------------
#> NULL
#> -----------------------------------------------
#> NULL