drifter : Concept Drift In R

MiLin

2022-11-09

简介

concept drift 是机器学习中会遇到的一个问题。指的是,模型在运行的过程中,数据分布或变量之间的关系随时间变化。从而倒是模型效果变差甚至失效。因此,非常重要的一点是及时的识别concpet drift 。

drifter 是R中用于进行concept drift 的一个包。

1. 计算两个数据框之间的协方差漂移

这里,协变量漂移被定义为两个分布之间的非相交距离。具体公式为:

\[ d(P,Q) = 1 - sum_i min(P_i, Q_i) \]

library(drifter)
library("DALEX")
#> Welcome to DALEX (version: 2.4.2).
#> Find examples and detailed introduction at: http://ema.drwhy.ai/

head(apartments,3)
#>   m2.price construction.year surface floor no.rooms    district
#> 1     5897              1953      25     3        1 Srodmiescie
#> 2     1818              1992     143     9        5     Bielany
#> 3     3643              1937      56     1        2       Praga
head(apartments_test,3)
#>      m2.price construction.year surface floor no.rooms    district
#> 1001     4644              1976     131     3        5 Srodmiescie
#> 1002     3082              1978     112     9        4     Mokotow
#> 1003     2498              1958     100     7        4     Bielany

# 数据没有漂移
d <- calculate_covariate_drift(apartments, apartments_test)
d
#>                   Variable  Shift
#>   -------------------------------------
#>                   m2.price    4.9  
#>          construction.year    6.0  
#>                    surface    6.8  
#>                      floor    4.9  
#>                   no.rooms    2.8  
#>                   district    2.8
#  数据有漂移
d <- calculate_covariate_drift(dragons, dragons_test)
d
#>                   Variable  Shift
#>   -------------------------------------
#>              year_of_birth    8.9  
#>                     height   15.3  .
#>                     weight   14.7  .
#>                      scars    4.6  
#>                     colour   18.0  .
#>          year_of_discovery   97.4  ***
#>       number_of_lost_teeth    6.3  
#>                life_length    8.6

2. 计算模型的drift

计算新旧模型的PDF 曲线的差别。

2.1 回归模型

library("DALEX")
 model_old <- lm(m2.price ~ ., data = apartments)
 model_new <- lm(m2.price ~ ., data = apartments_test[1:1000,])
 calculate_model_drift(model_old, model_new,
                  apartments_test[1:1000,],
                  apartments_test[1:1000,]$m2.price)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  1000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1867.309 , mean =  3488.057 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  1000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1867.309 , mean =  3488.057 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>                   Variable    Shift  Scaled
#>   -----------------------------------------------
#>                      floor    40.35     4.5  
#>                   no.rooms    53.54     6.0  
#>                    surface    63.60     7.1  
#>                   m2.price    53.16     6.0  
#>          construction.year    53.85     6.0

使用ranger构建回归树

library("ranger")
 predict_function <- function(m,x,...) predict(m, x, ...)$predictions
 model_old <- ranger(m2.price ~ ., data = apartments)
 model_new <- ranger(m2.price ~ ., data = apartments_test)
 calculate_model_drift(model_old, model_new,
                  apartments_test,
                  apartments_test$m2.price,
                  predict_function = predict_function)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1901.578 , mean =  3515.837 , max =  6072.946  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1901.578 , mean =  3515.837 , max =  6072.946  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>                   Variable    Shift  Scaled
#>   -----------------------------------------------
#>                      floor   161.13    17.9  .
#>                   no.rooms    93.58    10.4  .
#>                    surface    82.10     9.1  
#>                   m2.price   114.35    12.7  .
#>          construction.year    90.32    10.0  .

2.2 分类

predict_function <- function(m,x,...) predict(m, x, ..., probability=TRUE)$predictions[,1]
 data_old = HR[HR$gender == "male", -1]
 data_new = HR[HR$gender == "female", -1]
 model_old <- ranger(status ~ ., data = data_old, probability=TRUE)
 model_new <- ranger(status ~ ., data = data_new, probability=TRUE)
 calculate_model_drift(model_old, model_new,
                  HR_test,
                  HR_test$status == "fired",
                  predict_function = predict_function)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  7897  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0 , mean =  0.3813948 , max =  0.9990851  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  7897  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task classification (  default  ) 
#>   -> model_info        :  Model info detected classification task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult classification tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector with 0 and 1 values.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  0 , mean =  0.3813948 , max =  0.9990851  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>                   Variable    Shift  Scaled
#>   -----------------------------------------------
#>                     salary     0.03     7.2  
#>                 evaluation     0.06    13.1  .
#>                        age     0.02     4.8  
#>                      hours     0.02     4.6

2.3 可视化

library("ingredients")
#> 
#> Attaching package: 'ingredients'
#> The following object is masked from 'package:DALEX':
#> 
#>     feature_importance
prof_old <- partial_dependency(model_old,
                                     data = data_new[1:500,],
                                     label = "model_old",
                                     predict_function = predict_function,
                                     grid_points = 101,
                                     variable_splits = NULL)
 prof_new <- partial_dependency(model_new,
                                     data = data_new[1:500,],
                                     label = "model_new",
                                     predict_function = predict_function,
                                     grid_points = 101,
                                     variable_splits = NULL)
 plot(prof_old, prof_new, color = "_label_")

3. 计算残差的drift

library("DALEX")
 model_old <- lm(m2.price ~ ., data = apartments)
 model_new <- lm(m2.price ~ ., data = apartments_test[1:1000,])
 calculate_model_drift(model_old, model_new,
                  apartments_test[1:1000,],
                  apartments_test[1:1000,]$m2.price)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  1000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1867.309 , mean =  3488.057 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  1000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1867.309 , mean =  3488.057 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>                   Variable    Shift  Scaled
#>   -----------------------------------------------
#>                      floor    35.07     3.9  
#>                   no.rooms     8.11     0.9  
#>                    surface    36.63     4.1  
#>                   m2.price     6.22     0.7  
#>          construction.year     6.83     0.8
 
 library("ranger")
 predict_function <- function(m,x,...) predict(m, x, ...)$predictions
 model_old <- ranger(m2.price ~ ., data = apartments)
 calculate_residuals_drift(model_old,
                       apartments_test[1:4000,], apartments_test[4001:8000,],
                       apartments_test$m2.price[1:4000], apartments_test$m2.price[4001:8000],
                       predict_function = predict_function)
#>                   Variable  Shift
#>   -------------------------------------
#>                  Residuals    3.9
 calculate_residuals_drift(model_old,
                       apartments, apartments_test,
                       apartments$m2.price, apartments_test$m2.price,
                       predict_function = predict_function)
#>                   Variable  Shift
#>   -------------------------------------
#>                  Residuals   34.2  **

4. 同时计算三种drift

library("DALEX")
 model_old <- lm(m2.price ~ ., data = apartments)
 model_new <- lm(m2.price ~ ., data = apartments_test[1:1000,])
 check_drift(model_old, model_new,
                  apartments, apartments_test,
                  apartments$m2.price, apartments_test$m2.price)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1792.597 , mean =  3506.836 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package stats , ver. 4.2.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1792.597 , mean =  3506.836 , max =  6241.447  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>    -------------------------------------
#> NULL
#>    -------------------------------------
#> NULL
#>    -----------------------------------------------
#> NULL
 
 library("ranger")
 predict_function <- function(m,x,...) predict(m, x, ...)$predictions
 model_old <- ranger(m2.price ~ ., data = apartments)
 model_new <- ranger(m2.price ~ ., data = apartments_test)
 check_drift(model_old, model_new,
                  apartments, apartments_test,
                  apartments$m2.price, apartments_test$m2.price,
                  predict_function = predict_function)
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_old 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1909.66 , mean =  3514.777 , max =  6105.554  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!  
#> Preparation of a new explainer is initiated
#>   -> model label       :  model_new 
#>   -> data              :  9000  rows  6  cols 
#>   -> target variable   :  not specified! (  WARNING  )
#>   -> predict function  :  predict_function 
#>   -> predicted values  :  No value for predict function target column. (  default  )
#>   -> model_info        :  package ranger , ver. 0.14.1 , task regression (  default  ) 
#>   -> model_info        :  Model info detected regression task but 'y' is a NULL .  (  WARNING  )
#>   -> model_info        :  By deafult regressions tasks supports only numercical 'y' parameter. 
#>   -> model_info        :  Consider changing to numerical vector.
#>   -> model_info        :  Otherwise I will not be able to calculate residuals or loss function.
#>   -> predicted values  :  numerical, min =  1909.66 , mean =  3514.777 , max =  6105.554  
#>   -> residual function :  difference between y and yhat (  default  )
#>   A new explainer has been created!
#>    -------------------------------------
#> NULL
#>    -------------------------------------
#> NULL
#>    -----------------------------------------------
#> NULL