Explore Data

Our modeling goal is to predict the policy Lapsed or Inforce based on the Policy information, Customer demography and the interaction to policy events.

Let’s take a closer look to policy data:

Summary policy data:

Data summary
Name policy
Number of rows 1341
Number of columns 19
_______________________
Column type frequency:
factor 10
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts pct
Lapsed 0 1.00 FALSE 2 Inf: 907, Lap: 434 Inf: 0.68, Lap: 0.32
PO_Sex 0 1.00 FALSE 2 mal: 709, fem: 632 mal: 0.53, fem: 0.47
PO_Married 0 1.00 FALSE 2 Mar: 802, Sin: 539 Mar: 0.60, Sin: 0.40
Occupation 0 1.00 FALSE 4 Grp: 486, Grp: 425, Grp: 263, Grp: 167 Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12
Phone_registered 12 0.99 FALSE 2 Yes: 935, No: 394 Yes: 0.70, No: 0.30
PO_is_INS 0 1.00 FALSE 2 No: 1158, Yes: 183 No: 0.86, Yes: 0.14
INS_Sex 0 1.00 FALSE 2 mal: 693, fem: 648 mal: 0.52, fem: 0.48
CoveragePeriod 0 1.00 FALSE 3 5-1: 617, >10: 412, 1-5: 312 5-1: 0.46, >10: 0.31, 1-5: 0.23
PaymentTerm 0 1.00 FALSE 4 Qua: 442, Ann: 421, Sem: 258, Mon: 220 Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16
DistributionChannel 0 1.00 FALSE 5 Com: 550, Ban: 401, Cor: 199, Gen: 126 Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 671.00 387.26 1 336 671 1006 1341 ▇▇▇▇▇
NumOfReinstated 0 1 0.68 1.08 0 0 0 1 5 ▇▁▁▁▁
NumOfClaims 2 1 0.56 1.06 0 0 0 1 5 ▇▁▁▁▁
NumOfEmails 1 1 1.29 1.04 0 1 1 2 5 ▇▁▁▁▁
NumOfCalls 4 1 1.13 1.13 0 0 1 2 5 ▇▁▁▁▁
PO_Age 0 1 43.31 8.86 22 36 43 51 59 ▂▇▇▇▇
INS_Age 0 1 39.89 13.58 18 28 40 52 64 ▇▆▇▇▆
Premium 0 1 2825.38 2489.13 224 1025 2021 3761 12754 ▇▂▁▁▁
AgentYearSVR 0 1 2.11 1.06 1 1 2 3 6 ▇▂▁▁▁

The EDA steps have been done in another document.Check here for EDA.


Let’s take straightforward to the prediction steps.

Feature engineering

Split data and resample

The first step of our analysis is to split data into two separate sets: a “training” set and a “testing” set. Then resample training data using vfold-cv()

set.seed(1234)
policy_split <- policy %>%
  na.omit() %>%
  initial_split(prop = 0.80, strata = Lapsed)
policy_train <- training(policy_split)
policy_test <- testing(policy_split)

set.seed(234)
# Cross validation folds (default cv=10), with 3 repeats.
folds <- vfold_cv(policy_train, strata = Lapsed, 
                  repeats = 3)

Pre-processing

Before adding our data to model, we need to pre-process, using recipe():

# Create recipe
pol_rec <-
  recipe(Lapsed ~., data= policy_train) %>%
  update_role(ID, new_role = "id pol") %>% # change ID variable as 'id role'
  step_corr(all_numeric()) %>% # filter for High Correlation for numeric data.
  step_dummy(all_nominal(), -all_outcomes()) %>% # Convert nominal data to dummy variable, but not convert out_comes variable
  # missing handling - by median imputation
  # step_impute_median(PO_Age) %>%
  step_zv(all_numeric()) %>% # filter zero variance
#  step_normalize(all_numeric(),-ID) # normalize numeric data to have SD of one and mean of zero (Center and scale)
  step_normalize(all_numeric_predictors())
# pol_rec 

# It still did not do anything as this just a setup then we run it again with prep() command:
pol_prep <- pol_rec %>%  
  prep()
pol_prep
## Recipe
## 
## Inputs:
## 
##       role #variables
##     id pol          1
##    outcome          1
##  predictor         17
## 
## Training data contained 1059 data points and no missing data.
## 
## Operations:
## 
## Correlation filter on <none> [trained]
## Dummy variables from PO_Sex, PO_Married, Occupation, Phone_registered, PO_is_I... [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for NumOfReinstated, NumOfClaims, NumOfEmails, NumO... [trained]

Build Models

Model specification

We will build the models based on the following algorithms: Logistic regression, k-nearest neighbor, Decision Tree and Random forest. Let’s create model specification:

set.seed(123)
# Logistic regression
glm_spec <- logistic_reg() %>%
  set_engine('glm')# declare a computation engine to fit to model

# Nearest Neighbor (knn)
knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>% 
  set_mode("classification")

# Decision tree
tree_spec <- decision_tree() %>%
  set_engine("rpart") %>% 
  set_mode("classification")

# Random Forest
rf_spec <- rand_forest(trees = 1000) %>%
  set_engine("ranger",  importance = "impurity") %>%
  set_mode ("classification")

Fit model spec to Dataset

Fitting Glm model specification to the pre-processed training data set:

# glm
glm_fit <- glm_spec %>%
  fit(Lapsed ~., data = juice(pol_prep))
glm_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = Lapsed ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                        (Intercept)                                  ID  
##                           0.929216                            0.000107  
##                    NumOfReinstated                         NumOfClaims  
##                          -0.719950                           -0.518235  
##                        NumOfEmails                          NumOfCalls  
##                           0.357085                            0.583129  
##                             PO_Age                             INS_Age  
##                           0.174404                            0.437994  
##                            Premium                        AgentYearSVR  
##                          -1.419263                            0.217049  
##                        PO_Sex_male                   PO_Married_Single  
##                          -0.021519                            0.078064  
##                   Occupation_Grp_2                    Occupation_Grp_3  
##                          -0.139905                            0.242438  
##                   Occupation_Grp_4                Phone_registered_Yes  
##                           0.381843                           -0.064816  
##                      PO_is_INS_Yes                        INS_Sex_male  
##                          -0.198616                           -0.167991  
##            CoveragePeriod_X1.5.yrs            CoveragePeriod_X5.10.yrs  
##                           0.333567                            0.290028  
##                PaymentTerm_Monthly                PaymentTerm_Quartely  
##                          -0.220612                           -0.140353  
##            PaymentTerm_Semi.annual   DistributionChannel_Company.Agent  
##                          -0.067266                           -0.010849  
##           DistributionChannel_Corp  DistributionChannel_General.Agency  
##                          -0.046781                           -0.064555  
##      DistributionChannel_Others.PD  
##                          -0.263041  
## 
## Degrees of Freedom: 1058 Total (i.e. Null);  1032 Residual
## Null Deviance:       1335 
## Residual Deviance: 853.7     AIC: 907.7

Fitting other model specification : knn, tree and rf

knn_fit <- knn_spec %>%
  fit(Lapsed ~., data = juice(pol_prep))

tree_fit <- tree_spec %>%
  fit(Lapsed ~., data = juice(pol_prep))

rf_fit <- rf_spec %>%
  fit(Lapsed ~., data = juice(pol_prep))

Create Workflow

# Main workflow is created 
pol_wf <- workflow() %>%
  add_recipe(pol_rec)

Modeling with Resampling data

We are now fitting the models to resamples. Fitting Glm model, and display 10 folds with 3 repeats:

doParallel::registerDoParallel() # do Parallel for multi-cores
set.seed(456)
glm_rs <- pol_wf %>%
  add_model(glm_spec) %>% # using spec of glm to fit again the resampling
  fit_resamples(
    resamples = folds, # fit for 10 folds cv
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))
glm_rs
## # Resampling results
## # 10-fold cross-validation repeated 3 times using stratification 
## # A tibble: 30 × 6
##    splits            id      id2    .metrics         .notes           .predict…¹
##    <list>            <chr>   <chr>  <list>           <list>           <list>    
##  1 <split [952/107]> Repeat1 Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  2 <split [952/107]> Repeat1 Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  3 <split [952/107]> Repeat1 Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  4 <split [952/107]> Repeat1 Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  5 <split [953/106]> Repeat1 Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  6 <split [954/105]> Repeat1 Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  7 <split [954/105]> Repeat1 Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  8 <split [954/105]> Repeat1 Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
##  9 <split [954/105]> Repeat1 Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
## 10 <split [954/105]> Repeat1 Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>  
## # … with 20 more rows, and abbreviated variable name ¹​.predictions

The performance of model can be viewed via .metrics and .predictions

# display the metric (accuracy,roc_auc,sens, spec) for each fold
glm_rs %>%
  unnest(.metrics) 
## # A tibble: 120 × 9
##    splits            id      id2    .metric  .estimator .esti…¹ .config .notes  
##    <list>            <chr>   <chr>  <chr>    <chr>        <dbl> <chr>   <list>  
##  1 <split [952/107]> Repeat1 Fold01 accuracy binary       0.822 Prepro… <tibble>
##  2 <split [952/107]> Repeat1 Fold01 sens     binary       0.571 Prepro… <tibble>
##  3 <split [952/107]> Repeat1 Fold01 spec     binary       0.944 Prepro… <tibble>
##  4 <split [952/107]> Repeat1 Fold01 roc_auc  binary       0.827 Prepro… <tibble>
##  5 <split [952/107]> Repeat1 Fold02 accuracy binary       0.850 Prepro… <tibble>
##  6 <split [952/107]> Repeat1 Fold02 sens     binary       0.686 Prepro… <tibble>
##  7 <split [952/107]> Repeat1 Fold02 spec     binary       0.931 Prepro… <tibble>
##  8 <split [952/107]> Repeat1 Fold02 roc_auc  binary       0.919 Prepro… <tibble>
##  9 <split [952/107]> Repeat1 Fold03 accuracy binary       0.813 Prepro… <tibble>
## 10 <split [952/107]> Repeat1 Fold03 sens     binary       0.629 Prepro… <tibble>
## # … with 110 more rows, 1 more variable: .predictions <list>, and abbreviated
## #   variable name ¹​.estimate
glm_rs %>%
  unnest(.predictions)
## # A tibble: 3,177 × 11
##    splits            id    id2   .metrics .notes   .pred…¹  .row .pred…² .pred…³
##    <list>            <chr> <chr> <list>   <list>   <fct>   <int>   <dbl>   <dbl>
##  1 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce     1 0.475     0.525
##  2 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    22 0.0988    0.901
##  3 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    36 0.316     0.684
##  4 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    37 0.265     0.735
##  5 <split [952/107]> Repe… Fold… <tibble> <tibble> Lapsed     42 0.773     0.227
##  6 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    52 0.224     0.776
##  7 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    56 0.0781    0.922
##  8 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    62 0.00663   0.993
##  9 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    64 0.00436   0.996
## 10 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce    66 0.194     0.806
## # … with 3,167 more rows, 2 more variables: Lapsed <fct>, .config <chr>, and
## #   abbreviated variable names ¹​.pred_class, ²​.pred_Lapsed, ³​.pred_Inforce

Continue fitting other models to resamples

set.seed(456)
tree_rs <- pol_wf %>%
  add_model(tree_spec) %>%
  #tree_spec %>% 
  fit_resamples(
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

set.seed(456)
knn_rs <- pol_wf %>%
  add_model(knn_spec) %>% 
  fit_resamples(
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

set.seed(456)
rf_rs <- pol_wf %>%
  add_model(rf_spec) %>%
  fit_resamples(
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Evaluate models

Evaluate metrics from models

Four models have been build. We will evaluate models performance by comparing their metrics:

options(digits = 3)
glm_rs %>% collect_metrics() %>%
  select(.metric,mean) %>%
  rename("glm" = "mean") %>%
  bind_cols(collect_metrics(tree_rs) %>%
            select(mean) %>%
              rename("tree" = "mean")
  ) %>%
  bind_cols(collect_metrics(rf_rs) %>%
            select(mean) %>%
              rename("rf" = "mean")
  ) %>%
  bind_cols(collect_metrics(knn_rs) %>%
            select(mean) %>%
              rename("knn" = "mean")
  ) %>%
  knitr::kable(caption = "Metric evaluation")
Metric evaluation
.metric glm tree rf knn
accuracy 0.813 0.823 0.825 0.696
roc_auc 0.852 0.822 0.867 0.737
sens 0.611 0.657 0.612 0.512
spec 0.910 0.903 0.927 0.784

Look likes:

  • Random forest did better on roc_auc and overall accuracy. It also has the highest specificity.

  • Decision tree also did good on overall accuracy and have highest sensitivity.

ROC curve

ROC curve for each model

ggpubr::ggarrange(glm_roc, tree_roc, rf_roc, knn_roc,
                  labels = c("GLM", "Tree", "RF", "Knn"),
                  nrow = 2, ncol = 2)

Confusion matrix

From the comparison, we can ignore knn algorithm. We then try to find the people who Lapsed, by checking confusion matrix:

Tree
## # A tibble: 4 × 3
##   Prediction Truth    Freq
##   <fct>      <fct>   <dbl>
## 1 Lapsed     Lapsed  226  
## 2 Lapsed     Inforce  69.7
## 3 Inforce    Lapsed  118  
## 4 Inforce    Inforce 645.

Policy which did lapse, and were predicted to lapse (freq=226) is about twice as Lapsed policies which were predicted wrongly (118). So about 2/3 of policy which is lapsed were predicted correctly. Then why sensitivity is 0.657

The specificity tells us the policy which still in-force were predicted correctly 0.903.

RF
## # A tibble: 4 × 3
##   Prediction Truth    Freq
##   <fct>      <fct>   <dbl>
## 1 Lapsed     Lapsed  211. 
## 2 Lapsed     Inforce  52.3
## 3 Inforce    Lapsed  133. 
## 4 Inforce    Inforce 663.

Lapsed detection (or sensitive) rate by Random forest is 0.612

GLM
## # A tibble: 4 × 3
##   Prediction Truth    Freq
##   <fct>      <fct>   <dbl>
## 1 Lapsed     Lapsed   210.
## 2 Lapsed     Inforce   64 
## 3 Inforce    Lapsed   134.
## 4 Inforce    Inforce  651

Lapsed detection (or sensitive) rate by GLM is 0.611

As objective to predict the policy which is lapsed, it seems decision tree did better.

Finalize model

Fit the final model to the training set and evaluate the test set. Then we can select the final model

Tree - the selected model

tree_final <- pol_wf %>% # take the workflow of work
  # add the model specification agin
  add_model(tree_spec) %>%
  # apply with the last fit
  last_fit(policy_split) # with main members split

We did fit on the training set and evaluate on test set in order to have the final predictions.

We will compare metrics from testing data to metric on resampling on training:

collect_metrics(tree_final) # metrics from final tree model with testing data
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.838 Preprocessor1_Model1
## 2 roc_auc  binary         0.808 Preprocessor1_Model1
collect_metrics(tree_rs) # metrics on resampling on training
## # A tibble: 4 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.823    30 0.00775 Preprocessor1_Model1
## 2 roc_auc  binary     0.822    30 0.0106  Preprocessor1_Model1
## 3 sens     binary     0.657    30 0.0196  Preprocessor1_Model1
## 4 spec     binary     0.903    30 0.00663 Preprocessor1_Model1

The accuracy from testing is just a little bit better comparing to resampling data set.

View prediction

collect_predictions(tree_final) %>%
  conf_mat(Lapsed, .pred_class)
##           Truth
## Prediction Lapsed Inforce
##    Lapsed      55      12
##    Inforce     31     167

This looks pretty good: the model doing a pretty good job of recognizing both Lapsed and Inforce policy. We then consider Decision Tree as the selected model.

RF final

# Performance of RF model on testing
collect_metrics(rf_final)
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.819 Preprocessor1_Model1
## 2 roc_auc  binary         0.859 Preprocessor1_Model1
# Compare to resamples on training
collect_metrics(rf_rs) 
## # A tibble: 4 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.825    30 0.00588 Preprocessor1_Model1
## 2 roc_auc  binary     0.867    30 0.00885 Preprocessor1_Model1
## 3 sens     binary     0.612    30 0.0167  Preprocessor1_Model1
## 4 spec     binary     0.927    30 0.00565 Preprocessor1_Model1
collect_predictions(rf_final) %>%
  conf_mat(Lapsed, .pred_class)
##           Truth
## Prediction Lapsed Inforce
##    Lapsed      47       9
##    Inforce     39     170

GLM final

# Performance of GLM model on testing
collect_metrics(glm_final)
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.808 Preprocessor1_Model1
## 2 roc_auc  binary         0.856 Preprocessor1_Model1
# Compare to resamples on training
collect_metrics(glm_rs)
## # A tibble: 4 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.813    30 0.00596 Preprocessor1_Model1
## 2 roc_auc  binary     0.852    30 0.00667 Preprocessor1_Model1
## 3 sens     binary     0.611    30 0.0171  Preprocessor1_Model1
## 4 spec     binary     0.910    30 0.00641 Preprocessor1_Model1
collect_predictions(glm_final) %>%
  conf_mat(Lapsed, .pred_class)
##           Truth
## Prediction Lapsed Inforce
##    Lapsed      49      14
##    Inforce     37     165

Variable importance

Plot variable importance scores for the predictors in the model

library(vip)
vip_tree <-
  tree_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip() + ggtitle("Tree")

vip_rf <-
  rf_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip() + ggtitle("RF")

vip_glm <-
  glm_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip() + ggtitle("GLM")

ggpubr::ggarrange(vip_tree, vip_rf, vip_glm,
                  ncol = 3, nrow = 1) 

Premium, NumOfReinstated, NumOfCalls, NumberOfEmails, Occupation play some important score in classifying Lapsed and Inforce.

Conclusion and Prediction

  • The selected model - Decision Tree reaches an accuracy of 0.838. It means, the model can help us to classify correctly 8 out of 10 times whether the policy Lapsed or Inforce.

  • The sensitivity rate (~0.64 ) tells us 6 out of 10 lapsed policy can be predicted by model.

  • The following table is the Prediction rate of Lapsed/Inforce, comparing to their Truth status:

# predicting with testing data
library(DT)
tree_fit %>%
  predict(new_data = bake(pol_prep, new_data = policy_test),
          type = "prob") %>%
  mutate(PolicyID = policy_test$ID, .before = 1) %>%
  mutate(Truth = policy_test$Lapsed) %>%
#  filter(Truth == "Lapsed") %>%
  arrange(desc(.pred_Lapsed)) %>%
  datatable(caption = "Prediction rate of Lapsed/Inforce", filter = "top") %>%
  formatRound(2:3,digits = 3)

Bonus: Plot the selected decision tree