Explore Data

Our modeling goal is to predict the policy Lapsed or In-force based on the Policy information, Customer demography and interaction frequency to policy events.

Summary policy data:

Data summary
Name policy
Number of rows 1341
Number of columns 19
_______________________
Column type frequency:
factor 10
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts pct
Lapsed 0 1.00 FALSE 2 Inf: 907, Lap: 434 Inf: 0.68, Lap: 0.32
PO_Sex 0 1.00 FALSE 2 mal: 709, fem: 632 mal: 0.53, fem: 0.47
PO_Married 0 1.00 FALSE 2 Mar: 802, Sin: 539 Mar: 0.60, Sin: 0.40
Occupation 0 1.00 FALSE 4 Grp: 486, Grp: 425, Grp: 263, Grp: 167 Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12
Phone_registered 12 0.99 FALSE 2 Yes: 935, No: 394 Yes: 0.70, No: 0.30
PO_is_INS 0 1.00 FALSE 2 No: 1158, Yes: 183 No: 0.86, Yes: 0.14
INS_Sex 0 1.00 FALSE 2 mal: 693, fem: 648 mal: 0.52, fem: 0.48
CoveragePeriod 0 1.00 FALSE 3 5-1: 617, >10: 412, 1-5: 312 5-1: 0.46, >10: 0.31, 1-5: 0.23
PaymentTerm 0 1.00 FALSE 4 Qua: 442, Ann: 421, Sem: 258, Mon: 220 Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16
DistributionChannel 0 1.00 FALSE 5 Com: 550, Ban: 401, Cor: 199, Gen: 126 Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 671.00 387.26 1 336 671 1006 1341 ▇▇▇▇▇
NumOfReinstated 0 1 0.68 1.08 0 0 0 1 5 ▇▁▁▁▁
NumOfClaims 2 1 0.56 1.06 0 0 0 1 5 ▇▁▁▁▁
NumOfEmails 1 1 1.29 1.04 0 1 1 2 5 ▇▁▁▁▁
NumOfCalls 4 1 1.13 1.13 0 0 1 2 5 ▇▁▁▁▁
PO_Age 0 1 43.31 8.86 22 36 43 51 59 ▂▇▇▇▇
INS_Age 0 1 39.89 13.58 18 28 40 52 64 ▇▆▇▇▆
Premium 0 1 2825.38 2489.13 224 1025 2021 3761 12754 ▇▂▁▁▁
AgentYearSVR 0 1 2.11 1.06 1 1 2 3 6 ▇▂▁▁▁

The EDA steps have been done in another document. Check here for EDA.

Feature engineering

Split data and resample

The first step of our analysis is to split data into two separate sets: “training” set and “testing” set. The “training” set will be used to train the model while the “testing” set to evaluate the final model.

In order to prevent over fitting during training, we will resample training data using k-fold cross validated with vfold-cv(). With k=10 fold data set, we expect better estimate of the model’s performance.

set.seed(123)
policy_split <- policy %>%
  drop_na() %>%
  initial_split(prop = 0.80, strata = Lapsed)
train <- training(policy_split)
test <- testing(policy_split)

# Cross validation folds (default v=10)
folds <- vfold_cv(train, strata = Lapsed,
                  repeats = 3)

Pre-processing

Before adding our data to the model, we need to pre-process our data, using recipe():

pol_rec <-
  recipe(Lapsed ~., data= train) %>%
  update_role(ID, new_role = "id pol") %>% # change ID variable as 'id role'
  step_corr(all_numeric()) %>% # filter for High Correlation for numeric data.
  step_dummy(all_nominal(), -all_outcomes()) %>% # Convert nominal data to dummy variable,
  step_zv(all_numeric()) # filter zero variance
#  step_normalize(all_numeric_predictors()) # We might not need for Tree-based methods
pol_rec 
## Recipe
## 
## Inputs:
## 
##       role #variables
##     id pol          1
##    outcome          1
##  predictor         17
## 
## Operations:
## 
## Correlation filter on all_numeric()
## Dummy variables from all_nominal(), -all_outcomes()
## Zero variance filter on all_numeric()

Build models

Type of models

We will build the models based on Tree-based methods: Decision Tree, Bagging bag_tree(), Random Forest rand_forest() and Boosting boost_tree().
Other models can be found here https://www.tidymodels.org/find/parsnip/

Model Examples

Decision Tree

Specify model
set.seed(123)
tree_spec <- decision_tree() %>%
  set_engine("rpart") %>% 
  set_mode("classification")
Creat Workflow
tree_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(tree_spec)
Fit the model
set.seed(123)
#future::plan(multisession)

tree_fit <- fit_resamples(
    tree_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Bagged Trees

Specify model
library(baguette)
set.seed(123)
bag_spec <- bag_tree() %>%
  set_engine("rpart", times = 10) %>% # 10 boostrap resample 
  set_mode("classification")
Creat Workflow
bag_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(bag_spec)
Fit the model
set.seed(123)
#future::plan(multisession)

bag_fit <- fit_resamples(
    bag_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(verbose = TRUE,
                                save_pred = TRUE))

Random Forest

Specify model
set.seed(123)
rf_spec <- rand_forest() %>%
  set_engine("ranger",  importance = "impurity",
             verbose = TRUE) %>%
  set_mode ("classification") %>%
  set_args(trees = 1000)
Creat Workflow
rf_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(rf_spec)
Fit the model
set.seed(123)
#future::plan(multisession)

rf_fit <- fit_resamples(
    rf_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Boosted Trees

Specify model
set.seed(123)
xgb_spec <-
  boost_tree(
#    trees = tune(),
#    mtry = tune(),
#    min_n = tune(),
#    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")
Creat Workflow
xgb_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(xgb_spec)
Fit the model
set.seed(123)
#future::plan(multisession)
xgb_fit <- fit_resamples(
    xgb_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Evaluate models

Four models have been build. We will evaluate models performance by comparing their metrics:

The metrics

After running these four methods, now it’s time to evaluate their performance with collect_metric() function. Below are the table for comparison :

options(digits = 3)
tree_fit %>% collect_metrics() %>%
  select(.metric,mean) %>%
  rename("tree" = "mean") %>%
  bind_cols(collect_metrics(bag_fit) %>%
            select(mean) %>%
              rename("bag" = mean)) %>%
  bind_cols(collect_metrics(rf_fit) %>%
            select(mean) %>%
              rename("rf" = "mean")) %>%
  bind_cols(collect_metrics(xgb_fit) %>%
            select(mean) %>%
              rename("xgb" = mean)
            ) %>%
  knitr::kable(caption = "Metric evaluation")
Metric evaluation
.metric tree bag rf xgb
accuracy 0.809 0.788 0.824 0.820
roc_auc 0.810 0.812 0.859 0.854
sens 0.658 0.655 0.595 0.647
spec 0.882 0.851 0.934 0.904

Look likes:

  • Random forest did better on roc_auc and overall accuracy. It also has the highest specificity but has lowest sensitivity.

  • Decision tree also did good on overall accuracy and have highest sensitivity.

ROC Curve

Finalize model

The final step is to fit the trained model to testing data using last_fit() function.

fn_metrics <- metric_set(roc_auc, accuracy, sens, spec)
# Decision Tree
tree_final <- last_fit(
  tree_wf,
  split = policy_split,
  metrics = fn_metrics
)
# Bagging Trees
bag_final <- last_fit(
  bag_wf,
  split = policy_split,
  metrics = fn_metrics
)
# Random forest
rf_final <- last_fit(
  rf_wf,
  split = policy_split,
  metrics = fn_metrics
)
# boosted tree
xgb_final <- last_fit(
  xgb_wf,
  split = policy_split,
  metrics = fn_metrics
)

The table below shows the actual out-of-sample performance for each of our four models.

tree_final %>% collect_metrics() %>%
  select(.metric,.estimate) %>%
  rename("tree" = .estimate) %>%
  bind_cols(collect_metrics(bag_final) %>%
            select(.estimate) %>%
              rename("bag" = .estimate)) %>%
  bind_cols(collect_metrics(rf_final) %>%
            select(.estimate) %>%
              rename("rf" = .estimate)) %>%
  bind_cols(collect_metrics(xgb_final) %>%
            select(.estimate) %>%
              rename("xgb" = .estimate)
            ) %>%
  knitr::kable(caption = "Metric evaluation")
Metric evaluation
.metric tree bag rf xgb
accuracy 0.823 0.804 0.830 0.819
sens 0.767 0.709 0.686 0.698
spec 0.849 0.849 0.899 0.877
roc_auc 0.840 0.840 0.879 0.877

After applying four trained models to the unseen test data, similar to the perfomance metrics with resampling training dataset, it looks like:

  • Random Forests did better on overall accuracy and roc_auc. But it has the lowest sensitivity rate. If consider accuracy/roc_auc as selection criteria, then the Random Forests is our finalized model.

  • If the objective is to identify the likelihood of lapse, we should consider the model with the highest Sensitivity rate. In this case, Decision Tree is the selected model:

    • The selected model - Decision Tree reaches an accuracy of 0.823 . It means, the model can help us to classify correctly 8 out of 10 times whether the policy Lapsed or Inforce.

    • The sensitivity rate 0.767 , a bit higher than estimate on training data, tells us 7.7 out of 10 lapsed policy can be predicted by model.

A closer look on the predictions with Decision Tree:

Confusion matrix

Confusion matrix with Decision Tree

tree_final %>% collect_predictions() %>%
  conf_mat(Lapsed,.pred_class)
##           Truth
## Prediction Lapsed Inforce
##    Lapsed      66      27
##    Inforce     20     152

Variable importance

Plot variable importance scores for the predictors in the model

library(vip)
tree_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip() + ggtitle("Tree")

Prediction

The following table is the Prediction rate of Lapsed/In-force, comparing to their Truth status:

# predicting with testing data
library(DT)
tree_final %>% collect_predictions() %>%
  select(.pred_Lapsed,.pred_Inforce,.pred_class,Lapsed) %>%
  rename("Truth" = Lapsed,
         "Predicted" = .pred_class) %>%
  mutate(PolicyID = test$ID, .before=1) %>%
    arrange(desc(.pred_Lapsed)) %>%
  datatable(caption = "Prediction rate of Lapsed/Inforce", filter = "top") %>%
  formatRound(2:3,digits = 3)

From this table, we can filter the policy which is currently In-force (Truth=“Inforce”) but predicting Lapse (Predicted=“Lapse”) by the model, we can also view the Lapse probability rate for those policies.

Bonus: Plot the selected Decision Tree