Our modeling goal is to predict the policy Lapsed or In-force based on the Policy information, Customer demography and interaction frequency to policy events.
Summary policy data:
| Name | policy |
| Number of rows | 1341 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts | pct |
|---|---|---|---|---|---|---|
| Lapsed | 0 | 1.00 | FALSE | 2 | Inf: 907, Lap: 434 | Inf: 0.68, Lap: 0.32 |
| PO_Sex | 0 | 1.00 | FALSE | 2 | mal: 709, fem: 632 | mal: 0.53, fem: 0.47 |
| PO_Married | 0 | 1.00 | FALSE | 2 | Mar: 802, Sin: 539 | Mar: 0.60, Sin: 0.40 |
| Occupation | 0 | 1.00 | FALSE | 4 | Grp: 486, Grp: 425, Grp: 263, Grp: 167 | Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12 |
| Phone_registered | 12 | 0.99 | FALSE | 2 | Yes: 935, No: 394 | Yes: 0.70, No: 0.30 |
| PO_is_INS | 0 | 1.00 | FALSE | 2 | No: 1158, Yes: 183 | No: 0.86, Yes: 0.14 |
| INS_Sex | 0 | 1.00 | FALSE | 2 | mal: 693, fem: 648 | mal: 0.52, fem: 0.48 |
| CoveragePeriod | 0 | 1.00 | FALSE | 3 | 5-1: 617, >10: 412, 1-5: 312 | 5-1: 0.46, >10: 0.31, 1-5: 0.23 |
| PaymentTerm | 0 | 1.00 | FALSE | 4 | Qua: 442, Ann: 421, Sem: 258, Mon: 220 | Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16 |
| DistributionChannel | 0 | 1.00 | FALSE | 5 | Com: 550, Ban: 401, Cor: 199, Gen: 126 | Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 671.00 | 387.26 | 1 | 336 | 671 | 1006 | 1341 | ▇▇▇▇▇ |
| NumOfReinstated | 0 | 1 | 0.68 | 1.08 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfClaims | 2 | 1 | 0.56 | 1.06 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfEmails | 1 | 1 | 1.29 | 1.04 | 0 | 1 | 1 | 2 | 5 | ▇▁▁▁▁ |
| NumOfCalls | 4 | 1 | 1.13 | 1.13 | 0 | 0 | 1 | 2 | 5 | ▇▁▁▁▁ |
| PO_Age | 0 | 1 | 43.31 | 8.86 | 22 | 36 | 43 | 51 | 59 | ▂▇▇▇▇ |
| INS_Age | 0 | 1 | 39.89 | 13.58 | 18 | 28 | 40 | 52 | 64 | ▇▆▇▇▆ |
| Premium | 0 | 1 | 2825.38 | 2489.13 | 224 | 1025 | 2021 | 3761 | 12754 | ▇▂▁▁▁ |
| AgentYearSVR | 0 | 1 | 2.11 | 1.06 | 1 | 1 | 2 | 3 | 6 | ▇▂▁▁▁ |
The EDA steps have been done in another document. Check here for EDA.
The first step of our analysis is to split data into two separate sets: “training” set and “testing” set. The “training” set will be used to train the model while the “testing” set to evaluate the final model.
In order to prevent over fitting during training, we will resample
training data using k-fold cross validated with
vfold-cv(). With k=10 fold data set, we expect better
estimate of the model’s performance.
set.seed(123)
policy_split <- policy %>%
drop_na() %>%
initial_split(prop = 0.80, strata = Lapsed)
train <- training(policy_split)
test <- testing(policy_split)
# Cross validation folds (default v=10)
folds <- vfold_cv(train, strata = Lapsed,
repeats = 3)
Before adding our data to the model, we need to pre-process our data,
using recipe():
pol_rec <-
recipe(Lapsed ~., data= train) %>%
update_role(ID, new_role = "id pol") %>% # change ID variable as 'id role'
step_corr(all_numeric()) %>% # filter for High Correlation for numeric data.
step_dummy(all_nominal(), -all_outcomes()) %>% # Convert nominal data to dummy variable,
step_zv(all_numeric()) # filter zero variance
# step_normalize(all_numeric_predictors()) # We might not need for Tree-based methods
pol_rec
## Recipe
##
## Inputs:
##
## role #variables
## id pol 1
## outcome 1
## predictor 17
##
## Operations:
##
## Correlation filter on all_numeric()
## Dummy variables from all_nominal(), -all_outcomes()
## Zero variance filter on all_numeric()
We will build the models based on Tree-based methods: Decision Tree,
Bagging bag_tree(), Random Forest
rand_forest() and Boosting boost_tree().
Other models can be found here https://www.tidymodels.org/find/parsnip/
set.seed(123)
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
tree_wf <- workflow() %>%
add_recipe(pol_rec) %>%
add_model(tree_spec)
set.seed(123)
#future::plan(multisession)
tree_fit <- fit_resamples(
tree_wf,
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
library(baguette)
set.seed(123)
bag_spec <- bag_tree() %>%
set_engine("rpart", times = 10) %>% # 10 boostrap resample
set_mode("classification")
bag_wf <- workflow() %>%
add_recipe(pol_rec) %>%
add_model(bag_spec)
set.seed(123)
#future::plan(multisession)
bag_fit <- fit_resamples(
bag_wf,
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(verbose = TRUE,
save_pred = TRUE))
set.seed(123)
rf_spec <- rand_forest() %>%
set_engine("ranger", importance = "impurity",
verbose = TRUE) %>%
set_mode ("classification") %>%
set_args(trees = 1000)
rf_wf <- workflow() %>%
add_recipe(pol_rec) %>%
add_model(rf_spec)
set.seed(123)
#future::plan(multisession)
rf_fit <- fit_resamples(
rf_wf,
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
set.seed(123)
xgb_spec <-
boost_tree(
# trees = tune(),
# mtry = tune(),
# min_n = tune(),
# learn_rate = 0.01
) %>%
set_engine("xgboost") %>%
set_mode("classification")
xgb_wf <- workflow() %>%
add_recipe(pol_rec) %>%
add_model(xgb_spec)
set.seed(123)
#future::plan(multisession)
xgb_fit <- fit_resamples(
xgb_wf,
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
Four models have been build. We will evaluate models performance by comparing their metrics:
After running these four methods, now it’s time to evaluate their
performance with collect_metric() function. Below are the
table for comparison :
options(digits = 3)
tree_fit %>% collect_metrics() %>%
select(.metric,mean) %>%
rename("tree" = "mean") %>%
bind_cols(collect_metrics(bag_fit) %>%
select(mean) %>%
rename("bag" = mean)) %>%
bind_cols(collect_metrics(rf_fit) %>%
select(mean) %>%
rename("rf" = "mean")) %>%
bind_cols(collect_metrics(xgb_fit) %>%
select(mean) %>%
rename("xgb" = mean)
) %>%
knitr::kable(caption = "Metric evaluation")
| .metric | tree | bag | rf | xgb |
|---|---|---|---|---|
| accuracy | 0.809 | 0.788 | 0.824 | 0.820 |
| roc_auc | 0.810 | 0.812 | 0.859 | 0.854 |
| sens | 0.658 | 0.655 | 0.595 | 0.647 |
| spec | 0.882 | 0.851 | 0.934 | 0.904 |
Look likes:
Random forest did better on roc_auc and overall accuracy. It also has the highest specificity but has lowest sensitivity.
Decision tree also did good on overall accuracy and have highest sensitivity.
The final step is to fit the trained model to testing data using
last_fit() function.
fn_metrics <- metric_set(roc_auc, accuracy, sens, spec)
# Decision Tree
tree_final <- last_fit(
tree_wf,
split = policy_split,
metrics = fn_metrics
)
# Bagging Trees
bag_final <- last_fit(
bag_wf,
split = policy_split,
metrics = fn_metrics
)
# Random forest
rf_final <- last_fit(
rf_wf,
split = policy_split,
metrics = fn_metrics
)
# boosted tree
xgb_final <- last_fit(
xgb_wf,
split = policy_split,
metrics = fn_metrics
)
The table below shows the actual out-of-sample performance for each of our four models.
tree_final %>% collect_metrics() %>%
select(.metric,.estimate) %>%
rename("tree" = .estimate) %>%
bind_cols(collect_metrics(bag_final) %>%
select(.estimate) %>%
rename("bag" = .estimate)) %>%
bind_cols(collect_metrics(rf_final) %>%
select(.estimate) %>%
rename("rf" = .estimate)) %>%
bind_cols(collect_metrics(xgb_final) %>%
select(.estimate) %>%
rename("xgb" = .estimate)
) %>%
knitr::kable(caption = "Metric evaluation")
| .metric | tree | bag | rf | xgb |
|---|---|---|---|---|
| accuracy | 0.823 | 0.804 | 0.830 | 0.819 |
| sens | 0.767 | 0.709 | 0.686 | 0.698 |
| spec | 0.849 | 0.849 | 0.899 | 0.877 |
| roc_auc | 0.840 | 0.840 | 0.879 | 0.877 |
After applying four trained models to the unseen test data, similar to the perfomance metrics with resampling training dataset, it looks like:
Random Forests did better on overall accuracy and roc_auc. But it has the lowest sensitivity rate. If consider accuracy/roc_auc as selection criteria, then the Random Forests is our finalized model.
If the objective is to identify the likelihood of lapse, we should consider the model with the highest Sensitivity rate. In this case, Decision Tree is the selected model:
The selected model - Decision Tree reaches an accuracy of 0.823 . It means, the model can help us to classify correctly 8 out of 10 times whether the policy Lapsed or Inforce.
The sensitivity rate 0.767 , a bit higher than estimate on training data, tells us 7.7 out of 10 lapsed policy can be predicted by model.
A closer look on the predictions with Decision Tree:
Confusion matrix with Decision Tree
tree_final %>% collect_predictions() %>%
conf_mat(Lapsed,.pred_class)
## Truth
## Prediction Lapsed Inforce
## Lapsed 66 27
## Inforce 20 152
Plot variable importance scores for the predictors in the model
library(vip)
tree_final %>%
extract_fit_engine() %>% # extract engine from the final model
vip() + ggtitle("Tree")
The following table is the Prediction rate of Lapsed/In-force, comparing to their Truth status:
# predicting with testing data
library(DT)
tree_final %>% collect_predictions() %>%
select(.pred_Lapsed,.pred_Inforce,.pred_class,Lapsed) %>%
rename("Truth" = Lapsed,
"Predicted" = .pred_class) %>%
mutate(PolicyID = test$ID, .before=1) %>%
arrange(desc(.pred_Lapsed)) %>%
datatable(caption = "Prediction rate of Lapsed/Inforce", filter = "top") %>%
formatRound(2:3,digits = 3)
From this table, we can filter the policy which is currently In-force (Truth=“Inforce”) but predicting Lapse (Predicted=“Lapse”) by the model, we can also view the Lapse probability rate for those policies.