Policy Lapse Prediction with Tidymodels

Explore Data

Our modeling goal is to predict the policy Lapsed or In-force based on the Policy information, Customer demography and interaction frequency to policy events.

Summary policy data:

Data summary
Name	policy
Number of rows	1341
Number of columns	19
_______________________
Column type frequency:
factor	10
numeric	9
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts	pct
Lapsed	0	1.00	FALSE	2	Inf: 907, Lap: 434	Inf: 0.68, Lap: 0.32
PO_Sex	0	1.00	FALSE	2	mal: 709, fem: 632	mal: 0.53, fem: 0.47
PO_Married	0	1.00	FALSE	2	Mar: 802, Sin: 539	Mar: 0.60, Sin: 0.40
Occupation	0	1.00	FALSE	4	Grp: 486, Grp: 425, Grp: 263, Grp: 167	Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12
Phone_registered	12	0.99	FALSE	2	Yes: 935, No: 394	Yes: 0.70, No: 0.30
PO_is_INS	0	1.00	FALSE	2	No: 1158, Yes: 183	No: 0.86, Yes: 0.14
INS_Sex	0	1.00	FALSE	2	mal: 693, fem: 648	mal: 0.52, fem: 0.48
CoveragePeriod	0	1.00	FALSE	3	5-1: 617, >10: 412, 1-5: 312	5-1: 0.46, >10: 0.31, 1-5: 0.23
PaymentTerm	0	1.00	FALSE	4	Qua: 442, Ann: 421, Sem: 258, Mon: 220	Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16
DistributionChannel	0	1.00	FALSE	5	Com: 550, Ban: 401, Cor: 199, Gen: 126	Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ID	0	1	671.00	387.26	1	336	671	1006	1341	▇▇▇▇▇
NumOfReinstated	0	1	0.68	1.08	0	0	0	1	5	▇▁▁▁▁
NumOfClaims	2	1	0.56	1.06	0	0	0	1	5	▇▁▁▁▁
NumOfEmails	1	1	1.29	1.04	0	1	1	2	5	▇▁▁▁▁
NumOfCalls	4	1	1.13	1.13	0	0	1	2	5	▇▁▁▁▁
PO_Age	0	1	43.31	8.86	22	36	43	51	59	▂▇▇▇▇
INS_Age	0	1	39.89	13.58	18	28	40	52	64	▇▆▇▇▆
Premium	0	1	2825.38	2489.13	224	1025	2021	3761	12754	▇▂▁▁▁
AgentYearSVR	0	1	2.11	1.06	1	1	2	3	6	▇▂▁▁▁

The EDA steps have been done in another document. Check here for EDA.

Feature engineering

Split data and resample

The first step of our analysis is to split data into two separate sets: “training” set and “testing” set. The “training” set will be used to train the model while the “testing” set to evaluate the final model.

In order to prevent over fitting during training, we will resample training data using k-fold cross validated with vfold-cv(). With k=10 fold data set, we expect better estimate of the model’s performance.

set.seed(123)
policy_split <- policy %>%
  drop_na() %>%
  initial_split(prop = 0.80, strata = Lapsed)
train <- training(policy_split)
test <- testing(policy_split)

# Cross validation folds (default v=10)
folds <- vfold_cv(train, strata = Lapsed,
                  repeats = 3)

Pre-processing

Before adding our data to the model, we need to pre-process our data, using recipe():

pol_rec <-
  recipe(Lapsed ~., data= train) %>%
  update_role(ID, new_role = "id pol") %>% # change ID variable as 'id role'
  step_corr(all_numeric()) %>% # filter for High Correlation for numeric data.
  step_dummy(all_nominal(), -all_outcomes()) %>% # Convert nominal data to dummy variable,
  step_zv(all_numeric()) # filter zero variance
#  step_normalize(all_numeric_predictors()) # We might not need for Tree-based methods
pol_rec

## Recipe
## 
## Inputs:
## 
##       role #variables
##     id pol          1
##    outcome          1
##  predictor         17
## 
## Operations:
## 
## Correlation filter on all_numeric()
## Dummy variables from all_nominal(), -all_outcomes()
## Zero variance filter on all_numeric()

Build models

Type of models

We will build the models based on Tree-based methods: Decision Tree, Bagging bag_tree(), Random Forest rand_forest() and Boosting boost_tree().
Other models can be found here https://www.tidymodels.org/find/parsnip/

Model Examples

Decision Tree

Specify model

set.seed(123)
tree_spec <- decision_tree() %>%
  set_engine("rpart") %>% 
  set_mode("classification")

Creat Workflow

tree_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(tree_spec)

Fit the model

set.seed(123)
#future::plan(multisession)

tree_fit <- fit_resamples(
    tree_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Bagged Trees

Specify model

library(baguette)
set.seed(123)
bag_spec <- bag_tree() %>%
  set_engine("rpart", times = 10) %>% # 10 boostrap resample 
  set_mode("classification")

Creat Workflow

bag_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(bag_spec)

Fit the model

set.seed(123)
#future::plan(multisession)

bag_fit <- fit_resamples(
    bag_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(verbose = TRUE,
                                save_pred = TRUE))

Random Forest

Specify model

set.seed(123)
rf_spec <- rand_forest() %>%
  set_engine("ranger",  importance = "impurity",
             verbose = TRUE) %>%
  set_mode ("classification") %>%
  set_args(trees = 1000)

Creat Workflow

rf_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(rf_spec)

Fit the model

set.seed(123)
#future::plan(multisession)

rf_fit <- fit_resamples(
    rf_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Boosted Trees

Specify model

set.seed(123)
xgb_spec <-
  boost_tree(
#    trees = tune(),
#    mtry = tune(),
#    min_n = tune(),
#    learn_rate = 0.01
  ) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

Creat Workflow

xgb_wf <- workflow() %>%
  add_recipe(pol_rec) %>%
  add_model(xgb_spec)

Fit the model

set.seed(123)
#future::plan(multisession)
xgb_fit <- fit_resamples(
    xgb_wf,
    resamples = folds,
    metrics = metric_set(accuracy,roc_auc,sens, spec),
    control = control_resamples(save_pred = TRUE))

Evaluate models

Four models have been build. We will evaluate models performance by comparing their metrics:

The metrics

After running these four methods, now it’s time to evaluate their performance with collect_metric() function. Below are the table for comparison :

options(digits = 3)
tree_fit %>% collect_metrics() %>%
  select(.metric,mean) %>%
  rename("tree" = "mean") %>%
  bind_cols(collect_metrics(bag_fit) %>%
            select(mean) %>%
              rename("bag" = mean)) %>%
  bind_cols(collect_metrics(rf_fit) %>%
            select(mean) %>%
              rename("rf" = "mean")) %>%
  bind_cols(collect_metrics(xgb_fit) %>%
            select(mean) %>%
              rename("xgb" = mean)
            ) %>%
  knitr::kable(caption = "Metric evaluation")

Metric evaluation
.metric	tree	bag	rf	xgb
accuracy	0.809	0.788	0.824	0.820
roc_auc	0.810	0.812	0.859	0.854
sens	0.658	0.655	0.595	0.647
spec	0.882	0.851	0.934	0.904

Look likes:

Random forest did better on roc_auc and overall accuracy. It also has the highest specificity but has lowest sensitivity.
Decision tree also did good on overall accuracy and have highest sensitivity.

ROC Curve

Finalize model

The final step is to fit the trained model to testing data using last_fit() function.

fn_metrics <- metric_set(roc_auc, accuracy, sens, spec)
# Decision Tree
tree_final <- last_fit(
  tree_wf,
  split = policy_split,
  metrics = fn_metrics
)
# Bagging Trees
bag_final <- last_fit(
  bag_wf,
  split = policy_split,
  metrics = fn_metrics
)
# Random forest
rf_final <- last_fit(
  rf_wf,
  split = policy_split,
  metrics = fn_metrics
)
# boosted tree
xgb_final <- last_fit(
  xgb_wf,
  split = policy_split,
  metrics = fn_metrics
)

The table below shows the actual out-of-sample performance for each of our four models.

tree_final %>% collect_metrics() %>%
  select(.metric,.estimate) %>%
  rename("tree" = .estimate) %>%
  bind_cols(collect_metrics(bag_final) %>%
            select(.estimate) %>%
              rename("bag" = .estimate)) %>%
  bind_cols(collect_metrics(rf_final) %>%
            select(.estimate) %>%
              rename("rf" = .estimate)) %>%
  bind_cols(collect_metrics(xgb_final) %>%
            select(.estimate) %>%
              rename("xgb" = .estimate)
            ) %>%
  knitr::kable(caption = "Metric evaluation")

Metric evaluation
.metric	tree	bag	rf	xgb
accuracy	0.823	0.804	0.830	0.819
sens	0.767	0.709	0.686	0.698
spec	0.849	0.849	0.899	0.877
roc_auc	0.840	0.840	0.879	0.877

After applying four trained models to the unseen test data, similar to the perfomance metrics with resampling training dataset, it looks like:

Random Forests did better on overall accuracy and roc_auc. But it has the lowest sensitivity rate. If consider accuracy/roc_auc as selection criteria, then the Random Forests is our finalized model.
If the objective is to identify the likelihood of lapse, we should consider the model with the highest Sensitivity rate. In this case, Decision Tree is the selected model:
- The selected model - Decision Tree reaches an accuracy of 0.823 . It means, the model can help us to classify correctly 8 out of 10 times whether the policy Lapsed or Inforce.
- The sensitivity rate 0.767 , a bit higher than estimate on training data, tells us 7.7 out of 10 lapsed policy can be predicted by model.

A closer look on the predictions with Decision Tree:

Confusion matrix

Confusion matrix with Decision Tree

tree_final %>% collect_predictions() %>%
  conf_mat(Lapsed,.pred_class)

##           Truth
## Prediction Lapsed Inforce
##    Lapsed      66      27
##    Inforce     20     152

Variable importance

Plot variable importance scores for the predictors in the model

library(vip)
tree_final %>%
  extract_fit_engine() %>% # extract engine from the final model
  vip() + ggtitle("Tree")

Prediction

The following table is the Prediction rate of Lapsed/In-force, comparing to their Truth status:

# predicting with testing data
library(DT)
tree_final %>% collect_predictions() %>%
  select(.pred_Lapsed,.pred_Inforce,.pred_class,Lapsed) %>%
  rename("Truth" = Lapsed,
         "Predicted" = .pred_class) %>%
  mutate(PolicyID = test$ID, .before=1) %>%
    arrange(desc(.pred_Lapsed)) %>%
  datatable(caption = "Prediction rate of Lapsed/Inforce", filter = "top") %>%
  formatRound(2:3,digits = 3)

From this table, we can filter the policy which is currently In-force (Truth=“Inforce”) but predicting Lapse (Predicted=“Lapse”) by the model, we can also view the Lapse probability rate for those policies.

Policy Lapse Prediction with Tidymodels

Tam Pham

2023-04-12

Explore Data

Feature engineering

Split data and resample

Pre-processing

Build models

Type of models

Model Examples

Decision Tree

Specify model

Creat Workflow

Fit the model

Bagged Trees

Specify model

Creat Workflow

Fit the model

Random Forest

Specify model

Creat Workflow

Fit the model

Boosted Trees

Specify model

Creat Workflow

Fit the model

Evaluate models

The metrics

ROC Curve

Finalize model

Confusion matrix

Variable importance

Prediction

Bonus: Plot the selected Decision Tree