Our modeling goal is to predict the policy Lapsed or Inforce based on the Policy information, Customer demography and the interaction to policy events.
Let’s take a closer look to policy data:
Summary policy data:
| Name | policy |
| Number of rows | 1341 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts | pct |
|---|---|---|---|---|---|---|
| Lapsed | 0 | 1.00 | FALSE | 2 | Inf: 907, Lap: 434 | Inf: 0.68, Lap: 0.32 |
| PO_Sex | 0 | 1.00 | FALSE | 2 | mal: 709, fem: 632 | mal: 0.53, fem: 0.47 |
| PO_Married | 0 | 1.00 | FALSE | 2 | Mar: 802, Sin: 539 | Mar: 0.60, Sin: 0.40 |
| Occupation | 0 | 1.00 | FALSE | 4 | Grp: 486, Grp: 425, Grp: 263, Grp: 167 | Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12 |
| Phone_registered | 12 | 0.99 | FALSE | 2 | Yes: 935, No: 394 | Yes: 0.70, No: 0.30 |
| PO_is_INS | 0 | 1.00 | FALSE | 2 | No: 1158, Yes: 183 | No: 0.86, Yes: 0.14 |
| INS_Sex | 0 | 1.00 | FALSE | 2 | mal: 693, fem: 648 | mal: 0.52, fem: 0.48 |
| CoveragePeriod | 0 | 1.00 | FALSE | 3 | 5-1: 617, >10: 412, 1-5: 312 | 5-1: 0.46, >10: 0.31, 1-5: 0.23 |
| PaymentTerm | 0 | 1.00 | FALSE | 4 | Qua: 442, Ann: 421, Sem: 258, Mon: 220 | Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16 |
| DistributionChannel | 0 | 1.00 | FALSE | 5 | Com: 550, Ban: 401, Cor: 199, Gen: 126 | Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 671.00 | 387.26 | 1 | 336 | 671 | 1006 | 1341 | ▇▇▇▇▇ |
| NumOfReinstated | 0 | 1 | 0.68 | 1.08 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfClaims | 2 | 1 | 0.56 | 1.06 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfEmails | 1 | 1 | 1.29 | 1.04 | 0 | 1 | 1 | 2 | 5 | ▇▁▁▁▁ |
| NumOfCalls | 4 | 1 | 1.13 | 1.13 | 0 | 0 | 1 | 2 | 5 | ▇▁▁▁▁ |
| PO_Age | 0 | 1 | 43.31 | 8.86 | 22 | 36 | 43 | 51 | 59 | ▂▇▇▇▇ |
| INS_Age | 0 | 1 | 39.89 | 13.58 | 18 | 28 | 40 | 52 | 64 | ▇▆▇▇▆ |
| Premium | 0 | 1 | 2825.38 | 2489.13 | 224 | 1025 | 2021 | 3761 | 12754 | ▇▂▁▁▁ |
| AgentYearSVR | 0 | 1 | 2.11 | 1.06 | 1 | 1 | 2 | 3 | 6 | ▇▂▁▁▁ |
The EDA steps have been done in another document.Check here for EDA.
Let’s take straightforward to the prediction steps.
The first step of our analysis is to split data into two separate sets: a “training” set and a “testing” set. Then resample training data using vfold-cv()
set.seed(1234)
policy_split <- policy %>%
na.omit() %>%
initial_split(prop = 0.80, strata = Lapsed)
policy_train <- training(policy_split)
policy_test <- testing(policy_split)
set.seed(234)
# Cross validation folds (default cv=10), with 3 repeats.
folds <- vfold_cv(policy_train, strata = Lapsed,
repeats = 3)
Before adding our data to model, we need to pre-process, using recipe():
# Create recipe
pol_rec <-
recipe(Lapsed ~., data= policy_train) %>%
update_role(ID, new_role = "id pol") %>% # change ID variable as 'id role'
step_corr(all_numeric()) %>% # filter for High Correlation for numeric data.
step_dummy(all_nominal(), -all_outcomes()) %>% # Convert nominal data to dummy variable, but not convert out_comes variable
# missing handling - by median imputation
# step_impute_median(PO_Age) %>%
step_zv(all_numeric()) %>% # filter zero variance
# step_normalize(all_numeric(),-ID) # normalize numeric data to have SD of one and mean of zero (Center and scale)
step_normalize(all_numeric_predictors())
# pol_rec
# It still did not do anything as this just a setup then we run it again with prep() command:
pol_prep <- pol_rec %>%
prep()
pol_prep
## Recipe
##
## Inputs:
##
## role #variables
## id pol 1
## outcome 1
## predictor 17
##
## Training data contained 1059 data points and no missing data.
##
## Operations:
##
## Correlation filter on <none> [trained]
## Dummy variables from PO_Sex, PO_Married, Occupation, Phone_registered, PO_is_I... [trained]
## Zero variance filter removed <none> [trained]
## Centering and scaling for NumOfReinstated, NumOfClaims, NumOfEmails, NumO... [trained]
We will build the models based on the following algorithms: Logistic regression, k-nearest neighbor, Decision Tree and Random forest. Let’s create model specification:
set.seed(123)
# Logistic regression
glm_spec <- logistic_reg() %>%
set_engine('glm')# declare a computation engine to fit to model
# Nearest Neighbor (knn)
knn_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
# Decision tree
tree_spec <- decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
# Random Forest
rf_spec <- rand_forest(trees = 1000) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode ("classification")
Fitting Glm model specification to the pre-processed training data set:
# glm
glm_fit <- glm_spec %>%
fit(Lapsed ~., data = juice(pol_prep))
glm_fit
## parsnip model object
##
##
## Call: stats::glm(formula = Lapsed ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) ID
## 0.929216 0.000107
## NumOfReinstated NumOfClaims
## -0.719950 -0.518235
## NumOfEmails NumOfCalls
## 0.357085 0.583129
## PO_Age INS_Age
## 0.174404 0.437994
## Premium AgentYearSVR
## -1.419263 0.217049
## PO_Sex_male PO_Married_Single
## -0.021519 0.078064
## Occupation_Grp_2 Occupation_Grp_3
## -0.139905 0.242438
## Occupation_Grp_4 Phone_registered_Yes
## 0.381843 -0.064816
## PO_is_INS_Yes INS_Sex_male
## -0.198616 -0.167991
## CoveragePeriod_X1.5.yrs CoveragePeriod_X5.10.yrs
## 0.333567 0.290028
## PaymentTerm_Monthly PaymentTerm_Quartely
## -0.220612 -0.140353
## PaymentTerm_Semi.annual DistributionChannel_Company.Agent
## -0.067266 -0.010849
## DistributionChannel_Corp DistributionChannel_General.Agency
## -0.046781 -0.064555
## DistributionChannel_Others.PD
## -0.263041
##
## Degrees of Freedom: 1058 Total (i.e. Null); 1032 Residual
## Null Deviance: 1335
## Residual Deviance: 853.7 AIC: 907.7
Fitting other model specification : knn, tree and rf
knn_fit <- knn_spec %>%
fit(Lapsed ~., data = juice(pol_prep))
tree_fit <- tree_spec %>%
fit(Lapsed ~., data = juice(pol_prep))
rf_fit <- rf_spec %>%
fit(Lapsed ~., data = juice(pol_prep))
# Main workflow is created
pol_wf <- workflow() %>%
add_recipe(pol_rec)
We are now fitting the models to resamples. Fitting Glm model, and display 10 folds with 3 repeats:
doParallel::registerDoParallel() # do Parallel for multi-cores
set.seed(456)
glm_rs <- pol_wf %>%
add_model(glm_spec) %>% # using spec of glm to fit again the resampling
fit_resamples(
resamples = folds, # fit for 10 folds cv
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
glm_rs
## # Resampling results
## # 10-fold cross-validation repeated 3 times using stratification
## # A tibble: 30 × 6
## splits id id2 .metrics .notes .predict…¹
## <list> <chr> <chr> <list> <list> <list>
## 1 <split [952/107]> Repeat1 Fold01 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 2 <split [952/107]> Repeat1 Fold02 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 3 <split [952/107]> Repeat1 Fold03 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 4 <split [952/107]> Repeat1 Fold04 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 5 <split [953/106]> Repeat1 Fold05 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 6 <split [954/105]> Repeat1 Fold06 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 7 <split [954/105]> Repeat1 Fold07 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 8 <split [954/105]> Repeat1 Fold08 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 9 <split [954/105]> Repeat1 Fold09 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## 10 <split [954/105]> Repeat1 Fold10 <tibble [4 × 4]> <tibble [0 × 3]> <tibble>
## # … with 20 more rows, and abbreviated variable name ¹.predictions
The performance of model can be viewed via .metrics and
.predictions
# display the metric (accuracy,roc_auc,sens, spec) for each fold
glm_rs %>%
unnest(.metrics)
## # A tibble: 120 × 9
## splits id id2 .metric .estimator .esti…¹ .config .notes
## <list> <chr> <chr> <chr> <chr> <dbl> <chr> <list>
## 1 <split [952/107]> Repeat1 Fold01 accuracy binary 0.822 Prepro… <tibble>
## 2 <split [952/107]> Repeat1 Fold01 sens binary 0.571 Prepro… <tibble>
## 3 <split [952/107]> Repeat1 Fold01 spec binary 0.944 Prepro… <tibble>
## 4 <split [952/107]> Repeat1 Fold01 roc_auc binary 0.827 Prepro… <tibble>
## 5 <split [952/107]> Repeat1 Fold02 accuracy binary 0.850 Prepro… <tibble>
## 6 <split [952/107]> Repeat1 Fold02 sens binary 0.686 Prepro… <tibble>
## 7 <split [952/107]> Repeat1 Fold02 spec binary 0.931 Prepro… <tibble>
## 8 <split [952/107]> Repeat1 Fold02 roc_auc binary 0.919 Prepro… <tibble>
## 9 <split [952/107]> Repeat1 Fold03 accuracy binary 0.813 Prepro… <tibble>
## 10 <split [952/107]> Repeat1 Fold03 sens binary 0.629 Prepro… <tibble>
## # … with 110 more rows, 1 more variable: .predictions <list>, and abbreviated
## # variable name ¹.estimate
glm_rs %>%
unnest(.predictions)
## # A tibble: 3,177 × 11
## splits id id2 .metrics .notes .pred…¹ .row .pred…² .pred…³
## <list> <chr> <chr> <list> <list> <fct> <int> <dbl> <dbl>
## 1 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 1 0.475 0.525
## 2 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 22 0.0988 0.901
## 3 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 36 0.316 0.684
## 4 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 37 0.265 0.735
## 5 <split [952/107]> Repe… Fold… <tibble> <tibble> Lapsed 42 0.773 0.227
## 6 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 52 0.224 0.776
## 7 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 56 0.0781 0.922
## 8 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 62 0.00663 0.993
## 9 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 64 0.00436 0.996
## 10 <split [952/107]> Repe… Fold… <tibble> <tibble> Inforce 66 0.194 0.806
## # … with 3,167 more rows, 2 more variables: Lapsed <fct>, .config <chr>, and
## # abbreviated variable names ¹.pred_class, ².pred_Lapsed, ³.pred_Inforce
Continue fitting other models to resamples
set.seed(456)
tree_rs <- pol_wf %>%
add_model(tree_spec) %>%
#tree_spec %>%
fit_resamples(
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
set.seed(456)
knn_rs <- pol_wf %>%
add_model(knn_spec) %>%
fit_resamples(
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
set.seed(456)
rf_rs <- pol_wf %>%
add_model(rf_spec) %>%
fit_resamples(
resamples = folds,
metrics = metric_set(accuracy,roc_auc,sens, spec),
control = control_resamples(save_pred = TRUE))
Four models have been build. We will evaluate models performance by comparing their metrics:
options(digits = 3)
glm_rs %>% collect_metrics() %>%
select(.metric,mean) %>%
rename("glm" = "mean") %>%
bind_cols(collect_metrics(tree_rs) %>%
select(mean) %>%
rename("tree" = "mean")
) %>%
bind_cols(collect_metrics(rf_rs) %>%
select(mean) %>%
rename("rf" = "mean")
) %>%
bind_cols(collect_metrics(knn_rs) %>%
select(mean) %>%
rename("knn" = "mean")
) %>%
knitr::kable(caption = "Metric evaluation")
| .metric | glm | tree | rf | knn |
|---|---|---|---|---|
| accuracy | 0.813 | 0.823 | 0.825 | 0.696 |
| roc_auc | 0.852 | 0.822 | 0.867 | 0.737 |
| sens | 0.611 | 0.657 | 0.612 | 0.512 |
| spec | 0.910 | 0.903 | 0.927 | 0.784 |
Look likes:
Random forest did better on roc_auc and overall accuracy. It also has the highest specificity.
Decision tree also did good on overall accuracy and have highest sensitivity.
ggpubr::ggarrange(glm_roc, tree_roc, rf_roc, knn_roc,
labels = c("GLM", "Tree", "RF", "Knn"),
nrow = 2, ncol = 2)
From the comparison, we can ignore knn algorithm. We then try to find the people who Lapsed, by checking confusion matrix:
## # A tibble: 4 × 3
## Prediction Truth Freq
## <fct> <fct> <dbl>
## 1 Lapsed Lapsed 226
## 2 Lapsed Inforce 69.7
## 3 Inforce Lapsed 118
## 4 Inforce Inforce 645.
Policy which did lapse, and were predicted to lapse (freq=226) is about twice as Lapsed policies which were predicted wrongly (118). So about 2/3 of policy which is lapsed were predicted correctly. Then why sensitivity is 0.657
The specificity tells us the policy which still in-force were predicted correctly 0.903.
## # A tibble: 4 × 3
## Prediction Truth Freq
## <fct> <fct> <dbl>
## 1 Lapsed Lapsed 211.
## 2 Lapsed Inforce 52.3
## 3 Inforce Lapsed 133.
## 4 Inforce Inforce 663.
Lapsed detection (or sensitive) rate by Random forest is 0.612
## # A tibble: 4 × 3
## Prediction Truth Freq
## <fct> <fct> <dbl>
## 1 Lapsed Lapsed 210.
## 2 Lapsed Inforce 64
## 3 Inforce Lapsed 134.
## 4 Inforce Inforce 651
Lapsed detection (or sensitive) rate by GLM is 0.611
As objective to predict the policy which is lapsed, it seems decision tree did better.
Fit the final model to the training set and evaluate the test set. Then we can select the final model
tree_final <- pol_wf %>% # take the workflow of work
# add the model specification agin
add_model(tree_spec) %>%
# apply with the last fit
last_fit(policy_split) # with main members split
We did fit on the training set and evaluate on test set in order to have the final predictions.
We will compare metrics from testing data to metric on resampling on training:
collect_metrics(tree_final) # metrics from final tree model with testing data
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.838 Preprocessor1_Model1
## 2 roc_auc binary 0.808 Preprocessor1_Model1
collect_metrics(tree_rs) # metrics on resampling on training
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.823 30 0.00775 Preprocessor1_Model1
## 2 roc_auc binary 0.822 30 0.0106 Preprocessor1_Model1
## 3 sens binary 0.657 30 0.0196 Preprocessor1_Model1
## 4 spec binary 0.903 30 0.00663 Preprocessor1_Model1
The accuracy from testing is just a little bit better comparing to resampling data set.
View prediction
collect_predictions(tree_final) %>%
conf_mat(Lapsed, .pred_class)
## Truth
## Prediction Lapsed Inforce
## Lapsed 55 12
## Inforce 31 167
This looks pretty good: the model doing a pretty good job of recognizing both Lapsed and Inforce policy. We then consider Decision Tree as the selected model.
# Performance of RF model on testing
collect_metrics(rf_final)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.819 Preprocessor1_Model1
## 2 roc_auc binary 0.859 Preprocessor1_Model1
# Compare to resamples on training
collect_metrics(rf_rs)
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.825 30 0.00588 Preprocessor1_Model1
## 2 roc_auc binary 0.867 30 0.00885 Preprocessor1_Model1
## 3 sens binary 0.612 30 0.0167 Preprocessor1_Model1
## 4 spec binary 0.927 30 0.00565 Preprocessor1_Model1
collect_predictions(rf_final) %>%
conf_mat(Lapsed, .pred_class)
## Truth
## Prediction Lapsed Inforce
## Lapsed 47 9
## Inforce 39 170
# Performance of GLM model on testing
collect_metrics(glm_final)
## # A tibble: 2 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.808 Preprocessor1_Model1
## 2 roc_auc binary 0.856 Preprocessor1_Model1
# Compare to resamples on training
collect_metrics(glm_rs)
## # A tibble: 4 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.813 30 0.00596 Preprocessor1_Model1
## 2 roc_auc binary 0.852 30 0.00667 Preprocessor1_Model1
## 3 sens binary 0.611 30 0.0171 Preprocessor1_Model1
## 4 spec binary 0.910 30 0.00641 Preprocessor1_Model1
collect_predictions(glm_final) %>%
conf_mat(Lapsed, .pred_class)
## Truth
## Prediction Lapsed Inforce
## Lapsed 49 14
## Inforce 37 165
Plot variable importance scores for the predictors in the model
library(vip)
vip_tree <-
tree_final %>%
extract_fit_engine() %>% # extract engine from the final model
vip() + ggtitle("Tree")
vip_rf <-
rf_final %>%
extract_fit_engine() %>% # extract engine from the final model
vip() + ggtitle("RF")
vip_glm <-
glm_final %>%
extract_fit_engine() %>% # extract engine from the final model
vip() + ggtitle("GLM")
ggpubr::ggarrange(vip_tree, vip_rf, vip_glm,
ncol = 3, nrow = 1)
Premium, NumOfReinstated, NumOfCalls, NumberOfEmails, Occupation play some important score in classifying Lapsed and Inforce.
The selected model - Decision Tree reaches an accuracy of 0.838. It means, the model can help us to classify correctly 8 out of 10 times whether the policy Lapsed or Inforce.
The sensitivity rate (~0.64 ) tells us 6 out of 10 lapsed policy can be predicted by model.
The following table is the Prediction rate of Lapsed/Inforce, comparing to their Truth status:
# predicting with testing data
library(DT)
tree_fit %>%
predict(new_data = bake(pol_prep, new_data = policy_test),
type = "prob") %>%
mutate(PolicyID = policy_test$ID, .before = 1) %>%
mutate(Truth = policy_test$Lapsed) %>%
# filter(Truth == "Lapsed") %>%
arrange(desc(.pred_Lapsed)) %>%
datatable(caption = "Prediction rate of Lapsed/Inforce", filter = "top") %>%
formatRound(2:3,digits = 3)