Preventing Policy Lapse is one of key driver for successful insurance business. Proactive identify the likelihood of policy lapse will help business to take action before client churn and terminate their policy.
In this tutorial, we will first explore policy data before processing lapse prediction steps.
Summary policy data:
| Name | policy |
| Number of rows | 1341 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| factor | 10 |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts | pct |
|---|---|---|---|---|---|---|
| Lapsed | 0 | 1.00 | FALSE | 2 | Inf: 907, Lap: 434 | Inf: 0.68, Lap: 0.32 |
| PO_Sex | 0 | 1.00 | FALSE | 2 | mal: 709, fem: 632 | mal: 0.53, fem: 0.47 |
| PO_Married | 0 | 1.00 | FALSE | 2 | Mar: 802, Sin: 539 | Mar: 0.60, Sin: 0.40 |
| Occupation | 0 | 1.00 | FALSE | 4 | Grp: 486, Grp: 425, Grp: 263, Grp: 167 | Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12 |
| Phone_registered | 12 | 0.99 | FALSE | 2 | Yes: 935, No: 394 | Yes: 0.70, No: 0.30 |
| PO_is_INS | 0 | 1.00 | FALSE | 2 | No: 1158, Yes: 183 | No: 0.86, Yes: 0.14 |
| INS_Sex | 0 | 1.00 | FALSE | 2 | mal: 693, fem: 648 | mal: 0.52, fem: 0.48 |
| CoveragePeriod | 0 | 1.00 | FALSE | 3 | 5-1: 617, >10: 412, 1-5: 312 | 5-1: 0.46, >10: 0.31, 1-5: 0.23 |
| PaymentTerm | 0 | 1.00 | FALSE | 4 | Qua: 442, Ann: 421, Sem: 258, Mon: 220 | Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16 |
| DistributionChannel | 0 | 1.00 | FALSE | 5 | Com: 550, Ban: 401, Cor: 199, Gen: 126 | Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 671.00 | 387.26 | 1 | 336 | 671 | 1006 | 1341 | ▇▇▇▇▇ |
| NumOfReinstated | 0 | 1 | 0.68 | 1.08 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfClaims | 2 | 1 | 0.56 | 1.06 | 0 | 0 | 0 | 1 | 5 | ▇▁▁▁▁ |
| NumOfEmails | 1 | 1 | 1.29 | 1.04 | 0 | 1 | 1 | 2 | 5 | ▇▁▁▁▁ |
| NumOfCalls | 4 | 1 | 1.13 | 1.13 | 0 | 0 | 1 | 2 | 5 | ▇▁▁▁▁ |
| PO_Age | 0 | 1 | 43.31 | 8.86 | 22 | 36 | 43 | 51 | 59 | ▂▇▇▇▇ |
| INS_Age | 0 | 1 | 39.89 | 13.58 | 18 | 28 | 40 | 52 | 64 | ▇▆▇▇▆ |
| Premium | 0 | 1 | 2825.38 | 2489.13 | 224 | 1025 | 2021 | 3761 | 12754 | ▇▂▁▁▁ |
| AgentYearSVR | 0 | 1 | 2.11 | 1.06 | 1 | 1 | 2 | 3 | 6 | ▇▂▁▁▁ |
Overall, we can see:
32% of policy is lapsed. And Lapsed is the outcome
variable for our study.
53% of policy owner are male, similar to 52% of insured person are male.
14% of PO is also insured person.
PO age range from 22 to 59 with average age 43.3. And the age of insured person range from 18-64 with average of 40.
We also found some missing value in
NumOfClaims, NumOfEmails, NumOfCalls, and
Phone_registered information. However those are minor, we
can ignore them when building models
Is the likelihood of lapse dependent on gender
Policy with female Policy Owner (PO) are most likely to lapsed with a ratio of 34%, higher than male PO with ratio of 31%.
However, the lapse rate by Insured person is contrary.
Could it be the Occupation Class correlated with the probability of Lapse
Does the higher lapse rate in Occupation Group 1 have any correlation with gender distribution in which male clients dominate?
Female in occupation group 1 and 2 have a higher possibility of lapse than male, but lower in group 3,4
We can consider an interaction between Occupation
and PO_Sex.
What is the Lapsed rate with other categorical variables: Distribution Channel, Payment Term, Coverage Period, PO is Insurred
Policy with Monthly Payment mode likely have the highest laspe rate among 4 type of payment.
Other Distribution channel and Bancas has the highest and lowest lapsed rate among Distribution channel.
Policy with high coverage period is likely to lapse than others.
It’s likely the policy which PO is also Insured person easily to lapse than other.
For policy with less than 2 events (Reinstated and/or Claims), the lapse rate does not fluctuate change although the number of interaction (NumOfCalls/ NumOfEmails) change.
However, it’s likely the lapse rate reduces if increasing client interaction (via Calls/Email) for the policy having more than 2 events.
Let’s see distribution of PO_Age, and in correlation
with gender.
library(ggplot2)
library(GGally)
# Numeric variables correlation check
policy %>%
select(PO_Age,INS_Age,Premium) %>%
ggpairs()
# Those variables are ordinal numeric (with value from [1,6]), but seem they should not convert to categorical or normalize in pre-processing.
policy %>%
select(NumOfReinstated,NumOfClaims,NumOfEmails,NumOfCalls,AgentYearSVR) %>%
ggpairs()
# small function to display plots only if it's interactive
policy %>%
select(PO_Age,INS_Age,Premium,PO_Sex) %>%
ggpairs(columns = 1:3,
ggplot2::aes(colour=PO_Sex))
policy %>%
select(PO_Age,INS_Age,Premium,PO_Sex) %>%
ggpairs(ggplot2::aes(colour=PO_Sex))
After EDA, the following findings and action plan for next steps are as below:
There are missing value in
NumOfEmails, NumOfCalls, NumOfClaims and
Phone_registered. However, they are not major, we can
ignore them when building models.
NumOfReinstated, NumOfClaims, NumOfCalls, NumOfEmails
are integer variables, but they range from 0-5. We can keep them as
numeric or convert them to categorical.
Other numeric variables except the above could be normalized by scaling and centering.
Nominal variable should be converted into dummy variable in the pre-processing.
Next: Policy Lapse Prediction