Preventing Policy Lapse is one of key driver for successful insurance business. Proactive identify the likelihood of policy lapse will help business to take action before client churn and terminate their policy.

In this tutorial, we will first explore policy data before processing lapse prediction steps.

Policy Data

Summary policy data:

Data summary
Name policy
Number of rows 1341
Number of columns 19
_______________________
Column type frequency:
factor 10
numeric 9
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts pct
Lapsed 0 1.00 FALSE 2 Inf: 907, Lap: 434 Inf: 0.68, Lap: 0.32
PO_Sex 0 1.00 FALSE 2 mal: 709, fem: 632 mal: 0.53, fem: 0.47
PO_Married 0 1.00 FALSE 2 Mar: 802, Sin: 539 Mar: 0.60, Sin: 0.40
Occupation 0 1.00 FALSE 4 Grp: 486, Grp: 425, Grp: 263, Grp: 167 Grp: 0.36, Grp: 0.32, Grp: 0.20, Grp: 0.12
Phone_registered 12 0.99 FALSE 2 Yes: 935, No: 394 Yes: 0.70, No: 0.30
PO_is_INS 0 1.00 FALSE 2 No: 1158, Yes: 183 No: 0.86, Yes: 0.14
INS_Sex 0 1.00 FALSE 2 mal: 693, fem: 648 mal: 0.52, fem: 0.48
CoveragePeriod 0 1.00 FALSE 3 5-1: 617, >10: 412, 1-5: 312 5-1: 0.46, >10: 0.31, 1-5: 0.23
PaymentTerm 0 1.00 FALSE 4 Qua: 442, Ann: 421, Sem: 258, Mon: 220 Qua: 0.33, Ann: 0.31, Sem: 0.19, Mon: 0.16
DistributionChannel 0 1.00 FALSE 5 Com: 550, Ban: 401, Cor: 199, Gen: 126 Com: 0.41, Ban: 0.30, Cor: 0.15, Gen: 0.09, Oth: 0.05

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 671.00 387.26 1 336 671 1006 1341 ▇▇▇▇▇
NumOfReinstated 0 1 0.68 1.08 0 0 0 1 5 ▇▁▁▁▁
NumOfClaims 2 1 0.56 1.06 0 0 0 1 5 ▇▁▁▁▁
NumOfEmails 1 1 1.29 1.04 0 1 1 2 5 ▇▁▁▁▁
NumOfCalls 4 1 1.13 1.13 0 0 1 2 5 ▇▁▁▁▁
PO_Age 0 1 43.31 8.86 22 36 43 51 59 ▂▇▇▇▇
INS_Age 0 1 39.89 13.58 18 28 40 52 64 ▇▆▇▇▆
Premium 0 1 2825.38 2489.13 224 1025 2021 3761 12754 ▇▂▁▁▁
AgentYearSVR 0 1 2.11 1.06 1 1 2 3 6 ▇▂▁▁▁

Overall, we can see:

  • 32% of policy is lapsed. And Lapsed is the outcome variable for our study.

  • 53% of policy owner are male, similar to 52% of insured person are male.

  • 14% of PO is also insured person.

  • PO age range from 22 to 59 with average age 43.3. And the age of insured person range from 18-64 with average of 40.

  • We also found some missing value in NumOfClaims, NumOfEmails, NumOfCalls, and Phone_registered information. However those are minor, we can ignore them when building models

Exploratory Data Analysis (EDA)

For Categorical variables

Gender

Is the likelihood of lapse dependent on gender

  • Policy with female Policy Owner (PO) are most likely to lapsed with a ratio of 34%, higher than male PO with ratio of 31%.

  • However, the lapse rate by Insured person is contrary.

Occupation

Could it be the Occupation Class correlated with the probability of Lapse

  • Client with Occupation Group 1 are most likely to lapse than occupation group 2,3 and 4.

Occupation and Gender

Does the higher lapse rate in Occupation Group 1 have any correlation with gender distribution in which male clients dominate?

  • Female in occupation group 1 and 2 have a higher possibility of lapse than male, but lower in group 3,4

  • We can consider an interaction between Occupation and PO_Sex.

Others

What is the Lapsed rate with other categorical variables: Distribution Channel, Payment Term, Coverage Period, PO is Insurred

  • Policy with Monthly Payment mode likely have the highest laspe rate among 4 type of payment.

  • Other Distribution channel and Bancas has the highest and lowest lapsed rate among Distribution channel.

  • Policy with high coverage period is likely to lapse than others.

  • It’s likely the policy which PO is also Insured person easily to lapse than other.

For Numeric variables

Premium

  • It’s likely, the lapse rate increase with policy having Premium higher than 4,000.
  • Break down another level - Occupancy, the lapse rate in group 1 and group 2 is higher than in other group.
  • Beside that, the distribution shape of Premium, break down by Lapse status is not different between male and female.

Client Interaction vs Policy Events

  • For policy with less than 2 events (Reinstated and/or Claims), the lapse rate does not fluctuate change although the number of interaction (NumOfCalls/ NumOfEmails) change.

  • However, it’s likely the lapse rate reduces if increasing client interaction (via Calls/Email) for the policy having more than 2 events.

Age

Let’s see distribution of PO_Age, and in correlation with gender.

  • Lapse for male is higher than female for age from 30-40, but lower in other range.

Correlation check

library(ggplot2)
library(GGally)

# Numeric variables correlation check
policy %>%
  select(PO_Age,INS_Age,Premium) %>%
  ggpairs()

# Those variables are ordinal numeric (with value from [1,6]), but seem they should not convert to categorical or normalize in pre-processing.
policy %>%
  select(NumOfReinstated,NumOfClaims,NumOfEmails,NumOfCalls,AgentYearSVR) %>%
  ggpairs()

# small function to display plots only if it's interactive
policy %>%
  select(PO_Age,INS_Age,Premium,PO_Sex) %>%
  ggpairs(columns = 1:3, 
          ggplot2::aes(colour=PO_Sex))

policy %>%
  select(PO_Age,INS_Age,Premium,PO_Sex) %>%
  ggpairs(ggplot2::aes(colour=PO_Sex))

Wrapping-Up

After EDA, the following findings and action plan for next steps are as below:

  • There are missing value in NumOfEmails, NumOfCalls, NumOfClaims and Phone_registered. However, they are not major, we can ignore them when building models.

  • NumOfReinstated, NumOfClaims, NumOfCalls, NumOfEmails are integer variables, but they range from 0-5. We can keep them as numeric or convert them to categorical.

  • Other numeric variables except the above could be normalized by scaling and centering.

  • Nominal variable should be converted into dummy variable in the pre-processing.


Next: Policy Lapse Prediction