To Lapse, or Not To Lapse… (PART I)

Motivation

A coworker sent me a dataset that consist of some 10-year term life insurance lapse experiences data from Year 2000 - 2011. I don't have too much details about it, the data was accompanied by a Word document titled “Case Study 1 - Predictive Analytics Assignments.docx”, so I think it's probably one of the toy datasets that SOA used in their workshops. Anyway, I figure it will be fun to explore the dataset and train some machine learning algorithms on it.

Data

There are 114,525 rows and 13 variables in the dataset, the variables names are shown below with a summary for each:

data <- read.csv("LapseData2000_2011.csv", sep = ",", header = TRUE)

dim(data)
## [1] 114525     13
summary(data)
##  LAPSE_STUDY_YEAR    RISK_CLASS    SMOKER_CLASS GENDER   
##  Min.   :2000     Non-Pref:52245   NS:79308     F:47412  
##  1st Qu.:2005     Pref    :62280   SM:35217     M:67113  
##  Median :2008                                            
##  Mean   :2007                                            
##  3rd Qu.:2010                                            
##  Max.   :2011                                            
##                                                          
##         PREMIUM_MODE         ISSUE_YEAR_GROUP      FACE_AMOUNT_BAND
##  1. Annual    :28934   1990 & Earlier: 2299   A.  < 100k   :12314  
##  2. Semiannual:19227   1991-1993     : 9597   B.  100k-249k:46064  
##  3. Quarterly :33224   1994-1996     :19547   C.  250k-999k:39845  
##  4. Monthly   :32984   1997-1999     :29041   D.  1M +     :16302  
##  5. Biweekly  :  156   2000-2002     :33737                        
##                        2003-2004     :14297                        
##                        2005-2007     : 6007                        
##  ISSUE_AGE_GROUP   DURATION           PREM_JUMP_D11_D10 PREM_JUMP_CONT 
##  0-19 : 1538     09-Jun:67005   D.  4.01 - 5.00:13719   Min.   : 1.50  
##  20-29:13932     10    :20468   E.  5.01 - 6.00:13219   1st Qu.: 3.50  
##  30-39:30607     11    : 9629   B.  2.01 - 3.00:13043   Median : 5.50  
##  40-49:28743     12    : 5855   C.  3.01 - 4.00:12383   Mean   : 6.53  
##  50-59:22463     13+   :11568   F.  6.01 - 7.00:12122   3rd Qu.: 8.50  
##  60-69:13556                    G.  7.01 - 8.00:10171   Max.   :24.50  
##  70+  : 3686                    (Other)        :39868                  
##   EXPOSURE_CNT      LAPSE_CNT     
##  Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:   1.0   1st Qu.:  0.00  
##  Median :   3.0   Median :  0.00  
##  Mean   :  11.2   Mean   :  1.85  
##  3rd Qu.:   9.0   3rd Qu.:  1.00  
##  Max.   :1117.0   Max.   :257.00  
## 

Exploratory Data Analysis

First, I focus on looking at duration 10 lapse experiences. I also set the names to lower case so that it's easier to type.

data <- subset(data, DURATION == 10 & EXPOSURE_CNT != 0, select = -PREM_JUMP_D11_D10)
names(data) <- tolower(names(data))
dim(data)
## [1] 20442    12
head(data)
##    lapse_study_year risk_class smoker_class gender premium_mode
## 19             2000   Non-Pref           NS      F    1. Annual
## 23             2000   Non-Pref           NS      F    1. Annual
## 24             2000   Non-Pref           NS      F    1. Annual
## 30             2000   Non-Pref           NS      F    1. Annual
## 31             2000   Non-Pref           NS      F    1. Annual
## 32             2000   Non-Pref           NS      F    1. Annual
##    issue_year_group face_amount_band issue_age_group duration
## 19        1991-1993    B.  100k-249k           20-29       10
## 23        1991-1993    B.  100k-249k           30-39       10
## 24        1991-1993    B.  100k-249k           30-39       10
## 30        1991-1993    B.  100k-249k           40-49       10
## 31        1991-1993    B.  100k-249k           40-49       10
## 32        1991-1993    B.  100k-249k           40-49       10
##    prem_jump_cont exposure_cnt lapse_cnt
## 19            3.5            1         0
## 23            3.5            2         2
## 24            4.5            4         3
## 30            4.5            4         1
## 31            5.5            5         2
## 32            6.5            2         1

I then examine the lapse rate against each variable:

Study Years

a <- aggregate(lapse_cnt ~ lapse_study_year, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ lapse_study_year, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(year = a[, 1], D10_lapse = ab)
lapse
##    year D10_lapse
## 1  2000    0.6449
## 2  2001    0.7413
## 3  2002    0.6436
## 4  2003    0.5932
## 5  2004    0.5780
## 6  2005    0.5340
## 7  2006    0.5887
## 8  2007    0.6112
## 9  2008    0.6851
## 10 2009    0.7602
## 11 2010    0.7778
## 12 2011    0.7584

require(ggplot2)
## Loading required package: ggplot2
g <- ggplot(lapse, aes(year))
g + geom_line(aes(y = D10_lapse), color = "red")

plot of chunk study_year

An interesting pattern emerges. The duration 10 lapse rate decreases steadily from study year 2001 to 2005, and then bcounces back from there on. I try to look at premium jumps for explanations:

c <- aggregate(prem_jump_cont ~ lapse_study_year, data = data, FUN = mean)
g + geom_line(aes(y = c$prem_jump_cont), color = "green")

plot of chunk unnamed-chunk-3

Starting from 2004, the average premium jump has increased from 5.6 to 8.4. Therefore, I think one possible explanation is that the policyholders are seeing much larger premium increases in their bills, and they responded by lapsing their policies. This begs the question: are the insurance companies being too conservative? Is there an efficient to retain some of these policyholders, improve profitability while keeping the anti selection risk in check?

Premium Jumps

a <- aggregate(lapse_cnt ~ prem_jump_cont, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ prem_jump_cont, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(prem_jump = a[, 1], D10_lapse = ab)

g <- ggplot(lapse, aes(prem_jump))
g + geom_smooth(aes(y = D10_lapse), method = "loess")

plot of chunk prem_jump

The line above is smoothed out to show the clear trend that as premium jump rate increases, the lapse rate increases logarithmically (i.e y = log(x)).

Risk Class and Smoker Class

a <- aggregate(lapse_cnt ~ risk_class + smoker_class, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ risk_class + smoker_class, data = data, FUN = sum)
ab <- a[, 3]/b[, 3]
lapse <- data.frame(class = c("NonPref NS", "Pref NS", "NonPref Sm", "Pref Sm"), 
    D10_lapse = ab)
lapse
##        class D10_lapse
## 1 NonPref NS    0.7369
## 2    Pref NS    0.5975
## 3 NonPref Sm    0.7177
## 4    Pref Sm    0.7528
aggregate(prem_jump_cont ~ risk_class + smoker_class, data = data, FUN = mean)
##   risk_class smoker_class prem_jump_cont
## 1   Non-Pref           NS          7.636
## 2       Pref           NS          6.325
## 3   Non-Pref           SM          4.728
## 4       Pref           SM          6.399

According to this result, preferred non smokers have the lowest lapse rate. These are the healthy non smokers, and their premiums are the lowest among all the policyholders. So I would expect their premiums to jump the most to get to the same level as others. Looking at premium jumps vs classes, the average jump for Pref NS is 6.32, slightly lower than Pref. Smoker. I also can not wrap my head around why non preferred Non Smoker premium jump the most, and Non preferred smoker premium jump the least.

Premium Mode

a <- aggregate(lapse_cnt ~ premium_mode, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ premium_mode, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(mode = a[, 1], D10_lapse = ab)
lapse
##            mode D10_lapse
## 1     1. Annual    0.8325
## 2 2. Semiannual    0.6996
## 3  3. Quarterly    0.7659
## 4    4. Monthly    0.5354
## 5   5. Biweekly    0.0000
aggregate(prem_jump_cont ~ premium_mode, data = data, FUN = mean)
##    premium_mode prem_jump_cont
## 1     1. Annual          7.274
## 2 2. Semiannual          6.079
## 3  3. Quarterly          6.789
## 4    4. Monthly          6.236
## 5   5. Biweekly          4.000

The tables above show that the more frequent the payment is, the lower the lapse rate will be. I think this is because the large dollar amount increase in premium for annually and quarterly paying policyholders feel more “painful” than smaller but more frequent increases. Also, some policyholers might have set up automatic monthly payments along with their other bills. These people probably will not monitor their policies as closely as someone who needs to write a check to the insurance company every time.

Gender

a <- aggregate(lapse_cnt ~ gender, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ gender, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(gender = a[, 1], D10_lapse = ab)
lapse
##   gender D10_lapse
## 1      F    0.6445
## 2      M    0.7155
aggregate(prem_jump_cont ~ gender, data = data, FUN = mean)
##   gender prem_jump_cont
## 1      F          6.302
## 2      M          6.882

Male policyholders have slightly higer lapse rate than female. This might be connected to the well known phenomenon of men are more active stock traders than women, or it could be because men generally purchase more insurance than women. As we will see below, higher face amount is generally associated with higher premium jumps and higher lapses.

Face Amount Bands

a <- aggregate(lapse_cnt ~ face_amount_band, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ face_amount_band, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(face_amount_band = a[, 1], D10_lapse = ab)
lapse
##   face_amount_band D10_lapse
## 1       A.  < 100k    0.6918
## 2    B.  100k-249k    0.7083
## 3    C.  250k-999k    0.6748
## 4         D.  1M +    0.7311
aggregate(prem_jump_cont ~ face_amount_band, data = data, FUN = mean)
##   face_amount_band prem_jump_cont
## 1       A.  < 100k          5.692
## 2    B.  100k-249k          6.407
## 3    C.  250k-999k          6.760
## 4         D.  1M +          7.795

For policies that are greater than $1 million in face amount, the lapse rates are higher. The premium jumps also increase steadily as the face amount increases.

Issue Age

a <- aggregate(lapse_cnt ~ issue_age_group, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ issue_age_group, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(issue_age_group = a[, 1], D10_lapse = ab)
lapse
##   issue_age_group D10_lapse
## 1            0-19    0.4495
## 2           20-29    0.5099
## 3           30-39    0.6020
## 4           40-49    0.7059
## 5           50-59    0.8217
## 6           60-69    0.8506
## 7             70+    0.8712
aggregate(prem_jump_cont ~ issue_age_group, data = data, FUN = mean)
##   issue_age_group prem_jump_cont
## 1            0-19          2.662
## 2           20-29          4.162
## 3           30-39          6.224
## 4           40-49          6.408
## 5           50-59          7.303
## 6           60-69          8.546
## 7             70+         10.236

The older the policyholders, the more likely they would lapse their policies at the end of their term period. I can think of two possible explanations here.

Issue Year

a <- aggregate(lapse_cnt ~ issue_year_group, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ issue_year_group, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(issue_year_group = a[, 1], D10_lapse = ab)
lapse
##   issue_year_group D10_lapse
## 1        1991-1993    0.6791
## 2        1994-1996    0.5661
## 3        1997-1999    0.6396
## 4        2000-2002    0.7662
aggregate(prem_jump_cont ~ issue_year_group, data = data, FUN = mean)
##   issue_year_group prem_jump_cont
## 1        1991-1993          5.863
## 2        1994-1996          5.615
## 3        1997-1999          6.043
## 4        2000-2002          7.664

This is consistent with what we observed in the study year results.

Conclusion

Some of above observations are expected, while some are surprising. For reference, I compared the findings above to this SOA study. For the most part, I think the findings are consistent.