A coworker sent me a dataset that consist of some 10-year term life insurance lapse experiences data from Year 2000 - 2011. I don't have too much details about it, the data was accompanied by a Word document titled “Case Study 1 - Predictive Analytics Assignments.docx”, so I think it's probably one of the toy datasets that SOA used in their workshops. Anyway, I figure it will be fun to explore the dataset and train some machine learning algorithms on it.
There are 114,525 rows and 13 variables in the dataset, the variables names are shown below with a summary for each:
data <- read.csv("LapseData2000_2011.csv", sep = ",", header = TRUE)
dim(data)
## [1] 114525 13
summary(data)
## LAPSE_STUDY_YEAR RISK_CLASS SMOKER_CLASS GENDER
## Min. :2000 Non-Pref:52245 NS:79308 F:47412
## 1st Qu.:2005 Pref :62280 SM:35217 M:67113
## Median :2008
## Mean :2007
## 3rd Qu.:2010
## Max. :2011
##
## PREMIUM_MODE ISSUE_YEAR_GROUP FACE_AMOUNT_BAND
## 1. Annual :28934 1990 & Earlier: 2299 A. < 100k :12314
## 2. Semiannual:19227 1991-1993 : 9597 B. 100k-249k:46064
## 3. Quarterly :33224 1994-1996 :19547 C. 250k-999k:39845
## 4. Monthly :32984 1997-1999 :29041 D. 1M + :16302
## 5. Biweekly : 156 2000-2002 :33737
## 2003-2004 :14297
## 2005-2007 : 6007
## ISSUE_AGE_GROUP DURATION PREM_JUMP_D11_D10 PREM_JUMP_CONT
## 0-19 : 1538 09-Jun:67005 D. 4.01 - 5.00:13719 Min. : 1.50
## 20-29:13932 10 :20468 E. 5.01 - 6.00:13219 1st Qu.: 3.50
## 30-39:30607 11 : 9629 B. 2.01 - 3.00:13043 Median : 5.50
## 40-49:28743 12 : 5855 C. 3.01 - 4.00:12383 Mean : 6.53
## 50-59:22463 13+ :11568 F. 6.01 - 7.00:12122 3rd Qu.: 8.50
## 60-69:13556 G. 7.01 - 8.00:10171 Max. :24.50
## 70+ : 3686 (Other) :39868
## EXPOSURE_CNT LAPSE_CNT
## Min. : 0.0 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 0.00
## Median : 3.0 Median : 0.00
## Mean : 11.2 Mean : 1.85
## 3rd Qu.: 9.0 3rd Qu.: 1.00
## Max. :1117.0 Max. :257.00
##
First, I focus on looking at duration 10 lapse experiences. I also set the names to lower case so that it's easier to type.
data <- subset(data, DURATION == 10 & EXPOSURE_CNT != 0, select = -PREM_JUMP_D11_D10)
names(data) <- tolower(names(data))
dim(data)
## [1] 20442 12
head(data)
## lapse_study_year risk_class smoker_class gender premium_mode
## 19 2000 Non-Pref NS F 1. Annual
## 23 2000 Non-Pref NS F 1. Annual
## 24 2000 Non-Pref NS F 1. Annual
## 30 2000 Non-Pref NS F 1. Annual
## 31 2000 Non-Pref NS F 1. Annual
## 32 2000 Non-Pref NS F 1. Annual
## issue_year_group face_amount_band issue_age_group duration
## 19 1991-1993 B. 100k-249k 20-29 10
## 23 1991-1993 B. 100k-249k 30-39 10
## 24 1991-1993 B. 100k-249k 30-39 10
## 30 1991-1993 B. 100k-249k 40-49 10
## 31 1991-1993 B. 100k-249k 40-49 10
## 32 1991-1993 B. 100k-249k 40-49 10
## prem_jump_cont exposure_cnt lapse_cnt
## 19 3.5 1 0
## 23 3.5 2 2
## 24 4.5 4 3
## 30 4.5 4 1
## 31 5.5 5 2
## 32 6.5 2 1
I then examine the lapse rate against each variable:
a <- aggregate(lapse_cnt ~ lapse_study_year, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ lapse_study_year, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(year = a[, 1], D10_lapse = ab)
lapse
## year D10_lapse
## 1 2000 0.6449
## 2 2001 0.7413
## 3 2002 0.6436
## 4 2003 0.5932
## 5 2004 0.5780
## 6 2005 0.5340
## 7 2006 0.5887
## 8 2007 0.6112
## 9 2008 0.6851
## 10 2009 0.7602
## 11 2010 0.7778
## 12 2011 0.7584
require(ggplot2)
## Loading required package: ggplot2
g <- ggplot(lapse, aes(year))
g + geom_line(aes(y = D10_lapse), color = "red")
An interesting pattern emerges. The duration 10 lapse rate decreases steadily from study year 2001 to 2005, and then bcounces back from there on. I try to look at premium jumps for explanations:
c <- aggregate(prem_jump_cont ~ lapse_study_year, data = data, FUN = mean)
g + geom_line(aes(y = c$prem_jump_cont), color = "green")
Starting from 2004, the average premium jump has increased from 5.6 to 8.4. Therefore, I think one possible explanation is that the policyholders are seeing much larger premium increases in their bills, and they responded by lapsing their policies. This begs the question: are the insurance companies being too conservative? Is there an efficient to retain some of these policyholders, improve profitability while keeping the anti selection risk in check?
a <- aggregate(lapse_cnt ~ prem_jump_cont, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ prem_jump_cont, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(prem_jump = a[, 1], D10_lapse = ab)
g <- ggplot(lapse, aes(prem_jump))
g + geom_smooth(aes(y = D10_lapse), method = "loess")
The line above is smoothed out to show the clear trend that as premium jump rate increases, the lapse rate increases logarithmically (i.e y = log(x)).
a <- aggregate(lapse_cnt ~ risk_class + smoker_class, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ risk_class + smoker_class, data = data, FUN = sum)
ab <- a[, 3]/b[, 3]
lapse <- data.frame(class = c("NonPref NS", "Pref NS", "NonPref Sm", "Pref Sm"),
D10_lapse = ab)
lapse
## class D10_lapse
## 1 NonPref NS 0.7369
## 2 Pref NS 0.5975
## 3 NonPref Sm 0.7177
## 4 Pref Sm 0.7528
aggregate(prem_jump_cont ~ risk_class + smoker_class, data = data, FUN = mean)
## risk_class smoker_class prem_jump_cont
## 1 Non-Pref NS 7.636
## 2 Pref NS 6.325
## 3 Non-Pref SM 4.728
## 4 Pref SM 6.399
According to this result, preferred non smokers have the lowest lapse rate. These are the healthy non smokers, and their premiums are the lowest among all the policyholders. So I would expect their premiums to jump the most to get to the same level as others. Looking at premium jumps vs classes, the average jump for Pref NS is 6.32, slightly lower than Pref. Smoker. I also can not wrap my head around why non preferred Non Smoker premium jump the most, and Non preferred smoker premium jump the least.
a <- aggregate(lapse_cnt ~ premium_mode, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ premium_mode, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(mode = a[, 1], D10_lapse = ab)
lapse
## mode D10_lapse
## 1 1. Annual 0.8325
## 2 2. Semiannual 0.6996
## 3 3. Quarterly 0.7659
## 4 4. Monthly 0.5354
## 5 5. Biweekly 0.0000
aggregate(prem_jump_cont ~ premium_mode, data = data, FUN = mean)
## premium_mode prem_jump_cont
## 1 1. Annual 7.274
## 2 2. Semiannual 6.079
## 3 3. Quarterly 6.789
## 4 4. Monthly 6.236
## 5 5. Biweekly 4.000
The tables above show that the more frequent the payment is, the lower the lapse rate will be. I think this is because the large dollar amount increase in premium for annually and quarterly paying policyholders feel more “painful” than smaller but more frequent increases. Also, some policyholers might have set up automatic monthly payments along with their other bills. These people probably will not monitor their policies as closely as someone who needs to write a check to the insurance company every time.
a <- aggregate(lapse_cnt ~ gender, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ gender, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(gender = a[, 1], D10_lapse = ab)
lapse
## gender D10_lapse
## 1 F 0.6445
## 2 M 0.7155
aggregate(prem_jump_cont ~ gender, data = data, FUN = mean)
## gender prem_jump_cont
## 1 F 6.302
## 2 M 6.882
Male policyholders have slightly higer lapse rate than female. This might be connected to the well known phenomenon of men are more active stock traders than women, or it could be because men generally purchase more insurance than women. As we will see below, higher face amount is generally associated with higher premium jumps and higher lapses.
a <- aggregate(lapse_cnt ~ face_amount_band, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ face_amount_band, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(face_amount_band = a[, 1], D10_lapse = ab)
lapse
## face_amount_band D10_lapse
## 1 A. < 100k 0.6918
## 2 B. 100k-249k 0.7083
## 3 C. 250k-999k 0.6748
## 4 D. 1M + 0.7311
aggregate(prem_jump_cont ~ face_amount_band, data = data, FUN = mean)
## face_amount_band prem_jump_cont
## 1 A. < 100k 5.692
## 2 B. 100k-249k 6.407
## 3 C. 250k-999k 6.760
## 4 D. 1M + 7.795
For policies that are greater than $1 million in face amount, the lapse rates are higher. The premium jumps also increase steadily as the face amount increases.
a <- aggregate(lapse_cnt ~ issue_age_group, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ issue_age_group, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(issue_age_group = a[, 1], D10_lapse = ab)
lapse
## issue_age_group D10_lapse
## 1 0-19 0.4495
## 2 20-29 0.5099
## 3 30-39 0.6020
## 4 40-49 0.7059
## 5 50-59 0.8217
## 6 60-69 0.8506
## 7 70+ 0.8712
aggregate(prem_jump_cont ~ issue_age_group, data = data, FUN = mean)
## issue_age_group prem_jump_cont
## 1 0-19 2.662
## 2 20-29 4.162
## 3 30-39 6.224
## 4 40-49 6.408
## 5 50-59 7.303
## 6 60-69 8.546
## 7 70+ 10.236
The older the policyholders, the more likely they would lapse their policies at the end of their term period. I can think of two possible explanations here.
a <- aggregate(lapse_cnt ~ issue_year_group, data = data, FUN = sum)
b <- aggregate(exposure_cnt ~ issue_year_group, data = data, FUN = sum)
ab <- a[, 2]/b[, 2]
lapse <- data.frame(issue_year_group = a[, 1], D10_lapse = ab)
lapse
## issue_year_group D10_lapse
## 1 1991-1993 0.6791
## 2 1994-1996 0.5661
## 3 1997-1999 0.6396
## 4 2000-2002 0.7662
aggregate(prem_jump_cont ~ issue_year_group, data = data, FUN = mean)
## issue_year_group prem_jump_cont
## 1 1991-1993 5.863
## 2 1994-1996 5.615
## 3 1997-1999 6.043
## 4 2000-2002 7.664
This is consistent with what we observed in the study year results.
Some of above observations are expected, while some are surprising. For reference, I compared the findings above to this SOA study. For the most part, I think the findings are consistent.