This post will talk about how to apply linear regression model and tree based models to predict the duration 10 lapse rates.
First I split the data into training and test datasets:
data <- read.csv("LapseData2000_2011.csv", sep = ",", header = TRUE)
data <- subset(data, DURATION == 10 & EXPOSURE_CNT != 0, select = -PREM_JUMP_D11_D10)
names(data) <- tolower(names(data))
split <- 0.8
set.seed(2)
train.ind <- sample(1:nrow(data), split * nrow(data))
train <- data[train.ind, ]
test <- data[-train.ind, ]
Here I do a random 80/20 split, training the model on 80% of the data, and making the prediction on the remaining 20%. The goal is to match the predicted lapse to the actuals as closely as possible in the test set.
The set.seed function ensures the reproducibility of the results.
Below I fit a linear regression model to the data.
formula <- lapse_cnt/exposure_cnt ~ lapse_study_year + risk_class + smoker_class +
gender + premium_mode + face_amount_band + issue_age_group + prem_jump_cont
lm.fit <- lm(formula, data = train)
summary(lm.fit)
##
## Call:
## lm(formula = formula, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2746 -0.1609 0.0697 0.2153 0.7184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.65e+01 1.93e+00 -8.54 < 2e-16 ***
## lapse_study_year 8.43e-03 9.62e-04 8.76 < 2e-16 ***
## risk_classPref -8.86e-03 5.77e-03 -1.54 0.1248
## smoker_classSM 4.11e-02 5.75e-03 7.15 9.2e-13 ***
## genderM 1.07e-02 4.98e-03 2.16 0.0310 *
## premium_mode2. Semiannual -1.70e-02 7.58e-03 -2.24 0.0248 *
## premium_mode3. Quarterly -5.96e-02 6.42e-03 -9.28 < 2e-16 ***
## premium_mode4. Monthly -1.73e-01 6.66e-03 -26.04 < 2e-16 ***
## premium_mode5. Biweekly -7.75e-01 1.25e-01 -6.18 6.7e-10 ***
## face_amount_bandB. 100k-249k 2.44e-02 8.36e-03 2.92 0.0036 **
## face_amount_bandC. 250k-999k 4.83e-02 8.61e-03 5.60 2.1e-08 ***
## face_amount_bandD. 1M + 4.87e-02 1.02e-02 4.78 1.8e-06 ***
## issue_age_group20-29 5.29e-02 2.50e-02 2.11 0.0346 *
## issue_age_group30-39 9.67e-02 2.46e-02 3.94 8.2e-05 ***
## issue_age_group40-49 1.85e-01 2.46e-02 7.52 5.7e-14 ***
## issue_age_group50-59 2.34e-01 2.47e-02 9.47 < 2e-16 ***
## issue_age_group60-69 2.44e-01 2.52e-02 9.68 < 2e-16 ***
## issue_age_group70+ 2.13e-01 2.78e-02 7.67 1.9e-14 ***
## prem_jump_cont 2.55e-02 6.94e-04 36.72 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.307 on 16334 degrees of freedom
## Multiple R-squared: 0.223, Adjusted R-squared: 0.222
## F-statistic: 261 on 18 and 16334 DF, p-value: <2e-16
This link provides a very good explanation of how to read the summary. The key takeaway here is that the residuals are reletively small, and most predictors are significant. Although the goodness of fit measure, R square, is only 0.2236.
I can also play around with the formula by removing some variables, or by adding an interaction term, which I did not show here. Instead, I perform a best subset selection on the predictors. It helps decide what are the most important variables we should include in the final model.
require(leaps)
## Loading required package: leaps
regfit.full <- regsubsets(formula, data = train, nvmax = 18)
reg.summary <- summary(regfit.full)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
This process will go over each possible combinations of the p predictors and pick out the one that has the lowest prediction error. The problem with this process is that there are 2p possibilities need to be considered. So if p is large, we should consider using forward or backward stepwise model selection. This model selection or feature selection step is an entire topic onto itself.
After generating the summary for best subset, I examine different pieces graphcially:
par(mfrow = c(2, 2))
plot(reg.summary$rss, xlab = " Number of Variables ", ylab = "RSS", type = "l")
plot(reg.summary$cp, xlab = " Number of Variables ", ylab = "C(p)", type = "l")
# plot adjusted r square and find the maximum
plot(reg.summary$adjr2, xlab = " Number of Variables ", ylab = " Adjusted RSq",
type = "l")
which.max(reg.summary$adjr2)
## [1] 18
points(18, reg.summary$adjr2[18], col = "red", cex = 2, pch = 20)
# plot Bayesian information criterion (BIC) and find the minimum
plot(reg.summary$bic, xlab = " Number of Variables ", ylab = " BIC", type = "l")
which.min(reg.summary$bic)
## [1] 13
points(13, reg.summary$bic[13], col = "red", cex = 2, pch = 20)
# plot the Best Subset Selection scaled by BIC
plot(regfit.full, scale = "r2")
plot(regfit.full, scale = "Cp")
plot(regfit.full, scale = "adjr2")
plot(regfit.full, scale = "bic")
The top row of each plot contains a black square for each variable selected according to best subset associated with the corresponding measure. The first three measures suggest that we should use all 18 variables, the last one suggest we can safely ignore risk class and gender. After comparing both results, the results are very close, so I choose to go with the variables suggested by BIC.
I created a new formula, and predict the results using the 'predict' function. After that the A/E ratio is calculated. The result of 1.009 means that actual and expected are almost the same!
For visualization, I also plot the differeneces in the form of a histogram.
#### Adjust the formula with the findings from the model selection step ###
formulaBSS <- lapse_cnt/exposure_cnt ~ lapse_study_year + smoker_class + premium_mode +
face_amount_band + issue_age_group + prem_jump_cont
BSSmodel <- lm(formulaBSS, train)
summary(BSSmodel)
##
## Call:
## lm(formula = formulaBSS, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.2874 -0.1592 0.0704 0.2154 0.7269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.78e+01 1.66e+00 -10.75 < 2e-16 ***
## lapse_study_year 9.09e-03 8.26e-04 11.00 < 2e-16 ***
## smoker_classSM 3.97e-02 5.53e-03 7.18 7.5e-13 ***
## premium_mode2. Semiannual -1.67e-02 7.58e-03 -2.21 0.0272 *
## premium_mode3. Quarterly -5.96e-02 6.42e-03 -9.28 < 2e-16 ***
## premium_mode4. Monthly -1.74e-01 6.66e-03 -26.06 < 2e-16 ***
## premium_mode5. Biweekly -7.71e-01 1.25e-01 -6.14 8.3e-10 ***
## face_amount_bandB. 100k-249k 2.37e-02 8.35e-03 2.84 0.0045 **
## face_amount_bandC. 250k-999k 4.83e-02 8.59e-03 5.63 1.9e-08 ***
## face_amount_bandD. 1M + 5.10e-02 1.01e-02 5.03 4.8e-07 ***
## issue_age_group20-29 5.17e-02 2.50e-02 2.07 0.0386 *
## issue_age_group30-39 9.55e-02 2.46e-02 3.89 0.0001 ***
## issue_age_group40-49 1.84e-01 2.46e-02 7.48 7.6e-14 ***
## issue_age_group50-59 2.34e-01 2.47e-02 9.47 < 2e-16 ***
## issue_age_group60-69 2.45e-01 2.52e-02 9.74 < 2e-16 ***
## issue_age_group70+ 2.15e-01 2.78e-02 7.73 1.1e-14 ***
## prem_jump_cont 2.55e-02 6.90e-04 36.92 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.307 on 16336 degrees of freedom
## Multiple R-squared: 0.223, Adjusted R-squared: 0.222
## F-statistic: 293 on 16 and 16336 DF, p-value: <2e-16
## read in the test data
pred.lm <- predict(BSSmodel, test)
predLap.lm <- test$exposure_cnt * pred.lm
aeLM <- sum(test$lapse_cnt)/sum(predLap.lm) #find the actual and expected lapse
# plot the errors in histogram
par(mfrow = c(1, 1))
require(MASS)
## Loading required package: MASS
truehist(predLap.lm - test$lapse_cnt, xlim = c(-10, 10))
Next I fit the data into a decision tree, and then apply different aggregation methods like bagging, random forests and boosting to improve the model performance.
The most popular R package for classification and regression trees is probably 'rpart' package. An alternative is the 'tree' package. Here I use the later, for no particular reasons.
require(tree)
## Loading required package: tree
tree.train <- tree(formulaBSS, data = train)
summary(tree.train)
##
## Regression tree:
## tree(formula = formulaBSS, data = train)
## Variables actually used in tree construction:
## [1] "prem_jump_cont" "premium_mode" "lapse_study_year"
## Number of terminal nodes: 7
## Residual mean deviance: 0.0936 = 1530 / 16300
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.884 -0.137 0.116 0.000 0.204 0.659
plot(tree.train)
text(tree.train, pretty = 0)
By simply plugging in the formula and data, I get a nice visualization of a decision tree. The decision tree shows that the most relevant variable (i.e. the variable that decrease the RSS the most) is premium jump rate. Premium mode and lapse study year are the only two other variables used to construct the tree. The residual mean deviance is 0.09, which indicates a fairly good fit to the training data.
I also use cross-validation to determine the optimal level of tree complexity.
cv.tree.train = cv.tree(tree.train)
plot(cv.tree.train$size, cv.tree.train$dev, type = "b")
As the number of terminal tree nodes increases, the cross validation error rate decreases. So a tree with 7 nodes is optimal, which is what we get from the model earlier.
I then make the prediction with the test data.
pred.tree <- predict(tree.train, test)
predLap.tree <- test$exposure_cnt * pred.tree
aeTree <- sum(test$lapse_cnt)/sum(predLap.tree)
aeTree
## [1] 0.9759
The Actual to Expected ratio is 0.98. Not as good as the linear model, but still decent. Next, I want to see if I can improve the result by applying different aggregation methods to the decision tree. Bagging and boosting are general ensemble methods (not just for trees) that are known to improve model performance, you can read more about it on Wikipedia.
Random forest, in simple term, it's a forest of many decision trees. These trees are built using bootstapped training samples. The algorithm randomly selects m predictors out of p total variables as split candidates. The default m in the randomForest function in R is p/3. Therefore, since I have 6 variables in the new formula, the model will choose 2 random predictors and build the tree using only these predictors. The main reason this works is that this decorrelates the trees. We avoid the situation where a few dominating factors will show up in every tree, making the trees highly correlated.
The most efficient way to apply random forest is to input the data into matrices. I can then apply the randomForest function from the 'randomForest' package. I also plot the importance of each variable in the descenting order. Premium jump rate is again the most important predictor.
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(useful)
## Loading required package: ggplot2
X <- build.x(formulaBSS, data = train)
Y <- build.y(formulaBSS, data = train)
X1 <- build.x(formulaBSS, data = test)
forest.train <- randomForest(x = X, y = Y, ntree = 100, importance = TRUE)
pred.rf <- predict(forest.train, X1)
predLap.rf <- test$exposure_cnt * pred.rf
aeRF <- sum(test$lapse_cnt)/sum(predLap.rf)
aeRF
## [1] 1.002
varImpPlot(forest.train)
The A/E ratio improves to 1.002, which is saying the predicted values are almost the same as the actuals.
Next up, bagging. The only difference between bagging and random forest is that bagging uses all p variables, instead of being limited to m. Therefore, the formulas are very similar.
bag.train <- randomForest(x = X, y = Y, mtry = 6, ntree = 100, importance = TRUE)
pred.bag <- predict(bag.train, X1)
predLap.bag <- test$exposure_cnt * pred.bag
aeBag <- sum(test$lapse_cnt)/sum(predLap.bag)
aeBag
## [1] 1.006
varImpPlot(bag.train)
The A/E ratio is 1.006, very close to the random forest results.
Lastly, I apply another aggregation method, boosting.
library(gbm)
## Loading required package: survival
## Loading required package: splines
## Loading required package: lattice
## Loading required package: parallel
## Loaded gbm 2.1.1
boost.train = gbm.fit(x = X, y = Y, distribution = "gaussian", interaction.depth = 4,
n.trees = 5000, verbose = FALSE)
## Warning: variable 1: (Intercept) has no variation.
pred.boost <- predict(boost.train, X1, n.trees = 5000)
predLap.boost <- test$exposure_cnt * pred.boost
aeBoost <- sum(test$lapse_cnt)/sum(predLap.boost)
aeBoost
## [1] 0.9999
The A/E ratio againis very close to 1.
Here is all the A/E ratio in a table:
rbind(aeLM, aeTree, aeRF, aeBag, aeBoost)
## [,1]
## aeLM 1.0091
## aeTree 0.9759
## aeRF 1.0018
## aeBag 1.0060
## aeBoost 0.9999
For good measures, I also run the same model on Duration 11 data, and here is what I get:
To my surprise, linear model and Tree based models seem to work equally work on this dataset!