The primary purpose of this assignment is to predict the likelihood of an auto insurance customer being involved in an accident. First, missing data is accounted for and, where possible, null values and outliers are imputed using variable-specific regression tree models. Then three logistic regression and probit models are fit and evaluated. Additionally, a boosted classification tree model is developed to attempt to improve predictive accuracy. Finally, R
code is provided to score a fresh data file and make predictions for teams not included in the original sample. All code used in the development of this project can be found here.
The dataset includes information on 8,161 insurance customers. The variables included are listed below as well as their theoretical effects.
VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
---|---|---|
INDEX | Identification Variable (do not use) | None |
TARGET_FLAG | Was Car in a crash? 1=YES 0=NO | None |
TARGET_AMT | If car was in a crash, what was the cost | None |
AGE | Age of Driver | Very young people tend to be risky. Maybe very old people also. |
BLUEBOOK | Value of Vehicle | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_AGE | Vehicle Age | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_TYPE | Type of Car | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_USE | Vehicle Use | Commercial vehicles are driven more, so might increase probability of collision |
CLM_FREQ | #Claims(Past 5 Years) | The more claims you filed in the past, the more you are likely to file in the future |
EDUCATION | Max Education Level | Unknown effect, but in theory more educated people tend to drive more safely |
HOMEKIDS | #Children @Home | Unknown effect |
HOME_VAL | Home Value | In theory, home owners tend to drive more responsibly |
INCOME | Income | In theory, rich people tend to get into fewer crashes |
JOB | Job Category | In theory, white collar jobs tend to be safer |
KIDSDRIV | #Driving Children | When teenagers drive your car, you are more likely to get into crashes |
MSTATUS | Marital Status | In theory, married people drive more safely |
MVR_PTS | Motor Vehicle Record Points | If you get lots of traffic tickets, you tend to get into more crashes |
OLDCLAIM | Total Claims(Past 5 Years) | If your total payout over the past five years was high, this suggests future payouts will be high |
PARENT1 | Single Parent | Unknown effect |
RED_CAR | A Red Car | Urban legend says that red cars (especially red sports cars) are more risky. Is that true? |
REVOKED | License Revoked (Past 7 Years) | If your license was revoked in the past 7 years, you probably are a more risky driver. |
SEX | Gender | Urban legend says that women have less crashes then men. Is that true? |
TIF | Time in Force | People who have been customers for a long time are usually more safe. |
TRAVTIME | Distance to Work | Long drives to work usually suggest greater risk |
URBANICITY | Home/Work Area | Unknown |
YOJ | Years on Job | People who stay at a job for a long time are usually more safe |
Some basic explanatory plots and correlations are presented below. 26% of customers have been involved in an accident. Many of the variables follow a nice, normal distribution like age and tragel time. But others, like home value, and income are positively skewed. The data will need to be appropriately transformed to be useful for predictive purposes. No single numeric variable has an incredibly high correlation with claim likelihood, but number of past claims and moving violation points are closest - explaining about 20% of the variation in crashes.
target_flag kidsdriv age homekids yoj
Min. :0.0000 Min. :0.0000 Min. :16.00 Min. :0.0000 Min. : 0.0
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0
Median :0.0000 Median :0.0000 Median :45.00 Median :0.0000 Median :11.0
Mean :0.2638 Mean :0.1711 Mean :44.79 Mean :0.7212 Mean :10.5
3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
Max. :1.0000 Max. :4.0000 Max. :81.00 Max. :5.0000 Max. :23.0
NA's :6 NA's :454
income home_val travtime bluebook tif
Min. : 0 Min. : 0 Min. : 5.00 Min. : 1500 Min. : 1.000
1st Qu.: 28097 1st Qu.: 0 1st Qu.: 22.00 1st Qu.: 9280 1st Qu.: 1.000
Median : 54028 Median :161160 Median : 33.00 Median :14440 Median : 4.000
Mean : 61898 Mean :154867 Mean : 33.49 Mean :15710 Mean : 5.351
3rd Qu.: 85986 3rd Qu.:238724 3rd Qu.: 44.00 3rd Qu.:20850 3rd Qu.: 7.000
Max. :367030 Max. :885282 Max. :142.00 Max. :69740 Max. :25.000
NA's :445 NA's :464
oldclaim clm_freq mvr_pts car_age
Min. : 0 Min. :0.0000 Min. : 0.000 Min. :-3.000
1st Qu.: 0 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 1.000
Median : 0 Median :0.0000 Median : 1.000 Median : 8.000
Mean : 4037 Mean :0.7986 Mean : 1.696 Mean : 8.328
3rd Qu.: 4636 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
Max. :57037 Max. :5.0000 Max. :13.000 Max. :28.000
NA's :510
The data is largely complete. The variables with the most null values are only 5-6% incomplete.
Regression and classification trees are used to impute missing values. Additionally, outlier variables are deleted and imputed using the same methods. Binary flags are generated for values imputed for each possible reason. Finally, all values are scaled and centered to stabilize model estimates.
Three models are developed as part of this project - a logistic regression model using heuristic variable selection, a logistic regression model with variables selected through a bi-directional stepwise process meant to minimize AIC, and a probit regression model using the same variable selection technique.
The first model developed is a logistic regression model taking all inputs where the data dictionary contains some theoretical impact on accident likelihood. Logistic regression models work similarly to OLS regression models, but the OLS formula is transformed and inverted to produce probabilities of binary outcomes rather than point estimates of a continuous variable1.
Figures 2.1 and 2.2 illustrate a troubling tendency to return incorrectly low accident probabilities. Too many innacurately low predictions could lead to the company taking on more claims than premiums. Additionally, those records with high probability scores (in the 90s) don’t tend to have more accidents than some lower score groupings on aggregate (in the 70s and 80s).
Ignoring unknown parameters: weight
The strongest variable in the model is whether or not a customer’s driver’s license has been revoked in the past seven years. Kid drivers and a preponderance of moving violation points also contribute to claim likelihood which makes sense. Almost all variable coefficients are significant and perform according to the assumptions made in the data dictionary.
The second model tested is a logistic regression model developed with a backward variable selection process. Starting with a logit model using all possible variables the model iteratively works through possible sets of predictor variables, adding and dropping variables to minimize AIC23. It’s a computationally expensive process, which could become a factor in production decisions. However, the gains to predictive accuracy may be worthwhile.
Figure 3.1 looks nearly identical to 2.1 - especially the distributions of scores for claimants which skew too low. The cumulative density plot (Figure 3.3) looks slightly more skewed than that in 2.3. But the predictive accuracy illustrated in Figure 3.2 demonstrates significany improvement over the heursitc model.
Many of the variables selected through the stepwise process perform similarly to the assumptions laid out in the data dictionary. Some of the variable coefficients (like missing age data, SUV ownership, and presence of kids in the household) are not significant, but they’re left in the model to minimize AIC.
The Probit model uses the probit link function to modify OLS regression for classification purposes (similar to the logit process but with different transformations). In this case, the models performed similarly. A bi-directional stepwise variable selection technique is used to select a subset of variables that minimizes AIC. Then the winning model is iteratively applied to subsamples of the data and the average predicted values are calculated for each record.
Similar to the bi-directional logistic regression model, too many non-accident predictions are clearly a problem in this data set. Figures 4.1 and 4.3 illustrate the relative success of the model at positively identifying non-accident cases but also that a significantly high number of cases where accidents did received low scores. One bright spot in this model compared to the previous models is Figure 4.2 which illustrates that as scores go up (in aggregate) accident likelihood does tend to go up as well.
The probit model coefficients behave similarly to the stepwise logistic regression model. Again, regardless of statistical significance, the variables selected through the AIC minimization process are retained in the model.
The three models are evaluated based on the McFadden Pseudo \(R^2\) statistic, the Kolmagorov - Smirnov (KS) statistic, area under the curve (AUC), and Root-Mean Squared Error (RMSE). All calculations are cross-validated using 1,000 iterations of sampled model training and out-of-sample model testing.
McFadden offers a stand-in for the kind of \(R^2\) used to evaluate OLS regression models4. The Probit model explains the greatest amount of variance followed closely by the bi-directional logit model. The heuristic logit model trailed significantly.
The KS statistic compares the cumulative distribution of the model predictions to the ideal distribution for that data. In this case, the heuristic model performed best of all three. While the KS-statistic is a more formal test for appropriateness of the cumulative distribution curve, Figures 2.3, 3.3, and 4.3 all indicated significant depression of scores towards lower-end probabilities than were appropriate. The heuristic model was the closest by appearance to the ideal distribution and the KS-statistic confirms that hunch.
AUC “relates the hit rate to the false alarm rate” in classification modeling 5. The heuristic model was far, far behind on this model and produced highly unstable scores. For production purposes, consistent scores will ease adoption of the model, so to the extent that political concerns factor into model selection, the inconsistency inthe AUC measures for the heuristic model are problematic.
Finally, cross-validated RMSE tests hold-out sample predictions against known values. The bidirectional logit model slightly outperforms the bidirectional probit model, but not by any significant margin. That said, the heuristic model lags significantly.
Step-wise variable selection is a computationally expensive process, but the gains in predictive accuracy more than justify the added time required to train these models. Both logistic regression and probit regression are easy enough to explain to management if the predictions are correct. Therefore the winning model is the Bidirectional Stepwise Probit method.
The below function (and package dependencies) take a file of new data and returns a file with the index, predicted likelihood of a claim, and a rough estimate of likely claim cost. The cost estimate is the median value of claims.
library(tidyverse)
library(yaztheme)
library(reshape2)
library(ggridges)
library(randomForest)
library(moments)
library(mice)
library(MASS)
library(dummies)
library(dplyr)
score_creator <- function(infile){
train <- read_csv(infile)%>%
mutate(INCOME = as.numeric(gsub('[$,]','',INCOME)),
HOME_VAL = as.numeric(gsub('[$,]','',HOME_VAL)),
BLUEBOOK = as.numeric(gsub('[$,]','',BLUEBOOK)),
OLDCLAIM = as.numeric(gsub('[$,]','',OLDCLAIM)),
PARENT1 = tolower(gsub('[z_<]','', PARENT1)),
MSTATUS = tolower(gsub('[z_<]','', MSTATUS)),
SEX = tolower(gsub('[z_<]','', SEX)),
EDUCATION = tolower(gsub('[z_<]','', EDUCATION)),
JOB = tolower(gsub('[z_<]','', JOB)),
CAR_USE = tolower(gsub('[z_<]','', CAR_USE)),
CAR_TYPE = tolower(gsub('[z_<]','', CAR_TYPE)),
RED_CAR = tolower(gsub('[z_<]','', RED_CAR)),
REVOKED = tolower(gsub('[z_<]','', REVOKED)),
URBANICITY = tolower(gsub('[z_<]','', URBANICITY)))
colnames(train) <- tolower(colnames(train))
train.flagged <- train%>%
mutate_all(funs(na.flag = ifelse(is.na(.),1,0)))
int_df <- train.flagged%>%
dplyr::select(-index, -target_flag, -target_amt)%>%
dplyr::select_if(is.numeric)
# md.pattern(int_df)
cleaned_cols <- list()
for(c in colnames(train%>%
dplyr::select(-index, -target_flag, -target_amt, -kidsdriv)%>%
dplyr::select_if(is.numeric))){
column <- train.flagged%>%select_(col = c)
iqr <- quantile(column$col, na.rm = T)[4] - quantile(column$col, na.rm = T)[2]
low <- quantile(column$col, na.rm = T)[2] - iqr
high <- quantile(column$col, na.rm = T)[4] + iqr
vals <- c()
for(i in seq(1:nrow(int_df))){
ifelse(between(column$col[i], low - (1.5*iqr), high + (1.5*iqr)),
vals[i] <- column$col[i],
ifelse(is.na(column$col[i]), vals[i] <- NA, vals[i] <- NA))
}
ifelse(length(vals) == nrow(int_df),
cleaned_cols[[c]] <- vals,
cleaned_cols[[c]] <- c(vals,NA))
}
df2 <- bind_cols(
bind_cols(cleaned_cols)%>%
# select(-kidsdriv)%>%
scale(center = TRUE)%>%
data.frame(),
train.flagged%>%
dplyr::select(ends_with('na.flag'), kidsdriv),
train%>%dplyr::select_if(is.character)
)
df3 <- df2%>%
mutate(
kidsdriv_out.flag = ifelse(is.na(kidsdriv) & kidsdriv_na.flag ==0,1,0),
age_out.flag = ifelse(is.na(age) & age_na.flag ==0,1,0),
homekids_out.flag = ifelse(is.na(homekids) & homekids_na.flag ==0,1,0),
yoj_out.flag = ifelse(is.na(yoj) & yoj_na.flag ==0,1,0),
income_out.flag = ifelse(is.na(income) & income_na.flag ==0,1,0),
parent1_out.flag = ifelse(is.na(parent1) & parent1_na.flag ==0,1,0),
home_val_out.flag = ifelse(is.na(home_val) & home_val_na.flag ==0,1,0),
mstatus_out.flag = ifelse(is.na(mstatus) & mstatus_na.flag ==0,1,0),
sex_out.flag = ifelse(is.na(sex) & sex_na.flag ==0,1,0),
education_out.flag = ifelse(is.na(education) & education_na.flag ==0,1,0),
job_out.flag = ifelse(is.na(job) & job_na.flag ==0,1,0),
travtime_out.flag = ifelse(is.na(travtime) & travtime_na.flag ==0,1,0),
car_use_out.flag = ifelse(is.na(car_use) & car_use_na.flag ==0,1,0),
bluebook_out.flag = ifelse(is.na(bluebook) & bluebook_na.flag ==0,1,0),
tif_out.flag = ifelse(is.na(tif) & tif_na.flag ==0,1,0),
car_type_out.flag = ifelse(is.na(car_type) & car_type_na.flag ==0,1,0),
red_car_out.flag = ifelse(is.na(red_car) & red_car_na.flag ==0,1,0),
oldclaim_out.flag = ifelse(is.na(oldclaim) & oldclaim_na.flag ==0,1,0),
clm_freq_out.flag = ifelse(is.na(clm_freq) & clm_freq_na.flag ==0,1,0),
revoked_out.flag = ifelse(is.na(revoked) & revoked_na.flag ==0,1,0),
mvr_pts_out.flag = ifelse(is.na(mvr_pts) & mvr_pts_na.flag ==0,1,0),
car_age_out.flag = ifelse(is.na(car_age) & car_age_na.flag ==0,1,0),
urbanicity_out.flag = ifelse(is.na(urbanicity) & urbanicity_na.flag ==0,1,0)
)
temp_df <- mice(df3, method = 'cart', maxit = 1)
train_clean <- complete(temp_df)%>%
bind_cols(train%>%dplyr::select(index, target_flag, target_amt))
data.for.stepaic <- train_clean%>%
dplyr::select(-index_na.flag, -target_flag_na.flag, -target_amt_na.flag,
-target_amt)%>%
dplyr::select_if(is.numeric)%>%
bind_cols(
dummy.data.frame(train_clean%>%select_if(is.character))
)
med.claim <- 4104.00
prob.preds <- data.frame(
P_TARGET_FLAG = predict(model.prob, data.for.stepaic, type="response")
)%>%
bind_cols(data.for.stepaic)%>%
mutate(P_TARGET_AMT = med.claim)%>%
dplyr::select(INDEX = index, P_TARGET_FLAG, P_TARGET_AMT)
return(prob.preds)
}
test_data <- score_creator('logit_insurance_test.csv')
write.csv(x = test_data, file = 'yazman_insurance_test.csv')
Hoffmann, John P. Generalized Linear Models: an Applied Approach. Pearson/Allyn & Bacon, 2004.↩
James, Gareth, et al. An Introduction to Statistical Learning with Applications in R. Springer, 2017.↩
https://mathewanalytics.com/2015/09/02/logistic-regression-in-r-part-two/↩
https://www.kdnuggets.com/2010/09/pub-is-auc-the-best-measure.html↩