In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company… your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.
The provided dataset contains two response variables:
target_flag: a categorical variable where 1 indicates a crash event.
target_amt: a numeric variable indicating the insurance payout amount for crash events.
We’ll build three multivariate logistic regression models using target_flag to predict the probability of crash events, and two multivariate linear regression models using target_amt to predict the cost of crash events.
1. Data Exploration
In order to explore summary stats and distribution characteristics of our dataset, we’ll need to first conduct some basic cleanup:
Code
# load datadf_train <-as.data.frame(read.csv('data/insurance_training_data.csv'))df_test <-as.data.frame(read.csv('data/insurance_evaluation_data.csv'))# union for exploration and prepdf_train <- df_train %>%mutate(source='train')df_test <- df_test %>%mutate(source='test')df_all <-bind_rows(df_train, df_test)
Apart from our two response variables (target_flag and target_amt), we have 24 individual predictors describing characteristics of our 10,000 total observations. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.
Observations
First, none of our predictors appear to have linear relationships to our target_amt response variable, which is a primary assumption of linear regression. This suggests that alternative methods (such as polynomial regression) might be more successful in modeling the relationships.
Code
# scatterplot matrix of numeric cols vs target_amtdf_all[,numeric_cols] %>%gather(-target_amt, key="key", value="value") %>%ggplot(aes(x=value, y=target_amt)) +geom_point(alpha=0.25) +facet_wrap(~ key, scales='free')
Second, many of these distributions seem highly skewed and non-normal. As part of our data preparation we’ll use a power transformation (Box-Cox) to find whether transforming variables to more normal distributions improves our models’ efficacy.
2. Data Preparation
Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression.
We have only a single predictor with empty data: job, a categorical describing an individual’s job title, and we’ll impute a label ‘Unknown’ for these instances.
Code
# working copydf_prep <- df_all# find columns with empty valuesx <- df_prep %>%apply(2, function(col) sum(nchar(col)==0)) x[!is.na(x) & x>0] %>%as.data.frame() %>%setNames('empty')
Convert our target_flag response variable from a numeric to factor (to enable randomforest as part of the logistic regression steps.)
Code
# coerce some columns to factordf_prep <- df_prep %>%mutate(target_flag =as.factor(target_flag))
Next we’ll want to consider any power transformations. Our numeric response variable target_amt is a good candidate for transformation as its distribution is very highly skewed, and the assumption of normality is required in order to apply linear regression.
We’ll use the Box-Cox method to identify the ideal transformation parameter for this variable, but only after we split the datasets so that non-crash observations (where target_amt is always zero) don’t skew the result.
To simplify this step, we’ve created a custom function that takes a single dataframe column as a vector, adjusts it slightly to correct for any zero values, and returns a power transformation using the best-possible value for the Box-Cox lambda:
Code
# function to retrieve box-cox-transformed vector and the lambda valueget_boxcox_dist <-function(vec, plot=FALSE){ vec <- vec +0.1# avoid -Inf errors bc_formula <<-as.formula('vec[vec>0] ~ 1') # <<- pushed to R_GlobalEnv bc <-boxcox(lm(formula = bc_formula), plotit=plot) lambda <- bc$x[which.max(bc$y)] vec_boxcox <- (vec ^ lambda -1) / lambda out <-list(vec=vec_boxcox, lambda=lambda)return(out)}# reverse the boxcox transformation to interpret resultsget_boxcox_exp <-function(vec, lambda){ out <- ((vec * lambda) +1) ^ (1/lambda)return(out)}
Finally, we split the data back into training and test sets and remove the source_flag.
Code
# split back to train and test, remove source flagdf_train <- df_prep %>%filter(source_train ==1) %>% dplyr::select(!source_train)df_test <- df_prep %>%filter(source_train ==0) %>% dplyr::select(!source_train)
3. Build Models
We’ll build datasets to train Linear and Logistic Regression with validation hold-outs, and create one variant with a Box-Cox transformation of the target_amt response variable for the Linear model.
Code
# response variable for Linear Regression is target_amt. Filter out non-crash data.df_lm <- df_train %>%filter(target_flag==1) %>% dplyr::select(!target_flag)df_lm_bc <- df_lm# box-cox transform the response variable only (see Step 2) bc_obj <-get_boxcox_dist(df_lm_bc$target_amt, TRUE)
Code
df_lm_bc$target_amt <- bc_obj[['vec']] lm_bc_lambda <- bc_obj[['lambda']]# response variable for Logistic Regression is target_flag. Filter out payout data.df_lr <- df_train %>% dplyr::select(!target_amt)
The Box-Cox transformation on target_amt (lambda = 0.02) makes the distribution of our response variable much more normal.
Code
hist(df_lm$target_amt, main='target_amt')hist(df_lm_bc$target_amt, main='target_amt with BC trans')
We’ll create validation holdouts to assess model performance.
This regression produces statistically significant results for three predictors, bluebook, sex_m and revoked_yes. The coefficients suggest a positive linear relationship between the amount of an insurance payout and the value of the vehicle, the male gender of the driver, and a negative relationship with a driver’s license having been previously revoked.
The negative relationship with license revocation seems counter-intuitive at first; but it is possible there could be some correlation or collinearity with bluebook value. Perhaps accident-prone drivers are operating less expensive vehicles! It could merit a separate analysis, but for now we’ll just leave it in, as it is adding predictive power to the model.
The Adjusted R-Squared is quite low (0.014) meaning this model only explains 1.4% of the total variance in the response variable target_amt. Additionally, an examination of the residuals indicates some of the key assumptions for linear regression are not met. The Residuals vs Fitted plot does not show a constant variability of the residuals, and the Q-Q plot indicates a great deal of skew and non-normality.
3b. Multiple Linear Regression II
For our second linear regression, we’ve used the Box-Cox method to transform the response variable target_amt (with a lambda parameter value of 0.02) along with stepwise feature selection to try and improve on the prior results.
Code
# build model on box-cox transformed datalm2_all <-lm(target_amt ~ ., data = df_lm_bc_train)# select features and refit by backwards stepwise selection (AIC)lm2 <-stepAIC(lm2_all, trace=FALSE, direction='backward')# residual plotsplot(lm2)
The resulting model is much more parsimonious than the first, with statistically significant results for three predictors, bluebook, mvr_pts and mstatus_yes.
The Adjusted R-Squared is better than Model 1 but still very low (0.0167) meaning this model only explains about 1.7% of the total variance in the response variable target_amt. However, an examination of the residuals indicates most of the key assumptions for linear regression are met - the Residuals vs Fitted plot shows a more constant variability of the residuals, and the Q-Q plot indicates a greater level of normality.
The summary table includes the estimate transformed to original scale for easier interpretation.In this case, the ‘base’ target_amt would be estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver were male, and 0.9% decrease if the driver were married.
3d. Binary Logistic Regression I
We’ll train our first Binary Logistic Regression model on all 37 predictors.
Code
# train model on all predictorslr1 <-glm(target_flag ~ . , data = df_lr_train, family =binomial(link = logit))
score
Sensitivity 0.816
Specificity 0.659
Pos Pred Value 0.920
Neg Pred Value 0.426
Precision 0.920
Recall 0.816
F1 0.865
Prevalence 0.828
Detection Rate 0.676
Detection Prevalence 0.735
Balanced Accuracy 0.738
term
estimate
p.value
urbanicity_highly_urban_urban
2.38
0.00
revoked_yes
0.86
0.00
car_use_private
-0.79
0.00
(Intercept)
-2.44
0.00
car_type_sports_car
1.08
0.00
travtime
0.01
0.00
mvr_pts
0.10
0.00
clm_freq
0.22
0.00
tif
-0.05
0.00
car_type_suv
0.75
0.00
kidsdriv
0.40
0.00
mstatus_yes
-0.49
0.00
job_manager
-0.83
0.00
car_type_pickup
0.52
0.00
car_type_van
0.60
0.00
bluebook
0.00
0.00
education_bachelors
-0.45
0.00
car_type_panel_truck
0.62
0.00
home_val
0.00
0.00
parent1_yes
0.40
0.00
oldclaim
0.00
0.00
income
0.00
0.00
job_doctor
-0.67
0.03
job_unknown
-0.37
0.07
education_masters
-0.32
0.09
job_lawyer
-0.19
0.35
job_clerical
0.11
0.36
job_professional
-0.11
0.39
sex_m
0.10
0.40
age
0.00
0.41
education_phd
-0.17
0.46
car_age
-0.01
0.53
job_student
-0.08
0.55
red_car_yes
-0.03
0.74
education_high_school
-0.03
0.78
yoj
0.00
0.78
job_home_maker
-0.02
0.90
homekids
0.00
0.93
This regression produces statistically significant results for twenty-two predictors, with a variety of positive and negative effects on target_flag.
The Accuracy is 0.738, Sensitivity is 0.816 and Specificity is 0.659. The AUC is 0.812.
3e. Binary Logistic Regression II
Binary Logistic Regression - predictors from stepwise feature selection.
Code
# build model on all predictorslr2_all <-glm(target_flag ~ ., data = df_lr_train, family =binomial(link = logit))# select features and refit by backwards stepwise selection (AIC)lr2_step <-blr_step_aic_backward(lr2_all)# fit model to get metricslr2 <-glm(lr2_step$model$formula, data = df_lr_train, family =binomial(link = logit))
score
Sensitivity 0.814
Specificity 0.657
Pos Pred Value 0.921
Neg Pred Value 0.419
Precision 0.921
Recall 0.814
F1 0.864
Prevalence 0.831
Detection Rate 0.677
Detection Prevalence 0.735
Balanced Accuracy 0.736
term
estimate
p.value
urbanicity_highly_urban_urban
2.39
0.00
(Intercept)
-2.60
0.00
car_use_private
-0.82
0.00
car_type_sports_car
1.02
0.00
revoked_yes
0.86
0.00
car_type_suv
0.69
0.00
travtime
0.01
0.00
mvr_pts
0.10
0.00
clm_freq
0.21
0.00
kidsdriv
0.40
0.00
tif
-0.05
0.00
job_manager
-0.75
0.00
mstatus_yes
-0.49
0.00
education_bachelors
-0.47
0.00
bluebook
0.00
0.00
car_type_van
0.62
0.00
car_type_pickup
0.51
0.00
parent1_yes
0.43
0.00
education_masters
-0.46
0.00
car_type_panel_truck
0.66
0.00
income
0.00
0.00
home_val
0.00
0.00
oldclaim
0.00
0.00
job_doctor
-0.54
0.05
job_clerical
0.16
0.09
education_phd
-0.28
0.12
job_unknown
-0.24
0.14
This regression produces statistically significant results for twenty-three predictors, with a variety of positive and negative effects on target_flag.
The Accuracy is 0.736, Sensitivity is 0.814 and Specificity is 0.657. The AUC is also 0.812.
3f. Binary Logistic Regression III
Binary Logistic Regression - predictors from random forest feature selection (RFE).
Code
# train randomforest and examine importance of predictorslr3_rf <-randomForest(target_flag ~ ., data = df_lr_train)# select top N predictorspredictors_rf <-as.data.frame(importance(lr3_rf)) %>%top_n(20, MeanDecreaseGini) %>%rownames() %>%append('target_flag')# make new training datasetdf_lr3 <- df_lr_train %>% dplyr::select(all_of(predictors_rf))# train modellr3 <-glm(target_flag ~ . , data = df_lr3, family =binomial(link = logit))
score
Sensitivity 0.808
Specificity 0.651
Pos Pred Value 0.924
Neg Pred Value 0.392
Precision 0.924
Recall 0.808
F1 0.862
Prevalence 0.840
Detection Rate 0.679
Detection Prevalence 0.735
Balanced Accuracy 0.730
term
estimate
p.value
urbanicity_highly_urban_urban
2.30
0.00
car_use_private
-0.78
0.00
revoked_yes
0.86
0.00
(Intercept)
-1.93
0.00
clm_freq
0.22
0.00
mvr_pts
0.10
0.00
travtime
0.01
0.00
tif
-0.05
0.00
job_manager
-0.73
0.00
bluebook
0.00
0.00
kidsdriv
0.38
0.00
mstatus_yes
-0.47
0.00
income
0.00
0.00
home_val
0.00
0.00
parent1_yes
0.39
0.00
oldclaim
0.00
0.00
education_high_school
0.23
0.00
car_age
-0.02
0.00
homekids
0.02
0.55
age
0.00
0.57
yoj
0.00
0.71
Random Forest-based feature selection results in about 17 predictors with statistical significance, more parsimonious than the prior models.
The Accuracy is 0.73, Sensitivity is 0.808 and Specificity is 0.651. The AUC is 0.802.
4. Model Selection
Linear Regression Models
To compare the Linear Regression models, we’ll examine Adjusted R-Squared, Root Mean Squared Error and evaluate whether the models meet the general assumptions for linear regression.
Model 1
Model 2
Model 1 includes all 37 predictors (3 with statistical significance) for an Adjusted R-Squared of 0.014 and RMSE of 7233.
Model 2 uses a Box-Cox transformed response variable and includes 3 predictors (3 with statistical significance) for an Adjusted R-Squared of 0.017 and RMSE of 8900.
While the first model produces a lower RMSE, the residual plots demonstrate that several linear regression assumptions are not met. The second model, while producing a higher RMSE, does a much better job of meeting our linear regression assumptions around normality and constant variability of residuals.
The Adjusted R-Squared, or amount of variability explained by the model, is higher for the second model than the first - although the results are relatively low in both cases.
With this understanding, we’ll select Linear Regression Model 2 for our predictions.
Logistic Regression Models
To compare the Logistic Regression models, we’ll build a Confusion Matrix and calculate classification metrics for each, and inspect their ROC curves and the associated AUC metrics.
Model 1
score
Sensitivity 0.816
Specificity 0.659
Pos Pred Value 0.920
Neg Pred Value 0.426
Precision 0.920
Recall 0.816
F1 0.865
Prevalence 0.828
Detection Rate 0.676
Detection Prevalence 0.735
Balanced Accuracy 0.738
Model 2
score
Sensitivity 0.814
Specificity 0.657
Pos Pred Value 0.921
Neg Pred Value 0.419
Precision 0.921
Recall 0.814
F1 0.864
Prevalence 0.831
Detection Rate 0.677
Detection Prevalence 0.735
Balanced Accuracy 0.736
Model 3
score
Sensitivity 0.808
Specificity 0.651
Pos Pred Value 0.924
Neg Pred Value 0.392
Precision 0.924
Recall 0.808
F1 0.862
Prevalence 0.840
Detection Rate 0.679
Detection Prevalence 0.735
Balanced Accuracy 0.730
In general, the Logistic Regression models appear to be more successful than Linear Regression in working with this dataset; AUC scores are around 80%, while our LR models were only able to account for under 2% of variability in the target.
Model 1 and Model 2 both produce AUC scores of 0.812, but Model 1 has a slight edge in prediction accuracy and F1 score. The error rate for Model 2 is 0.264 whereas for Model 1 it is 0.262.
Model 3 produces a lower AUC score of 0.802, and has a lower Accuracy and F1 score than the first two models as well.
With this understanding, we’ll select Logistic Regression Model 1 for our predictions of the target_flag response variable.
5. Predictions
Finally, let’s made some predictions on our test split of the dataset. We’ll use the Logistic Regression model to predict whether a crash occurs; and if so, we’ll use the Linear Regression model to predict the cost.
Code
df_pred <- df_test# predict crashdf_pred$target_flag <-round(predict(lr1, df_test, type='response'),0)# predict costdf_pred$target_amt <-predict(lm2, df_test)# convert target_amt to original scale; 0 if no crash predicted.df_pred <- df_pred %>%mutate(target_amt =get_boxcox_exp(target_amt, lm_bc_lambda)) %>%mutate(target_amt =round(ifelse(target_flag ==0, 0, target_amt),2))
Since our Linear Regression model used a transformed response variable (using Box-Cox to determine a lambda parameter of 0.0202), we add one additional step to transform the response back to the original scale (i.e. the actual dollar amount.)
For instances where a crash event is predicted, the ‘base’ target_amt is estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver is male, and 0.9% decrease if the driver is married. Let’s look at a random sample, and graph the relationships: