Homework 4

CUNY Data 621 - Spring 2022

Author

Jeff Parks

Assignment

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company… your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car.

The provided dataset contains two response variables:

  • target_flag: a categorical variable where 1 indicates a crash event.
  • target_amt: a numeric variable indicating the insurance payout amount for crash events.

We’ll build three multivariate logistic regression models using target_flag to predict the probability of crash events, and two multivariate linear regression models using target_amt to predict the cost of crash events.

1. Data Exploration

In order to explore summary stats and distribution characteristics of our dataset, we’ll need to first conduct some basic cleanup:

Code
# load data
df_train <- as.data.frame(read.csv('data/insurance_training_data.csv'))
df_test <- as.data.frame(read.csv('data/insurance_evaluation_data.csv'))

# union for exploration and prep
df_train <- df_train %>% mutate(source='train')
df_test <- df_test %>% mutate(source='test')
df_all <- bind_rows(df_train, df_test)
Code
# fix column labels
names(df_all) <- str_to_lower(str_replace_all(names(df_all), c(" " = "_" , "," = "", "\\*" = "", "\\(" = "", "\\)" = "", "`" = "", "\\/" = "_")))

# drop un-necessary columns
df_all <- df_all %>% dplyr::select(!index)

# coerce some columns to numeric
df_all <- df_all %>% 
  mutate(income=as.numeric(str_replace_all(income,'\\$|,',''))) %>%
  mutate(home_val=as.numeric(str_replace_all(home_val,'\\$|,',''))) %>%
  mutate(bluebook=as.numeric(str_replace_all(bluebook,'\\$|,',''))) %>%
  mutate(oldclaim=as.numeric(str_replace_all(oldclaim,'\\$|,','')))

Data structure:

'data.frame':   10302 obs. of  26 variables:
 $ target_flag: int  0 0 0 0 0 1 0 1 1 0 ...
 $ target_amt : num  0 0 0 0 0 ...
 $ kidsdriv   : int  0 0 0 0 0 0 0 1 0 0 ...
 $ age        : int  60 43 35 51 50 34 54 37 34 50 ...
 $ homekids   : int  0 0 1 0 0 1 0 2 0 0 ...
 $ yoj        : int  11 11 10 14 NA 12 NA NA 10 7 ...
 $ income     : num  67349 91449 16039 NA 114986 ...
 $ parent1    : chr  "No" "No" "No" "No" ...
 $ home_val   : num  0 257252 124191 306251 243925 ...
 $ mstatus    : chr  "z_No" "z_No" "Yes" "Yes" ...
 $ sex        : chr  "M" "M" "z_F" "M" ...
 $ education  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
 $ job        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
 $ travtime   : int  14 22 5 32 36 46 33 44 34 48 ...
 $ car_use    : chr  "Private" "Commercial" "Private" "Private" ...
 $ bluebook   : num  14230 14940 4010 15440 18000 ...
 $ tif        : int  11 1 4 7 1 1 1 1 1 7 ...
 $ car_type   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
 $ red_car    : chr  "yes" "yes" "no" "yes" ...
 $ oldclaim   : num  4461 0 38690 0 19217 ...
 $ clm_freq   : int  2 0 2 0 2 0 0 1 0 0 ...
 $ revoked    : chr  "No" "No" "No" "No" ...
 $ mvr_pts    : int  3 0 3 0 3 0 0 10 0 1 ...
 $ car_age    : int  18 1 10 6 17 7 1 7 1 17 ...
 $ urbanicity : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...
 $ source     : chr  "train" "train" "train" "train" ...

Summary statistics (non-dummy variables)

                 mean        sd median       mad  min      max  skew
target_flag      0.26      0.44      0      0.00    0      1.0  1.07
target_amt    1504.32   4704.03      0      0.00    0 107586.1  8.71
kidsdriv         0.17      0.51      0      0.00    0      4.0  3.34
age             44.84      8.61     45      8.90   16     81.0 -0.03
homekids         0.72      1.12      0      0.00    0      5.0  1.34
yoj             10.47      4.11     11      2.97    0     23.0 -1.20
income       61572.07  47457.20  53529  42340.83    0 367030.0  1.16
home_val    154523.02 129188.44 160661 148308.93    0 885282.0  0.49
travtime        33.42     15.87     33     16.31    5    142.0  0.44
bluebook     15659.92   8428.77  14400   8532.36 1500  69740.0  0.77
tif              5.33      4.11      4      4.45    1     25.0  0.90
oldclaim      4033.98   8733.14      0      0.00    0  57037.0  3.12
clm_freq         0.80      1.15      0      0.00    0      5.0  1.19
mvr_pts          1.71      2.16      1      1.48    0     13.0  1.34
car_age          8.30      5.71      8      7.41   -3     28.0  0.28

Apart from our two response variables (target_flag and target_amt), we have 24 individual predictors describing characteristics of our 10,000 total observations. These include demographic measures such as age and gender, socioeconomic measures such as education and household income, and vehicle-specific metrics such as car model, age and assessed value.

Observations

First, none of our predictors appear to have linear relationships to our target_amt response variable, which is a primary assumption of linear regression. This suggests that alternative methods (such as polynomial regression) might be more successful in modeling the relationships.

Code
# scatterplot matrix of numeric cols vs target_amt
df_all[,numeric_cols] %>%
  gather(-target_amt, key="key", value="value") %>%
  ggplot(aes(x=value, y=target_amt)) +
  geom_point(alpha=0.25) + 
  facet_wrap(~ key, scales='free')

Second, many of these distributions seem highly skewed and non-normal. As part of our data preparation we’ll use a power transformation (Box-Cox) to find whether transforming variables to more normal distributions improves our models’ efficacy.

2. Data Preparation

Before we go further, we need to identify and handle any missing, NA or negative data values so we can perform log transformations and regression.

We have only a single predictor with empty data: job, a categorical describing an individual’s job title, and we’ll impute a label ‘Unknown’ for these instances.

Code
# working copy
df_prep <- df_all

# find columns with empty values
x <- df_prep %>% apply(2, function(col) sum(nchar(col)==0)) 
x[!is.na(x) & x>0] %>% as.data.frame() %>% setNames('empty')
    empty
job   665
Code
## job 665 empty rows
df_prep <- df_prep %>% mutate(job = ifelse(nchar(job)==0, 'Unknown', job))

Set any negative values such as car_age to NA, attributing them to data errors.

Code
# handle any negative values for log transformations
numeric_cols <- sapply(df_prep, is.numeric)

for (c in colnames(df_prep[, numeric_cols])){
  df_prep <- df_prep %>%
    mutate(!!c := ifelse(.data[[c]] < 0, NA, .data[[c]]))
}

We have five predictors remaining that contain NA values - Age, YOJ (years on job), Income, Home Value and Car Age:

Code
# find columns with NAs
x <- df_prep %>% apply(2, function(col) sum(is.na(col)))
x[x>0] %>% as.data.frame() %>% setNames('is.na')
            is.na
target_flag  2141
target_amt   2141
age             7
yoj           548
income        570
home_val      575
car_age       640

We’ll impute a simple median for both Age and Car Age:

Code
# AGE - impute median
df_prep <- df_prep %>% 
  mutate(age = as.integer(ifelse(is.na(age), 
  median(age, na.rm=TRUE), age))) # 45.0

# CAR_AGE - impute median
df_prep <- df_prep %>% 
  mutate(car_age = as.integer(ifelse(is.na(car_age), 
  median(car_age, na.rm=TRUE), car_age))) # 8.0

However imputing a single median for YOJ and Income might be imprecise; we can see they are correlated with Job codes:

Code
# view medians by job code
df_prep %>% 
  group_by(job) %>%
  summarize(median_yoj = median(yoj, na.rm=TRUE), 
            median_income = median(income, na.rm=TRUE)) %>%
  arrange(median_income) %>%
  kable()
job median_yoj median_income
Student 6 360.0
Home Maker 5 776.0
Clerical 12 30799.5
z_Blue Collar 12 53694.0
Professional 12 71230.5
Manager 12 78589.0
Lawyer 12 83230.5
Unknown 12 109953.0
Doctor 12 121398.0

Instead, we impute individual medians for YOJ, INCOME based on the JOB code.

Code
# impute median based on job code
df_prep <- df_prep %>% 
  group_by(job) %>% 
  mutate(yoj = ifelse(is.na(yoj), median(yoj, na.rm=TRUE), yoj)) %>%
  mutate(income = ifelse(is.na(income), median(income, na.rm=TRUE), income)) %>%
  ungroup()

We’ll assume HOME_VAL = NA actually means non-home ownership; we’ll set this to zero.

Code
# Assume 0 HOME_VAL means rental
df_prep <- df_prep %>%
  mutate(home_val = ifelse(is.na(home_val), 0, home_val))

Correct any remaining data errors and inconsistencies:

Code
## lower-case all character variables for consistency
df_prep <- df_prep %>% mutate_if(is.character, function(c) str_to_lower(c))

## correct instances of 'z_' in front of some entries.
df_prep <- df_prep %>% mutate_if(is.character, function(c) str_replace(c,'z_',''))
## education, job, car_use, car_type

## remove problematic special characters
df_prep <- df_prep %>% mutate_if(is.character, function(c) str_replace_all(c, c("/ " = "_", " " = "_", "," = "", "\\*" = "", "\\(" = "", "\\)" = "", "`" = "")))

Create dummy variables to represent the levels of variables such as parent1, mstatus, sec, education, job, car_type and urbanicity.

Code
# create dummy variables
df_prep <- df_prep %>% dummy_cols(remove_first_dummy = TRUE, remove_selected_columns = TRUE)

Convert our target_flag response variable from a numeric to factor (to enable randomforest as part of the logistic regression steps.)

Code
# coerce some columns to factor
df_prep <- df_prep %>%
  mutate(target_flag = as.factor(target_flag))

Next we’ll want to consider any power transformations. Our numeric response variable target_amt is a good candidate for transformation as its distribution is very highly skewed, and the assumption of normality is required in order to apply linear regression.

We’ll use the Box-Cox method to identify the ideal transformation parameter for this variable, but only after we split the datasets so that non-crash observations (where target_amt is always zero) don’t skew the result.

To simplify this step, we’ve created a custom function that takes a single dataframe column as a vector, adjusts it slightly to correct for any zero values, and returns a power transformation using the best-possible value for the Box-Cox lambda:

Code
# function to retrieve box-cox-transformed vector and the lambda value
get_boxcox_dist <- function(vec, plot=FALSE){
  vec <- vec + 0.1 # avoid -Inf errors
  bc_formula <<- as.formula('vec[vec>0] ~ 1') # <<- pushed to R_GlobalEnv
  bc <- boxcox(lm(formula = bc_formula), plotit=plot)
  lambda <- bc$x[which.max(bc$y)]
  vec_boxcox <- (vec ^ lambda - 1) / lambda
  out <- list(vec=vec_boxcox, lambda=lambda)
  return(out)
}

# reverse the boxcox transformation to interpret results
get_boxcox_exp <- function(vec, lambda){
  out <- ((vec * lambda) + 1) ^ (1/lambda)
  return(out)
}

Finally, we split the data back into training and test sets and remove the source_flag.

Code
# split back to train and test, remove source flag
df_train <- df_prep %>% filter(source_train == 1) %>% dplyr::select(!source_train)
df_test <- df_prep %>% filter(source_train == 0) %>% dplyr::select(!source_train)

3. Build Models

We’ll build datasets to train Linear and Logistic Regression with validation hold-outs, and create one variant with a Box-Cox transformation of the target_amt response variable for the Linear model.

Code
# response variable for Linear Regression is target_amt. Filter out non-crash data.
df_lm <- df_train %>%
  filter(target_flag==1) %>%
  dplyr::select(!target_flag)

df_lm_bc <- df_lm

  # box-cox transform the response variable only (see Step 2)
   bc_obj <- get_boxcox_dist(df_lm_bc$target_amt, TRUE) 
Code
   df_lm_bc$target_amt <- bc_obj[['vec']]
   lm_bc_lambda <- bc_obj[['lambda']]
  
# response variable for Logistic Regression is target_flag. Filter out payout data.
df_lr <- df_train %>%
  dplyr::select(!target_amt)

The Box-Cox transformation on target_amt (lambda = 0.02) makes the distribution of our response variable much more normal.

Code
hist(df_lm$target_amt, main='target_amt')
hist(df_lm_bc$target_amt, main='target_amt with BC trans')

We’ll create validation holdouts to assess model performance.

Code
# create validation holdouts
set.seed(123)

sample_lm <- sample(c(TRUE,FALSE), nrow(df_lm), replace=TRUE, prob=c(0.85,0.15))

df_lm_train <- df_lm[sample_lm,]
df_lm_valid <- df_lm[!sample_lm,]

df_lm_bc_train <- df_lm_bc[sample_lm,]
df_lm_bc_valid <- df_lm_bc[!sample_lm,]

sample_lr <- sample(c(TRUE,FALSE), nrow(df_lr), replace=TRUE, prob=c(0.85,0.15))

df_lr_train <- df_lr[sample_lr,]
df_lr_valid <- df_lr[!sample_lr,]

3a. Multiple Linear Regression I

For our first linear regression, we’ll use all predictors. The summary table below is sorted by P-Value and Estimate absolute value.

Code
# build model on all predictors
lm1 <- lm(target_amt ~ ., data=df_lm_train)

# residual plots
plot(lm1)

Code
# summary table
cap <- paste('Adjusted R2:',round(summary(lm1)$adj.r.squared,3))
lm1 %>% tidy() %>% dplyr::select(term, estimate, `p.value`) %>% arrange(p.value, desc(abs(estimate))) %>% kable(digits = 2, caption=cap) 
# validate and calculate RMSE
lm1_valid <- predict(lm1, newdata = df_lm_valid)
lm1_eval <- bind_cols(target = df_lm_valid$target_amt, predicted=lm1_valid)
lm1_rmse <- sqrt(mean((lm1_eval$target - lm1_eval$predicted)^2)) 

# plot targets vs predicted
lm1_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lm1_rmse,1)))
Adjusted R2: 0.014
term estimate p.value
bluebook 0.14 0.00
revoked_yes -1256.82 0.03
sex_m 1560.10 0.03
(Intercept) 3750.22 0.03
car_age -88.35 0.07
education_phd 2539.77 0.08
mstatus_yes -903.55 0.09
income -0.01 0.11
job_manager -1345.81 0.16
oldclaim 0.03 0.21
job_doctor -2563.16 0.21
car_type_sports_car 971.35 0.23
car_type_suv 848.49 0.24
education_masters 1337.89 0.25
home_val 0.00 0.35
homekids 206.24 0.36
red_car_yes -474.21 0.38
clm_freq -146.00 0.39
mvr_pts 62.57 0.40
job_home_maker -728.38 0.46
education_high_school -385.45 0.50
yoj 37.21 0.50
car_use_private -370.33 0.52
job_student -490.80 0.54
car_type_panel_truck -643.58 0.54
car_type_van -517.33 0.54
parent1_yes 303.33 0.63
job_clerical -262.99 0.68
education_bachelors 277.80 0.69
kidsdriv -135.45 0.70
age 6.90 0.76
job_lawyer -362.80 0.77
tif -11.68 0.80
job_professional 168.83 0.82
urbanicity_highly_urban_urban 146.58 0.86
travtime -1.48 0.90
job_unknown -151.48 0.90
car_type_pickup -73.77 0.91

This regression produces statistically significant results for three predictors, bluebook, sex_m and revoked_yes. The coefficients suggest a positive linear relationship between the amount of an insurance payout and the value of the vehicle, the male gender of the driver, and a negative relationship with a driver’s license having been previously revoked.

The negative relationship with license revocation seems counter-intuitive at first; but it is possible there could be some correlation or collinearity with bluebook value. Perhaps accident-prone drivers are operating less expensive vehicles! It could merit a separate analysis, but for now we’ll just leave it in, as it is adding predictive power to the model.

The Adjusted R-Squared is quite low (0.014) meaning this model only explains 1.4% of the total variance in the response variable target_amt. Additionally, an examination of the residuals indicates some of the key assumptions for linear regression are not met. The Residuals vs Fitted plot does not show a constant variability of the residuals, and the Q-Q plot indicates a great deal of skew and non-normality.

3b. Multiple Linear Regression II

For our second linear regression, we’ve used the Box-Cox method to transform the response variable target_amt (with a lambda parameter value of 0.02) along with stepwise feature selection to try and improve on the prior results.

Code
# build model on box-cox transformed data
lm2_all <- lm(target_amt ~ ., data = df_lm_bc_train)

# select features and refit by backwards stepwise selection (AIC)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='backward')

# residual plots
plot(lm2)

Code
# summary table with orig scale coefficients
cap <- paste('Adjusted R2:',round(summary(lm2)$adj.r.squared,5))
lm2 %>% tidy() %>% 
  dplyr::select(term, estimate, `p.value`) %>%
  arrange(p.value, desc(abs(estimate))) %>% 
  mutate(est_orig = get_boxcox_exp(estimate, lm_bc_lambda)) %>%
  relocate(est_orig, .after=estimate) %>%
  kable(digits = 3, caption=cap)
# validate and calculate RMSE
lm2_valid <- predict(lm2, newdata = df_lm_bc_valid)
lm2_eval <- bind_cols(target = df_lm_valid$target_amt, predicted=lm2_valid)
lm2_rmse <- sqrt(mean((lm2_eval$target - lm2_eval$predicted)^2)) 

# plot targets vs predicted
lm2_eval %>%
  ggplot(aes(x = target, y = predicted)) +
  geom_point(alpha = .3) +
  geom_smooth(method="lm", color='grey', alpha=.3, se=FALSE) +
  labs(title=paste('RMSE:',round(lm2_rmse,1)))
Adjusted R2: 0.01656
term estimate est_orig p.value
(Intercept) 8.850 3434.765 0.000
bluebook 0.000 1.000 0.000
mstatus_yes -0.095 0.909 0.033
sex_m 0.064 1.066 0.157

The resulting model is much more parsimonious than the first, with statistically significant results for three predictors, bluebook, mvr_pts and mstatus_yes.

The Adjusted R-Squared is better than Model 1 but still very low (0.0167) meaning this model only explains about 1.7% of the total variance in the response variable target_amt. However, an examination of the residuals indicates most of the key assumptions for linear regression are met - the Residuals vs Fitted plot shows a more constant variability of the residuals, and the Q-Q plot indicates a greater level of normality.

The summary table includes the estimate transformed to original scale for easier interpretation.In this case, the ‘base’ target_amt would be estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver were male, and 0.9% decrease if the driver were married.

3d. Binary Logistic Regression I

We’ll train our first Binary Logistic Regression model on all 37 predictors.

Code
# train model on all predictors
lr1 <- glm(target_flag ~ . , data = df_lr_train, family = binomial(link = logit))
Code
# get coefficients
df_lr1_coeff <- setNames(as.data.frame(exp(lr1$coeff)), c('coeff')) %>% arrange(desc(coeff))

# classification table
df_lr1_compare <- df_lr_train %>%
  dplyr::select(target=target_flag) %>%
  bind_cols(pred = round(fitted(lr1))) %>%
  bind_cols(prob = fitted(lr1)) %>%
  mutate(target = as_factor(target), pred = as_factor(pred)) 

# roc curve and auc
lr1_roc <- roc(df_lr1_compare$target, df_lr1_compare$prob, plot=TRUE, ci=TRUE, smoothed=TRUE, print.auc=TRUE)

# confusion matrix
lr1_cmatrix <- confusionMatrix(df_lr1_compare$target, df_lr1_compare$pred)
lr1_cmatrix$table
          Reference
Prediction    0    1
         0 4720  409
         1 1063  790

Code
# classification metrics
lr1_cmetrics <- as.data.frame(round(lr1_cmatrix$byClass,3)) %>% setNames('score')
lr1_cmetrics 
#
# summary table
cap <- ''
lr1 %>% tidy() %>% dplyr::select(term, estimate, `p.value`) %>% arrange(p.value, desc(abs(estimate))) %>% kable(digits = 2, caption=cap) 
                     score
Sensitivity          0.816
Specificity          0.659
Pos Pred Value       0.920
Neg Pred Value       0.426
Precision            0.920
Recall               0.816
F1                   0.865
Prevalence           0.828
Detection Rate       0.676
Detection Prevalence 0.735
Balanced Accuracy    0.738
term estimate p.value
urbanicity_highly_urban_urban 2.38 0.00
revoked_yes 0.86 0.00
car_use_private -0.79 0.00
(Intercept) -2.44 0.00
car_type_sports_car 1.08 0.00
travtime 0.01 0.00
mvr_pts 0.10 0.00
clm_freq 0.22 0.00
tif -0.05 0.00
car_type_suv 0.75 0.00
kidsdriv 0.40 0.00
mstatus_yes -0.49 0.00
job_manager -0.83 0.00
car_type_pickup 0.52 0.00
car_type_van 0.60 0.00
bluebook 0.00 0.00
education_bachelors -0.45 0.00
car_type_panel_truck 0.62 0.00
home_val 0.00 0.00
parent1_yes 0.40 0.00
oldclaim 0.00 0.00
income 0.00 0.00
job_doctor -0.67 0.03
job_unknown -0.37 0.07
education_masters -0.32 0.09
job_lawyer -0.19 0.35
job_clerical 0.11 0.36
job_professional -0.11 0.39
sex_m 0.10 0.40
age 0.00 0.41
education_phd -0.17 0.46
car_age -0.01 0.53
job_student -0.08 0.55
red_car_yes -0.03 0.74
education_high_school -0.03 0.78
yoj 0.00 0.78
job_home_maker -0.02 0.90
homekids 0.00 0.93

This regression produces statistically significant results for twenty-two predictors, with a variety of positive and negative effects on target_flag.

The Accuracy is 0.738, Sensitivity is 0.816 and Specificity is 0.659. The AUC is 0.812.

3e. Binary Logistic Regression II

Binary Logistic Regression - predictors from stepwise feature selection.

Code
# build model on all predictors
lr2_all <- glm(target_flag ~ ., data = df_lr_train, family = binomial(link = logit))

# select features and refit by backwards stepwise selection (AIC)
lr2_step <- blr_step_aic_backward(lr2_all)

# fit model to get metrics
lr2 <- glm(lr2_step$model$formula, data = df_lr_train, family = binomial(link = logit))
Code
# get coefficients
df_lr2_coeff <- setNames(as.data.frame(exp(lr2$coeff)), c('coeff')) %>% arrange(desc(coeff))

# classification table
df_lr2_compare <- df_lr_train %>%
  dplyr::select(target=target_flag) %>%
  bind_cols(pred = round(fitted(lr2))) %>%
  bind_cols(prob = fitted(lr2)) %>%
  mutate(target = as_factor(target), pred = as_factor(pred)) 

# roc curve and auc
lr2_roc <- roc(df_lr2_compare$target, df_lr2_compare$prob, plot=TRUE, ci=TRUE, smoothed=TRUE, print.auc=TRUE)

# confusion matrix
lr2_cmatrix <- confusionMatrix(df_lr2_compare$target, df_lr2_compare$pred)
lr2_cmatrix$table
          Reference
Prediction    0    1
         0 4724  405
         1 1076  777

Code
# classification metrics
lr2_cmetrics <- as.data.frame(round(lr2_cmatrix$byClass,3)) %>% setNames('score')
lr2_cmetrics
#
# summary table
cap = ''
lr2 %>% tidy() %>% dplyr::select(term, estimate, `p.value`) %>% arrange(p.value, desc(abs(estimate))) %>% kable(digits = 2, caption=cap)
                     score
Sensitivity          0.814
Specificity          0.657
Pos Pred Value       0.921
Neg Pred Value       0.419
Precision            0.921
Recall               0.814
F1                   0.864
Prevalence           0.831
Detection Rate       0.677
Detection Prevalence 0.735
Balanced Accuracy    0.736
term estimate p.value
urbanicity_highly_urban_urban 2.39 0.00
(Intercept) -2.60 0.00
car_use_private -0.82 0.00
car_type_sports_car 1.02 0.00
revoked_yes 0.86 0.00
car_type_suv 0.69 0.00
travtime 0.01 0.00
mvr_pts 0.10 0.00
clm_freq 0.21 0.00
kidsdriv 0.40 0.00
tif -0.05 0.00
job_manager -0.75 0.00
mstatus_yes -0.49 0.00
education_bachelors -0.47 0.00
bluebook 0.00 0.00
car_type_van 0.62 0.00
car_type_pickup 0.51 0.00
parent1_yes 0.43 0.00
education_masters -0.46 0.00
car_type_panel_truck 0.66 0.00
income 0.00 0.00
home_val 0.00 0.00
oldclaim 0.00 0.00
job_doctor -0.54 0.05
job_clerical 0.16 0.09
education_phd -0.28 0.12
job_unknown -0.24 0.14

This regression produces statistically significant results for twenty-three predictors, with a variety of positive and negative effects on target_flag.

The Accuracy is 0.736, Sensitivity is 0.814 and Specificity is 0.657. The AUC is also 0.812.

3f. Binary Logistic Regression III

Binary Logistic Regression - predictors from random forest feature selection (RFE).

Code
# train randomforest and examine importance of predictors
lr3_rf <- randomForest(target_flag ~ ., data = df_lr_train)

# select top N predictors
predictors_rf <- as.data.frame(importance(lr3_rf)) %>% top_n(20, MeanDecreaseGini) %>% rownames() %>% append('target_flag')

# make new training dataset
df_lr3 <- df_lr_train %>% dplyr::select(all_of(predictors_rf))

# train model
lr3 <- glm(target_flag ~ . , data = df_lr3, family = binomial(link = logit))
Code
# get coefficients
df_lr3_coeff <- setNames(as.data.frame(exp(lr3$coeff)), c('coeff')) %>% arrange(desc(coeff))

# classification table
df_lr3_compare <- df_lr_train %>%
  dplyr::select(target=target_flag) %>%
  bind_cols(pred = round(fitted(lr3))) %>%
  bind_cols(prob = fitted(lr3)) %>%
  mutate(target = as_factor(target), pred = as_factor(pred)) 

# roc curve and auc
lr3_roc <- roc(df_lr3_compare$target, df_lr3_compare$prob, plot=TRUE, ci=TRUE, smoothed=TRUE, print.auc=TRUE)

# confusion matrix
lr3_cmatrix <- confusionMatrix(df_lr3_compare$target, df_lr3_compare$pred)
lr3_cmatrix$table
          Reference
Prediction    0    1
         0 4740  389
         1 1127  726

Code
# classification metrics
lr3_cmetrics <- as.data.frame(round(lr3_cmatrix$byClass,3)) %>% setNames('score')
lr3_cmetrics
#
# summary table
cap = ''
lr3 %>% tidy() %>% dplyr::select(term, estimate, `p.value`) %>% arrange(p.value, desc(abs(estimate))) %>% kable(digits = 2, caption=cap) 
                     score
Sensitivity          0.808
Specificity          0.651
Pos Pred Value       0.924
Neg Pred Value       0.392
Precision            0.924
Recall               0.808
F1                   0.862
Prevalence           0.840
Detection Rate       0.679
Detection Prevalence 0.735
Balanced Accuracy    0.730
term estimate p.value
urbanicity_highly_urban_urban 2.30 0.00
car_use_private -0.78 0.00
revoked_yes 0.86 0.00
(Intercept) -1.93 0.00
clm_freq 0.22 0.00
mvr_pts 0.10 0.00
travtime 0.01 0.00
tif -0.05 0.00
job_manager -0.73 0.00
bluebook 0.00 0.00
kidsdriv 0.38 0.00
mstatus_yes -0.47 0.00
income 0.00 0.00
home_val 0.00 0.00
parent1_yes 0.39 0.00
oldclaim 0.00 0.00
education_high_school 0.23 0.00
car_age -0.02 0.00
homekids 0.02 0.55
age 0.00 0.57
yoj 0.00 0.71

Random Forest-based feature selection results in about 17 predictors with statistical significance, more parsimonious than the prior models.

The Accuracy is 0.73, Sensitivity is 0.808 and Specificity is 0.651. The AUC is 0.802.

4. Model Selection

Linear Regression Models

To compare the Linear Regression models, we’ll examine Adjusted R-Squared, Root Mean Squared Error and evaluate whether the models meet the general assumptions for linear regression.

Model 1

Model 2

  • Model 1 includes all 37 predictors (3 with statistical significance) for an Adjusted R-Squared of 0.014 and RMSE of 7233.

  • Model 2 uses a Box-Cox transformed response variable and includes 3 predictors (3 with statistical significance) for an Adjusted R-Squared of 0.017 and RMSE of 8900.

While the first model produces a lower RMSE, the residual plots demonstrate that several linear regression assumptions are not met. The second model, while producing a higher RMSE, does a much better job of meeting our linear regression assumptions around normality and constant variability of residuals.

The Adjusted R-Squared, or amount of variability explained by the model, is higher for the second model than the first - although the results are relatively low in both cases.

With this understanding, we’ll select Linear Regression Model 2 for our predictions.

Logistic Regression Models

To compare the Logistic Regression models, we’ll build a Confusion Matrix and calculate classification metrics for each, and inspect their ROC curves and the associated AUC metrics.

Model 1

                     score
Sensitivity          0.816
Specificity          0.659
Pos Pred Value       0.920
Neg Pred Value       0.426
Precision            0.920
Recall               0.816
F1                   0.865
Prevalence           0.828
Detection Rate       0.676
Detection Prevalence 0.735
Balanced Accuracy    0.738

Model 2

                     score
Sensitivity          0.814
Specificity          0.657
Pos Pred Value       0.921
Neg Pred Value       0.419
Precision            0.921
Recall               0.814
F1                   0.864
Prevalence           0.831
Detection Rate       0.677
Detection Prevalence 0.735
Balanced Accuracy    0.736

Model 3

                     score
Sensitivity          0.808
Specificity          0.651
Pos Pred Value       0.924
Neg Pred Value       0.392
Precision            0.924
Recall               0.808
F1                   0.862
Prevalence           0.840
Detection Rate       0.679
Detection Prevalence 0.735
Balanced Accuracy    0.730

In general, the Logistic Regression models appear to be more successful than Linear Regression in working with this dataset; AUC scores are around 80%, while our LR models were only able to account for under 2% of variability in the target.

  • Model 1 and Model 2 both produce AUC scores of 0.812, but Model 1 has a slight edge in prediction accuracy and F1 score. The error rate for Model 2 is 0.264 whereas for Model 1 it is 0.262.

  • Model 3 produces a lower AUC score of 0.802, and has a lower Accuracy and F1 score than the first two models as well.

With this understanding, we’ll select Logistic Regression Model 1 for our predictions of the target_flag response variable.

5. Predictions

Finally, let’s made some predictions on our test split of the dataset. We’ll use the Logistic Regression model to predict whether a crash occurs; and if so, we’ll use the Linear Regression model to predict the cost.

Code
df_pred <- df_test

# predict crash
df_pred$target_flag <- round(predict(lr1, df_test, type='response'),0)

# predict cost
df_pred$target_amt <- predict(lm2, df_test)

# convert target_amt to original scale; 0 if no crash predicted.
df_pred <- df_pred %>%
  mutate(target_amt = get_boxcox_exp(target_amt, lm_bc_lambda)) %>%
  mutate(target_amt = round(ifelse(target_flag == 0, 0, target_amt),2))

Since our Linear Regression model used a transformed response variable (using Box-Cox to determine a lambda parameter of 0.0202), we add one additional step to transform the response back to the original scale (i.e. the actual dollar amount.)

For instances where a crash event is predicted, the ‘base’ target_amt is estimated at $3,434.76 with an increase in 1% per each dollar of bluebook value, a 1.07% increase if the driver is male, and 0.9% decrease if the driver is married. Let’s look at a random sample, and graph the relationships:

Code
df_pred <- df_pred %>% 
  mutate(sex_m = as.factor(sex_m), mstatus_yes = as.factor(mstatus_yes))
Code
df_pred %>% 
  filter(target_flag == 1) %>% 
  dplyr::select(target_flag, target_amt, bluebook, sex_m, mstatus_yes) %>%
  sample_n(15) %>% 
  arrange(desc(target_amt)) %>%
  kable() %>% kable_styling(full_width = F)
target_flag target_amt bluebook sex_m mstatus_yes
1 4330.13 22690 1 1
1 4314.72 20030 0 0
1 4131.86 16220 0 0
1 3932.33 18950 0 1
1 3907.37 11310 0 0
1 3816.55 4510 1 0
1 3777.40 15420 0 1
1 3758.52 14980 0 1
1 3694.01 6380 0 0
1 3670.92 5830 0 0
1 3613.78 6800 1 1
1 3494.06 1500 0 0
1 3447.70 7410 0 1
1 3377.20 5600 0 1
1 3361.81 5200 0 1
Code
df_pred %>% 
  filter(target_flag == 1) %>% 
  ggplot(aes(x=bluebook, y=target_amt, color=sex_m, shape=mstatus_yes)) +
  geom_point(alpha=0.3, size=1.8) + 
  labs(title='Cost x Bluebook Value')

df_pred %>% 
  filter(target_flag == 1) %>% 
  ggplot(aes(x=sex_m, y=target_amt)) +
  geom_boxplot() +
  labs(title='Cost x Male Drivers')

df_pred %>% 
  filter(target_flag == 1) %>% 
  ggplot(aes(x=mstatus_yes, y=target_amt)) +
  geom_boxplot() + 
  labs(title='Cost x Married Drivers')