library(tidyverse)
library(forcats)
library(modelr)
library(skimr)
library(knitr)
library(kableExtra)
library(broom)
library(caTools)
library(pscl)
library(ROCR)
library(MASS)
library(caret)
library(car)
library(DT)
library(GGally)
library(psych)
library(corrplot)
library(reshape)
library(ggthemes)
library(moments)
library(qqplotr)
library(gridExtra)
library(geoR)
library(alr3)
library(caret)
library(pROC)

Analysis

The objective of this analysis is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. A summary of the data set provided for this analysis is set forth below:

VARIABLE_NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_FLAG Was Car in a crash? 1=YES 0=NO None
TARGET_AMT If car was in a crash, what was the cost None
AGE Age of Driver Very young people tend to be risky. Maybe very old people also.
BLUEBOOK Value of Vehicle Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_AGE Vehicle Age Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_TYPE Type of Car Unknown effect on probability of collision, but probably effect the payout if there is a crash
CAR_USE Vehicle Use Commercial vehicles are driven more, so might increase probability of collision
CLM_FREQ # Claims (Past 5 Years) The more claims you filed in the past, the more you are likely to file in the future
EDUCATION Max Education Level Unknown effect, but in theory more educated people tend to drive more safely
HOMEKIDS # Children at Home Unknown effect
HOME_VAL Home Value In theory, home owners tend to drive more responsibly
INCOME Income In theory, rich people tend to get into fewer crashes
JOB Job Category In theory, white collar jobs tend to be safer
KIDSDRIV # Driving Children When teenagers drive your car, you are more likely to get into crashes
MSTATUS Marital Status In theory, married people drive more safely
MVR_PTS Motor Vehicle Record Points If you get lots of traffic tickets, you tend to get into more crashes
OLDCLAIM Total Claims (Past 5 Years) If your total payout over the past five years was high, this suggests future payouts will be high
PARENT1 Single Parent Unknown effect
RED_CAR A Red Car Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
REVOKED License Revoked (Past 7 Years) If your license was revoked in the past 7 years, you probably are a more risky driver.
SEX Gender Urban legend says that women have less crashes then men. Is that true?
TIF Time in Force People who have been customers for a long time are usually more safe.
TRAVTIME Distance to Work Long drives to work usually suggest greater risk
URBANICITY Home/Work Area Unknown
YOJ Years on Job People who stay at a job for a long time are usually more safe

Data

In the Data phase of this analysis we will review the data and take any necessary clean up measures before moving into the EDA phase of the analysis.


First let’s utilize the str and summary functions to familiarize myself with the data and to determine what data cleanup may be required.

Structure

str(df)


Summary

summary(df)

Target Variables


Next we’ll plot our two target variables to get better understand of the data set. Moving left to right we plot a bar plot of TARGET_FLAG and then box plots of TARGET_AMT with and without outliers.


df %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
  ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
  geom_bar() + scale_y_continuous() +
  theme_fivethirtyeight() +
  theme(legend.position = "none") +
  labs(x="TARGET_FLAG", y="n()") +
  labs(title="TARGET_FLAG") -> p1



df %>% filter(TARGET_FLAG == 1) %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT") -> p2



df %>% filter(TARGET_FLAG == 1) %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
  #geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(df$TARGET_AMT, c(0.1, 0.9))) +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT - No outliers") -> p3

grid.arrange(p1, p2, p3, ncol=3)

Data Observations

Here are some data observations from the EDA that will need to be addressed:

  • TARGET_FLAG appears to have an imbalance with almost 3X as may no accidents vs accidents.
  • TARGET_AMT has some outlier that we will have to address in data prep, model selection or both
  • NAs in Explanatory Variables - Several variables have NA (YOJ, CAR_AGE, AGE)
  • Dollar Signs ($) - Several variables have dollars that cause the variables to be treated as factors

We’ll use the tidyverse dplyr to address some of the data clean-up items so we can continue our EDA and extend it to the explanatory variables. We’ll make the following changes to the data set and then have another look at the data summaries.

  • Remove character from number variable and cast as numeric (INCOME, HOME_VAL, BLUE_BOOK, OLDCLAIM)
  • INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM has also been rescaled by dividing by 1000
  • Remove random characters from variables (z_, <, etc.)
  • Replace NAs with median values
  • Replace negative CAR_AGE value(s) with the Median CAR_AGE

Finally we will correct the imbalnce in the TARGET_FLAG variable, filter out no accident rows from the TARGET_AMT variable and create training and test data sets for our analysis.


df <- df %>% 
  mutate(INCOME =as.numeric(str_replace_all(INCOME,'([\\$,])',''))/1000) %>%
  mutate(HOME_VAL =as.numeric(str_replace_all(HOME_VAL,'([\\$,])',''))/1000) %>%
  mutate(BLUEBOOK =as.numeric(str_replace_all(BLUEBOOK,'([\\$,])',''))/1000) %>%
  mutate(OLDCLAIM =as.numeric(str_replace_all(OLDCLAIM,'([\\$,])',''))/1000) %>%
  mutate(MSTATUS =as.factor(str_replace_all(MSTATUS,'([z_<])',''))) %>%
  mutate(SEX =as.factor(str_replace_all(SEX,'([z_<])',''))) %>%
  mutate(EDUCATION =as.factor(str_replace_all(EDUCATION,'([z_<])',''))) %>%
  mutate(JOB = as.factor(str_replace_all(JOB,'([z_<])',''))) %>%
  mutate(CAR_TYPE = as.factor(str_replace_all(CAR_TYPE,'([z_<])',''))) %>%
  mutate(URBANICITY = as.factor(str_replace_all(URBANICITY,'([z_<])',''))) %>%
  mutate(CAR_AGE = if_else(is.na(CAR_AGE),median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
  mutate(CAR_AGE = if_else(CAR_AGE<0,median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
  mutate(YOJ = if_else(is.na(YOJ),median(YOJ,na.rm=TRUE),YOJ)) %>%
  mutate(AGE = if_else(is.na(AGE),median(AGE,na.rm=TRUE),AGE)) %>%
  mutate(INCOME = if_else(is.na(INCOME),median(INCOME,na.rm=TRUE),INCOME)) %>%
  mutate(HOME_VAL = if_else(is.na(HOME_VAL),median(HOME_VAL,na.rm=TRUE),HOME_VAL))
  

# Order EDUCATION FACTOR

   df$EDUCATION <- ordered(df$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )


Revised DATA

A post-data wrangling summary of the data is set forth below. This data reflects the changes proposed in the Data Observation section. The data wrangling script is included in the Appendix.

summary(df)
set.seed(10)
train_index <- createDataPartition(df$TARGET_FLAG, p = .75, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]
set.seed(10)
minority <- nrow(train[train$TARGET_FLAG == 1,])
majority <- nrow(train[train$TARGET_FLAG == 0,])
diff <- majority - minority

minority_index <- train[train$TARGET_FLAG == 1,]$INDEX

over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>% 
  merge(train,.) %>% 
  bind_rows(train)
set.seed(42)
accidents <- df %>%
  filter(TARGET_FLAG == 1)
amt_train_index <- createDataPartition(accidents$TARGET_AMT, p = .75, list = FALSE, times = 1)
amt_train <- accidents[amt_train_index,]
amt_test <- accidents[-amt_train_index,]

Prepared Data

The three charts below (moving left to right) display the re balanced TARGET_FLAG variable as well as box plots of TARGET_AMOUNT with no accidents removed with and without outliers.


over_sample_train %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
  mutate(AGE = cut(AGE, breaks=c(0,20,50,60,70,80,Inf),labels=FALSE)) %>%
  ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
  geom_bar() + scale_y_continuous() +
  theme_fivethirtyeight() +
  theme(legend.position = "none") +
  labs(x="TARGET_FLAG", y="n()") +
  labs(title="TARGET_FLAG") -> p10



amt_train %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT") -> p20



amt_train %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
  #geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(over_sample_train$TARGET_AMT, c(0.1, 0.9))) +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT - No outliers") -> p30

grid.arrange(p10, p20, p30, ncol=3)

EDA

We are now ready to proceed to the EDA phase of our analysis. We will employ a visual EDA approach to better understand the data. Explanatory charts for each explanatory variable are set forth below.



for (Myvar in names(over_sample_train)){
  
  for (var in names(over_sample_train)){
    if(var == Myvar){
    plot_df <- over_sample_train
    plot_df$x <- plot_df[,var]
    p <- ggplot(plot_df, aes(x, color = factor(TARGET_FLAG))) +
      geom_density() +
      theme_fivethirtyeight()  
    p <- p + labs(title =str_c(var, " by target",sep=''))
    
  
  b <- ggplot(over_sample_train, aes(factor(plot_df$x), log(TARGET_AMT))) + 
  geom_boxplot()+
  theme_fivethirtyeight()+ labs(title = str_c(var, ' by Log(TARGET_AMT)'), x = var) + 
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) 
  

  
  bc <- ggplot(over_sample_train, aes(plot_df$x, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", alpha=0.5) +
  theme_fivethirtyeight()+ labs(title = str_c(var,' by TARGET_FLAG')) + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1)) +
  theme(legend.title=element_text(size=8),legend.text=element_text(size=7)) 
 
  

  b2 <- ggplot(over_sample_train, aes(x=TARGET_FLAG, plot_df$x)) + 
        geom_boxplot(outlier.size=0, alpha=1.0) +
        labs(title = str_c(var, " by TARGET_FLAG",sep='')) + 
        theme_fivethirtyeight() + 
        stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
        theme(plot.title = element_text(hjust = 0.5)) 
  
  bc2 <- ggplot(over_sample_train, aes(x=reorder(plot_df$x, plot_df$x,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1)) + 
  theme_fivethirtyeight()+ labs(title = str_c(var,' Prop')) +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(legend.title=element_text(size=8),legend.text=element_text(size=7)) + 
  scale_x_discrete() + 
  labs(x="JOB") + coord_flip()

  
  
  h <-ggplot(over_sample_train, aes(plot_df$x,log(TARGET_AMT)))+
    geom_hex(bins=50) +
    geom_smooth(method = "loess", size=1, color="red") +
    labs(title = str_c(var, " by TARGET_AMT",sep='')) +
    theme_fivethirtyeight()

  if(var != 'INDEX' & var != 'TARGET_FLAG' & var != 'TARGET_AMT'){

    #Group 1
    if(var == 'KIDSDRIV' | var == 'HOMEKIDS' | var == 'CLM_FREQ' | var == 'MVR_PTS'){
     
      grid.arrange(bc, b2,  b, ncol=3)   
      
    }
    
    #Group 2
    if(var == 'AGE' | var == 'YOJ' | var == 'TRAVTIME' | var == 'TIF' | var == 'CAR_AGE'){
     
      grid.arrange(bc, b2, h, ncol=3)   
      
    }
   
    #Group 3
    if(var == 'PARENT1' | var == 'MSTATUS' | var == 'SEX' | var == 'CAR_USE' | var == 'RED_CAR'| var == 'REVOKED' | var == 'URBANICITY'){
     
      grid.arrange( b, ncol=1)   
      
    }
  
    #Group 4
    if(var == 'JOB' | var == 'CART_TYPE' | var == 'EDUCATION'){
     
      grid.arrange(bc2, b, ncol=2)   
      
    }
    
    #Group 5
    if(var == 'INCOME' | var == 'HOME_VAL' | var == 'BLUEBOOK' | var == 'OLDCLAIM'){
     
      grid.arrange(p, b2, h, ncol=3)   
      
    }
  
    
  }
   
     }
  }
 

 
}

EDA Observations:

A cursory review of the EDA visualizations would leaves one with the feeling this is a very difficult data set. The follow are some selected EDA Observations:

KIDSDRIV - the number of teenagers driving in household

The KIDSdriv variable has significanlty more households with no kids driving -this creates a strong right skew in the variable. The data for household with teenage drivers support the premise that teenage drivers increase the chance for an accident. Might consider changing the levels of this variables by combining all homes with teenage driver versus those without. Its less clear, given an accident, that teenage drivers increase the cost of the accident. Our box plot shows what appears to by a higher cost in two of the categories (1 driver and 4 drivers), but 2 and 3 seems on par or below the overall average.

AGE - age of driver

The AGE variable is near-normal for both crash and no crash groups. Our bar chart and box plot support the idea that younger people are involved in more accidents. The hex bin plot offers some evidence that the cost of accidents is higher for young people and old people (gt 60), but the preponderance of accidents occur for individuals in the middle (light blue in hex bins).

HOMEKIDS - number of kids at home

HOMEKIDS like KIDSDRIV show that kids increase accidents. This is evident in the bar and box plots for HOMEKIDS. The cost of accidents for HOMEKIDS is consistent with our finding for KIDSDRIV.

YOJ - years on job

Our box box plot supports the theory that insured with more years on the job are less likely to be in an accident. The bar chart reveals a large percentage of insured with less than a year on the job. This group showed a higher propensity to be in an accident. The hex bin plot for TARGET_AMT is difficult to interpret.

INCOME

Income’s two density plots (accident, no accident) have significant overlap. This may undermine the variable’s use in the classification analysis. There does appear to be a slight negative relationship between income and accidents. The INCOME hex bin plot has a significant right skew owing to a handful of outlier incomes.

HOME_VAL

The HOME_VAL box plot shows that insured with lower home values have more accidents. This variable might be useful in the classification model. There is a large number of insured with zero HOME_VALs. This could reflect renters and might be something that should be addressed in the Data Prep phase.

MSTATUS AND SEX

The MSTATUS and SEX box plot do not provide a lot of insight. The box plots for accidents and no accidents for both variables are very similar.

TRAVTIME

The TRAVTIME box plot indicates that insured with longer travel times have more accidents. The hex bin plot indicates that there is a concentration of travel times between 29 and 50 minutes.

BLUEBOOK

The BLUEBOOK box plot supports the premise that cars with lower bluebook values are involved in more accidnts. There is also significant overlap between the accident vs no accident BLUEBOOK density plots. Its difficult derive much insight from the BLUEBOOK hex bin plot.

MVR_PTS

The MVR_PTS box plot clearly shows insured with more MVR points are in more accidents. MVR points to claim amounts seemed to be concentrated in a farily narrow range across point levels. There is a concentration of insured without any points.


Feature Engineering

In our data prep phase we addressed missing values, rescaled values and cleaned up data inconsistencies and problems ($ signs and random characters). We also corrected the imbalance in the TARGET_FLAG and since our regression model will seek to determine cost of accidents given accident we created a training and evaluation data set that removed observations with no accident.

BOX COX Analysis

We use the Box Cox analysis below to inform our transform decisions.

INCOME


#Income:
boxcoxfit(over_sample_train$INCOME[over_sample_train$INCOME>0])

HOME_VAL

#HOME_VAL
boxcoxfit(over_sample_train$HOME_VAL[over_sample_train$HOME_VAL>0])

BLUEBOOK

#BLUEBOOK
boxcoxfit(over_sample_train$BLUEBOOK[over_sample_train$BLUEBOOK>0])

OLDCLAIM

#OLDCLAIM
boxcoxfit(over_sample_train$OLDCLAIM[over_sample_train$OLDCLAIM>0])


Based upon our EDA and the Box Cox analysis above, we will employ some additional data transformations in our analysis. In some cases we create two different transforms - one for the classification analysis and one for the linear analysis. Transformations follow:

  • TARGET_AMT - we will use log(TARGET_AMOUNT) in our regression analysis.
  • KIDSDRIVE - create KIDSDRIVE2 that collapses observations with kids driving into one bucket.
  • KIDSDRIVE - create KIDSDROVE_T to address the potential curvature in relation to TARGET_AMT we’ll create a centered version
  • HOMEKIDS - create HOMEKIDS2 that collapses observations with kids HOME into one bucket.
  • HOMEKIDS - similar to KIDSDRIVE we’ll create HOMEKIDS_T - centered version
  • YOJ - create YOJ_T to address large percentage of zero YOJ observation using scale function.
  • AGE - create AGE2 that collapse breaks AGE in to three buckets young, middle and old
  • AGE - create AGE_T that centers the distribution with the scale function
  • INCOME - create INCOME2 that will be 1 if income is below 50K and 0 otherwise.
  • INCOME - create INCOME_T that uses a square root transform based upon box cox
  • HOME_VAL - create HOME_VAL_T that uses a square root transform based upon box cox
  • BLUEBOK - create BLUEBOOK_T that uses a square root transform based upon box cox
  • OLDCLAIM - create OLDCLAIM_T that applies a log transform to address significant right skew
  • OLDCLAIM - create OLDCLAIM2 that collapse into a 0 or 1 variable where 1 means there’s been a claim

#This block of code applies the transforms discussed in the feature engineering section.

over_sample_train <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F)) %>% 
  mutate(AGE_T =scale(AGE, scale=F)) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>% 
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))

test <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
  
  
amt_train <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))

amt_test <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF_T = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))


Transformed Data Set

The summary table below shows our transformed data set that includes original and transformed variables:

summary(over_sample_train)

Note - VIF scores were calculated but not significant issues were identified.

Model Building

We will now commence our model building.

Classification Model - Odds of Accident

Mod1 - this model will simply establish a baseline. Variable select was informed by commons sense and intuition.

mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )

 
mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )



tidym1 <- tidy(mod1)

tidym1 %>% 
  kable() %>% 
  kable_styling()

tdf1 <- glance(mod1)

Model 1 Evaluation

tdf1 %>% 
  kable() %>% 
  kable_styling()


Mod2 - this model will take significant variable from mod1 and add some quadratic transforms.

mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )

 
mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )

tidym2 <- tidy(mod2)

tidym2 %>% 
  kable() %>% 
  kable_styling()

Model 2 Evaluation


tdf2 <- glance(mod2)

tdf2 %>% 
  kable() %>% 
  kable_styling()

Mod3 - remove non-significant variables from mod2 and added I(CLM_FREQ^0.5)

mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )

 
mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2  + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )


tidym3 <- tidy(mod3)

tidym3 %>% 
  kable() %>% 
  kable_styling()

Model 3 Evaluation


tdf3 <- glance(mod3)

tdf3 %>% 
  kable() %>% 
  kable_styling()

Mod4 - this will be the step wise regression

mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')

stepdat <- over_sample_train %>% 
  dplyr::select(-TARGET_AMT, -INDEX)
 
mod4.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=stepdat)
mod4.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=stepdat)

mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')

tidym4 <- tidy(mod4)

tidym4 %>% 
  kable() %>% 
  kable_styling()

Modle 4 Evaluation


tdf4 <- glance(mod4)

tdf4 %>% 
  kable() %>% 
  kable_styling()

Linear Model - Cost of Claim

Now we will utilize the amt_train data set to model the cost of the claim. Note, the amt_claim data set only includes rows that have actual claims (i.e. zero claim rows have been removed).

Our first model will use some logical variable like BLUEBOOK, CAR_TYPE and URBANICITY among st others. This model will establish our baseline.

Mod 5 - Linear Regression - Base Line (logical variables)

mod5 <- lm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)

mod5 <- lm(TARGET_AMT ~   I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)

tidym5 <- tidy(mod5)

tidym5 %>% 
  kable() %>% 
  kable_styling()

Model 5 Evaluation


tdf5 <- glance(mod5)

tdf5 %>% 
  kable() %>% 
  kable_styling()

Mod 6 Linear Regression - Kitchen Sink

mod6 <- lm(TARGET_AMT ~ ., data =amt_train)

mod6 <- lm(TARGET_AMT ~ ., data =amt_train)

tidym6 <- tidy(mod6)

tidym6 %>% 
  kable() %>% 
  kable_styling()

Model 6 Evaluation

tdf6 <- glance(mod6)

tdf6 %>% 
  kable() %>% 
  kable_styling()

We are getting some bad results with linear model fit1 and fit2 had a combined R-squared of 0.30 - Not Good! I suspect the problem is outliers in the data set. We identified this as an issue in the data review and now it’s time to address it. fit3 and fit4 will employ a robust regression model (rlm).


Mod 7: Robust 1

mod7 <- rlm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)

mod7 <- rlm(TARGET_AMT ~  I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)

tidym7 <- tidy(mod7)

tidym7 %>% 
  kable() %>% 
  kable_styling()

Model 7 Evaluation

tdf7 <- glance(mod7)

tdf7 %>% 
  kable() %>% 
  kable_styling()


Mod 8: Robust 2

mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE, amt_train, psi = psi.hampel)

mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2  + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE, amt_train, psi = psi.hampel)

tidym8 <- tidy(mod8)

tidym8 %>% 
  kable() %>% 
  kable_styling()

Model 8 Evaluation

tdf8 <- glance(mod8)

tdf8 %>% 
  kable() %>% 
  kable_styling()

Comparing Models - Classification

The choice of the best classification model is fairly straight forward. Model 3 has the logLik, AIC, BIC and deviance. A review of the coefficients indicates that model results make intuitive sense.


comp_class <- rbind(tdf1,tdf2,tdf3,tdf4)

comp_class %>% 
  kable() %>% 
  kable_styling()


Comparing Models - Linear

The choice of the best linear model is more difficult. Outliers in the data set make use of an lm model difficult. This is particularly true because the outliers in the data ser were not necessarly errors and the data may just naturally include outliers. Given this fact I would pick model 3 or 4, basically picking the robust regression approach for this data set.


DMwR::regr.eval(amt_train$TARGET_AMT, mod5$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod6$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod7$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod8$fitted.values)

Evaluation Data Set

Each of the selected models were successfully applied to the evaluation data sets.

Appendix

Please click the following link to access the rmarkdown for this analysis.