Data 621 - Assignment 4 - Predicting Insurance Claims and Cost

library(tidyverse)
library(forcats)
library(modelr)
library(skimr)
library(knitr)
library(kableExtra)
library(broom)
library(caTools)
library(pscl)
library(ROCR)
library(MASS)
library(caret)
library(car)
library(DT)
library(GGally)
library(psych)
library(corrplot)
library(reshape)
library(ggthemes)
library(moments)
library(qqplotr)
library(gridExtra)
library(geoR)
library(alr3)
library(caret)
library(pROC)

Analysis

The objective of this analysis is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. A summary of the data set provided for this analysis is set forth below:

VARIABLE_NAME	DEFINITION	THEORETICAL EFFECT
`INDEX`	Identification Variable (do not use)	None
`TARGET_FLAG`	Was Car in a crash?	1=YES 0=NO None
`TARGET_AMT`	If car was in a crash, what was the cost	None
`AGE`	Age of Driver Very young people tend to be risky.	Maybe very old people also.
`BLUEBOOK`	Value of Vehicle	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_AGE`	Vehicle Age	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_TYPE`	Type of Car	Unknown effect on probability of collision, but probably effect the payout if there is a crash
`CAR_USE`	Vehicle Use	Commercial vehicles are driven more, so might increase probability of collision
`CLM_FREQ`	# Claims (Past 5 Years)	The more claims you filed in the past, the more you are likely to file in the future
`EDUCATION`	Max Education Level	Unknown effect, but in theory more educated people tend to drive more safely
`HOMEKIDS`	# Children at Home	Unknown effect
`HOME_VAL`	Home Value	In theory, home owners tend to drive more responsibly
`INCOME`	Income	In theory, rich people tend to get into fewer crashes
`JOB`	Job	Category In theory, white collar jobs tend to be safer
`KIDSDRIV`	# Driving Children	When teenagers drive your car, you are more likely to get into crashes
`MSTATUS`	Marital Status	In theory, married people drive more safely
`MVR_PTS`	Motor Vehicle Record Points	If you get lots of traffic tickets, you tend to get into more crashes
`OLDCLAIM`	Total Claims (Past 5 Years)	If your total payout over the past five years was high, this suggests future payouts will be high
`PARENT1`	Single Parent	Unknown effect
`RED_CAR`	A Red Car	Urban legend says that red cars (especially red sports cars) are more risky. Is that true?
`REVOKED License`	Revoked (Past 7 Years)	If your license was revoked in the past 7 years, you probably are a more risky driver.
`SEX`	Gender	Urban legend says that women have less crashes then men. Is that true?
`TIF`	Time in Force	People who have been customers for a long time are usually more safe.
`TRAVTIME`	Distance to Work	Long drives to work usually suggest greater risk
`URBANICITY`	Home/Work Area	Unknown
`YOJ`	Years on Job	People who stay at a job for a long time are usually more safe

Data

In the Data phase of this analysis we will review the data and take any necessary clean up measures before moving into the EDA phase of the analysis.

First let’s utilize the str and summary functions to familiarize myself with the data and to determine what data cleanup may be required.

Structure

str(df)

Summary

summary(df)

Target Variables

Next we’ll plot our two target variables to get better understand of the data set. Moving left to right we plot a bar plot of TARGET_FLAG and then box plots of TARGET_AMT with and without outliers.

df %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
  ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
  geom_bar() + scale_y_continuous() +
  theme_fivethirtyeight() +
  theme(legend.position = "none") +
  labs(x="TARGET_FLAG", y="n()") +
  labs(title="TARGET_FLAG") -> p1



df %>% filter(TARGET_FLAG == 1) %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT") -> p2



df %>% filter(TARGET_FLAG == 1) %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
  #geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(df$TARGET_AMT, c(0.1, 0.9))) +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT - No outliers") -> p3

grid.arrange(p1, p2, p3, ncol=3)

Data Observations

Here are some data observations from the EDA that will need to be addressed:

TARGET_FLAG appears to have an imbalance with almost 3X as may no accidents vs accidents.
TARGET_AMT has some outlier that we will have to address in data prep, model selection or both
NAs in Explanatory Variables - Several variables have NA (YOJ, CAR_AGE, AGE)
Dollar Signs ($) - Several variables have dollars that cause the variables to be treated as factors

We’ll use the tidyverse dplyr to address some of the data clean-up items so we can continue our EDA and extend it to the explanatory variables. We’ll make the following changes to the data set and then have another look at the data summaries.

Remove character from number variable and cast as numeric (INCOME, HOME_VAL, BLUE_BOOK, OLDCLAIM)
INCOME, HOME_VAL, BLUEBOOK and OLDCLAIM has also been rescaled by dividing by 1000
Remove random characters from variables (z_, <, etc.)
Replace NAs with median values
Replace negative CAR_AGE value(s) with the Median CAR_AGE

Finally we will correct the imbalnce in the TARGET_FLAG variable, filter out no accident rows from the TARGET_AMT variable and create training and test data sets for our analysis.


df <- df %>% 
  mutate(INCOME =as.numeric(str_replace_all(INCOME,'([\\$,])',''))/1000) %>%
  mutate(HOME_VAL =as.numeric(str_replace_all(HOME_VAL,'([\\$,])',''))/1000) %>%
  mutate(BLUEBOOK =as.numeric(str_replace_all(BLUEBOOK,'([\\$,])',''))/1000) %>%
  mutate(OLDCLAIM =as.numeric(str_replace_all(OLDCLAIM,'([\\$,])',''))/1000) %>%
  mutate(MSTATUS =as.factor(str_replace_all(MSTATUS,'([z_<])',''))) %>%
  mutate(SEX =as.factor(str_replace_all(SEX,'([z_<])',''))) %>%
  mutate(EDUCATION =as.factor(str_replace_all(EDUCATION,'([z_<])',''))) %>%
  mutate(JOB = as.factor(str_replace_all(JOB,'([z_<])',''))) %>%
  mutate(CAR_TYPE = as.factor(str_replace_all(CAR_TYPE,'([z_<])',''))) %>%
  mutate(URBANICITY = as.factor(str_replace_all(URBANICITY,'([z_<])',''))) %>%
  mutate(CAR_AGE = if_else(is.na(CAR_AGE),median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
  mutate(CAR_AGE = if_else(CAR_AGE<0,median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
  mutate(YOJ = if_else(is.na(YOJ),median(YOJ,na.rm=TRUE),YOJ)) %>%
  mutate(AGE = if_else(is.na(AGE),median(AGE,na.rm=TRUE),AGE)) %>%
  mutate(INCOME = if_else(is.na(INCOME),median(INCOME,na.rm=TRUE),INCOME)) %>%
  mutate(HOME_VAL = if_else(is.na(HOME_VAL),median(HOME_VAL,na.rm=TRUE),HOME_VAL))
  

# Order EDUCATION FACTOR

   df$EDUCATION <- ordered(df$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )

Revised DATA

A post-data wrangling summary of the data is set forth below. This data reflects the changes proposed in the Data Observation section. The data wrangling script is included in the Appendix.

summary(df)

set.seed(10)
train_index <- createDataPartition(df$TARGET_FLAG, p = .75, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]

set.seed(10)
minority <- nrow(train[train$TARGET_FLAG == 1,])
majority <- nrow(train[train$TARGET_FLAG == 0,])
diff <- majority - minority

minority_index <- train[train$TARGET_FLAG == 1,]$INDEX

over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>% 
  merge(train,.) %>% 
  bind_rows(train)

set.seed(42)
accidents <- df %>%
  filter(TARGET_FLAG == 1)
amt_train_index <- createDataPartition(accidents$TARGET_AMT, p = .75, list = FALSE, times = 1)
amt_train <- accidents[amt_train_index,]
amt_test <- accidents[-amt_train_index,]

Prepared Data

The three charts below (moving left to right) display the re balanced TARGET_FLAG variable as well as box plots of TARGET_AMOUNT with no accidents removed with and without outliers.

over_sample_train %>% 
  mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
  mutate(AGE = cut(AGE, breaks=c(0,20,50,60,70,80,Inf),labels=FALSE)) %>%
  ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
  geom_bar() + scale_y_continuous() +
  theme_fivethirtyeight() +
  theme(legend.position = "none") +
  labs(x="TARGET_FLAG", y="n()") +
  labs(title="TARGET_FLAG") -> p10



amt_train %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT") -> p20



amt_train %>%
  ggplot(aes(x=1, y=TARGET_AMT)) +
  geom_boxplot() +
  stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
  #geom_boxplot(outlier.shape = NA) +
  scale_y_continuous(limits = quantile(over_sample_train$TARGET_AMT, c(0.1, 0.9))) +
  scale_x_continuous(breaks=NULL) +
  theme_fivethirtyeight() +
  labs(title="TARGET_AMT - No outliers") -> p30

grid.arrange(p10, p20, p30, ncol=3)

EDA

We are now ready to proceed to the EDA phase of our analysis. We will employ a visual EDA approach to better understand the data. Explanatory charts for each explanatory variable are set forth below.



for (Myvar in names(over_sample_train)){
  
  for (var in names(over_sample_train)){
    if(var == Myvar){
    plot_df <- over_sample_train
    plot_df$x <- plot_df[,var]
    p <- ggplot(plot_df, aes(x, color = factor(TARGET_FLAG))) +
      geom_density() +
      theme_fivethirtyeight()  
    p <- p + labs(title =str_c(var, " by target",sep=''))
    
  
  b <- ggplot(over_sample_train, aes(factor(plot_df$x), log(TARGET_AMT))) + 
  geom_boxplot()+
  theme_fivethirtyeight()+ labs(title = str_c(var, ' by Log(TARGET_AMT)'), x = var) + 
  stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) 
  

  
  bc <- ggplot(over_sample_train, aes(plot_df$x, fill=factor(TARGET_FLAG))) + 
  geom_bar(aes(y = ..prop..), position="dodge", alpha=0.5) +
  theme_fivethirtyeight()+ labs(title = str_c(var,' by TARGET_FLAG')) + 
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) + 
  theme(legend.position = c(1,1),legend.justification  = c(1,1)) +
  theme(legend.title=element_text(size=8),legend.text=element_text(size=7)) 
 
  

  b2 <- ggplot(over_sample_train, aes(x=TARGET_FLAG, plot_df$x)) + 
        geom_boxplot(outlier.size=0, alpha=1.0) +
        labs(title = str_c(var, " by TARGET_FLAG",sep='')) + 
        theme_fivethirtyeight() + 
        stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
        theme(plot.title = element_text(hjust = 0.5)) 
  
  bc2 <- ggplot(over_sample_train, aes(x=reorder(plot_df$x, plot_df$x,function(x) length(x)))) + 
  geom_bar(aes(y = ..prop.., group=1)) + 
  theme_fivethirtyeight()+ labs(title = str_c(var,' Prop')) +
  theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
  theme(legend.title=element_text(size=8),legend.text=element_text(size=7)) + 
  scale_x_discrete() + 
  labs(x="JOB") + coord_flip()

  
  
  h <-ggplot(over_sample_train, aes(plot_df$x,log(TARGET_AMT)))+
    geom_hex(bins=50) +
    geom_smooth(method = "loess", size=1, color="red") +
    labs(title = str_c(var, " by TARGET_AMT",sep='')) +
    theme_fivethirtyeight()

  if(var != 'INDEX' & var != 'TARGET_FLAG' & var != 'TARGET_AMT'){

    #Group 1
    if(var == 'KIDSDRIV' | var == 'HOMEKIDS' | var == 'CLM_FREQ' | var == 'MVR_PTS'){
     
      grid.arrange(bc, b2,  b, ncol=3)   
      
    }
    
    #Group 2
    if(var == 'AGE' | var == 'YOJ' | var == 'TRAVTIME' | var == 'TIF' | var == 'CAR_AGE'){
     
      grid.arrange(bc, b2, h, ncol=3)   
      
    }
   
    #Group 3
    if(var == 'PARENT1' | var == 'MSTATUS' | var == 'SEX' | var == 'CAR_USE' | var == 'RED_CAR'| var == 'REVOKED' | var == 'URBANICITY'){
     
      grid.arrange( b, ncol=1)   
      
    }
  
    #Group 4
    if(var == 'JOB' | var == 'CART_TYPE' | var == 'EDUCATION'){
     
      grid.arrange(bc2, b, ncol=2)   
      
    }
    
    #Group 5
    if(var == 'INCOME' | var == 'HOME_VAL' | var == 'BLUEBOOK' | var == 'OLDCLAIM'){
     
      grid.arrange(p, b2, h, ncol=3)   
      
    }
  
    
  }
   
     }
  }
 

 
}

EDA Observations:

A cursory review of the EDA visualizations would leaves one with the feeling this is a very difficult data set. The follow are some selected EDA Observations:

KIDSDRIV - the number of teenagers driving in household

The KIDSdriv variable has significanlty more households with no kids driving -this creates a strong right skew in the variable. The data for household with teenage drivers support the premise that teenage drivers increase the chance for an accident. Might consider changing the levels of this variables by combining all homes with teenage driver versus those without. Its less clear, given an accident, that teenage drivers increase the cost of the accident. Our box plot shows what appears to by a higher cost in two of the categories (1 driver and 4 drivers), but 2 and 3 seems on par or below the overall average.

AGE - age of driver

The AGE variable is near-normal for both crash and no crash groups. Our bar chart and box plot support the idea that younger people are involved in more accidents. The hex bin plot offers some evidence that the cost of accidents is higher for young people and old people (gt 60), but the preponderance of accidents occur for individuals in the middle (light blue in hex bins).

HOMEKIDS - number of kids at home

HOMEKIDS like KIDSDRIV show that kids increase accidents. This is evident in the bar and box plots for HOMEKIDS. The cost of accidents for HOMEKIDS is consistent with our finding for KIDSDRIV.

YOJ - years on job

Our box box plot supports the theory that insured with more years on the job are less likely to be in an accident. The bar chart reveals a large percentage of insured with less than a year on the job. This group showed a higher propensity to be in an accident. The hex bin plot for TARGET_AMT is difficult to interpret.

INCOME

Income’s two density plots (accident, no accident) have significant overlap. This may undermine the variable’s use in the classification analysis. There does appear to be a slight negative relationship between income and accidents. The INCOME hex bin plot has a significant right skew owing to a handful of outlier incomes.

HOME_VAL

The HOME_VAL box plot shows that insured with lower home values have more accidents. This variable might be useful in the classification model. There is a large number of insured with zero HOME_VALs. This could reflect renters and might be something that should be addressed in the Data Prep phase.

MSTATUS AND SEX

The MSTATUS and SEX box plot do not provide a lot of insight. The box plots for accidents and no accidents for both variables are very similar.

TRAVTIME

The TRAVTIME box plot indicates that insured with longer travel times have more accidents. The hex bin plot indicates that there is a concentration of travel times between 29 and 50 minutes.

BLUEBOOK

The BLUEBOOK box plot supports the premise that cars with lower bluebook values are involved in more accidnts. There is also significant overlap between the accident vs no accident BLUEBOOK density plots. Its difficult derive much insight from the BLUEBOOK hex bin plot.

MVR_PTS

The MVR_PTS box plot clearly shows insured with more MVR points are in more accidents. MVR points to claim amounts seemed to be concentrated in a farily narrow range across point levels. There is a concentration of insured without any points.

Feature Engineering

In our data prep phase we addressed missing values, rescaled values and cleaned up data inconsistencies and problems ($ signs and random characters). We also corrected the imbalance in the TARGET_FLAG and since our regression model will seek to determine cost of accidents given accident we created a training and evaluation data set that removed observations with no accident.

BOX COX Analysis

We use the Box Cox analysis below to inform our transform decisions.

INCOME


#Income:
boxcoxfit(over_sample_train$INCOME[over_sample_train$INCOME>0])

HOME_VAL

#HOME_VAL
boxcoxfit(over_sample_train$HOME_VAL[over_sample_train$HOME_VAL>0])

BLUEBOOK

#BLUEBOOK
boxcoxfit(over_sample_train$BLUEBOOK[over_sample_train$BLUEBOOK>0])

OLDCLAIM

#OLDCLAIM
boxcoxfit(over_sample_train$OLDCLAIM[over_sample_train$OLDCLAIM>0])

Based upon our EDA and the Box Cox analysis above, we will employ some additional data transformations in our analysis. In some cases we create two different transforms - one for the classification analysis and one for the linear analysis. Transformations follow:

TARGET_AMT - we will use log(TARGET_AMOUNT) in our regression analysis.
KIDSDRIVE - create KIDSDRIVE2 that collapses observations with kids driving into one bucket.
KIDSDRIVE - create KIDSDROVE_T to address the potential curvature in relation to TARGET_AMT we’ll create a centered version
HOMEKIDS - create HOMEKIDS2 that collapses observations with kids HOME into one bucket.
HOMEKIDS - similar to KIDSDRIVE we’ll create HOMEKIDS_T - centered version
YOJ - create YOJ_T to address large percentage of zero YOJ observation using scale function.
AGE - create AGE2 that collapse breaks AGE in to three buckets young, middle and old
AGE - create AGE_T that centers the distribution with the scale function
INCOME - create INCOME2 that will be 1 if income is below 50K and 0 otherwise.
INCOME - create INCOME_T that uses a square root transform based upon box cox
HOME_VAL - create HOME_VAL_T that uses a square root transform based upon box cox
BLUEBOK - create BLUEBOOK_T that uses a square root transform based upon box cox
OLDCLAIM - create OLDCLAIM_T that applies a log transform to address significant right skew
OLDCLAIM - create OLDCLAIM2 that collapse into a 0 or 1 variable where 1 means there’s been a claim


#This block of code applies the transforms discussed in the feature engineering section.

over_sample_train <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F)) %>% 
  mutate(AGE_T =scale(AGE, scale=F)) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>% 
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))

test <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
  
  
amt_train <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))

amt_test <- over_sample_train %>% 
  mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>% 
  mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
  mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
  mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
  mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
  mutate(TIF_T = scale(TIF, scale=F)) %>%
  mutate(YOJ_T =scale(YOJ, scale=F) ) %>% 
  mutate(AGE_T =scale(AGE, scale=F) ) %>%
  mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
  mutate(INCOME2 = if_else(INCOME<50,1,0)) %>% 
  mutate(INCOME_T = INCOME^0.5) %>% 
  mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>% 
  mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
  mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
  mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
  mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))

Transformed Data Set

The summary table below shows our transformed data set that includes original and transformed variables:

summary(over_sample_train)

Note - VIF scores were calculated but not significant issues were identified.

Model Building

We will now commence our model building.

Classification Model - Odds of Accident

Mod1 - this model will simply establish a baseline. Variable select was informed by commons sense and intuition.

mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )

 
mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )



tidym1 <- tidy(mod1)

tidym1 %>% 
  kable() %>% 
  kable_styling()

tdf1 <- glance(mod1)

Model 1 Evaluation

tdf1 %>% 
  kable() %>% 
  kable_styling()

Mod2 - this model will take significant variable from mod1 and add some quadratic transforms.

mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )

 
mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )

tidym2 <- tidy(mod2)

tidym2 %>% 
  kable() %>% 
  kable_styling()

Model 2 Evaluation


tdf2 <- glance(mod2)

tdf2 %>% 
  kable() %>% 
  kable_styling()

Mod3 - remove non-significant variables from mod2 and added I(CLM_FREQ^0.5)

mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )

 
mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2  + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )


tidym3 <- tidy(mod3)

tidym3 %>% 
  kable() %>% 
  kable_styling()

Model 3 Evaluation


tdf3 <- glance(mod3)

tdf3 %>% 
  kable() %>% 
  kable_styling()

Mod4 - this will be the step wise regression

mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')

stepdat <- over_sample_train %>% 
  dplyr::select(-TARGET_AMT, -INDEX)
 
mod4.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=stepdat)
mod4.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=stepdat)

mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')

tidym4 <- tidy(mod4)

tidym4 %>% 
  kable() %>% 
  kable_styling()

Modle 4 Evaluation


tdf4 <- glance(mod4)

tdf4 %>% 
  kable() %>% 
  kable_styling()

Linear Model - Cost of Claim

Now we will utilize the amt_train data set to model the cost of the claim. Note, the amt_claim data set only includes rows that have actual claims (i.e. zero claim rows have been removed).

Our first model will use some logical variable like BLUEBOOK, CAR_TYPE and URBANICITY among st others. This model will establish our baseline.

Mod 5 - Linear Regression - Base Line (logical variables)

mod5 <- lm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)

mod5 <- lm(TARGET_AMT ~   I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)

tidym5 <- tidy(mod5)

tidym5 %>% 
  kable() %>% 
  kable_styling()

Model 5 Evaluation


tdf5 <- glance(mod5)

tdf5 %>% 
  kable() %>% 
  kable_styling()

Mod 6 Linear Regression - Kitchen Sink

mod6 <- lm(TARGET_AMT ~ ., data =amt_train)

mod6 <- lm(TARGET_AMT ~ ., data =amt_train)

tidym6 <- tidy(mod6)

tidym6 %>% 
  kable() %>% 
  kable_styling()

Model 6 Evaluation

tdf6 <- glance(mod6)

tdf6 %>% 
  kable() %>% 
  kable_styling()

We are getting some bad results with linear model fit1 and fit2 had a combined R-squared of 0.30 - Not Good! I suspect the problem is outliers in the data set. We identified this as an issue in the data review and now it’s time to address it. fit3 and fit4 will employ a robust regression model (rlm).

Mod 7: Robust 1

mod7 <- rlm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)

mod7 <- rlm(TARGET_AMT ~  I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)

tidym7 <- tidy(mod7)

tidym7 %>% 
  kable() %>% 
  kable_styling()

Model 7 Evaluation

tdf7 <- glance(mod7)

tdf7 %>% 
  kable() %>% 
  kable_styling()

Mod 8: Robust 2

mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE, amt_train, psi = psi.hampel)

mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2  + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T  + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF  + CAR_TYPE, amt_train, psi = psi.hampel)

tidym8 <- tidy(mod8)

tidym8 %>% 
  kable() %>% 
  kable_styling()

Model 8 Evaluation

tdf8 <- glance(mod8)

tdf8 %>% 
  kable() %>% 
  kable_styling()

Comparing Models - Classification

The choice of the best classification model is fairly straight forward. Model 3 has the logLik, AIC, BIC and deviance. A review of the coefficients indicates that model results make intuitive sense.


comp_class <- rbind(tdf1,tdf2,tdf3,tdf4)

comp_class %>% 
  kable() %>% 
  kable_styling()

Comparing Models - Linear

The choice of the best linear model is more difficult. Outliers in the data set make use of an lm model difficult. This is particularly true because the outliers in the data ser were not necessarly errors and the data may just naturally include outliers. Given this fact I would pick model 3 or 4, basically picking the robust regression approach for this data set.


DMwR::regr.eval(amt_train$TARGET_AMT, mod5$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod6$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod7$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod8$fitted.values)

Evaluation Data Set

Each of the selected models were successfully applied to the evaluation data sets.

Appendix

Please click the following link to access the rmarkdown for this analysis.

Data 621 - Assignment 4 - Predicting Insurance Claims and Cost

James Mundy

April 20, 2020

Analysis

Data

Structure

Summary

Target Variables

Data Observations

Revised DATA

Prepared Data

EDA

EDA Observations:

KIDSDRIV - the number of teenagers driving in household

AGE - age of driver

HOMEKIDS - number of kids at home

YOJ - years on job

INCOME

HOME_VAL

MSTATUS AND SEX

TRAVTIME

BLUEBOOK

MVR_PTS

Feature Engineering

BOX COX Analysis

INCOME

HOME_VAL

BLUEBOOK

OLDCLAIM

Transformed Data Set

Model Building

Classification Model - Odds of Accident

Mod1 - this model will simply establish a baseline. Variable select was informed by commons sense and intuition.

Model 1 Evaluation

Mod2 - this model will take significant variable from mod1 and add some quadratic transforms.

Model 2 Evaluation

Mod3 - remove non-significant variables from mod2 and added I(CLM_FREQ^0.5)

Model 3 Evaluation

Mod4 - this will be the step wise regression

Modle 4 Evaluation

Linear Model - Cost of Claim

Mod 5 - Linear Regression - Base Line (logical variables)

Model 5 Evaluation

Mod 6 Linear Regression - Kitchen Sink

Model 6 Evaluation

Mod 7: Robust 1

Model 7 Evaluation

Mod 8: Robust 2

Model 8 Evaluation

Comparing Models - Classification

Comparing Models - Linear

Evaluation Data Set

Appendix