library(tidyverse)
library(forcats)
library(modelr)
library(skimr)
library(knitr)
library(kableExtra)
library(broom)
library(caTools)
library(pscl)
library(ROCR)
library(MASS)
library(caret)
library(car)
library(DT)
library(GGally)
library(psych)
library(corrplot)
library(reshape)
library(ggthemes)
library(moments)
library(qqplotr)
library(gridExtra)
library(geoR)
library(alr3)
library(caret)
library(pROC)
The objective of this analysis is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. A summary of the data set provided for this analysis is set forth below:
| VARIABLE_NAME | DEFINITION | THEORETICAL EFFECT |
|---|---|---|
INDEX |
Identification Variable (do not use) | None |
TARGET_FLAG |
Was Car in a crash? | 1=YES 0=NO None |
TARGET_AMT |
If car was in a crash, what was the cost | None |
AGE |
Age of Driver Very young people tend to be risky. | Maybe very old people also. |
BLUEBOOK |
Value of Vehicle | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_AGE |
Vehicle Age | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_TYPE |
Type of Car | Unknown effect on probability of collision, but probably effect the payout if there is a crash |
CAR_USE |
Vehicle Use | Commercial vehicles are driven more, so might increase probability of collision |
CLM_FREQ |
# Claims (Past 5 Years) | The more claims you filed in the past, the more you are likely to file in the future |
EDUCATION |
Max Education Level | Unknown effect, but in theory more educated people tend to drive more safely |
HOMEKIDS |
# Children at Home | Unknown effect |
HOME_VAL |
Home Value | In theory, home owners tend to drive more responsibly |
INCOME |
Income | In theory, rich people tend to get into fewer crashes |
JOB |
Job | Category In theory, white collar jobs tend to be safer |
KIDSDRIV |
# Driving Children | When teenagers drive your car, you are more likely to get into crashes |
MSTATUS |
Marital Status | In theory, married people drive more safely |
MVR_PTS |
Motor Vehicle Record Points | If you get lots of traffic tickets, you tend to get into more crashes |
OLDCLAIM |
Total Claims (Past 5 Years) | If your total payout over the past five years was high, this suggests future payouts will be high |
PARENT1 |
Single Parent | Unknown effect |
RED_CAR |
A Red Car | Urban legend says that red cars (especially red sports cars) are more risky. Is that true? |
REVOKED License |
Revoked (Past 7 Years) | If your license was revoked in the past 7 years, you probably are a more risky driver. |
SEX |
Gender | Urban legend says that women have less crashes then men. Is that true? |
TIF |
Time in Force | People who have been customers for a long time are usually more safe. |
TRAVTIME |
Distance to Work | Long drives to work usually suggest greater risk |
URBANICITY |
Home/Work Area | Unknown |
YOJ |
Years on Job | People who stay at a job for a long time are usually more safe |
In the Data phase of this analysis we will review the data and take any necessary clean up measures before moving into the EDA phase of the analysis.
First let’s utilize the str and summary functions to familiarize myself with the data and to determine what data cleanup may be required.
str(df)
summary(df)
Next we’ll plot our two target variables to get better understand of the data set. Moving left to right we plot a bar plot of TARGET_FLAG and then box plots of TARGET_AMT with and without outliers.
df %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
geom_bar() + scale_y_continuous() +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x="TARGET_FLAG", y="n()") +
labs(title="TARGET_FLAG") -> p1
df %>% filter(TARGET_FLAG == 1) %>%
ggplot(aes(x=1, y=TARGET_AMT)) +
geom_boxplot() +
stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
scale_x_continuous(breaks=NULL) +
theme_fivethirtyeight() +
labs(title="TARGET_AMT") -> p2
df %>% filter(TARGET_FLAG == 1) %>%
ggplot(aes(x=1, y=TARGET_AMT)) +
geom_boxplot() +
stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
#geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(df$TARGET_AMT, c(0.1, 0.9))) +
scale_x_continuous(breaks=NULL) +
theme_fivethirtyeight() +
labs(title="TARGET_AMT - No outliers") -> p3
grid.arrange(p1, p2, p3, ncol=3)
Here are some data observations from the EDA that will need to be addressed:
We’ll use the tidyverse dplyr to address some of the data clean-up items so we can continue our EDA and extend it to the explanatory variables. We’ll make the following changes to the data set and then have another look at the data summaries.
Finally we will correct the imbalnce in the TARGET_FLAG variable, filter out no accident rows from the TARGET_AMT variable and create training and test data sets for our analysis.
df <- df %>%
mutate(INCOME =as.numeric(str_replace_all(INCOME,'([\\$,])',''))/1000) %>%
mutate(HOME_VAL =as.numeric(str_replace_all(HOME_VAL,'([\\$,])',''))/1000) %>%
mutate(BLUEBOOK =as.numeric(str_replace_all(BLUEBOOK,'([\\$,])',''))/1000) %>%
mutate(OLDCLAIM =as.numeric(str_replace_all(OLDCLAIM,'([\\$,])',''))/1000) %>%
mutate(MSTATUS =as.factor(str_replace_all(MSTATUS,'([z_<])',''))) %>%
mutate(SEX =as.factor(str_replace_all(SEX,'([z_<])',''))) %>%
mutate(EDUCATION =as.factor(str_replace_all(EDUCATION,'([z_<])',''))) %>%
mutate(JOB = as.factor(str_replace_all(JOB,'([z_<])',''))) %>%
mutate(CAR_TYPE = as.factor(str_replace_all(CAR_TYPE,'([z_<])',''))) %>%
mutate(URBANICITY = as.factor(str_replace_all(URBANICITY,'([z_<])',''))) %>%
mutate(CAR_AGE = if_else(is.na(CAR_AGE),median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
mutate(CAR_AGE = if_else(CAR_AGE<0,median(CAR_AGE,na.rm=TRUE),CAR_AGE)) %>%
mutate(YOJ = if_else(is.na(YOJ),median(YOJ,na.rm=TRUE),YOJ)) %>%
mutate(AGE = if_else(is.na(AGE),median(AGE,na.rm=TRUE),AGE)) %>%
mutate(INCOME = if_else(is.na(INCOME),median(INCOME,na.rm=TRUE),INCOME)) %>%
mutate(HOME_VAL = if_else(is.na(HOME_VAL),median(HOME_VAL,na.rm=TRUE),HOME_VAL))
# Order EDUCATION FACTOR
df$EDUCATION <- ordered(df$EDUCATION, levels = c("High School", "Bachelors", "Masters", "PhD") )
A post-data wrangling summary of the data is set forth below. This data reflects the changes proposed in the Data Observation section. The data wrangling script is included in the Appendix.
summary(df)
set.seed(10)
train_index <- createDataPartition(df$TARGET_FLAG, p = .75, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]
set.seed(10)
minority <- nrow(train[train$TARGET_FLAG == 1,])
majority <- nrow(train[train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- train[train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(train,.) %>%
bind_rows(train)
set.seed(42)
accidents <- df %>%
filter(TARGET_FLAG == 1)
amt_train_index <- createDataPartition(accidents$TARGET_AMT, p = .75, list = FALSE, times = 1)
amt_train <- accidents[amt_train_index,]
amt_test <- accidents[-amt_train_index,]
The three charts below (moving left to right) display the re balanced TARGET_FLAG variable as well as box plots of TARGET_AMOUNT with no accidents removed with and without outliers.
over_sample_train %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>%
mutate(AGE = cut(AGE, breaks=c(0,20,50,60,70,80,Inf),labels=FALSE)) %>%
ggplot(aes(x=TARGET_FLAG,fill=TARGET_FLAG)) +
geom_bar() + scale_y_continuous() +
theme_fivethirtyeight() +
theme(legend.position = "none") +
labs(x="TARGET_FLAG", y="n()") +
labs(title="TARGET_FLAG") -> p10
amt_train %>%
ggplot(aes(x=1, y=TARGET_AMT)) +
geom_boxplot() +
stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#FF2700") +
scale_x_continuous(breaks=NULL) +
theme_fivethirtyeight() +
labs(title="TARGET_AMT") -> p20
amt_train %>%
ggplot(aes(x=1, y=TARGET_AMT)) +
geom_boxplot() +
stat_summary(fun.y ="mean", geom = "point", shape=23, size = 3, fill="#008FD5") +
#geom_boxplot(outlier.shape = NA) +
scale_y_continuous(limits = quantile(over_sample_train$TARGET_AMT, c(0.1, 0.9))) +
scale_x_continuous(breaks=NULL) +
theme_fivethirtyeight() +
labs(title="TARGET_AMT - No outliers") -> p30
grid.arrange(p10, p20, p30, ncol=3)
We are now ready to proceed to the EDA phase of our analysis. We will employ a visual EDA approach to better understand the data. Explanatory charts for each explanatory variable are set forth below.
for (Myvar in names(over_sample_train)){
for (var in names(over_sample_train)){
if(var == Myvar){
plot_df <- over_sample_train
plot_df$x <- plot_df[,var]
p <- ggplot(plot_df, aes(x, color = factor(TARGET_FLAG))) +
geom_density() +
theme_fivethirtyeight()
p <- p + labs(title =str_c(var, " by target",sep=''))
b <- ggplot(over_sample_train, aes(factor(plot_df$x), log(TARGET_AMT))) +
geom_boxplot()+
theme_fivethirtyeight()+ labs(title = str_c(var, ' by Log(TARGET_AMT)'), x = var) +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2)
bc <- ggplot(over_sample_train, aes(plot_df$x, fill=factor(TARGET_FLAG))) +
geom_bar(aes(y = ..prop..), position="dodge", alpha=0.5) +
theme_fivethirtyeight()+ labs(title = str_c(var,' by TARGET_FLAG')) +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.position = c(1,1),legend.justification = c(1,1)) +
theme(legend.title=element_text(size=8),legend.text=element_text(size=7))
b2 <- ggplot(over_sample_train, aes(x=TARGET_FLAG, plot_df$x)) +
geom_boxplot(outlier.size=0, alpha=1.0) +
labs(title = str_c(var, " by TARGET_FLAG",sep='')) +
theme_fivethirtyeight() +
stat_summary(fun.y=mean, colour="darkred", geom="point", shape=16, size=2) +
theme(plot.title = element_text(hjust = 0.5))
bc2 <- ggplot(over_sample_train, aes(x=reorder(plot_df$x, plot_df$x,function(x) length(x)))) +
geom_bar(aes(y = ..prop.., group=1)) +
theme_fivethirtyeight()+ labs(title = str_c(var,' Prop')) +
theme(plot.title = element_text(hjust = 0.5),axis.title.y=element_text(size=10)) +
theme(legend.title=element_text(size=8),legend.text=element_text(size=7)) +
scale_x_discrete() +
labs(x="JOB") + coord_flip()
h <-ggplot(over_sample_train, aes(plot_df$x,log(TARGET_AMT)))+
geom_hex(bins=50) +
geom_smooth(method = "loess", size=1, color="red") +
labs(title = str_c(var, " by TARGET_AMT",sep='')) +
theme_fivethirtyeight()
if(var != 'INDEX' & var != 'TARGET_FLAG' & var != 'TARGET_AMT'){
#Group 1
if(var == 'KIDSDRIV' | var == 'HOMEKIDS' | var == 'CLM_FREQ' | var == 'MVR_PTS'){
grid.arrange(bc, b2, b, ncol=3)
}
#Group 2
if(var == 'AGE' | var == 'YOJ' | var == 'TRAVTIME' | var == 'TIF' | var == 'CAR_AGE'){
grid.arrange(bc, b2, h, ncol=3)
}
#Group 3
if(var == 'PARENT1' | var == 'MSTATUS' | var == 'SEX' | var == 'CAR_USE' | var == 'RED_CAR'| var == 'REVOKED' | var == 'URBANICITY'){
grid.arrange( b, ncol=1)
}
#Group 4
if(var == 'JOB' | var == 'CART_TYPE' | var == 'EDUCATION'){
grid.arrange(bc2, b, ncol=2)
}
#Group 5
if(var == 'INCOME' | var == 'HOME_VAL' | var == 'BLUEBOOK' | var == 'OLDCLAIM'){
grid.arrange(p, b2, h, ncol=3)
}
}
}
}
}
A cursory review of the EDA visualizations would leaves one with the feeling this is a very difficult data set. The follow are some selected EDA Observations:
The KIDSdriv variable has significanlty more households with no kids driving -this creates a strong right skew in the variable. The data for household with teenage drivers support the premise that teenage drivers increase the chance for an accident. Might consider changing the levels of this variables by combining all homes with teenage driver versus those without. Its less clear, given an accident, that teenage drivers increase the cost of the accident. Our box plot shows what appears to by a higher cost in two of the categories (1 driver and 4 drivers), but 2 and 3 seems on par or below the overall average.
The AGE variable is near-normal for both crash and no crash groups. Our bar chart and box plot support the idea that younger people are involved in more accidents. The hex bin plot offers some evidence that the cost of accidents is higher for young people and old people (gt 60), but the preponderance of accidents occur for individuals in the middle (light blue in hex bins).
HOMEKIDS like KIDSDRIV show that kids increase accidents. This is evident in the bar and box plots for HOMEKIDS. The cost of accidents for HOMEKIDS is consistent with our finding for KIDSDRIV.
Our box box plot supports the theory that insured with more years on the job are less likely to be in an accident. The bar chart reveals a large percentage of insured with less than a year on the job. This group showed a higher propensity to be in an accident. The hex bin plot for TARGET_AMT is difficult to interpret.
Income’s two density plots (accident, no accident) have significant overlap. This may undermine the variable’s use in the classification analysis. There does appear to be a slight negative relationship between income and accidents. The INCOME hex bin plot has a significant right skew owing to a handful of outlier incomes.
The HOME_VAL box plot shows that insured with lower home values have more accidents. This variable might be useful in the classification model. There is a large number of insured with zero HOME_VALs. This could reflect renters and might be something that should be addressed in the Data Prep phase.
The MSTATUS and SEX box plot do not provide a lot of insight. The box plots for accidents and no accidents for both variables are very similar.
The TRAVTIME box plot indicates that insured with longer travel times have more accidents. The hex bin plot indicates that there is a concentration of travel times between 29 and 50 minutes.
The BLUEBOOK box plot supports the premise that cars with lower bluebook values are involved in more accidnts. There is also significant overlap between the accident vs no accident BLUEBOOK density plots. Its difficult derive much insight from the BLUEBOOK hex bin plot.
The MVR_PTS box plot clearly shows insured with more MVR points are in more accidents. MVR points to claim amounts seemed to be concentrated in a farily narrow range across point levels. There is a concentration of insured without any points.
In our data prep phase we addressed missing values, rescaled values and cleaned up data inconsistencies and problems ($ signs and random characters). We also corrected the imbalance in the TARGET_FLAG and since our regression model will seek to determine cost of accidents given accident we created a training and evaluation data set that removed observations with no accident.
We use the Box Cox analysis below to inform our transform decisions.
#Income:
boxcoxfit(over_sample_train$INCOME[over_sample_train$INCOME>0])
#HOME_VAL
boxcoxfit(over_sample_train$HOME_VAL[over_sample_train$HOME_VAL>0])
#BLUEBOOK
boxcoxfit(over_sample_train$BLUEBOOK[over_sample_train$BLUEBOOK>0])
#OLDCLAIM
boxcoxfit(over_sample_train$OLDCLAIM[over_sample_train$OLDCLAIM>0])
Based upon our EDA and the Box Cox analysis above, we will employ some additional data transformations in our analysis. In some cases we create two different transforms - one for the classification analysis and one for the linear analysis. Transformations follow:
#This block of code applies the transforms discussed in the feature engineering section.
over_sample_train <- over_sample_train %>%
mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>%
mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
mutate(TIF = scale(TIF, scale=F)) %>%
mutate(YOJ_T =scale(YOJ, scale=F)) %>%
mutate(AGE_T =scale(AGE, scale=F)) %>%
mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
mutate(INCOME2 = if_else(INCOME<50,1,0)) %>%
mutate(INCOME_T = INCOME^0.5) %>%
mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>%
mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
test <- over_sample_train %>%
mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>%
mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
mutate(TIF = scale(TIF, scale=F)) %>%
mutate(YOJ_T =scale(YOJ, scale=F) ) %>%
mutate(AGE_T =scale(AGE, scale=F) ) %>%
mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
mutate(INCOME2 = if_else(INCOME<50,1,0)) %>%
mutate(INCOME_T = INCOME^0.5) %>%
mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>%
mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
amt_train <- over_sample_train %>%
mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>%
mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
mutate(TIF = scale(TIF, scale=F)) %>%
mutate(YOJ_T =scale(YOJ, scale=F) ) %>%
mutate(AGE_T =scale(AGE, scale=F) ) %>%
mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
mutate(INCOME2 = if_else(INCOME<50,1,0)) %>%
mutate(INCOME_T = INCOME^0.5) %>%
mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>%
mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
amt_test <- over_sample_train %>%
mutate(KIDSDRIV_T = scale(KIDSDRIV, scale=F)) %>%
mutate(HOMEKIDS_T = scale(HOMEKIDS, scale=F)) %>%
mutate(KIDSDRIV2 = if_else(KIDSDRIV<=0,0,1)) %>%
mutate(HOMEKIDS2 = if_else(HOMEKIDS<=0,0,1)) %>%
mutate(TRAVTIME_T = scale(TRAVTIME, scale=F)) %>%
mutate(TIF_T = scale(TIF, scale=F)) %>%
mutate(YOJ_T =scale(YOJ, scale=F) ) %>%
mutate(AGE_T =scale(AGE, scale=F) ) %>%
mutate(AGE2 = if_else(AGE<20,1,if_else(AGE>59,2,0))) %>%
mutate(INCOME2 = if_else(INCOME<50,1,0)) %>%
mutate(INCOME_T = INCOME^0.5) %>%
mutate(BLUEBOOK_T = BLUEBOOK^0.5) %>%
mutate(HOME_VAL_T = log(HOME_VAL+1)) %>%
mutate(OLDCLAIM_T = log(OLDCLAIM+1)) %>%
mutate(CLM_FREQ = scale(CLM_FREQ, scale=F)) %>%
mutate(OLDCLAIM2 = if_else(OLDCLAIM<=0,0,1))
The summary table below shows our transformed data set that includes original and transformed variables:
summary(over_sample_train)
Note - VIF scores were calculated but not significant issues were identified.
We will now commence our model building.
mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )
mod1 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY, data =over_sample_train, family="binomial" )
tidym1 <- tidy(mod1)
tidym1 %>%
kable() %>%
kable_styling()
tdf1 <- glance(mod1)
tdf1 %>%
kable() %>%
kable_styling()
mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )
mod2 <- glm(TARGET_FLAG ~ KIDSDRIV2 + AGE2 + INCOME2 + HOMEKIDS2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2), data =over_sample_train, family="binomial" )
tidym2 <- tidy(mod2)
tidym2 %>%
kable() %>%
kable_styling()
tdf2 <- glance(mod2)
tdf2 %>%
kable() %>%
kable_styling()
mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )
mod3 <- glm(TARGET_FLAG ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE + TRAVTIME + I(AGE_T^2) + I(CLM_FREQ^0.5), data =over_sample_train, family="binomial" )
tidym3 <- tidy(mod3)
tidym3 %>%
kable() %>%
kable_styling()
tdf3 <- glance(mod3)
tdf3 %>%
kable() %>%
kable_styling()
mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')
stepdat <- over_sample_train %>%
dplyr::select(-TARGET_AMT, -INDEX)
mod4.upper <- glm(TARGET_FLAG ~ .,family="binomial", data=stepdat)
mod4.null <- glm(TARGET_FLAG ~ 1, family = "binomial", data=stepdat)
mod4 <- step(mod4.null,scope=list(upper=mod4.upper, lower=mod4.null), trace= 0, direction='both')
tidym4 <- tidy(mod4)
tidym4 %>%
kable() %>%
kable_styling()
tdf4 <- glance(mod4)
tdf4 %>%
kable() %>%
kable_styling()
Now we will utilize the amt_train data set to model the cost of the claim. Note, the amt_claim data set only includes rows that have actual claims (i.e. zero claim rows have been removed).
Our first model will use some logical variable like BLUEBOOK, CAR_TYPE and URBANICITY among st others. This model will establish our baseline.
mod5 <- lm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)
mod5 <- lm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, data = amt_train)
tidym5 <- tidy(mod5)
tidym5 %>%
kable() %>%
kable_styling()
tdf5 <- glance(mod5)
tdf5 %>%
kable() %>%
kable_styling()
mod6 <- lm(TARGET_AMT ~ ., data =amt_train)
mod6 <- lm(TARGET_AMT ~ ., data =amt_train)
tidym6 <- tidy(mod6)
tidym6 %>%
kable() %>%
kable_styling()
tdf6 <- glance(mod6)
tdf6 %>%
kable() %>%
kable_styling()
mod7 <- rlm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)
mod7 <- rlm(TARGET_AMT ~ I(BLUEBOOK_T^2) + CAR_TYPE + MSTATUS + URBANICITY + MVR_PTS + AGE + AGE2, amt_train, psi = psi.hampel)
tidym7 <- tidy(mod7)
tidym7 %>%
kable() %>%
kable_styling()
tdf7 <- glance(mod7)
tdf7 %>%
kable() %>%
kable_styling()
mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE, amt_train, psi = psi.hampel)
mod8 <- rlm(TARGET_AMT ~ KIDSDRIV2 + OLDCLAIM2 + REVOKED + MVR_PTS + EDUCATION + CAR_USE + URBANICITY + INCOME_T + HOME_VAL + PARENT1 + MSTATUS + JOB + BLUEBOOK_T + BLUEBOOK + TIF + CAR_TYPE, amt_train, psi = psi.hampel)
tidym8 <- tidy(mod8)
tidym8 %>%
kable() %>%
kable_styling()
tdf8 <- glance(mod8)
tdf8 %>%
kable() %>%
kable_styling()
The choice of the best classification model is fairly straight forward. Model 3 has the logLik, AIC, BIC and deviance. A review of the coefficients indicates that model results make intuitive sense.
comp_class <- rbind(tdf1,tdf2,tdf3,tdf4)
comp_class %>%
kable() %>%
kable_styling()
The choice of the best linear model is more difficult. Outliers in the data set make use of an lm model difficult. This is particularly true because the outliers in the data ser were not necessarly errors and the data may just naturally include outliers. Given this fact I would pick model 3 or 4, basically picking the robust regression approach for this data set.
DMwR::regr.eval(amt_train$TARGET_AMT, mod5$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod6$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod7$fitted.values)
DMwR::regr.eval(amt_train$TARGET_AMT, mod8$fitted.values)
Each of the selected models were successfully applied to the evaluation data sets.
Please click the following link to access the rmarkdown for this analysis.