DATA 621 Homework #4

Introduction
Data Preparation & Exploration
Model Creation & Evaluation
- Classification Model
- Linear Claim Model
Predictions

df <- read.csv("./data/insurance_training_data.csv")
evaluation <- read.csv("./data/insurance-evaluation-data.csv")

strip_dollars <- function(x){
  x <- as.character(x)
  x <- gsub(",", "", x)
  x <- gsub("\\$", "", x)
  as.numeric(x)
}

Introduction

We have been given a dataset with 8161 records representing customers of an auto insurance company. Each record has two response variables. The first objective is to train a logistic regression classifier to predict if a person was in a car crash. The response variable here is TARGET_FLAG respresenting whether a person had an accident (1) or did not have one (0).

The second objective will be to train a regression model to predict the cost of a crash, if one occurred. The second reponse variable, TARGET_VALUE, is the amount it will cost if the person crashes their car. The value is zero if the person did not crash their car.

Looking at the distribution of the TARGET_AMT variable, we can see that the variable is considerably right-skewed. Thus, a LOG transform might be best here.

Data Preparation & Exploration

We will first look at the summary statistics for the data

df %>%
  summary() %>%
  kable() %>%
  kable_styling()

INDEX	TARGET_FLAG	TARGET_AMT	KIDSDRIV	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
Min. : 1	Min. :0.0000	Min. : 0	Min. :0.0000	Min. :16.00	Min. :0.0000	Min. : 0.0	$0 : 615	No :7084	$0 :2294	Yes :4894	M :3786	<High School :1203	z_Blue Collar:1825	Min. : 5.00	Commercial:3029	$1,500 : 157	Min. : 1.000	Minivan :2145	no :5783	$0 :5009	Min. :0.0000	No :7161	Min. : 0.000	Min. :-3.000	Highly Urban/ Urban :6492
1st Qu.: 2559	1st Qu.:0.0000	1st Qu.: 0	1st Qu.:0.0000	1st Qu.:39.00	1st Qu.:0.0000	1st Qu.: 9.0	: 445	Yes:1077	: 464	z_No:3267	z_F:4375	Bachelors :2242	Clerical :1271	1st Qu.: 22.00	Private :5132	$6,000 : 34	1st Qu.: 1.000	Panel Truck: 676	yes:2378	$1,310 : 4	1st Qu.:0.0000	Yes:1000	1st Qu.: 0.000	1st Qu.: 1.000	z_Highly Rural/ Rural:1669
Median : 5133	Median :0.0000	Median : 0	Median :0.0000	Median :45.00	Median :0.0000	Median :11.0	$26,840 : 4	NA	$111,129: 3	NA	NA	Masters :1658	Professional :1117	Median : 33.00	NA	$5,800 : 33	Median : 4.000	Pickup :1389	NA	$1,391 : 4	Median :0.0000	NA	Median : 1.000	Median : 8.000	NA
Mean : 5152	Mean :0.2638	Mean : 1504	Mean :0.1711	Mean :44.79	Mean :0.7212	Mean :10.5	$48,509 : 4	NA	$115,249: 3	NA	NA	PhD : 728	Manager : 988	Mean : 33.49	NA	$6,200 : 33	Mean : 5.351	Sports Car : 907	NA	$4,263 : 4	Mean :0.7986	NA	Mean : 1.696	Mean : 8.328	NA
3rd Qu.: 7745	3rd Qu.:1.0000	3rd Qu.: 1036	3rd Qu.:0.0000	3rd Qu.:51.00	3rd Qu.:1.0000	3rd Qu.:13.0	$61,790 : 4	NA	$123,109: 3	NA	NA	z_High School:2330	Lawyer : 835	3rd Qu.: 44.00	NA	$6,400 : 31	3rd Qu.: 7.000	Van : 750	NA	$1,105 : 3	3rd Qu.:2.0000	NA	3rd Qu.: 3.000	3rd Qu.:12.000	NA
Max. :10302	Max. :1.0000	Max. :107586	Max. :4.0000	Max. :81.00	Max. :5.0000	Max. :23.0	$107,375: 3	NA	$153,061: 3	NA	NA	NA	Student : 712	Max. :142.00	NA	$5,900 : 30	Max. :25.000	z_SUV :2294	NA	$1,332 : 3	Max. :5.0000	NA	Max. :13.000	Max. :28.000	NA
NA	NA	NA	NA	NA’s :6	NA	NA’s :454	(Other) :7086	NA	(Other) :5391	NA	NA	NA	(Other) :1413	NA	NA	(Other):7843	NA	NA	NA	(Other):3134	NA	NA	NA	NA’s :510	NA

There are some missing values that we will need to deal with. There are also some values that seem invalid (i.e. -3 CAR_AGE).

Fix Data Types

There are some variables that are currently factors that are dollar values that need to be transformed into numeric variables. We will do this to both the df and evaluation data frames. There are also some invalid data that will be changed to NAs.

strip_dollars <- function(x){
  x <- as.character(x)
  x <- gsub(",", "", x)
  x <- gsub("\\$", "", x)
  as.numeric(x)
}

fix_data_types <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(INCOME = strip_dollars(INCOME),
           HOME_VAL = strip_dollars(HOME_VAL),
           BLUEBOOK = strip_dollars(BLUEBOOK),
           OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
    ungroup()
}

na_bad_values <- function(messy_df){
  messy_df %>%
    rowwise() %>%
    mutate(CAR_AGE = ifelse(CAR_AGE < 0, NA, CAR_AGE))%>%
    ungroup()
}

df$TARGET_FLAG <- factor(df$TARGET_FLAG)


df <- df %>%
  fix_data_types() %>%
  na_bad_values()
evaluation <- evaluation %>%
  fix_data_types() %>%
  na_bad_values()

Now that we have corrected the invalid variables, we can look at a sumamry of the data:

df %>%
  summary() %>%
  kable() %>%
  kable_styling()

INDEX	TARGET_FLAG	TARGET_AMT	KIDSDRIV	AGE	HOMEKIDS	YOJ	INCOME	PARENT1	HOME_VAL	MSTATUS	SEX	EDUCATION	JOB	TRAVTIME	CAR_USE	BLUEBOOK	TIF	CAR_TYPE	RED_CAR	OLDCLAIM	CLM_FREQ	REVOKED	MVR_PTS	CAR_AGE	URBANICITY
Min. : 1	0:6008	Min. : 0	Min. :0.0000	Min. :16.00	Min. :0.0000	Min. : 0.0	Min. : 0	No :7084	Min. : 0	Yes :4894	M :3786	<High School :1203	z_Blue Collar:1825	Min. : 5.00	Commercial:3029	Min. : 1500	Min. : 1.000	Minivan :2145	no :5783	Min. : 0	Min. :0.0000	No :7161	Min. : 0.000	Min. : 0.00	Highly Urban/ Urban :6492
1st Qu.: 2559	1:2153	1st Qu.: 0	1st Qu.:0.0000	1st Qu.:39.00	1st Qu.:0.0000	1st Qu.: 9.0	1st Qu.: 28097	Yes:1077	1st Qu.: 0	z_No:3267	z_F:4375	Bachelors :2242	Clerical :1271	1st Qu.: 22.00	Private :5132	1st Qu.: 9280	1st Qu.: 1.000	Panel Truck: 676	yes:2378	1st Qu.: 0	1st Qu.:0.0000	Yes:1000	1st Qu.: 0.000	1st Qu.: 1.00	z_Highly Rural/ Rural:1669
Median : 5133	NA	Median : 0	Median :0.0000	Median :45.00	Median :0.0000	Median :11.0	Median : 54028	NA	Median :161160	NA	NA	Masters :1658	Professional :1117	Median : 33.00	NA	Median :14440	Median : 4.000	Pickup :1389	NA	Median : 0	Median :0.0000	NA	Median : 1.000	Median : 8.00	NA
Mean : 5152	NA	Mean : 1504	Mean :0.1711	Mean :44.79	Mean :0.7212	Mean :10.5	Mean : 61898	NA	Mean :154867	NA	NA	PhD : 728	Manager : 988	Mean : 33.49	NA	Mean :15710	Mean : 5.351	Sports Car : 907	NA	Mean : 4037	Mean :0.7986	NA	Mean : 1.696	Mean : 8.33	NA
3rd Qu.: 7745	NA	3rd Qu.: 1036	3rd Qu.:0.0000	3rd Qu.:51.00	3rd Qu.:1.0000	3rd Qu.:13.0	3rd Qu.: 85986	NA	3rd Qu.:238724	NA	NA	z_High School:2330	Lawyer : 835	3rd Qu.: 44.00	NA	3rd Qu.:20850	3rd Qu.: 7.000	Van : 750	NA	3rd Qu.: 4636	3rd Qu.:2.0000	NA	3rd Qu.: 3.000	3rd Qu.:12.00	NA
Max. :10302	NA	Max. :107586	Max. :4.0000	Max. :81.00	Max. :5.0000	Max. :23.0	Max. :367030	NA	Max. :885282	NA	NA	NA	Student : 712	Max. :142.00	NA	Max. :69740	Max. :25.000	z_SUV :2294	NA	Max. :57037	Max. :5.0000	NA	Max. :13.000	Max. :28.00	NA
NA	NA	NA	NA	NA’s :6	NA	NA’s :454	NA’s :445	NA	NA’s :464	NA	NA	NA	(Other) :1413	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :511	NA

Fix Missing Values

There are 511 observations where the CAR_AGE variable is missing, 454 observations where variable YOJ is missing, 6 observations where the variable AGE is missing, 445 observations where the variable INCOME is missing, and 464 observations where the variable HOME_VAL is missing. There are 1714, or 21% of the observations missing varables. We will fill in the missing data with the median value.

fix_missing <- function(df) {
  df %>% 
    mutate_at(vars(c("CAR_AGE", "YOJ", "AGE", "INCOME", "HOME_VAL")), ~ifelse(is.na(.), median(., na.rm = TRUE), .))
}

df <- fix_missing(df)
evaluation <- fix_missing(evaluation)

Feature Creation

We will create a log transformed income and home value feature. We will also create an average claim amount that will hopefully track better with the TARGET_AMT variable. We will also flag the TARGET_AMT observations that are outliers. These would be cases where the vehicle is totaled and are qualitatively different than the minor accidents.

# Function to add features

outlier <- min(boxplot(df[df$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)

create_features <- function(d){
  d %>%
    mutate(LOG_INCOME = log(INCOME + 1),
           LOG_HOME_VAL = log(HOME_VAL + 1),
           AVG_CLAIM = ifelse(CLM_FREQ > 0, OLDCLAIM / CLM_FREQ, 0),
           PRIOR_ACCIDENT = factor(ifelse(OLDCLAIM == 0 & AVG_CLAIM == 0, 0, 1)),
           COLLEGE_EDUCATED = factor(ifelse(EDUCATION %in% c("Bachelors", "Masters", "PhD"), 1, 0)),
           URBAN_DRIVER = factor(ifelse(URBANICITY == "Highly Urban/ Urban", 1, 0)),
           YOUNG_MALE = factor(ifelse(SEX == "M" & AGE < 25, 1, 0)),
           YOUNG = factor(ifelse(AGE < 25, 1, 0)),
           STUDENT = factor(ifelse(JOB == "Student", 1, 0)),
           MVR_PTS_GT_0 = factor(ifelse(MVR_PTS > 0, 1, 0)),
           RED_SPORTS_CAR = factor(ifelse(CAR_TYPE == "Sports Car" & RED_CAR == "yes", 1, 0)),
           HAS_KIDS = factor(ifelse(HOMEKIDS == 0, 0, 1)),
           KID_DRIVERS = factor(ifelse(KIDSDRIV == 0, 0, 1)),
           TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, 0, 1)) %>%
    select(-URBANICITY)
}

df <- create_features(df)
evaluation <- create_features(evaluation)

Creating Training/Test Data Sets

For Classifier Model

Now that we have a complete data set we will split the data into a training (train) and test set (test) for the classifier. We’ll use a 70-30 split between train and test, respectively.

set.seed(42)
train_index <- createDataPartition(df$TARGET_FLAG, p = .7, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]

There are 1508 out of 5714 records in the training data set that have been in an accident. We want to correct the imbalance in the dataset by over sampling this group so our classifier will do a better job identifying this minority group.

set.seed(42)
minority <- nrow(train[train$TARGET_FLAG == 1,])
majority <- nrow(train[train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- train[train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
  merge(train, .) %>%
  bind_rows(train)

The over sampled data frame has 8412 records, and as you can see in the following figure is now balanced:

For Linear Regression Model

Since the linear regression model’s goal is to predict the cost of the claim we will subset the 8161 records for those involved in an accident. We will split these 2153 records into training and test sets using the same 70-30 split.

set.seed(42)
accidents <- df %>%
  filter(TARGET_FLAG == 1)
amt_train_index <- createDataPartition(accidents$TARGET_AMT, p = .7, list = FALSE, times = 1)
amt_train <- accidents[amt_train_index,]
amt_test <- accidents[-amt_train_index,]

There are 1509 out of 2153 records in the training data set.

Exploratory Data Analysis

In exploring the data we are looking for two things. We are looking for variables that will help use divide the data into those who have been in an accident and those who have not. We are also looking for variables that are linearly correlated with the claim amount so it may be used as a predictor for the linear regression model. We will be looking at both training set for the two model types.

First we will examine the numeric variables found in the oversampled classification data set:

plot_vars <- c("TARGET_FLAG", names(keep(over_sample_train, is.numeric)))

over_sample_train[plot_vars] %>%
  select(-INDEX, -TARGET_AMT) %>%
  gather(variable, value, -TARGET_FLAG) %>%
  ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) + 
  geom_boxplot() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

There are a few variables that seem to have a difference between the two groups. For example CLM_FREQ might be useful in differentiating. There isn’t much difference between a lot of groups which is somewhat understandable.

Next we will look at the catagorical variables in the oversampled classifcation data set. These charts look to see if a variable can be uses to distinguish those that have had an accident (blue) from those who haven’t (red).

plot_vars <- names(keep(over_sample_train, is.factor))

temp <- over_sample_train[plot_vars] %>%
  gather(variable, value, -TARGET_FLAG) %>%
  group_by(TARGET_FLAG, variable, value) %>%
  tally()

temp %>%
  group_by(variable, value) %>%
  summarise(total = sum(n)) %>%
  merge(temp,.) %>%
  mutate(percent = n / total) %>%
  ggplot(., aes(value, percent, fill=TARGET_FLAG)) + 
  geom_col() +
  scale_fill_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

The URBAN_DRIVER is much more likely to be in an accident than the rural driver. The YOUNG and those with a PRIOR_ACCIDENT are also more likely to be in an accident. There doesn’t seem to be much evidence for a difference between the sexes.

Now let’s focus on the relationship between these variables and the size of the claim. As a reminder this data only includes those who have been in an accident. For starters we will look at the distribution of the claims for those in an accident.

ggplot(amt_train, aes(x=TARGET_AMT)) + 
  geom_density() +
  theme_light() +
  geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
  geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
  geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
  geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
  labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")

As was previously noted this distribution has a long tail. The mean payout is $5689 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.

amt_train %>%
  mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT_OUTLIER == 1, "Yes", "No")) %>%
  group_by(TARGET_AMT_OUTLIER) %>%
  summarise(Mean = mean(TARGET_AMT),
            Median = median(TARGET_AMT)) %>%
  kable() %>%
  kable_styling()

TARGET_AMT_OUTLIER	Mean	Median
No	4032.048	3893.00
Yes	27402.149	22951.25

Now to look at the data for the linear regression model. We are looking for good predictors of the claim amount. We will look at the numeric variables scatter and correlation plots:

amt_train %>%
  keep(is.numeric) %>%
  gather(variable, value, -TARGET_AMT) %>%
  ggplot(., aes(value, TARGET_AMT)) + 
  geom_point() +
  scale_color_brewer(palette="Set1") +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

M <- amt_train %>%
  select(-INDEX) %>%
  keep(is.numeric) %>%
  cor(.)
corrplot(M, type = "upper")

amt_train %>%
  keep(is.numeric)%>%
  select(-INDEX, -TARGET_AMT) %>%
  cor(., amt_train$TARGET_AMT) %>%
  kable() %>%
  kable_styling()

KIDSDRIV	0.0061500
AGE	0.0150957
HOMEKIDS	0.0171601
YOJ	0.0300531
INCOME	0.0404098
HOME_VAL	0.0442609
TRAVTIME	0.0120292
BLUEBOOK	0.1257628
TIF	-0.0180912
OLDCLAIM	0.0124373
CLM_FREQ	-0.0026814
MVR_PTS	0.0425935
CAR_AGE	-0.0040321
LOG_INCOME	0.0384403
LOG_HOME_VAL	0.0289741
AVG_CLAIM	0.0310000
TARGET_AMT_OUTLIER	0.7901975

These two points highlight that there isn’t a really strong correlation between most of the predictors and the claim amount. The only one that is strong is the outlier flag that we created. Let’s finish our exploration by looking at the categoricial variables. We are removing the TARGET_AMT_OUTLIERS from the plot to increase their readability.

plot_vars <- c("TARGET_AMT", "TARGET_AMT_OUTLIER", names(keep(amt_train, is.factor)))

amt_train[plot_vars] %>% 
  filter(TARGET_AMT_OUTLIER == 0) %>%
  gather(variable, value, -TARGET_AMT) %>%
  ggplot(., aes(value, TARGET_AMT)) + 
  geom_boxplot() +
  theme_light() +
  theme(legend.position = "none") +
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

These boxplots show that there isn’t really much difference in the claim amounts between these groups.

Summary of Findings

Through this analysis we have found that there is a lot of noise associated with this signal. When it comes the the classification task, most variables offer a similar picture. This is probably due to the random nature of vehicle accidents. There are a few variables that may be helpful in building our models.

The claim prediciton task will definately be more of a challenge. Little is correlated with the claim amounts. The model fit will not be as strong as the classification models.

Model Creation & Evaluation

Now that we have created datasets and explored them, it is time to build some predictive models and evaluate them.

For the classification models we will look at the prediction accuracy on the test set. Especially with the accuracy at discovering those that are in an accident. Accurate prediction of this minority class indicates a good model.

evaluate_model <- function(model, test_df, target = "TARGET_FLAG", threshold = 0.5){
  test_df$yhat <- ifelse(predict.glm(model, test_df, "response") >= threshold, 1, 0)
  cm <- confusionMatrix(factor(test_df$yhat), factor(test_df[[target]]), "1")
  deviance <- model$deviance
  r2 <- 1 - model$deviance / model$null.deviance
  
  cat("F1 =", cm$byClass[7],"\nR2 =", r2, "\n\n")
  print(cm)
  
  eval <- data.frame(actual = test_df$TARGET_FLAG, 
                     predicted = test_df$yhat, 
                     prob = predict(model, test_df))


  pred <- prediction(eval$prob, eval$actual)
  auc <- performance(pred, measure = "auc")@y.values[[1]]
  perf <- performance(pred, "tpr", "fpr")
  plot(perf,main="ROC Curve", sub = paste0("AUC: ", round(auc, 3)))
  
  return(cm)
}

Classification Model

Baseline Model

We will create a simple model to serve as the baseline. This model posits that drivers history is a good representation of their future. So drivers that have been in an accident (OLDCLAIM > 0 or AVG_CLAIM > 0) are more likely to be in an accident. Conversely those who haven’t aren’t likely to be in one in the future.

baseline_model <- glm(TARGET_FLAG ~ PRIOR_ACCIDENT, family = binomial(link = "logit"), over_sample_train)
summary(baseline_model)


Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.45262  -0.97194  -0.02338   0.92517   1.39782  

Coefficients:
                Estimate Std. Error z value            Pr(>|z|)    
(Intercept)     -0.50462    0.03031  -16.65 <0.0000000000000002 ***
PRIOR_ACCIDENT1  1.13171    0.04567   24.78 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 11022  on 8410  degrees of freedom
AIC: 11026

Number of Fisher Scoring iterations: 4

results <- evaluate_model(baseline_model, test)

F1 = 0.4659588 
R2 = 0.05485262 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1219  272
         1  583  373
                                             
               Accuracy : 0.6506             
                 95% CI : (0.6313, 0.6695)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2206             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.5783             
            Specificity : 0.6765             
         Pos Pred Value : 0.3902             
         Neg Pred Value : 0.8176             
             Prevalence : 0.2636             
         Detection Rate : 0.1524             
   Detection Prevalence : 0.3907             
      Balanced Accuracy : 0.6274             
                                             
       'Positive' Class : 1

This simple model has statistically significant predictors. Applying this model to the test data set indicates this simple model has a 65.1% accuracy rate. It correcly identified 57.8% of the people with accidents and 67.6% of those without. It is better than flipping a coin but not by much. All future models must outpreform this baseline.

Risk Taker Model

For this model we are going assume that risk takers are more likely to be in an accident. To identify those who tend to be risk takers, we are going with the urban legend that those who drive red sports cars are risk takers. Also young men (16-24) also tend to be risk takers.

risk_taker_model <- glm(TARGET_FLAG ~ RED_SPORTS_CAR + YOUNG_MALE, family = binomial(link = "logit"), over_sample_train)
summary(risk_taker_model)


Call:
glm(formula = TARGET_FLAG ~ RED_SPORTS_CAR + YOUNG_MALE, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3791  -1.1759  -0.0938   1.1789   1.1789  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)     -0.003599   0.021907  -0.164    0.869
RED_SPORTS_CAR1  0.308981   0.352893   0.876    0.381
YOUNG_MALE1      0.466223   0.310351   1.502    0.133

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 11658  on 8409  degrees of freedom
AIC: 11664

Number of Fisher Scoring iterations: 3

risk_taker_results <- evaluate_model(risk_taker_model, test)

F1 = 0.03927492 
R2 = 0.0002641343 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1798  632
         1    4   13
                                             
               Accuracy : 0.7401             
                 95% CI : (0.7222, 0.7574)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 0.3494             
                                             
                  Kappa : 0.0261             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.020155           
            Specificity : 0.997780           
         Pos Pred Value : 0.764706           
         Neg Pred Value : 0.739918           
             Prevalence : 0.263588           
         Detection Rate : 0.005313           
   Detection Prevalence : 0.006947           
      Balanced Accuracy : 0.508968           
                                             
       'Positive' Class : 1

This model only has one statistically significant predictor, however it has a 74% accuracy rate. This is because it identified 99.8% of those who didn’t have an accident (and they are more numerous). When looking at the model’s performance on identifying the minority group of interest we see it correcly identified 2% of the people with accidents. So while the accuracy rate out preformed the baseline model, this one will not be the model we will use.

Traditional Model

Traditional wisdom holds that there are a few tried and true predictors of someone’s risk of an automobile accident. Namely: age (under 25 vs. 25+), marital status, driving record, distance to work, and the driver’s sex (which may or may not be proper to use, but seems to have held throughout the years).

traditional_model <- glm(TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, family = binomial(link = "logit"), over_sample_train)
summary(traditional_model)


Call:
glm(formula = TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + 
    SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, family = binomial(link = "logit"), 
    data = over_sample_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2918  -0.9992  -0.1634   1.0394   1.9197  

Coefficients:
                 Estimate Std. Error z value             Pr(>|z|)    
(Intercept)     -1.127120   0.075485 -14.932 < 0.0000000000000002 ***
YOUNG1           0.803997   0.244210   3.292             0.000994 ***
MSTATUSz_No      0.585304   0.047681  12.275 < 0.0000000000000002 ***
PRIOR_ACCIDENT1  0.813243   0.051742  15.717 < 0.0000000000000002 ***
SEXz_F           0.298638   0.050099   5.961        0.00000000251 ***
REVOKEDYes       0.805760   0.068452  11.771 < 0.0000000000000002 ***
MVR_PTS          0.142383   0.011822  12.044 < 0.0000000000000002 ***
TRAVTIME         0.008707   0.001515   5.746        0.00000000912 ***
CAR_USEPrivate  -0.586681   0.051014 -11.500 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11662  on 8411  degrees of freedom
Residual deviance: 10350  on 8403  degrees of freedom
AIC: 10368

Number of Fisher Scoring iterations: 4

model_results <- evaluate_model(traditional_model, test)

F1 = 0.510559 
R2 = 0.1124805 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1248  234
         1  554  411
                                             
               Accuracy : 0.678              
                 95% CI : (0.6591, 0.6965)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2845             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.6372             
            Specificity : 0.6926             
         Pos Pred Value : 0.4259             
         Neg Pred Value : 0.8421             
             Prevalence : 0.2636             
         Detection Rate : 0.1680             
   Detection Prevalence : 0.3944             
      Balanced Accuracy : 0.6649             
                                             
       'Positive' Class : 1

This model has statistically significant predictors. Applying this model to the test data set indicates this model has a 67.8% accuracy rate. It correcly identified 63.7% of the people with accidents and 69.3% of those without. This model out preforms the baseline model.

Traditional Model with Cross-Validation

Here we try to use cross-validation techniques to improve the traditional model. We go back to the raw training data, because it is typically not-advised to use data that is oversampled prior to cross-validation.

# Get complete cases
cases <- train %>%
  select(YOUNG,MSTATUS,PRIOR_ACCIDENT,SEX,REVOKED,MVR_PTS, TRAVTIME, CAR_USE) %>%
  complete.cases()
temp <- train[cases,]

# Use 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5, sampling = "up")
traditional_cv <- train(form = TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, method = "glm", family = "binomial", data = temp, trControl = train_control)

traditional_cv

Generalized Linear Model 

5714 samples
   8 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 4572, 4571, 4571, 4571, 4571 
Addtional sampling using up-sampling

Resampling results:

  Accuracy   Kappa    
  0.6757072  0.2826379

# Evaluating the model
eval <- data.frame(actual = test$TARGET_FLAG, predicted = predict(traditional_cv,newdata=test,type="raw"), prob = predict(traditional_cv,newdata=test,type="prob"))

confusionMatrix(eval$predicted, eval$actual, positive = "1")

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1249  236
         1  553  409
                                             
               Accuracy : 0.6776             
                 95% CI : (0.6586, 0.6961)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.2826             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.6341             
            Specificity : 0.6931             
         Pos Pred Value : 0.4252             
         Neg Pred Value : 0.8411             
             Prevalence : 0.2636             
         Detection Rate : 0.1671             
   Detection Prevalence : 0.3931             
      Balanced Accuracy : 0.6636             
                                             
       'Positive' Class : 1

pred <- prediction(eval$prob.1, eval$actual)
auc <- performance(pred, measure = "auc")@y.values[[1]]
perf <- performance(pred,"tpr","fpr")
plot(perf,main="ROC Curve", sub=paste0("AUC: ",round(auc,3)))

Alternate Traditional Model

This model is an alternate to the traditional model.

model <- glm(TARGET_FLAG ~ PRIOR_ACCIDENT + KID_DRIVERS + MSTATUS + INCOME + SEX + CAR_USE + COLLEGE_EDUCATED + REVOKED + URBAN_DRIVER, family = binomial(link = "logit"), over_sample_train)
summary(model)


Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT + KID_DRIVERS + MSTATUS + 
    INCOME + SEX + CAR_USE + COLLEGE_EDUCATED + REVOKED + URBAN_DRIVER, 
    family = binomial(link = "logit"), data = over_sample_train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.47843  -0.95017   0.04407   0.92959   2.57995  

Coefficients:
                       Estimate    Std. Error z value             Pr(>|z|)
(Intercept)       -1.5346466203  0.0927535704 -16.545 < 0.0000000000000002
PRIOR_ACCIDENT1    0.7921271578  0.0506657588  15.634 < 0.0000000000000002
KID_DRIVERS1       0.8289530622  0.0742694029  11.161 < 0.0000000000000002
MSTATUSz_No        0.7589906840  0.0508503649  14.926 < 0.0000000000000002
INCOME            -0.0000083140  0.0000006582 -12.632 < 0.0000000000000002
SEXz_F             0.2759209075  0.0528561497   5.220          0.000000179
CAR_USEPrivate    -0.7343462515  0.0546026351 -13.449 < 0.0000000000000002
COLLEGE_EDUCATED1 -0.5201355594  0.0584399941  -8.900 < 0.0000000000000002
REVOKEDYes         0.6851814650  0.0720376503   9.511 < 0.0000000000000002
URBAN_DRIVER1      1.9516574434  0.0842859474  23.155 < 0.0000000000000002
                     
(Intercept)       ***
PRIOR_ACCIDENT1   ***
KID_DRIVERS1      ***
MSTATUSz_No       ***
INCOME            ***
SEXz_F            ***
CAR_USEPrivate    ***
COLLEGE_EDUCATED1 ***
REVOKEDYes        ***
URBAN_DRIVER1     ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11661.5  on 8411  degrees of freedom
Residual deviance:  9501.3  on 8402  degrees of freedom
AIC: 9521.3

Number of Fisher Scoring iterations: 4

model_results <- evaluate_model(model, test)

F1 = 0.5638611 
R2 = 0.1852393 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1227  166
         1  575  479
                                             
               Accuracy : 0.6972             
                 95% CI : (0.6785, 0.7153)   
    No Information Rate : 0.7364             
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.3519             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.7426             
            Specificity : 0.6809             
         Pos Pred Value : 0.4545             
         Neg Pred Value : 0.8808             
             Prevalence : 0.2636             
         Detection Rate : 0.1957             
   Detection Prevalence : 0.4307             
      Balanced Accuracy : 0.7118             
                                             
       'Positive' Class : 1

This model’s predictors are all statistically significant. Applying this model to the test data set indicates this model has a 69.7% accuracy rate. It correcly identified 74.3% of the people with accidents and 68.1% of those without. This model also out preforms the baseline model. Now that we have a few viable classification models, we can turn our attention to the linear claim model.

Linear Claim Model

Baseline Model

Given the unusual distribution one posible solution would be to predict the claim is the median value. This is a gross simplification but will serve as the baseline for the linear models to compete against. The way we are looking at how well these models preform is calculating the prediction error as a percentage of the actual and then averaging it across the data set.

the_median <- median(amt_train$TARGET_AMT)

evaluate_lm <- function(yhat, actual){
  data.frame(yhat = yhat, actual = actual) %>%
    mutate(error = yhat - actual) %>%
    mutate(error_percent = error / actual) %>%
    summarise(`Average Error` = mean(error), `Average Percent` = mean(error_percent)) %>%
    kable() %>%
    kable_styling()
}

evaluate_lm(c(the_median), amt_test$TARGET_AMT)

Average Error	Average Percent
-1630.658	0.504024

By simply using the median to predict the values of the claims we would underestimate the actual value on average by $1,600 dollars. On a percentage basis the error would be 50% of the actual.

Simple Linear Model

This simple linerar model is that the amount of the claim is based off of the value of the vehicle. More expensive vehicles should be more costly to repair than less expensive vehicles. It is overly simplistic.

baseline_lm <- lm(TARGET_AMT ~ BLUEBOOK, amt_train)
summary(baseline_lm)


Call:
lm(formula = TARGET_AMT ~ BLUEBOOK, data = amt_train)

Residuals:
   Min     1Q Median     3Q    Max 
 -7818  -3153  -1548    326  78883 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 4038.05174  387.55794  10.419 < 0.0000000000000002 ***
BLUEBOOK       0.11399    0.02316   4.921          0.000000955 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7536 on 1507 degrees of freedom
Multiple R-squared:  0.01582,   Adjusted R-squared:  0.01516 
F-statistic: 24.22 on 1 and 1507 DF,  p-value: 0.0000009546

While this predictor is statistically significant and positive, the model’s overall explanitory power is weak. Let’s see how it preformed on the test set:

evaluate_lm(predict(baseline_lm, amt_test), amt_test$TARGET_AMT)

Average Error	Average Percent
-130.8995	1.001244

Well the average error has decreased which is good but the average error share of the actual has doubled. This is a garbage model.

Outlier Model

We previously flagged the outliers in the data set. IF we can build another classification model, how well will it preform?

outlier_lm <- lm(TARGET_AMT ~ TARGET_AMT_OUTLIER, amt_train)
summary(outlier_lm)


Call:
lm(formula = TARGET_AMT ~ TARGET_AMT_OUTLIER, data = amt_train)

Residuals:
   Min     1Q Median     3Q    Max 
-16808  -1661   -186   1345  58122 

Coefficients:
                   Estimate Std. Error t value            Pr(>|t|)    
(Intercept)          4032.0      124.3   32.43 <0.0000000000000002 ***
TARGET_AMT_OUTLIER  23370.1      466.9   50.05 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4655 on 1507 degrees of freedom
Multiple R-squared:  0.6244,    Adjusted R-squared:  0.6242 
F-statistic:  2505 on 1 and 1507 DF,  p-value: < 0.00000000000000022

This model is unfair as it is predicting the outcomes based on a predictor derived from the outcome. Unsuprisingly this model does a really good job. It has an adjusted $R^2$ of 0.6241628.

The reason we explored this model is to see if we has a reason to build another classifier. If we build a second classifier that can predict this variable would it be worth it. Let’s see how the model preforms on the test set:

evaluate_lm(predict(outlier_lm, amt_test), amt_test$TARGET_AMT)

Average Error	Average Percent
-140.1835	0.5530068

The predictions on average understate the actual value by $140. The error as a percent of the actual is close to what we get with just using the median. The improvement really comes from using a higher value to estimate the claim for the outliers.

The challenge will be to create a classifier. There are 107 out of 1509 that are outliers. Let’s create another balanced data set and see if we can build another classifier.

set.seed(42)
minority <- nrow(amt_train[amt_train$TARGET_AMT_OUTLIER == 1,])
majority <- nrow(amt_train[amt_train$TARGET_AMT_OUTLIER == 0,])
diff <- majority - minority
minority_index <- amt_train[amt_train$TARGET_AMT_OUTLIER == 1,]$INDEX
over_sample_train_2 <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
  merge(amt_train, .) %>%
  bind_rows(amt_train)

Let’s explore the correlations to see what is correlated:

M <- over_sample_train_2 %>%
  select(-INDEX, -TARGET_AMT) %>%
  keep(is.numeric) %>%
  cor(.)
corrplot(M, type = "upper")

It looks like the BLUEBOOK and PARENT1 are the only variables that are correlated with this outcome. So let’s build the classifier.

outlier_model <- glm(TARGET_AMT_OUTLIER ~ BLUEBOOK + PARENT1, family = binomial(link = "logit"), over_sample_train_2)
summary(outlier_model)


Call:
glm(formula = TARGET_AMT_OUTLIER ~ BLUEBOOK + PARENT1, family = binomial(link = "logit"), 
    data = over_sample_train_2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0215  -1.1039  -0.1248   1.1664   1.4622  

Coefficients:
                Estimate   Std. Error z value             Pr(>|z|)    
(Intercept) -0.889430410  0.084530469 -10.522 < 0.0000000000000002 ***
BLUEBOOK     0.000048194  0.000004618  10.435 < 0.0000000000000002 ***
PARENT1Yes   0.441269006  0.087932394   5.018          0.000000521 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3887.2  on 2803  degrees of freedom
Residual deviance: 3732.8  on 2801  degrees of freedom
AIC: 3738.8

Number of Fisher Scoring iterations: 4

outlier_model_results <- evaluate_model(outlier_model, test, "TARGET_AMT_OUTLIER")

F1 = 0.04214963 
R2 = 0.03970578 

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1518   24
         1  885   20
                                             
               Accuracy : 0.6285             
                 95% CI : (0.609, 0.6477)    
    No Information Rate : 0.982              
    P-Value [Acc > NIR] : 1                  
                                             
                  Kappa : 0.0081             
                                             
 Mcnemar's Test P-Value : <0.0000000000000002
                                             
            Sensitivity : 0.454545           
            Specificity : 0.631710           
         Pos Pred Value : 0.022099           
         Neg Pred Value : 0.984436           
             Prevalence : 0.017981           
         Detection Rate : 0.008173           
   Detection Prevalence : 0.369841           
      Balanced Accuracy : 0.543128           
                                             
       'Positive' Class : 1

Now to test this out. Based on our linear regression test set we will predict if the observation is likely to be an outlier or not. We will then estimate the TARGET_AMT using the linear model. We will then evaluate the results of the “model”.

temp <- amt_test
temp$TARGET_AMT_OUTLIER <- ifelse(predict.glm(outlier_model, amt_test, "response") >= 0.5, 1, 0)
temp$TARGET_AMT_yhat <- predict(outlier_lm, amt_test)
evaluate_lm(temp$TARGET_AMT_yhat, temp$TARGET_AMT)

Average Error	Average Percent
-140.1835	0.5530068

So through this somewhat convoluted approach we have done slightly better than using the median.

Predictions

The final predictions will use the alternative traditional model to predict the TARGET_FLAG variable, then pass it through the second classifier which predicts the TARGET_AMT_OUTLIER flag. We then use the linear model to make the final amount prediction.

predictions <- function(df, classifier_1, linear_model, classifier_2){
  df$TARGET_FLAG <- ifelse(predict.glm(classifier_1, df, "response") >= 0.5, 1, 0)
  # We assume that everyone with a TARGET_FLAG = 0 has a TARGET_AMT as zero.  
  # We then refine it with the outlier model.
  df$TARGET_AMT_OUTLIER <- ifelse(predict.glm(classifier_2, df, "response") >= 0.5, 1, 0)
  df$yhat <- predict(linear_model, df)
  df <- df %>%
    mutate(TARGET_AMT = ifelse(TARGET_AMT_OUTLIER == 1, yhat, 0)) %>%
    mutate(TARGET_AMT = ifelse(TARGET_FLAG == 0, 0, TARGET_AMT)) %>%
    select(-yhat, -TARGET_AMT_OUTLIER)
  return(df)
}

evaluation <- predictions(evaluation, model, outlier_lm, outlier_model)

DATA 621 Homework #4

From Work by Critical Thinking Group 3