DATA 621 Homework #4
From Work by Critical Thinking Group 3
DATA 621 Homework #4
df <- read.csv("./data/insurance_training_data.csv")
evaluation <- read.csv("./data/insurance-evaluation-data.csv")
strip_dollars <- function(x){
x <- as.character(x)
x <- gsub(",", "", x)
x <- gsub("\\$", "", x)
as.numeric(x)
}
Introduction
We have been given a dataset with 8161 records representing customers of an auto insurance company. Each record has two response variables. The first objective is to train a logistic regression classifier to predict if a person was in a car crash. The response variable here is TARGET_FLAG
respresenting whether a person had an accident (1) or did not have one (0).
The second objective will be to train a regression model to predict the cost of a crash, if one occurred. The second reponse variable, TARGET_VALUE
, is the amount it will cost if the person crashes their car. The value is zero if the person did not crash their car.
Looking at the distribution of the TARGET_AMT
variable, we can see that the variable is considerably right-skewed. Thus, a LOG transform might be best here.
Data Preparation & Exploration
We will first look at the summary statistics for the data
INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1 | Min. :0.0000 | Min. : 0 | Min. :0.0000 | Min. :16.00 | Min. :0.0000 | Min. : 0.0 | $0 : 615 | No :7084 | $0 :2294 | Yes :4894 | M :3786 | <High School :1203 | z_Blue Collar:1825 | Min. : 5.00 | Commercial:3029 | $1,500 : 157 | Min. : 1.000 | Minivan :2145 | no :5783 | $0 :5009 | Min. :0.0000 | No :7161 | Min. : 0.000 | Min. :-3.000 | Highly Urban/ Urban :6492 | |
1st Qu.: 2559 | 1st Qu.:0.0000 | 1st Qu.: 0 | 1st Qu.:0.0000 | 1st Qu.:39.00 | 1st Qu.:0.0000 | 1st Qu.: 9.0 | : 445 | Yes:1077 | : 464 | z_No:3267 | z_F:4375 | Bachelors :2242 | Clerical :1271 | 1st Qu.: 22.00 | Private :5132 | $6,000 : 34 | 1st Qu.: 1.000 | Panel Truck: 676 | yes:2378 | $1,310 : 4 | 1st Qu.:0.0000 | Yes:1000 | 1st Qu.: 0.000 | 1st Qu.: 1.000 | z_Highly Rural/ Rural:1669 | |
Median : 5133 | Median :0.0000 | Median : 0 | Median :0.0000 | Median :45.00 | Median :0.0000 | Median :11.0 | $26,840 : 4 | NA | $111,129: 3 | NA | NA | Masters :1658 | Professional :1117 | Median : 33.00 | NA | $5,800 : 33 | Median : 4.000 | Pickup :1389 | NA | $1,391 : 4 | Median :0.0000 | NA | Median : 1.000 | Median : 8.000 | NA | |
Mean : 5152 | Mean :0.2638 | Mean : 1504 | Mean :0.1711 | Mean :44.79 | Mean :0.7212 | Mean :10.5 | $48,509 : 4 | NA | $115,249: 3 | NA | NA | PhD : 728 | Manager : 988 | Mean : 33.49 | NA | $6,200 : 33 | Mean : 5.351 | Sports Car : 907 | NA | $4,263 : 4 | Mean :0.7986 | NA | Mean : 1.696 | Mean : 8.328 | NA | |
3rd Qu.: 7745 | 3rd Qu.:1.0000 | 3rd Qu.: 1036 | 3rd Qu.:0.0000 | 3rd Qu.:51.00 | 3rd Qu.:1.0000 | 3rd Qu.:13.0 | $61,790 : 4 | NA | $123,109: 3 | NA | NA | z_High School:2330 | Lawyer : 835 | 3rd Qu.: 44.00 | NA | $6,400 : 31 | 3rd Qu.: 7.000 | Van : 750 | NA | $1,105 : 3 | 3rd Qu.:2.0000 | NA | 3rd Qu.: 3.000 | 3rd Qu.:12.000 | NA | |
Max. :10302 | Max. :1.0000 | Max. :107586 | Max. :4.0000 | Max. :81.00 | Max. :5.0000 | Max. :23.0 | $107,375: 3 | NA | $153,061: 3 | NA | NA | NA | Student : 712 | Max. :142.00 | NA | $5,900 : 30 | Max. :25.000 | z_SUV :2294 | NA | $1,332 : 3 | Max. :5.0000 | NA | Max. :13.000 | Max. :28.000 | NA | |
NA | NA | NA | NA | NA’s :6 | NA | NA’s :454 | (Other) :7086 | NA | (Other) :5391 | NA | NA | NA | (Other) :1413 | NA | NA | (Other):7843 | NA | NA | NA | (Other):3134 | NA | NA | NA | NA’s :510 | NA |
There are some missing values that we will need to deal with. There are also some values that seem invalid (i.e. -3 CAR_AGE).
Fix Data Types
There are some variables that are currently factors that are dollar values that need to be transformed into numeric variables. We will do this to both the df
and evaluation
data frames. There are also some invalid data that will be changed to NAs.
strip_dollars <- function(x){
x <- as.character(x)
x <- gsub(",", "", x)
x <- gsub("\\$", "", x)
as.numeric(x)
}
fix_data_types <- function(messy_df){
messy_df %>%
rowwise() %>%
mutate(INCOME = strip_dollars(INCOME),
HOME_VAL = strip_dollars(HOME_VAL),
BLUEBOOK = strip_dollars(BLUEBOOK),
OLDCLAIM = strip_dollars(OLDCLAIM)) %>%
ungroup()
}
na_bad_values <- function(messy_df){
messy_df %>%
rowwise() %>%
mutate(CAR_AGE = ifelse(CAR_AGE < 0, NA, CAR_AGE))%>%
ungroup()
}
df$TARGET_FLAG <- factor(df$TARGET_FLAG)
df <- df %>%
fix_data_types() %>%
na_bad_values()
evaluation <- evaluation %>%
fix_data_types() %>%
na_bad_values()
Now that we have corrected the invalid variables, we can look at a sumamry of the data:
INDEX | TARGET_FLAG | TARGET_AMT | KIDSDRIV | AGE | HOMEKIDS | YOJ | INCOME | PARENT1 | HOME_VAL | MSTATUS | SEX | EDUCATION | JOB | TRAVTIME | CAR_USE | BLUEBOOK | TIF | CAR_TYPE | RED_CAR | OLDCLAIM | CLM_FREQ | REVOKED | MVR_PTS | CAR_AGE | URBANICITY | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 1 | 0:6008 | Min. : 0 | Min. :0.0000 | Min. :16.00 | Min. :0.0000 | Min. : 0.0 | Min. : 0 | No :7084 | Min. : 0 | Yes :4894 | M :3786 | <High School :1203 | z_Blue Collar:1825 | Min. : 5.00 | Commercial:3029 | Min. : 1500 | Min. : 1.000 | Minivan :2145 | no :5783 | Min. : 0 | Min. :0.0000 | No :7161 | Min. : 0.000 | Min. : 0.00 | Highly Urban/ Urban :6492 | |
1st Qu.: 2559 | 1:2153 | 1st Qu.: 0 | 1st Qu.:0.0000 | 1st Qu.:39.00 | 1st Qu.:0.0000 | 1st Qu.: 9.0 | 1st Qu.: 28097 | Yes:1077 | 1st Qu.: 0 | z_No:3267 | z_F:4375 | Bachelors :2242 | Clerical :1271 | 1st Qu.: 22.00 | Private :5132 | 1st Qu.: 9280 | 1st Qu.: 1.000 | Panel Truck: 676 | yes:2378 | 1st Qu.: 0 | 1st Qu.:0.0000 | Yes:1000 | 1st Qu.: 0.000 | 1st Qu.: 1.00 | z_Highly Rural/ Rural:1669 | |
Median : 5133 | NA | Median : 0 | Median :0.0000 | Median :45.00 | Median :0.0000 | Median :11.0 | Median : 54028 | NA | Median :161160 | NA | NA | Masters :1658 | Professional :1117 | Median : 33.00 | NA | Median :14440 | Median : 4.000 | Pickup :1389 | NA | Median : 0 | Median :0.0000 | NA | Median : 1.000 | Median : 8.00 | NA | |
Mean : 5152 | NA | Mean : 1504 | Mean :0.1711 | Mean :44.79 | Mean :0.7212 | Mean :10.5 | Mean : 61898 | NA | Mean :154867 | NA | NA | PhD : 728 | Manager : 988 | Mean : 33.49 | NA | Mean :15710 | Mean : 5.351 | Sports Car : 907 | NA | Mean : 4037 | Mean :0.7986 | NA | Mean : 1.696 | Mean : 8.33 | NA | |
3rd Qu.: 7745 | NA | 3rd Qu.: 1036 | 3rd Qu.:0.0000 | 3rd Qu.:51.00 | 3rd Qu.:1.0000 | 3rd Qu.:13.0 | 3rd Qu.: 85986 | NA | 3rd Qu.:238724 | NA | NA | z_High School:2330 | Lawyer : 835 | 3rd Qu.: 44.00 | NA | 3rd Qu.:20850 | 3rd Qu.: 7.000 | Van : 750 | NA | 3rd Qu.: 4636 | 3rd Qu.:2.0000 | NA | 3rd Qu.: 3.000 | 3rd Qu.:12.00 | NA | |
Max. :10302 | NA | Max. :107586 | Max. :4.0000 | Max. :81.00 | Max. :5.0000 | Max. :23.0 | Max. :367030 | NA | Max. :885282 | NA | NA | NA | Student : 712 | Max. :142.00 | NA | Max. :69740 | Max. :25.000 | z_SUV :2294 | NA | Max. :57037 | Max. :5.0000 | NA | Max. :13.000 | Max. :28.00 | NA | |
NA | NA | NA | NA | NA’s :6 | NA | NA’s :454 | NA’s :445 | NA | NA’s :464 | NA | NA | NA | (Other) :1413 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA’s :511 | NA |
Fix Missing Values
There are 511 observations where the CAR_AGE
variable is missing, 454 observations where variable YOJ
is missing, 6 observations where the variable AGE
is missing, 445 observations where the variable INCOME
is missing, and 464 observations where the variable HOME_VAL
is missing. There are 1714, or 21% of the observations missing varables. We will fill in the missing data with the median value.
Feature Creation
We will create a log transformed income and home value feature. We will also create an average claim amount that will hopefully track better with the TARGET_AMT
variable. We will also flag the TARGET_AMT
observations that are outliers. These would be cases where the vehicle is totaled and are qualitatively different than the minor accidents.
# Function to add features
outlier <- min(boxplot(df[df$TARGET_FLAG==1,]$TARGET_AMT, plot=FALSE)$out)
create_features <- function(d){
d %>%
mutate(LOG_INCOME = log(INCOME + 1),
LOG_HOME_VAL = log(HOME_VAL + 1),
AVG_CLAIM = ifelse(CLM_FREQ > 0, OLDCLAIM / CLM_FREQ, 0),
PRIOR_ACCIDENT = factor(ifelse(OLDCLAIM == 0 & AVG_CLAIM == 0, 0, 1)),
COLLEGE_EDUCATED = factor(ifelse(EDUCATION %in% c("Bachelors", "Masters", "PhD"), 1, 0)),
URBAN_DRIVER = factor(ifelse(URBANICITY == "Highly Urban/ Urban", 1, 0)),
YOUNG_MALE = factor(ifelse(SEX == "M" & AGE < 25, 1, 0)),
YOUNG = factor(ifelse(AGE < 25, 1, 0)),
STUDENT = factor(ifelse(JOB == "Student", 1, 0)),
MVR_PTS_GT_0 = factor(ifelse(MVR_PTS > 0, 1, 0)),
RED_SPORTS_CAR = factor(ifelse(CAR_TYPE == "Sports Car" & RED_CAR == "yes", 1, 0)),
HAS_KIDS = factor(ifelse(HOMEKIDS == 0, 0, 1)),
KID_DRIVERS = factor(ifelse(KIDSDRIV == 0, 0, 1)),
TARGET_AMT_OUTLIER = ifelse(TARGET_AMT < outlier, 0, 1)) %>%
select(-URBANICITY)
}
df <- create_features(df)
evaluation <- create_features(evaluation)
Creating Training/Test Data Sets
For Classifier Model
Now that we have a complete data set we will split the data into a training (train
) and test set (test
) for the classifier. We’ll use a 70-30 split between train and test, respectively.
set.seed(42)
train_index <- createDataPartition(df$TARGET_FLAG, p = .7, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]
There are 1508 out of 5714 records in the training data set that have been in an accident. We want to correct the imbalance in the dataset by over sampling this group so our classifier will do a better job identifying this minority group.
set.seed(42)
minority <- nrow(train[train$TARGET_FLAG == 1,])
majority <- nrow(train[train$TARGET_FLAG == 0,])
diff <- majority - minority
minority_index <- train[train$TARGET_FLAG == 1,]$INDEX
over_sample_train <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(train, .) %>%
bind_rows(train)
The over sampled data frame has 8412 records, and as you can see in the following figure is now balanced:
For Linear Regression Model
Since the linear regression model’s goal is to predict the cost of the claim we will subset the 8161 records for those involved in an accident. We will split these 2153 records into training and test sets using the same 70-30 split.
set.seed(42)
accidents <- df %>%
filter(TARGET_FLAG == 1)
amt_train_index <- createDataPartition(accidents$TARGET_AMT, p = .7, list = FALSE, times = 1)
amt_train <- accidents[amt_train_index,]
amt_test <- accidents[-amt_train_index,]
There are 1509 out of 2153 records in the training data set.
Exploratory Data Analysis
In exploring the data we are looking for two things. We are looking for variables that will help use divide the data into those who have been in an accident and those who have not. We are also looking for variables that are linearly correlated with the claim amount so it may be used as a predictor for the linear regression model. We will be looking at both training set for the two model types.
First we will examine the numeric variables found in the oversampled classification data set:
plot_vars <- c("TARGET_FLAG", names(keep(over_sample_train, is.numeric)))
over_sample_train[plot_vars] %>%
select(-INDEX, -TARGET_AMT) %>%
gather(variable, value, -TARGET_FLAG) %>%
ggplot(., aes(TARGET_FLAG, value, color=TARGET_FLAG)) +
geom_boxplot() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
There are a few variables that seem to have a difference between the two groups. For example CLM_FREQ
might be useful in differentiating. There isn’t much difference between a lot of groups which is somewhat understandable.
Next we will look at the catagorical variables in the oversampled classifcation data set. These charts look to see if a variable can be uses to distinguish those that have had an accident (blue) from those who haven’t (red).
plot_vars <- names(keep(over_sample_train, is.factor))
temp <- over_sample_train[plot_vars] %>%
gather(variable, value, -TARGET_FLAG) %>%
group_by(TARGET_FLAG, variable, value) %>%
tally()
temp %>%
group_by(variable, value) %>%
summarise(total = sum(n)) %>%
merge(temp,.) %>%
mutate(percent = n / total) %>%
ggplot(., aes(value, percent, fill=TARGET_FLAG)) +
geom_col() +
scale_fill_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
The URBAN_DRIVER
is much more likely to be in an accident than the rural driver. The YOUNG
and those with a PRIOR_ACCIDENT
are also more likely to be in an accident. There doesn’t seem to be much evidence for a difference between the sexes.
Now let’s focus on the relationship between these variables and the size of the claim. As a reminder this data only includes those who have been in an accident. For starters we will look at the distribution of the claims for those in an accident.
ggplot(amt_train, aes(x=TARGET_AMT)) +
geom_density() +
theme_light() +
geom_vline(aes(xintercept = mean(TARGET_AMT)), lty=2, col="red") +
geom_label(aes(x=25000, y=0.00015, label=paste("mean =", round(mean(TARGET_AMT),0)))) +
geom_vline(aes(xintercept = median(TARGET_AMT)), lty=2, col="darkgreen") +
geom_label(aes(x=25000, y=0.00010, label=paste("median = ", round(median(TARGET_AMT), 0)))) +
labs(title="TARGET_AMT Density Plot", y="Density", x="TARGET_AMT")
As was previously noted this distribution has a long tail. The mean payout is $5689 and the median is $4102. The median and mean are higher, of course for those observations we classified as outliers. The outlier cutoff point is $10594.
amt_train %>%
mutate(TARGET_AMT_OUTLIER = ifelse(TARGET_AMT_OUTLIER == 1, "Yes", "No")) %>%
group_by(TARGET_AMT_OUTLIER) %>%
summarise(Mean = mean(TARGET_AMT),
Median = median(TARGET_AMT)) %>%
kable() %>%
kable_styling()
TARGET_AMT_OUTLIER | Mean | Median |
---|---|---|
No | 4032.048 | 3893.00 |
Yes | 27402.149 | 22951.25 |
Now to look at the data for the linear regression model. We are looking for good predictors of the claim amount. We will look at the numeric variables scatter and correlation plots:
amt_train %>%
keep(is.numeric) %>%
gather(variable, value, -TARGET_AMT) %>%
ggplot(., aes(value, TARGET_AMT)) +
geom_point() +
scale_color_brewer(palette="Set1") +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
amt_train %>%
keep(is.numeric)%>%
select(-INDEX, -TARGET_AMT) %>%
cor(., amt_train$TARGET_AMT) %>%
kable() %>%
kable_styling()
KIDSDRIV | 0.0061500 |
AGE | 0.0150957 |
HOMEKIDS | 0.0171601 |
YOJ | 0.0300531 |
INCOME | 0.0404098 |
HOME_VAL | 0.0442609 |
TRAVTIME | 0.0120292 |
BLUEBOOK | 0.1257628 |
TIF | -0.0180912 |
OLDCLAIM | 0.0124373 |
CLM_FREQ | -0.0026814 |
MVR_PTS | 0.0425935 |
CAR_AGE | -0.0040321 |
LOG_INCOME | 0.0384403 |
LOG_HOME_VAL | 0.0289741 |
AVG_CLAIM | 0.0310000 |
TARGET_AMT_OUTLIER | 0.7901975 |
These two points highlight that there isn’t a really strong correlation between most of the predictors and the claim amount. The only one that is strong is the outlier flag that we created. Let’s finish our exploration by looking at the categoricial variables. We are removing the TARGET_AMT_OUTLIERS
from the plot to increase their readability.
plot_vars <- c("TARGET_AMT", "TARGET_AMT_OUTLIER", names(keep(amt_train, is.factor)))
amt_train[plot_vars] %>%
filter(TARGET_AMT_OUTLIER == 0) %>%
gather(variable, value, -TARGET_AMT) %>%
ggplot(., aes(value, TARGET_AMT)) +
geom_boxplot() +
theme_light() +
theme(legend.position = "none") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
These boxplots show that there isn’t really much difference in the claim amounts between these groups.
Summary of Findings
Through this analysis we have found that there is a lot of noise associated with this signal. When it comes the the classification task, most variables offer a similar picture. This is probably due to the random nature of vehicle accidents. There are a few variables that may be helpful in building our models.
The claim prediciton task will definately be more of a challenge. Little is correlated with the claim amounts. The model fit will not be as strong as the classification models.
Model Creation & Evaluation
Now that we have created datasets and explored them, it is time to build some predictive models and evaluate them.
For the classification models we will look at the prediction accuracy on the test set. Especially with the accuracy at discovering those that are in an accident. Accurate prediction of this minority class indicates a good model.
evaluate_model <- function(model, test_df, target = "TARGET_FLAG", threshold = 0.5){
test_df$yhat <- ifelse(predict.glm(model, test_df, "response") >= threshold, 1, 0)
cm <- confusionMatrix(factor(test_df$yhat), factor(test_df[[target]]), "1")
deviance <- model$deviance
r2 <- 1 - model$deviance / model$null.deviance
cat("F1 =", cm$byClass[7],"\nR2 =", r2, "\n\n")
print(cm)
eval <- data.frame(actual = test_df$TARGET_FLAG,
predicted = test_df$yhat,
prob = predict(model, test_df))
pred <- prediction(eval$prob, eval$actual)
auc <- performance(pred, measure = "auc")@y.values[[1]]
perf <- performance(pred, "tpr", "fpr")
plot(perf,main="ROC Curve", sub = paste0("AUC: ", round(auc, 3)))
return(cm)
}
Classification Model
Baseline Model
We will create a simple model to serve as the baseline. This model posits that drivers history is a good representation of their future. So drivers that have been in an accident (OLDCLAIM
> 0 or AVG_CLAIM
> 0) are more likely to be in an accident. Conversely those who haven’t aren’t likely to be in one in the future.
baseline_model <- glm(TARGET_FLAG ~ PRIOR_ACCIDENT, family = binomial(link = "logit"), over_sample_train)
summary(baseline_model)
Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT, family = binomial(link = "logit"),
data = over_sample_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.45262 -0.97194 -0.02338 0.92517 1.39782
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.50462 0.03031 -16.65 <0.0000000000000002 ***
PRIOR_ACCIDENT1 1.13171 0.04567 24.78 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11662 on 8411 degrees of freedom
Residual deviance: 11022 on 8410 degrees of freedom
AIC: 11026
Number of Fisher Scoring iterations: 4
F1 = 0.4659588
R2 = 0.05485262
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1219 272
1 583 373
Accuracy : 0.6506
95% CI : (0.6313, 0.6695)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 1
Kappa : 0.2206
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.5783
Specificity : 0.6765
Pos Pred Value : 0.3902
Neg Pred Value : 0.8176
Prevalence : 0.2636
Detection Rate : 0.1524
Detection Prevalence : 0.3907
Balanced Accuracy : 0.6274
'Positive' Class : 1
This simple model has statistically significant predictors. Applying this model to the test data set indicates this simple model has a 65.1% accuracy rate. It correcly identified 57.8% of the people with accidents and 67.6% of those without. It is better than flipping a coin but not by much. All future models must outpreform this baseline.
Risk Taker Model
For this model we are going assume that risk takers are more likely to be in an accident. To identify those who tend to be risk takers, we are going with the urban legend that those who drive red sports cars are risk takers. Also young men (16-24) also tend to be risk takers.
risk_taker_model <- glm(TARGET_FLAG ~ RED_SPORTS_CAR + YOUNG_MALE, family = binomial(link = "logit"), over_sample_train)
summary(risk_taker_model)
Call:
glm(formula = TARGET_FLAG ~ RED_SPORTS_CAR + YOUNG_MALE, family = binomial(link = "logit"),
data = over_sample_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3791 -1.1759 -0.0938 1.1789 1.1789
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.003599 0.021907 -0.164 0.869
RED_SPORTS_CAR1 0.308981 0.352893 0.876 0.381
YOUNG_MALE1 0.466223 0.310351 1.502 0.133
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11662 on 8411 degrees of freedom
Residual deviance: 11658 on 8409 degrees of freedom
AIC: 11664
Number of Fisher Scoring iterations: 3
F1 = 0.03927492
R2 = 0.0002641343
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1798 632
1 4 13
Accuracy : 0.7401
95% CI : (0.7222, 0.7574)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 0.3494
Kappa : 0.0261
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.020155
Specificity : 0.997780
Pos Pred Value : 0.764706
Neg Pred Value : 0.739918
Prevalence : 0.263588
Detection Rate : 0.005313
Detection Prevalence : 0.006947
Balanced Accuracy : 0.508968
'Positive' Class : 1
This model only has one statistically significant predictor, however it has a 74% accuracy rate. This is because it identified 99.8% of those who didn’t have an accident (and they are more numerous). When looking at the model’s performance on identifying the minority group of interest we see it correcly identified 2% of the people with accidents. So while the accuracy rate out preformed the baseline model, this one will not be the model we will use.
Traditional Model
Traditional wisdom holds that there are a few tried and true predictors of someone’s risk of an automobile accident. Namely: age (under 25 vs. 25+), marital status, driving record, distance to work, and the driver’s sex (which may or may not be proper to use, but seems to have held throughout the years).
traditional_model <- glm(TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, family = binomial(link = "logit"), over_sample_train)
summary(traditional_model)
Call:
glm(formula = TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT +
SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, family = binomial(link = "logit"),
data = over_sample_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2918 -0.9992 -0.1634 1.0394 1.9197
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.127120 0.075485 -14.932 < 0.0000000000000002 ***
YOUNG1 0.803997 0.244210 3.292 0.000994 ***
MSTATUSz_No 0.585304 0.047681 12.275 < 0.0000000000000002 ***
PRIOR_ACCIDENT1 0.813243 0.051742 15.717 < 0.0000000000000002 ***
SEXz_F 0.298638 0.050099 5.961 0.00000000251 ***
REVOKEDYes 0.805760 0.068452 11.771 < 0.0000000000000002 ***
MVR_PTS 0.142383 0.011822 12.044 < 0.0000000000000002 ***
TRAVTIME 0.008707 0.001515 5.746 0.00000000912 ***
CAR_USEPrivate -0.586681 0.051014 -11.500 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11662 on 8411 degrees of freedom
Residual deviance: 10350 on 8403 degrees of freedom
AIC: 10368
Number of Fisher Scoring iterations: 4
F1 = 0.510559
R2 = 0.1124805
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1248 234
1 554 411
Accuracy : 0.678
95% CI : (0.6591, 0.6965)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 1
Kappa : 0.2845
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.6372
Specificity : 0.6926
Pos Pred Value : 0.4259
Neg Pred Value : 0.8421
Prevalence : 0.2636
Detection Rate : 0.1680
Detection Prevalence : 0.3944
Balanced Accuracy : 0.6649
'Positive' Class : 1
This model has statistically significant predictors. Applying this model to the test data set indicates this model has a 67.8% accuracy rate. It correcly identified 63.7% of the people with accidents and 69.3% of those without. This model out preforms the baseline model.
Traditional Model with Cross-Validation
Here we try to use cross-validation techniques to improve the traditional model. We go back to the raw training data, because it is typically not-advised to use data that is oversampled prior to cross-validation.
# Get complete cases
cases <- train %>%
select(YOUNG,MSTATUS,PRIOR_ACCIDENT,SEX,REVOKED,MVR_PTS, TRAVTIME, CAR_USE) %>%
complete.cases()
temp <- train[cases,]
# Use 5-fold cross-validation
train_control <- trainControl(method = "cv", number = 5, sampling = "up")
traditional_cv <- train(form = TARGET_FLAG ~ YOUNG + MSTATUS + PRIOR_ACCIDENT + SEX + REVOKED + MVR_PTS + TRAVTIME + CAR_USE, method = "glm", family = "binomial", data = temp, trControl = train_control)
traditional_cv
Generalized Linear Model
5714 samples
8 predictor
2 classes: '0', '1'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 4572, 4571, 4571, 4571, 4571
Addtional sampling using up-sampling
Resampling results:
Accuracy Kappa
0.6757072 0.2826379
# Evaluating the model
eval <- data.frame(actual = test$TARGET_FLAG, predicted = predict(traditional_cv,newdata=test,type="raw"), prob = predict(traditional_cv,newdata=test,type="prob"))
confusionMatrix(eval$predicted, eval$actual, positive = "1")
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1249 236
1 553 409
Accuracy : 0.6776
95% CI : (0.6586, 0.6961)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 1
Kappa : 0.2826
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.6341
Specificity : 0.6931
Pos Pred Value : 0.4252
Neg Pred Value : 0.8411
Prevalence : 0.2636
Detection Rate : 0.1671
Detection Prevalence : 0.3931
Balanced Accuracy : 0.6636
'Positive' Class : 1
pred <- prediction(eval$prob.1, eval$actual)
auc <- performance(pred, measure = "auc")@y.values[[1]]
perf <- performance(pred,"tpr","fpr")
plot(perf,main="ROC Curve", sub=paste0("AUC: ",round(auc,3)))
Alternate Traditional Model
This model is an alternate to the traditional model.
model <- glm(TARGET_FLAG ~ PRIOR_ACCIDENT + KID_DRIVERS + MSTATUS + INCOME + SEX + CAR_USE + COLLEGE_EDUCATED + REVOKED + URBAN_DRIVER, family = binomial(link = "logit"), over_sample_train)
summary(model)
Call:
glm(formula = TARGET_FLAG ~ PRIOR_ACCIDENT + KID_DRIVERS + MSTATUS +
INCOME + SEX + CAR_USE + COLLEGE_EDUCATED + REVOKED + URBAN_DRIVER,
family = binomial(link = "logit"), data = over_sample_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.47843 -0.95017 0.04407 0.92959 2.57995
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.5346466203 0.0927535704 -16.545 < 0.0000000000000002
PRIOR_ACCIDENT1 0.7921271578 0.0506657588 15.634 < 0.0000000000000002
KID_DRIVERS1 0.8289530622 0.0742694029 11.161 < 0.0000000000000002
MSTATUSz_No 0.7589906840 0.0508503649 14.926 < 0.0000000000000002
INCOME -0.0000083140 0.0000006582 -12.632 < 0.0000000000000002
SEXz_F 0.2759209075 0.0528561497 5.220 0.000000179
CAR_USEPrivate -0.7343462515 0.0546026351 -13.449 < 0.0000000000000002
COLLEGE_EDUCATED1 -0.5201355594 0.0584399941 -8.900 < 0.0000000000000002
REVOKEDYes 0.6851814650 0.0720376503 9.511 < 0.0000000000000002
URBAN_DRIVER1 1.9516574434 0.0842859474 23.155 < 0.0000000000000002
(Intercept) ***
PRIOR_ACCIDENT1 ***
KID_DRIVERS1 ***
MSTATUSz_No ***
INCOME ***
SEXz_F ***
CAR_USEPrivate ***
COLLEGE_EDUCATED1 ***
REVOKEDYes ***
URBAN_DRIVER1 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 11661.5 on 8411 degrees of freedom
Residual deviance: 9501.3 on 8402 degrees of freedom
AIC: 9521.3
Number of Fisher Scoring iterations: 4
F1 = 0.5638611
R2 = 0.1852393
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1227 166
1 575 479
Accuracy : 0.6972
95% CI : (0.6785, 0.7153)
No Information Rate : 0.7364
P-Value [Acc > NIR] : 1
Kappa : 0.3519
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.7426
Specificity : 0.6809
Pos Pred Value : 0.4545
Neg Pred Value : 0.8808
Prevalence : 0.2636
Detection Rate : 0.1957
Detection Prevalence : 0.4307
Balanced Accuracy : 0.7118
'Positive' Class : 1
This model’s predictors are all statistically significant. Applying this model to the test data set indicates this model has a 69.7% accuracy rate. It correcly identified 74.3% of the people with accidents and 68.1% of those without. This model also out preforms the baseline model. Now that we have a few viable classification models, we can turn our attention to the linear claim model.
Linear Claim Model
Baseline Model
Given the unusual distribution one posible solution would be to predict the claim is the median value. This is a gross simplification but will serve as the baseline for the linear models to compete against. The way we are looking at how well these models preform is calculating the prediction error as a percentage of the actual and then averaging it across the data set.
the_median <- median(amt_train$TARGET_AMT)
evaluate_lm <- function(yhat, actual){
data.frame(yhat = yhat, actual = actual) %>%
mutate(error = yhat - actual) %>%
mutate(error_percent = error / actual) %>%
summarise(`Average Error` = mean(error), `Average Percent` = mean(error_percent)) %>%
kable() %>%
kable_styling()
}
evaluate_lm(c(the_median), amt_test$TARGET_AMT)
Average Error | Average Percent |
---|---|
-1630.658 | 0.504024 |
By simply using the median to predict the values of the claims we would underestimate the actual value on average by $1,600 dollars. On a percentage basis the error would be 50% of the actual.
Simple Linear Model
This simple linerar model is that the amount of the claim is based off of the value of the vehicle. More expensive vehicles should be more costly to repair than less expensive vehicles. It is overly simplistic.
Call:
lm(formula = TARGET_AMT ~ BLUEBOOK, data = amt_train)
Residuals:
Min 1Q Median 3Q Max
-7818 -3153 -1548 326 78883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4038.05174 387.55794 10.419 < 0.0000000000000002 ***
BLUEBOOK 0.11399 0.02316 4.921 0.000000955 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7536 on 1507 degrees of freedom
Multiple R-squared: 0.01582, Adjusted R-squared: 0.01516
F-statistic: 24.22 on 1 and 1507 DF, p-value: 0.0000009546
While this predictor is statistically significant and positive, the model’s overall explanitory power is weak. Let’s see how it preformed on the test set:
Average Error | Average Percent |
---|---|
-130.8995 | 1.001244 |
Well the average error has decreased which is good but the average error share of the actual has doubled. This is a garbage model.
Outlier Model
We previously flagged the outliers in the data set. IF we can build another classification model, how well will it preform?
Call:
lm(formula = TARGET_AMT ~ TARGET_AMT_OUTLIER, data = amt_train)
Residuals:
Min 1Q Median 3Q Max
-16808 -1661 -186 1345 58122
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4032.0 124.3 32.43 <0.0000000000000002 ***
TARGET_AMT_OUTLIER 23370.1 466.9 50.05 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4655 on 1507 degrees of freedom
Multiple R-squared: 0.6244, Adjusted R-squared: 0.6242
F-statistic: 2505 on 1 and 1507 DF, p-value: < 0.00000000000000022
This model is unfair as it is predicting the outcomes based on a predictor derived from the outcome. Unsuprisingly this model does a really good job. It has an adjusted \(R^2\) of 0.6241628.
The reason we explored this model is to see if we has a reason to build another classifier. If we build a second classifier that can predict this variable would it be worth it. Let’s see how the model preforms on the test set:
Average Error | Average Percent |
---|---|
-140.1835 | 0.5530068 |
The predictions on average understate the actual value by $140. The error as a percent of the actual is close to what we get with just using the median. The improvement really comes from using a higher value to estimate the claim for the outliers.
The challenge will be to create a classifier. There are 107 out of 1509 that are outliers. Let’s create another balanced data set and see if we can build another classifier.
set.seed(42)
minority <- nrow(amt_train[amt_train$TARGET_AMT_OUTLIER == 1,])
majority <- nrow(amt_train[amt_train$TARGET_AMT_OUTLIER == 0,])
diff <- majority - minority
minority_index <- amt_train[amt_train$TARGET_AMT_OUTLIER == 1,]$INDEX
over_sample_train_2 <- data.frame(INDEX = sample(minority_index, diff, TRUE)) %>%
merge(amt_train, .) %>%
bind_rows(amt_train)
Let’s explore the correlations to see what is correlated:
M <- over_sample_train_2 %>%
select(-INDEX, -TARGET_AMT) %>%
keep(is.numeric) %>%
cor(.)
corrplot(M, type = "upper")
It looks like the BLUEBOOK
and PARENT1
are the only variables that are correlated with this outcome. So let’s build the classifier.
outlier_model <- glm(TARGET_AMT_OUTLIER ~ BLUEBOOK + PARENT1, family = binomial(link = "logit"), over_sample_train_2)
summary(outlier_model)
Call:
glm(formula = TARGET_AMT_OUTLIER ~ BLUEBOOK + PARENT1, family = binomial(link = "logit"),
data = over_sample_train_2)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0215 -1.1039 -0.1248 1.1664 1.4622
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.889430410 0.084530469 -10.522 < 0.0000000000000002 ***
BLUEBOOK 0.000048194 0.000004618 10.435 < 0.0000000000000002 ***
PARENT1Yes 0.441269006 0.087932394 5.018 0.000000521 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3887.2 on 2803 degrees of freedom
Residual deviance: 3732.8 on 2801 degrees of freedom
AIC: 3738.8
Number of Fisher Scoring iterations: 4
F1 = 0.04214963
R2 = 0.03970578
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1518 24
1 885 20
Accuracy : 0.6285
95% CI : (0.609, 0.6477)
No Information Rate : 0.982
P-Value [Acc > NIR] : 1
Kappa : 0.0081
Mcnemar's Test P-Value : <0.0000000000000002
Sensitivity : 0.454545
Specificity : 0.631710
Pos Pred Value : 0.022099
Neg Pred Value : 0.984436
Prevalence : 0.017981
Detection Rate : 0.008173
Detection Prevalence : 0.369841
Balanced Accuracy : 0.543128
'Positive' Class : 1
Now to test this out. Based on our linear regression test set we will predict if the observation is likely to be an outlier or not. We will then estimate the TARGET_AMT
using the linear model. We will then evaluate the results of the “model”.
temp <- amt_test
temp$TARGET_AMT_OUTLIER <- ifelse(predict.glm(outlier_model, amt_test, "response") >= 0.5, 1, 0)
temp$TARGET_AMT_yhat <- predict(outlier_lm, amt_test)
evaluate_lm(temp$TARGET_AMT_yhat, temp$TARGET_AMT)
Average Error | Average Percent |
---|---|
-140.1835 | 0.5530068 |
So through this somewhat convoluted approach we have done slightly better than using the median.
Predictions
The final predictions will use the alternative traditional model to predict the TARGET_FLAG
variable, then pass it through the second classifier which predicts the TARGET_AMT_OUTLIER
flag. We then use the linear model to make the final amount prediction.
predictions <- function(df, classifier_1, linear_model, classifier_2){
df$TARGET_FLAG <- ifelse(predict.glm(classifier_1, df, "response") >= 0.5, 1, 0)
# We assume that everyone with a TARGET_FLAG = 0 has a TARGET_AMT as zero.
# We then refine it with the outlier model.
df$TARGET_AMT_OUTLIER <- ifelse(predict.glm(classifier_2, df, "response") >= 0.5, 1, 0)
df$yhat <- predict(linear_model, df)
df <- df %>%
mutate(TARGET_AMT = ifelse(TARGET_AMT_OUTLIER == 1, yhat, 0)) %>%
mutate(TARGET_AMT = ifelse(TARGET_FLAG == 0, 0, TARGET_AMT)) %>%
select(-yhat, -TARGET_AMT_OUTLIER)
return(df)
}
evaluation <- predictions(evaluation, model, outlier_lm, outlier_model)