Here is a description of the assignment and data as provided by Marcus Ellis.
In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
We will use tidyverse libraries including ggplot2, tidyr, dplyr, and stringr to process this data.
We will also use gridExtra to be able to place ggplot2 plots side-by-side.
Also use caret and pROC when evaluating models.
Start by reading in and formatting the data.
This initial formatting will include removing the INDEX column. It will also include removing dollar signs and commas from dollar amount columns in order to convert them to numeric.
We read in the training data from here. Note, the file was already manually edited to remove non-standard characters.
https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/insurance_training_data.csv
As we run each formatting step on the evaluation data, we will also run it on the evaluation data. This way when it comes time to apply the models to the evaluation data, it will be ready to go.
Evaluation data is available here:
https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/insurance-evaluation-data.csv
We start by looking at the number of non-NA values for column, which lets us look for missing values.
## [1] "Total number of records in data:"
## [1] 8161
## [1] "Number of non-NA values per variable in data:"
## Variable n
## 1 CAR_AGE 7651
## 2 HOME_VAL 7697
## 3 YOJ 7707
## 4 INCOME 7716
## 5 AGE 8155
## 6 BLUEBOOK 8161
## 7 CAR_TYPE 8161
## 8 CAR_USE 8161
## 9 CLM_FREQ 8161
## 10 EDUCATION 8161
## 11 HOMEKIDS 8161
## 12 JOB 8161
## 13 KIDSDRIV 8161
## 14 MSTATUS 8161
## 15 MVR_PTS 8161
## 16 OLDCLAIM 8161
## 17 PARENT1 8161
## 18 RED_CAR 8161
## 19 REVOKED 8161
## 20 SEX 8161
## 21 TARGET_AMT 8161
## 22 TARGET_FLAG 8161
## 23 TIF 8161
## 24 TRAVTIME 8161
## 25 URBANICITY 8161
We find around 400-500 NAs for CAR_AGE, HOME_VAL, YOJ, and INCOME, and a handful of NAs for AGE.
Next, find the number of unique values per column in the training data. This will give us a sense of the variable type of each column.
## Variable Num.uniques
## 5 CAR_USE 2
## 13 MSTATUS 2
## 16 PARENT1 2
## 17 RED_CAR 2
## 18 REVOKED 2
## 19 SEX 2
## 21 TARGET_FLAG 2
## 24 URBANICITY 2
## 7 EDUCATION 5
## 12 KIDSDRIV 5
## 4 CAR_TYPE 6
## 6 CLM_FREQ 6
## 9 HOMEKIDS 6
## 11 JOB 9
## 14 MVR_PTS 13
## 25 YOJ 22
## 22 TIF 23
## 3 CAR_AGE 31
## 1 AGE 61
## 23 TRAVTIME 97
## 20 TARGET_AMT 1949
## 2 BLUEBOOK 2789
## 15 OLDCLAIM 2857
## 8 HOME_VAL 5107
## 10 INCOME 6613
## [1] "Levels of EDUCATION:"
## [1] "PhD" "z_High School" "<High School" "Bachelors"
## [5] "Masters"
## [1] "Levels of HOMEKIDS:"
## [1] 0 1 2 3 4 5
We find a number of variables are either binary (2 unique values), or have only a few unique values. For example, the 5 unique values in EDUCATION correspond to a few different possible levels of education.
We also find a few multi-level variables that may be best converted to binary. For example, it is likely that HOMEKIDS (# children at home) may be most relevant to tell us whether or not there are children at home, while the exact number of children may be less relevant.
Let’s start with a simple barplot to show the percent of records with each level of the binary variables.
Here, we will reformat each binary variable to be expressed as either 0 or 1. For example instead of “SEX” with male and female, we will have variable “Male” that is true (1) when the individual is a male.
We find that a bit over 25% of records are for individuals who were in a car crash (our target for the binary logistic regression).
Some predictor binary variables are relatively evenly split (e.g. a 60/40 split for whether people are married, or a nearly 50/50 split by sex). Others are less even (e.g. it looks like around 80% of records are for individuals living or working in an urban area).
Next, let’s look at the variables that are not binary, but have relatively few unique values.
These include factor variables (like education) as well as select discrete numeric variables like HOMEKIDS.
Based on the barplots, it really seems like CLM_FREQ, HOMEKIDS, and KIDSDRIV would be best converted to a binary variable that is true when >0.
Let’s do this transformation now.
Rename these variables “Past_claim”, “Kids”, and “Driving_kids”.
Looking at this barplot, I think it would be reasonable to consider all records with more than 10 MVR_PTS as having 10, since there are so few.
We find one record with CAR_AGE = -3, which I assume is a simple error.
We find there are many cars that are one year old, then once you start looking at cars that are 3+ years old the distribution becomes normal-ish (a bit right-skewed in terms of there being a few cars that are very old).
Many records are for individuals who have only been insured with the company for 1 year, though there are also a fair number of individuals who have been insured with us for quite awhile.
For years on the job, there are relatively few individuals that have been on the job for just a few years (1-4). Most individuals are either new (on the job for 0 years) or have been on the job for awhile (5+ years).
Based on these barplots, it seems most sensible to convert years on the job to a binary variable.
The vast majority of individuals have been on the job for 5+ years. And I doubt that individuals who have been on the job for 10 or 15 years are inherently different than those on the job for 5 or 6.
So, let’s convert this variable to a binary called “New_on_the_job” that is true when YOJ = 0 and false otherwise.
Then, we’ll simply use the most frequent value (“New_on_the_job” = FALSE) to fill in missing values.
We will use histograms to display the frequency of numeric variables with 50+ unique values.
For home value, we will assume that NA’s are the same as 0’s, aka the individual does not own a home. So the histogram will only show values > 0. We will just count the 0’s.
We’ll also convert the NA’s to 0’s now.
For income, we will count the number of 0’s, but then show only values > 0 for the histogram.
## [1] "Percent of records with HOME_VAL > 0:"
## [1] 66.21
## [1] "Percent of records with INCOME not stated:"
## [1] 5.45
## [1] "Percent of records with INCOME == 0:"
## [1] 7.54
## [1] "Percent of records with INCOME > 0:"
## [1] 87.01
We find that age is roughly normally distributed, and the min/max also make sense.
Travel time is right-skewed. Most people have commutes under an hour, but then there are a few individuals with very long commutes.
As we might expect, all the variables representing dollar amounts (including TARGET_AMT) are right-skewed. We may want to log-transform these variables to bring them closer to normal.
In this section, we will explore the correlation of predictor variables with TARGET_FLAG (binary variable of whether or not the individual has had a crash).
Start by looking at the frequency of crashes for each level of binary variables.
This should include adding a binary variable “Homeowner” that says whether or not an individual is a homeowner.
We find a less than 1% difference in the proportion of individuals with vs. without red cars who crash their vehicles.
We also find a very minimal difference between genders, and it is actually females who have a slightly higher crash rate.
Other than that, we find a lot of the differences we might have predicted. Individuals driving commercial vehicles, unmarried individuals, those whose licenses have been revoked, urban drivers, those who do not own homes, those who have had a claim in the past five years, and those with driving teenagers in their family all have crashes at a higher rate.
We also find that single parents and those with one or more children at home crash at a higher rate.
First of all, how are single parents defined? All unmarried people with kids in the home, or some other additional criteria? Let’s check.
## [1] "Number of unmarried individuals with children:"
## [1] 1077
## [1] "Number of single parents:"
## [1] 1077
Looks like single parents are simply defined as any unmarried person with children.
The group of those with one or more children at home includes single parents. How do the crash rates of partnered parents compare to those without children?
## [1] "Crash rate of single parents:"
## [1] 44.2
## [1] "Crash rate of partnered parents:"
## [1] 28.08
## [1] "Crash rate of individuals without children:"
## [1] 22.18
Looks like on its own, having children is not associated with a huge increased risk of crashes. The risk associated with having children appears inflated because this group includes single parents.
We find that crash rate is substantially lower among those with at least a Bachelors degree.
There are even further reductions in crash risk among those with a Masters or Phd, but I think we can create a simpler yet still very good model by reducing to a binary of either college-educated or not.
We find that minivan drivers are substantially safer than drivers of other types of cars.
We find that students and blue collar works appear to crash at a higher rate than individuals with other types of jobs.
Let’s add binary variables for whether or not an individual has at least a college education, whether or not they drive a minivan, and whether or not they area student or blue collar worker.
Let’s start by looking at crash rates for each car age.
## [1] "Crash rate when CAR_AGE = 1:"
## [1] 32.37
## [1] "Crash rate when CAR_AGE >= 20:"
## [1] 17.96
Does not seem like there is a clear pattern here. It looks on average like drivers of cars that are more than 10 years old may be a bit safer, but it is probably not a strong enough correlation to include this variable in the model.
What about time in force?
We will show only when the exact number occurs fairly often for TIF < 10.
There seems to be some correlation between longer time in force and being safer, but the association is not very strong. I don’t think we will want to include this variable in the model.
Finally, let’s look at motor vehicle report points, where we expect a strong correlation.
It looks like there is a nearly perfect linear relationship between motor vehicle record points and crash rate. Except, the crash rate with 7+ points is higher than a linear model would predict.
What if we transform this variable a bit by adding a constant when motor vehicle record points >= 7?
Let’s add a variable “MVR_PTS_MOD” where a constant of 3 is added when MVR_PTS >= 7.
Let’s look at the crash rate among younger (under age 25) and older (65+) drivers vs. the rest of the population (age 25-64).
## [1] "Crash rate for age group Young (<25): 56.94"
## [1] "Crash rate for age group In-between (25-64): 26.12"
## [1] "Crash rate for age group Senior (65+): 21.25"
Looks like being under age 25 is definitely associated with higher crash risk, while being older (age 65+) is not.
Let’s add a binary variable Young_age that is true when age < 25.
Since the majority of individuals in the data are age 25+, I think it reasonable to assume this will be false for the few cases where age = NA.
We can look at the correlation of these numeric variables with TARGET_FLAG via a simple set of boxplots.
As we might expect, it appears that individuals with longer commute times tend to crash more often, while individuals with higher incomes and more expensive cars tend to crash less often.
We see some obvious correlations, like between “Married” and “Single_parent” and between “Kids” and “Driving_kids”.
Let’s replot minus these.
Not shown, but I checked previously and the correlation between driving a commercial vehicle and being a blue-collar worker is in the positive direction. It is likely that most of the increase in crash risk we saw among blue collar workers is probably due to this. So we probably don’t need to include both variables in the model.
We find that married individuals are much more likely to own a home (correlation is positive), which makes sense.
We also see some other correlations that are less strong but also make sense, like between urban vs. rural and having at least a college education.
Let’s check the correlations within each pair of INCOME, BLUEBOOK, and TRAVTIME.
## [1] "Correlation INCOME and BLUEBOOK:"
## [1] 0.43
## [1] "Correlation INCOME and TRAVTIME:"
## [1] -0.05
## [1] "Correlation TRAVTIME and BLUEBOOK:"
## [1] -0.02
There is next to no correlation of either of these variables with commute time, but income and car value are somewhat correlated.
What is the correlation of our binary variables with income?
## Variable Correlation
## 9 New_on_the_job -0.37
## 13 Student -0.36
## 6 Kids -0.16
## 15 Young_age -0.09
## 12 Single_parent -0.08
## 10 Past_claim -0.07
## 4 Driving_kids -0.05
## 1 Blue_collar -0.03
## 7 Married -0.03
## 11 Revoked -0.02
## 8 Minivan 0.04
## 3 Commercial_vehicle 0.08
## 5 Homeowner 0.11
## 14 Urban_not_rural 0.21
## 2 College_education 0.51
We see a number of correlations that make sense, such as urban and college-educated individuals having higher incomes, as well as individuals who own homes. While students and young people tend to have lower incomes.
What about travel time?
## Variable Correlation
## 14 Urban_not_rural -0.17
## 2 College_education -0.04
## 12 Single_parent -0.02
## 5 Homeowner -0.01
## 6 Kids -0.01
## 8 Minivan -0.01
## 11 Revoked -0.01
## 4 Driving_kids 0.00
## 10 Past_claim 0.00
## 7 Married 0.01
## 3 Commercial_vehicle 0.02
## 15 Young_age 0.02
## 9 New_on_the_job 0.03
## 13 Student 0.03
## 1 Blue_collar 0.04
Rural individuals tend to have longer commutes, which makes sense.
Other than that, none of these correlations are very strong.
In this section, we will explore the correlation of different variables with TARGET_AMT within those individuals who have had a crash.
The only correlation we find over 0.1 for TARGET_AMT (either transformed or not) is with BLUEBOOK (either transformed or not).
It appears that vans and panel trucks may tend to have higher claim amounts when they get into a crash.
I suspect this may be linked to a higher rate of use of these car types among commercial vehicles. Let’s check claim amounts for commercial vs. private vehicles.
Looks like this may explain at least some of the difference, though it is unclear if it explains it all.
Let’s get correlation coefficients of each binary variable with log10(TARGET_AMT).
## Variable Correlation
## 7 Married -0.05
## 9 New_on_the_job -0.04
## 11 Revoked -0.02
## 13 Student -0.02
## 1 Blue_collar -0.01
## 5 Homeowner -0.01
## 8 Minivan -0.01
## 15 Young_age -0.01
## 4 Driving_kids 0.00
## 6 Kids 0.00
## 10 Past_claim 0.00
## 2 College_education 0.02
## 14 Urban_not_rural 0.02
## 12 Single_parent 0.03
## 3 Commercial_vehicle 0.04
We get a small positive correlation of commercial vehicle with the claim amount, as we might expect.
Let’s summarize the transformations we have completed so far.
First, we created the following binary variables.
We also modified motor vehicle record points to max out at 10, then added 3 when an individual had 7+ points to better create a linear fit of crash rate vs. points.
We shouldn’t need to do much in the way of additional transformations.
One additional variable to add for the linear model step might be a dummy variable that is 1 when car type = van and 2 when car type = panel truck. Call this “Car_type_for_claim_amount”.
Another thing we need to do is address the collinearity between variables kids, married, single parent, and homeowner.
For kids, it is probably simplest to just drop this variable.
For the remaining three, we find that single parent is associated with the biggest crash risk, then being a renter, then being unmarried.
Individuals who are both single parents and renters have an especially high risk.
We can combine these variables into a dummy variable with the following values.
Let’s call this variable “marriage_home_and_kids_score”.
Oh, and let’s replace NA’s in income with the median across training and evaluation data.
Let’s just check and make sure this number seems reasonable.
## [1] "Median income in training data:"
## [1] 54028
## [1] "Median income in evaluation data:"
## [1] 51778
## [1] "Median income across both:"
## [1] 53529
Seems reasonable. This should be an OK strategy.
We start with most binary variables except the few that did not seem correlated with TARGET_FLAG. Plus we subtract out kids/single parent/married/homeowner, since kids is entangled with the other variables, and single parent/married/homeowner are combined into the “marriage, home, and kids score”.
We also include commute time and motor vehicle record points (modified).
##
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + New_on_the_job + TRAVTIME +
## Commercial_vehicle + Past_claim + Revoked + MVR_PTS_MOD +
## Urban_not_rural + College_education + Minivan + Young_age +
## marriage_home_and_kids_score, family = "binomial", data = insurance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4727 -0.7327 -0.4461 0.6446 3.1712
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.780972 0.151023 -18.414 < 2e-16 ***
## Driving_kids 0.594295 0.082845 7.174 7.31e-13 ***
## New_on_the_job 0.670628 0.104760 6.402 1.54e-10 ***
## TRAVTIME 0.015198 0.001846 8.233 < 2e-16 ***
## Commercial_vehicle 0.565967 0.058688 9.644 < 2e-16 ***
## Past_claim 0.464455 0.063182 7.351 1.97e-13 ***
## Revoked 0.760794 0.078551 9.685 < 2e-16 ***
## MVR_PTS_MOD 0.093085 0.011561 8.052 8.18e-16 ***
## Urban_not_rural 2.217961 0.112955 19.636 < 2e-16 ***
## College_education -0.817193 0.058417 -13.989 < 2e-16 ***
## Minivan -0.692472 0.073067 -9.477 < 2e-16 ***
## Young_age 0.874548 0.277621 3.150 0.00163 **
## marriage_home_and_kids_score -0.157879 0.011583 -13.631 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7538.4 on 8148 degrees of freedom
## AIC: 7564.4
##
## Number of Fisher Scoring iterations: 5
We find that all of the variables we selected are significant.
The directions of the coefficients also all make sense. Having a teenager in the family driving, being new on the job, having a longer commute, driving commercially, having a past claim, having had a revoked license, having points on your record, being urban, and being young all have positive coefficients. Being college-educated, driving a minivan, and having a high “marriage home and kids” score all have negatve coefficients, as these variables are associated with lower crash risk.
For the second model, let’s add log10-transformed income. Even though it is somewhat entangled with the other variables, maybe it will still prove useful.
##
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + New_on_the_job + TRAVTIME +
## Commercial_vehicle + Past_claim + Revoked + MVR_PTS_MOD +
## Urban_not_rural + College_education + Minivan + Young_age +
## marriage_home_and_kids_score + log10(INCOME + 1), family = "binomial",
## data = insurance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4839 -0.7341 -0.4411 0.6456 3.1761
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.724450 0.263332 -6.549 5.81e-11 ***
## Driving_kids 0.599389 0.082966 7.225 5.03e-13 ***
## New_on_the_job -0.382268 0.242726 -1.575 0.11528
## TRAVTIME 0.015192 0.001848 8.221 < 2e-16 ***
## Commercial_vehicle 0.593552 0.059101 10.043 < 2e-16 ***
## Past_claim 0.456967 0.063296 7.220 5.22e-13 ***
## Revoked 0.759203 0.078694 9.648 < 2e-16 ***
## MVR_PTS_MOD 0.093139 0.011581 8.042 8.82e-16 ***
## Urban_not_rural 2.245511 0.113401 19.802 < 2e-16 ***
## College_education -0.729870 0.061086 -11.948 < 2e-16 ***
## Minivan -0.689019 0.073180 -9.415 < 2e-16 ***
## Young_age 0.768947 0.278636 2.760 0.00579 **
## marriage_home_and_kids_score -0.155624 0.011618 -13.395 < 2e-16 ***
## log10(INCOME + 1) -0.246976 0.050943 -4.848 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7514.3 on 8147 degrees of freedom
## AIC: 7542.3
##
## Number of Fisher Scoring iterations: 5
We find that income is negatively associated with crash risk, which makes sense.
Most of the other coefficients stay in the same direction, except now New_on_the_job changes direction and has a much less significant p-value.
What if we keep in log10(INCOME), remove this variable?
##
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + TRAVTIME + Commercial_vehicle +
## Past_claim + Revoked + MVR_PTS_MOD + Urban_not_rural + College_education +
## Minivan + Young_age + marriage_home_and_kids_score + log10(INCOME +
## 1), family = "binomial", data = insurance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5041 -0.7337 -0.4419 0.6411 3.1777
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.054284 0.160723 -12.782 < 2e-16 ***
## Driving_kids 0.598559 0.082963 7.215 5.40e-13 ***
## TRAVTIME 0.015222 0.001848 8.237 < 2e-16 ***
## Commercial_vehicle 0.588271 0.058989 9.973 < 2e-16 ***
## Past_claim 0.458530 0.063276 7.246 4.28e-13 ***
## Revoked 0.759670 0.078679 9.655 < 2e-16 ***
## MVR_PTS_MOD 0.092883 0.011580 8.021 1.05e-15 ***
## Urban_not_rural 2.245944 0.113449 19.797 < 2e-16 ***
## College_education -0.751540 0.059520 -12.627 < 2e-16 ***
## Minivan -0.687824 0.073163 -9.401 < 2e-16 ***
## Young_age 0.794146 0.277344 2.863 0.00419 **
## marriage_home_and_kids_score -0.155767 0.011615 -13.411 < 2e-16 ***
## log10(INCOME + 1) -0.174704 0.022048 -7.924 2.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7516.8 on 8148 degrees of freedom
## AIC: 7542.8
##
## Number of Fisher Scoring iterations: 5
The model makes a lot more sense now.
The “marriage home and kids score” will probably result in improved accuracy, but is is a bit complicated. Let’s also try just including the individual variables instead.
Otherwise, keep the same as model 1.
##
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + TRAVTIME + Commercial_vehicle +
## Past_claim + Revoked + MVR_PTS_MOD + Urban_not_rural + College_education +
## Minivan + Young_age + New_on_the_job + Married + Homeowner +
## Single_parent, family = "binomial", data = insurance)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4502 -0.7351 -0.4398 0.6559 3.2009
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.415397 0.148471 -23.004 < 2e-16 ***
## Driving_kids 0.673539 0.086214 7.812 5.61e-15 ***
## TRAVTIME 0.015198 0.001849 8.218 < 2e-16 ***
## Commercial_vehicle 0.567382 0.058794 9.650 < 2e-16 ***
## Past_claim 0.456181 0.063293 7.207 5.70e-13 ***
## Revoked 0.761552 0.078629 9.685 < 2e-16 ***
## MVR_PTS_MOD 0.094359 0.011582 8.147 3.74e-16 ***
## Urban_not_rural 2.219943 0.112806 19.679 < 2e-16 ***
## College_education -0.845860 0.059053 -14.324 < 2e-16 ***
## Minivan -0.693897 0.073063 -9.497 < 2e-16 ***
## Young_age 0.923453 0.277633 3.326 0.000881 ***
## New_on_the_job 0.705639 0.106023 6.656 2.82e-11 ***
## Married -0.442006 0.077264 -5.721 1.06e-08 ***
## Homeowner -0.251675 0.070654 -3.562 0.000368 ***
## Single_parent 0.455344 0.093359 4.877 1.08e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7524.3 on 8146 degrees of freedom
## AIC: 7554.3
##
## Number of Fisher Scoring iterations: 5
Directions of the coefficients all make sense.
In this section, we will create linear models of TARGET_AMT for the records where TARGET_FLAG is true.
First, let’s try modeling both log10(BLUEBOOK) and “Car_type_for_claim_amount”.
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Car_type_for_claim_amount,
## data = insurance[insurance$TARGET_FLAG == 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07923 -0.17163 0.01576 0.17112 1.40108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.99246 0.12170 24.589 < 2e-16 ***
## log10(BLUEBOOK) 0.14699 0.03028 4.855 1.29e-06 ***
## Car_type_for_claim_amount 0.01232 0.01456 0.847 0.397
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.35 on 2150 degrees of freedom
## Multiple R-squared: 0.01725, Adjusted R-squared: 0.01634
## F-statistic: 18.87 on 2 and 2150 DF, p-value: 7.528e-09
Actually looks like Car_type_for_claim_amount is not significant. What if we factor it?
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + factor(Car_type_for_claim_amount),
## data = insurance[insurance$TARGET_FLAG == 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07975 -0.17217 0.01617 0.17099 1.41360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.98780 0.12198 24.493 < 2e-16
## log10(BLUEBOOK) 0.14833 0.03037 4.884 1.12e-06
## factor(Car_type_for_claim_amount)1 -0.00124 0.02757 -0.045 0.964
## factor(Car_type_for_claim_amount)2 0.03037 0.03075 0.988 0.323
##
## (Intercept) ***
## log10(BLUEBOOK) ***
## factor(Car_type_for_claim_amount)1
## factor(Car_type_for_claim_amount)2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3501 on 2149 degrees of freedom
## Multiple R-squared: 0.0174, Adjusted R-squared: 0.01603
## F-statistic: 12.69 on 3 and 2149 DF, p-value: 3.214e-08
Still not significant. Let’s actually just make model 1 log10(TARGET_AMT) vs. log10(BLUEBOOK).
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK), data = insurance[insurance$TARGET_FLAG ==
## 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.07976 -0.16969 0.01844 0.16997 1.40784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.94366 0.10718 27.465 < 2e-16 ***
## log10(BLUEBOOK) 0.15976 0.02626 6.085 1.38e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.35 on 2151 degrees of freedom
## Multiple R-squared: 0.01692, Adjusted R-squared: 0.01646
## F-statistic: 37.02 on 1 and 2151 DF, p-value: 1.377e-09
R^2 is extremely poor (less than 2%), but this is somewhat expected given that the correlation was not very good.
Let’s see what a model looks like including all of the same variables as the first model for binary logistic regression, plus log10(BLUEBOOK).
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Driving_kids +
## New_on_the_job + TRAVTIME + Commercial_vehicle + Past_claim +
## Revoked + MVR_PTS_MOD + Urban_not_rural + College_education +
## Minivan + Young_age + marriage_home_and_kids_score, data = insurance[insurance$TARGET_FLAG ==
## 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.06507 -0.17124 0.01447 0.17278 1.38368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9734149 0.1193233 24.919 < 2e-16 ***
## log10(BLUEBOOK) 0.1590873 0.0285612 5.570 2.87e-08 ***
## Driving_kids -0.0117642 0.0203230 -0.579 0.5627
## New_on_the_job -0.0190492 0.0246131 -0.774 0.4390
## TRAVTIME -0.0001705 0.0005019 -0.340 0.7341
## Commercial_vehicle 0.0027962 0.0159351 0.175 0.8607
## Past_claim -0.0089963 0.0164609 -0.547 0.5848
## Revoked -0.0147670 0.0187696 -0.787 0.4315
## MVR_PTS_MOD 0.0047147 0.0025369 1.858 0.0632 .
## Urban_not_rural 0.0106175 0.0340155 0.312 0.7550
## College_education -0.0063219 0.0157211 -0.402 0.6876
## Minivan -0.0101147 0.0207952 -0.486 0.6267
## Young_age -0.0078310 0.0562297 -0.139 0.8893
## marriage_home_and_kids_score -0.0050570 0.0027675 -1.827 0.0678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3502 on 2139 degrees of freedom
## Multiple R-squared: 0.02098, Adjusted R-squared: 0.01503
## F-statistic: 3.526 on 13 and 2139 DF, p-value: 1.744e-05
Remove all variables with p-value > 0.1.
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + MVR_PTS_MOD +
## marriage_home_and_kids_score, data = insurance[insurance$TARGET_FLAG ==
## 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05687 -0.17393 0.01758 0.17377 1.38698
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.946220 0.108028 27.273 < 2e-16 ***
## log10(BLUEBOOK) 0.162285 0.026251 6.182 7.55e-10 ***
## MVR_PTS_MOD 0.004275 0.002341 1.826 0.068 .
## marriage_home_and_kids_score -0.004399 0.002685 -1.639 0.101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3496 on 2149 degrees of freedom
## Multiple R-squared: 0.01983, Adjusted R-squared: 0.01846
## F-statistic: 14.49 on 3 and 2149 DF, p-value: 2.39e-09
We find our adjusted R^2 is improved a tiny bit from model 1.
What if we include log10(BLUEBOOK) plus Kids, Married, Homeowner, and Single_parent instead of marriage_home_and_kids_score?
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Kids + Married +
## Homeowner + Single_parent + MVR_PTS_MOD, data = insurance[insurance$TARGET_FLAG ==
## 1, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05026 -0.17163 0.01709 0.17423 1.39271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.939893 0.109254 26.909 < 2e-16 ***
## log10(BLUEBOOK) 0.159614 0.026405 6.045 1.76e-09 ***
## Kids 0.005139 0.021626 0.238 0.8122
## Married -0.033523 0.022872 -1.466 0.1429
## Homeowner 0.008364 0.018292 0.457 0.6475
## Single_parent 0.006748 0.030312 0.223 0.8238
## MVR_PTS_MOD 0.004469 0.002347 1.904 0.0571 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3497 on 2146 degrees of freedom
## Multiple R-squared: 0.02086, Adjusted R-squared: 0.01813
## F-statistic: 7.622 on 6 and 2146 DF, p-value: 4.095e-08
Looks like being married is the most correlated. Let’s look just at this plus log10(BLUEBOOK) and MVR_PTS_MOD.
##
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Married +
## MVR_PTS_MOD, data = insurance[insurance$TARGET_FLAG == 1,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.05141 -0.17297 0.01786 0.17287 1.39181
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.949001 0.107863 27.340 < 2e-16 ***
## log10(BLUEBOOK) 0.159228 0.026234 6.070 1.51e-09 ***
## Married -0.031628 0.015073 -2.098 0.0360 *
## MVR_PTS_MOD 0.004459 0.002336 1.909 0.0564 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3495 on 2149 degrees of freedom
## Multiple R-squared: 0.02062, Adjusted R-squared: 0.01925
## F-statistic: 15.08 on 3 and 2149 DF, p-value: 1.033e-09
This seems like the best model. It improves adjusted R^2 to almost 2%, and all variables are at or very close to significant.
I am not quite sure why married individuals would have smaller claims, but at least giving married individuals a discount based on both lower crash risk and lower claim amounts are compatible.
Just to make things simple later on, add columns log10.INCOME and log10.BLUEBOOK to insurance and evaluation, and redefine models to use these.
Also log10.TARGET.
Let’s run caret::confusionMatrix on each of our three models using a p=0.5 cutoff.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5582 1369
## 1 426 784
##
## Accuracy : 0.7801
## 95% CI : (0.7709, 0.789)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3412
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.36414
## Specificity : 0.92909
## Pos Pred Value : 0.64793
## Neg Pred Value : 0.80305
## Prevalence : 0.26382
## Detection Rate : 0.09607
## Detection Prevalence : 0.14827
## Balanced Accuracy : 0.64662
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5584 1359
## 1 424 794
##
## Accuracy : 0.7815
## 95% CI : (0.7724, 0.7904)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3465
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.36879
## Specificity : 0.92943
## Pos Pred Value : 0.65189
## Neg Pred Value : 0.80426
## Prevalence : 0.26382
## Detection Rate : 0.09729
## Detection Prevalence : 0.14925
## Balanced Accuracy : 0.64911
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5575 1356
## 1 433 797
##
## Accuracy : 0.7808
## 95% CI : (0.7717, 0.7897)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3457
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.37018
## Specificity : 0.92793
## Pos Pred Value : 0.64797
## Neg Pred Value : 0.80436
## Prevalence : 0.26382
## Detection Rate : 0.09766
## Detection Prevalence : 0.15072
## Balanced Accuracy : 0.64906
##
## 'Positive' Class : 1
##
Stats look relatively similar. Let’s also plot ROC curves.
##
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model1_numeric_predictions, plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45, 0.5), main = "Model1")
##
## Data: model1_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7969
##
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model2_numeric_predictions, plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45, 0.5), main = "Model2")
##
## Data: model2_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7986
##
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model3_numeric_predictions, plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45, 0.5), main = "Model3")
##
## Data: model3_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7976
From the company’s perspective, we may want to slightly increase sensitivity at the expense of some specificity. We can’t go too crazy, as our customers may go to the competititor if we charge risk premiums unnecessarily all the time. At the same time, false negatives are also very risky financially here, as one claim can be thousands of dollars.
Let’s look at the confusion matrices and stats for each of our models if we set a cutoff of p=0.35 instead of p=0.5.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4913 868
## 1 1095 1285
##
## Accuracy : 0.7595
## 95% CI : (0.75, 0.7687)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : 7.743e-07
##
## Kappa : 0.401
## Mcnemar's Test P-Value : 3.380e-07
##
## Sensitivity : 0.5968
## Specificity : 0.8177
## Pos Pred Value : 0.5399
## Neg Pred Value : 0.8499
## Prevalence : 0.2638
## Detection Rate : 0.1575
## Detection Prevalence : 0.2916
## Balanced Accuracy : 0.7073
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4933 865
## 1 1075 1288
##
## Accuracy : 0.7623
## 95% CI : (0.7529, 0.7715)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : 3.425e-08
##
## Kappa : 0.4066
## Mcnemar's Test P-Value : 2.084e-06
##
## Sensitivity : 0.5982
## Specificity : 0.8211
## Pos Pred Value : 0.5451
## Neg Pred Value : 0.8508
## Prevalence : 0.2638
## Detection Rate : 0.1578
## Detection Prevalence : 0.2895
## Balanced Accuracy : 0.7097
##
## 'Positive' Class : 1
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4914 853
## 1 1094 1300
##
## Accuracy : 0.7614
## 95% CI : (0.752, 0.7706)
## No Information Rate : 0.7362
## P-Value [Acc > NIR] : 9.177e-08
##
## Kappa : 0.4071
## Mcnemar's Test P-Value : 5.355e-08
##
## Sensitivity : 0.6038
## Specificity : 0.8179
## Pos Pred Value : 0.5430
## Neg Pred Value : 0.8521
## Prevalence : 0.2638
## Detection Rate : 0.1593
## Detection Prevalence : 0.2933
## Balanced Accuracy : 0.7109
##
## 'Positive' Class : 1
##
This seems like the way to go. The false negative rates we had before were simply unacceptable. Even though our false positive rates are now also a lot higher, it seems worth it to avoid having to pay out claims for tons of individuals who we did not even get to make a risk premium on.
Using AUC, sensitivity, and specificity at p=0.35 cutoff, it looks like either model2 or model3 are better than model1.
Both model2 and model3 have very similar stats, but model3 has the benefit of being more parsimonious. So let’s go with that.
In summary, stats for model 3 using p=0.35 cutoff include:
We already printed the confusion matrix.
Let’s now just calculate precision and F1 score.
## [1] "Precision = 0.543"
## [1] "F1 = 0.5718"
These stats may not seem very good. However, I think the nature of the problem here means that the limited predictive value we find from a simple binary logistic regression is somewhat expected. Lots of people who don’t have 9 points on their license, are over age 25, etc. still get into car crashes just by chance.
We already printed mean squared error, R^2, and F-statistic when we ran model summaries.
So main additional step now is to run residual plots.
The residuals look normal-ish, and not too biased toward any one direction as the fitted values increase.
The variability also does not seem to change too much across the range of fitted values.
The residuals are often quite large (+/- 1 is an order of magnitude since this is log-scale). But that’s somewhat expected with the extremely low R^2 even the better model has.
Since model2 is still quite parsimonious and has slightly better performance, let’s go with that.
Since we’ve been applying all the same formatting changes to the evaluation data all along as we have to the training, we should be good to go to apply the models.
Let’s just check and make sure everything looks OK in the evaluation data.
apply(evaluation[,c("Driving_kids","Commercial_vehicle","Past_claim","Revoked","MVR_PTS_MOD","Urban_not_rural","College_education","Minivan","Young_age","New_on_the_job","Married","Homeowner","Single_parent")],2,function(x)table(x,useNA="ifany"))
## $Driving_kids
## x
## 0 1
## 1889 252
##
## $Commercial_vehicle
## x
## 0 1
## 1381 760
##
## $Past_claim
## x
## 0 1
## 1283 858
##
## $Revoked
## x
## 0 1
## 1880 261
##
## $MVR_PTS_MOD
## x
## 0 1 2 3 4 5 6 10 11 12 13
## 946 310 251 208 128 129 75 46 30 8 10
##
## $Urban_not_rural
## x
## 0 1
## 403 1738
##
## $College_education
## x
## 0 1
## 934 1207
##
## $Minivan
## x
## 0 1
## 1592 549
##
## $Young_age
## x
## 0 1
## 2125 16
##
## $New_on_the_job
## x
## 0 1
## 1959 182
##
## $Married
## x
## 0 1
## 847 1294
##
## $Homeowner
## x
## 0 1
## 725 1416
##
## $Single_parent
## x
## 0 1
## 1875 266
summary(evaluation$TRAVTIME)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.00 22.00 33.00 33.15 43.00 105.00
summary(evaluation$log10.BLUEBOOK)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.176 3.948 4.151 4.111 4.323 4.698
Looks great!
First, we will apply model3 to predict TARGET_FLAG. Then if TARGET_FLAG is predicted to be true, apply model2 from the linear models to predict TARGET_AMT. Otherwise, set TARGET_AMT = 0. Finally, convert TARGET_AMT from log10-scale to the actual values.
Remember, we are using a p=0.35 cutoff, not p=0.5.
Predictions from the models as applied to the evaluation data will be available here:
Code from this analysis is available here:
https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/geiger_heather_data621_hw4_cleaned.R https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/geiger_heather_data621_hw4_cleaned.Rmd