Introduction

Here is a description of the assignment and data as provided by Marcus Ellis.

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

INDEX - ID Variable (do not use)
TARGET_FLAG - Was car in a crash? 1=YES 0=NO
TARGET_AMT - If car was in a crash, what was the cost?
AGE - Age of the driver. Very young people tend to be risky. Maybe very old people also.
BLUEBOOK - Value of vehicle. Probably affects payout if there is a crash.
CAR_AGE - Vehicle age. Probably affects payout if there is a crash.
CAR_TYPE - Type of car. Probably affects payout if there is a crash.
CAR_USE - Vehicle use. Commercial vehicles are driven more, so might increase probability of collision.
CLM_FREQ - # claims (past 5 years). Predicted to be positively correlated with TARGET_FLAG.
EDUCATION - Max education level. Predicted to be negatively correlated with TARGET_FLAG.
HOMEKIDS - # children at home.
HOME_VAL - Home value. Having a nonzero value here predicted to be negatively correlated with TARGET_FLAG, as home owners tend to drive more responsibly in theory.
INCOME - Income. Predicted to be negatively correlated with TARGET_FLAG, as in theory rich people tend to get into fewer crashes.
JOB - Job category. In theory, white collar jobs tend to be safer.
KIDSDRIV - # driving children. Predicted to be positively correlated with TARGET_FLAG.
MSTATUS - Marital status. In theory, married people drive more safely.
MVR_PTS - Motor vehicle record points. Predicted to be positively correlated with TARGET_FLAG.
OLDCLAIM - Total claim value (past 5 years). Predicted to be positively correlated with TARGET_AMT.
PARENT1 - Single parent.
RED_CAR - A red car. Urban legend predicts this to be positively correlated with TARGET_FLAG, but is this really true?
REVOKED - License revoked (past 7 years). Predicted to be positively correlated with TARGET_FLAG.
SEX - Gender. Urban legend predicts being male to be positively correlated with TARGET_FLAG, but is this really true?
TIF - Time in force. People who have been customers for a long time are usually more safe.
TRAVTIME - Distance to work. Long drives to work usually suggest greater risk.
URBANICITY - Urban vs. rural home/work area.
YOJ - Years on job. Predicted to be negatively correlated with TARGET_FLAG.

Libraries

We will use tidyverse libraries including ggplot2, tidyr, dplyr, and stringr to process this data.

We will also use gridExtra to be able to place ggplot2 plots side-by-side.

Also use caret and pROC when evaluating models.

Data exploration

Basic data exploration

Reading in and basic formatting of data

Start by reading in and formatting the data.

This initial formatting will include removing the INDEX column. It will also include removing dollar signs and commas from dollar amount columns in order to convert them to numeric.

We read in the training data from here. Note, the file was already manually edited to remove non-standard characters.

https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/insurance_training_data.csv

As we run each formatting step on the evaluation data, we will also run it on the evaluation data. This way when it comes time to apply the models to the evaluation data, it will be ready to go.

Evaluation data is available here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/insurance-evaluation-data.csv

Check for variable type of each column and for missing values.

We start by looking at the number of non-NA values for column, which lets us look for missing values.

## [1] "Total number of records in data:"

## [1] 8161

## [1] "Number of non-NA values per variable in data:"

##       Variable    n
## 1      CAR_AGE 7651
## 2     HOME_VAL 7697
## 3          YOJ 7707
## 4       INCOME 7716
## 5          AGE 8155
## 6     BLUEBOOK 8161
## 7     CAR_TYPE 8161
## 8      CAR_USE 8161
## 9     CLM_FREQ 8161
## 10   EDUCATION 8161
## 11    HOMEKIDS 8161
## 12         JOB 8161
## 13    KIDSDRIV 8161
## 14     MSTATUS 8161
## 15     MVR_PTS 8161
## 16    OLDCLAIM 8161
## 17     PARENT1 8161
## 18     RED_CAR 8161
## 19     REVOKED 8161
## 20         SEX 8161
## 21  TARGET_AMT 8161
## 22 TARGET_FLAG 8161
## 23         TIF 8161
## 24    TRAVTIME 8161
## 25  URBANICITY 8161

We find around 400-500 NAs for CAR_AGE, HOME_VAL, YOJ, and INCOME, and a handful of NAs for AGE.

Next, find the number of unique values per column in the training data. This will give us a sense of the variable type of each column.

##       Variable Num.uniques
## 5      CAR_USE           2
## 13     MSTATUS           2
## 16     PARENT1           2
## 17     RED_CAR           2
## 18     REVOKED           2
## 19         SEX           2
## 21 TARGET_FLAG           2
## 24  URBANICITY           2
## 7    EDUCATION           5
## 12    KIDSDRIV           5
## 4     CAR_TYPE           6
## 6     CLM_FREQ           6
## 9     HOMEKIDS           6
## 11         JOB           9
## 14     MVR_PTS          13
## 25         YOJ          22
## 22         TIF          23
## 3      CAR_AGE          31
## 1          AGE          61
## 23    TRAVTIME          97
## 20  TARGET_AMT        1949
## 2     BLUEBOOK        2789
## 15    OLDCLAIM        2857
## 8     HOME_VAL        5107
## 10      INCOME        6613

## [1] "Levels of EDUCATION:"

## [1] "PhD"           "z_High School" "<High School"  "Bachelors"    
## [5] "Masters"

## [1] "Levels of HOMEKIDS:"

## [1] 0 1 2 3 4 5

We find a number of variables are either binary (2 unique values), or have only a few unique values. For example, the 5 unique values in EDUCATION correspond to a few different possible levels of education.

We also find a few multi-level variables that may be best converted to binary. For example, it is likely that HOMEKIDS (# children at home) may be most relevant to tell us whether or not there are children at home, while the exact number of children may be less relevant.

Distribution of variables

Frequency of binary variables

Let’s start with a simple barplot to show the percent of records with each level of the binary variables.

Here, we will reformat each binary variable to be expressed as either 0 or 1. For example instead of “SEX” with male and female, we will have variable “Male” that is true (1) when the individual is a male.

We find that a bit over 25% of records are for individuals who were in a car crash (our target for the binary logistic regression).

Some predictor binary variables are relatively evenly split (e.g. a 60/40 split for whether people are married, or a nearly 50/50 split by sex). Others are less even (e.g. it looks like around 80% of records are for individuals living or working in an urban area).

Frequency of factor and select discrete numeric variables

Next, let’s look at the variables that are not binary, but have relatively few unique values.

These include factor variables (like education) as well as select discrete numeric variables like HOMEKIDS.

Before proceed - transform select discrete numeric variables to binary.

Based on the barplots, it really seems like CLM_FREQ, HOMEKIDS, and KIDSDRIV would be best converted to a binary variable that is true when >0.

Let’s do this transformation now.

Rename these variables “Past_claim”, “Kids”, and “Driving_kids”.

Barplot of motor vehicle record points

Before proceed - set max MVR_PTS at 10.

Looking at this barplot, I think it would be reasonable to consider all records with more than 10 MVR_PTS as having 10, since there are so few.

Frequency of CAR_AGE, TIF, and YOJ

We find one record with CAR_AGE = -3, which I assume is a simple error.

We find there are many cars that are one year old, then once you start looking at cars that are 3+ years old the distribution becomes normal-ish (a bit right-skewed in terms of there being a few cars that are very old).

Many records are for individuals who have only been insured with the company for 1 year, though there are also a fair number of individuals who have been insured with us for quite awhile.

For years on the job, there are relatively few individuals that have been on the job for just a few years (1-4). Most individuals are either new (on the job for 0 years) or have been on the job for awhile (5+ years).

Before proceed - convert years on the job to binary and fill in missing values.

Based on these barplots, it seems most sensible to convert years on the job to a binary variable.

The vast majority of individuals have been on the job for 5+ years. And I doubt that individuals who have been on the job for 10 or 15 years are inherently different than those on the job for 5 or 6.

So, let’s convert this variable to a binary called “New_on_the_job” that is true when YOJ = 0 and false otherwise.

Then, we’ll simply use the most frequent value (“New_on_the_job” = FALSE) to fill in missing values.

Frequency of numeric variables with 50+ unique values

We will use histograms to display the frequency of numeric variables with 50+ unique values.

For home value, we will assume that NA’s are the same as 0’s, aka the individual does not own a home. So the histogram will only show values > 0. We will just count the 0’s.

We’ll also convert the NA’s to 0’s now.

For income, we will count the number of 0’s, but then show only values > 0 for the histogram.

## [1] "Percent of records with HOME_VAL > 0:"

## [1] 66.21

## [1] "Percent of records with INCOME not stated:"

## [1] 5.45

## [1] "Percent of records with INCOME == 0:"

## [1] 7.54

## [1] "Percent of records with INCOME > 0:"

## [1] 87.01

We find that age is roughly normally distributed, and the min/max also make sense.

Travel time is right-skewed. Most people have commutes under an hour, but then there are a few individuals with very long commutes.

As we might expect, all the variables representing dollar amounts (including TARGET_AMT) are right-skewed. We may want to log-transform these variables to bring them closer to normal.

Correlation of predictor variables with TARGET_FLAG

In this section, we will explore the correlation of predictor variables with TARGET_FLAG (binary variable of whether or not the individual has had a crash).

Correlation of binary predictor variables with TARGET_FLAG

Start by looking at the frequency of crashes for each level of binary variables.

This should include adding a binary variable “Homeowner” that says whether or not an individual is a homeowner.

We find a less than 1% difference in the proportion of individuals with vs. without red cars who crash their vehicles.

We also find a very minimal difference between genders, and it is actually females who have a slightly higher crash rate.

Other than that, we find a lot of the differences we might have predicted. Individuals driving commercial vehicles, unmarried individuals, those whose licenses have been revoked, urban drivers, those who do not own homes, those who have had a claim in the past five years, and those with driving teenagers in their family all have crashes at a higher rate.

We also find that single parents and those with one or more children at home crash at a higher rate.

First of all, how are single parents defined? All unmarried people with kids in the home, or some other additional criteria? Let’s check.

## [1] "Number of unmarried individuals with children:"

## [1] 1077

## [1] "Number of single parents:"

## [1] 1077

Looks like single parents are simply defined as any unmarried person with children.

The group of those with one or more children at home includes single parents. How do the crash rates of partnered parents compare to those without children?

## [1] "Crash rate of single parents:"

## [1] 44.2

## [1] "Crash rate of partnered parents:"

## [1] 28.08

## [1] "Crash rate of individuals without children:"

## [1] 22.18

Looks like on its own, having children is not associated with a huge increased risk of crashes. The risk associated with having children appears inflated because this group includes single parents.

Correlation of factor variables with TARGET_FLAG

We find that crash rate is substantially lower among those with at least a Bachelors degree.

There are even further reductions in crash risk among those with a Masters or Phd, but I think we can create a simpler yet still very good model by reducing to a binary of either college-educated or not.

We find that minivan drivers are substantially safer than drivers of other types of cars.

We find that students and blue collar works appear to crash at a higher rate than individuals with other types of jobs.

Before proceed - add binary variables “College_education”, “Minivan”, “Student”, and “Blue_collar”.

Let’s add binary variables for whether or not an individual has at least a college education, whether or not they drive a minivan, and whether or not they area student or blue collar worker.

Correlation of MVR_PTS, CAR_AGE, and TIF with TARGET_FLAG

Let’s start by looking at crash rates for each car age.

## [1] "Crash rate when CAR_AGE = 1:"

## [1] 32.37

## [1] "Crash rate when CAR_AGE >= 20:"

## [1] 17.96

Does not seem like there is a clear pattern here. It looks on average like drivers of cars that are more than 10 years old may be a bit safer, but it is probably not a strong enough correlation to include this variable in the model.

What about time in force?

We will show only when the exact number occurs fairly often for TIF < 10.

There seems to be some correlation between longer time in force and being safer, but the association is not very strong. I don’t think we will want to include this variable in the model.

Finally, let’s look at motor vehicle report points, where we expect a strong correlation.

It looks like there is a nearly perfect linear relationship between motor vehicle record points and crash rate. Except, the crash rate with 7+ points is higher than a linear model would predict.

What if we transform this variable a bit by adding a constant when motor vehicle record points >= 7?

Before proceed - create a modified version of motor vehicle record points.

Let’s add a variable “MVR_PTS_MOD” where a constant of 3 is added when MVR_PTS >= 7.

Correlation of AGE with TARGET_FLAG

Let’s look at the crash rate among younger (under age 25) and older (65+) drivers vs. the rest of the population (age 25-64).

## [1] "Crash rate for age group Young (<25): 56.94"
## [1] "Crash rate for age group In-between (25-64): 26.12"
## [1] "Crash rate for age group Senior (65+): 21.25"

Before proceed - create binary variable Young_age.

Looks like being under age 25 is definitely associated with higher crash risk, while being older (age 65+) is not.

Let’s add a binary variable Young_age that is true when age < 25.

Since the majority of individuals in the data are age 25+, I think it reasonable to assume this will be false for the few cases where age = NA.

Correlation of INCOME, BLUEBOOK, and TRAVTIME with TARGET_FLAG

We can look at the correlation of these numeric variables with TARGET_FLAG via a simple set of boxplots.

As we might expect, it appears that individuals with longer commute times tend to crash more often, while individuals with higher incomes and more expensive cars tend to crash less often.

Correlations within predictor variables for TARGET_FLAG

Correlations within binary predictor variables

We see some obvious correlations, like between “Married” and “Single_parent” and between “Kids” and “Driving_kids”.

Let’s replot minus these.

Not shown, but I checked previously and the correlation between driving a commercial vehicle and being a blue-collar worker is in the positive direction. It is likely that most of the increase in crash risk we saw among blue collar workers is probably due to this. So we probably don’t need to include both variables in the model.

We find that married individuals are much more likely to own a home (correlation is positive), which makes sense.

We also see some other correlations that are less strong but also make sense, like between urban vs. rural and having at least a college education.

Correlations within numeric variables and between numeric and binary variables

Let’s check the correlations within each pair of INCOME, BLUEBOOK, and TRAVTIME.

## [1] "Correlation INCOME and BLUEBOOK:"

## [1] 0.43

## [1] "Correlation INCOME and TRAVTIME:"

## [1] -0.05

## [1] "Correlation TRAVTIME and BLUEBOOK:"

## [1] -0.02

There is next to no correlation of either of these variables with commute time, but income and car value are somewhat correlated.

What is the correlation of our binary variables with income?

##              Variable Correlation
## 9      New_on_the_job       -0.37
## 13            Student       -0.36
## 6                Kids       -0.16
## 15          Young_age       -0.09
## 12      Single_parent       -0.08
## 10         Past_claim       -0.07
## 4        Driving_kids       -0.05
## 1         Blue_collar       -0.03
## 7             Married       -0.03
## 11            Revoked       -0.02
## 8             Minivan        0.04
## 3  Commercial_vehicle        0.08
## 5           Homeowner        0.11
## 14    Urban_not_rural        0.21
## 2   College_education        0.51

We see a number of correlations that make sense, such as urban and college-educated individuals having higher incomes, as well as individuals who own homes. While students and young people tend to have lower incomes.

What about travel time?

##              Variable Correlation
## 14    Urban_not_rural       -0.17
## 2   College_education       -0.04
## 12      Single_parent       -0.02
## 5           Homeowner       -0.01
## 6                Kids       -0.01
## 8             Minivan       -0.01
## 11            Revoked       -0.01
## 4        Driving_kids        0.00
## 10         Past_claim        0.00
## 7             Married        0.01
## 3  Commercial_vehicle        0.02
## 15          Young_age        0.02
## 9      New_on_the_job        0.03
## 13            Student        0.03
## 1         Blue_collar        0.04

Rural individuals tend to have longer commutes, which makes sense.

Other than that, none of these correlations are very strong.

Correlation of variables with TARGET_AMT

In this section, we will explore the correlation of different variables with TARGET_AMT within those individuals who have had a crash.

Correlation of numeric variables with TARGET_AMT

The only correlation we find over 0.1 for TARGET_AMT (either transformed or not) is with BLUEBOOK (either transformed or not).

Correlation of car type and commercial vs. private vehicle with TARGET_AMT

It appears that vans and panel trucks may tend to have higher claim amounts when they get into a crash.

I suspect this may be linked to a higher rate of use of these car types among commercial vehicles. Let’s check claim amounts for commercial vs. private vehicles.

Looks like this may explain at least some of the difference, though it is unclear if it explains it all.

Correlation of binary variables with TARGET_AMT

Let’s get correlation coefficients of each binary variable with log10(TARGET_AMT).

##              Variable Correlation
## 7             Married       -0.05
## 9      New_on_the_job       -0.04
## 11            Revoked       -0.02
## 13            Student       -0.02
## 1         Blue_collar       -0.01
## 5           Homeowner       -0.01
## 8             Minivan       -0.01
## 15          Young_age       -0.01
## 4        Driving_kids        0.00
## 6                Kids        0.00
## 10         Past_claim        0.00
## 2   College_education        0.02
## 14    Urban_not_rural        0.02
## 12      Single_parent        0.03
## 3  Commercial_vehicle        0.04

We get a small positive correlation of commercial vehicle with the claim amount, as we might expect.

Data transformation

Let’s summarize the transformations we have completed so far.

First, we created the following binary variables.

Past_claim
Kids
Driving_kids
New_on_the_job
Young_age
College_education
Minivan
Student
Blue_collar
Homeowner

We also modified motor vehicle record points to max out at 10, then added 3 when an individual had 7+ points to better create a linear fit of crash rate vs. points.

We shouldn’t need to do much in the way of additional transformations.

One additional variable to add for the linear model step might be a dummy variable that is 1 when car type = van and 2 when car type = panel truck. Call this “Car_type_for_claim_amount”.

Another thing we need to do is address the collinearity between variables kids, married, single parent, and homeowner.

For kids, it is probably simplest to just drop this variable.

For the remaining three, we find that single parent is associated with the biggest crash risk, then being a renter, then being unmarried.

Individuals who are both single parents and renters have an especially high risk.

We can combine these variables into a dummy variable with the following values.

8 points - married homeowner
7 points - unmarried homeowner without children
6 points - married renter
5 points - unmarried renter without children
4 points - single parent homeowner
0 points - single parent renter

Let’s call this variable “marriage_home_and_kids_score”.

Oh, and let’s replace NA’s in income with the median across training and evaluation data.

Let’s just check and make sure this number seems reasonable.

## [1] "Median income in training data:"

## [1] 54028

## [1] "Median income in evaluation data:"

## [1] 51778

## [1] "Median income across both:"

## [1] 53529

Seems reasonable. This should be an OK strategy.

Data modeling

Binary logistic regression

Model 1 - Most binary variables except kids/single parent/married/homeowner, + marriage home and kids score, plus commute time and motor vehicle record points (modified)

We start with most binary variables except the few that did not seem correlated with TARGET_FLAG. Plus we subtract out kids/single parent/married/homeowner, since kids is entangled with the other variables, and single parent/married/homeowner are combined into the “marriage, home, and kids score”.

We also include commute time and motor vehicle record points (modified).

## 
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + New_on_the_job + TRAVTIME + 
##     Commercial_vehicle + Past_claim + Revoked + MVR_PTS_MOD + 
##     Urban_not_rural + College_education + Minivan + Young_age + 
##     marriage_home_and_kids_score, family = "binomial", data = insurance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4727  -0.7327  -0.4461   0.6446   3.1712  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.780972   0.151023 -18.414  < 2e-16 ***
## Driving_kids                  0.594295   0.082845   7.174 7.31e-13 ***
## New_on_the_job                0.670628   0.104760   6.402 1.54e-10 ***
## TRAVTIME                      0.015198   0.001846   8.233  < 2e-16 ***
## Commercial_vehicle            0.565967   0.058688   9.644  < 2e-16 ***
## Past_claim                    0.464455   0.063182   7.351 1.97e-13 ***
## Revoked                       0.760794   0.078551   9.685  < 2e-16 ***
## MVR_PTS_MOD                   0.093085   0.011561   8.052 8.18e-16 ***
## Urban_not_rural               2.217961   0.112955  19.636  < 2e-16 ***
## College_education            -0.817193   0.058417 -13.989  < 2e-16 ***
## Minivan                      -0.692472   0.073067  -9.477  < 2e-16 ***
## Young_age                     0.874548   0.277621   3.150  0.00163 ** 
## marriage_home_and_kids_score -0.157879   0.011583 -13.631  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7538.4  on 8148  degrees of freedom
## AIC: 7564.4
## 
## Number of Fisher Scoring iterations: 5

We find that all of the variables we selected are significant.

The directions of the coefficients also all make sense. Having a teenager in the family driving, being new on the job, having a longer commute, driving commercially, having a past claim, having had a revoked license, having points on your record, being urban, and being young all have positive coefficients. Being college-educated, driving a minivan, and having a high “marriage home and kids” score all have negatve coefficients, as these variables are associated with lower crash risk.

Model 2 - Add log10(INCOME), remove New_on_the_job.

For the second model, let’s add log10-transformed income. Even though it is somewhat entangled with the other variables, maybe it will still prove useful.

## 
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + New_on_the_job + TRAVTIME + 
##     Commercial_vehicle + Past_claim + Revoked + MVR_PTS_MOD + 
##     Urban_not_rural + College_education + Minivan + Young_age + 
##     marriage_home_and_kids_score + log10(INCOME + 1), family = "binomial", 
##     data = insurance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4839  -0.7341  -0.4411   0.6456   3.1761  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -1.724450   0.263332  -6.549 5.81e-11 ***
## Driving_kids                  0.599389   0.082966   7.225 5.03e-13 ***
## New_on_the_job               -0.382268   0.242726  -1.575  0.11528    
## TRAVTIME                      0.015192   0.001848   8.221  < 2e-16 ***
## Commercial_vehicle            0.593552   0.059101  10.043  < 2e-16 ***
## Past_claim                    0.456967   0.063296   7.220 5.22e-13 ***
## Revoked                       0.759203   0.078694   9.648  < 2e-16 ***
## MVR_PTS_MOD                   0.093139   0.011581   8.042 8.82e-16 ***
## Urban_not_rural               2.245511   0.113401  19.802  < 2e-16 ***
## College_education            -0.729870   0.061086 -11.948  < 2e-16 ***
## Minivan                      -0.689019   0.073180  -9.415  < 2e-16 ***
## Young_age                     0.768947   0.278636   2.760  0.00579 ** 
## marriage_home_and_kids_score -0.155624   0.011618 -13.395  < 2e-16 ***
## log10(INCOME + 1)            -0.246976   0.050943  -4.848 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7514.3  on 8147  degrees of freedom
## AIC: 7542.3
## 
## Number of Fisher Scoring iterations: 5

We find that income is negatively associated with crash risk, which makes sense.

Most of the other coefficients stay in the same direction, except now New_on_the_job changes direction and has a much less significant p-value.

What if we keep in log10(INCOME), remove this variable?

## 
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + TRAVTIME + Commercial_vehicle + 
##     Past_claim + Revoked + MVR_PTS_MOD + Urban_not_rural + College_education + 
##     Minivan + Young_age + marriage_home_and_kids_score + log10(INCOME + 
##     1), family = "binomial", data = insurance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5041  -0.7337  -0.4419   0.6411   3.1777  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.054284   0.160723 -12.782  < 2e-16 ***
## Driving_kids                  0.598559   0.082963   7.215 5.40e-13 ***
## TRAVTIME                      0.015222   0.001848   8.237  < 2e-16 ***
## Commercial_vehicle            0.588271   0.058989   9.973  < 2e-16 ***
## Past_claim                    0.458530   0.063276   7.246 4.28e-13 ***
## Revoked                       0.759670   0.078679   9.655  < 2e-16 ***
## MVR_PTS_MOD                   0.092883   0.011580   8.021 1.05e-15 ***
## Urban_not_rural               2.245944   0.113449  19.797  < 2e-16 ***
## College_education            -0.751540   0.059520 -12.627  < 2e-16 ***
## Minivan                      -0.687824   0.073163  -9.401  < 2e-16 ***
## Young_age                     0.794146   0.277344   2.863  0.00419 ** 
## marriage_home_and_kids_score -0.155767   0.011615 -13.411  < 2e-16 ***
## log10(INCOME + 1)            -0.174704   0.022048  -7.924 2.31e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7516.8  on 8148  degrees of freedom
## AIC: 7542.8
## 
## Number of Fisher Scoring iterations: 5

The model makes a lot more sense now.

Model 3 - Separate Married, Single_parent, and Homeowner.

The “marriage home and kids score” will probably result in improved accuracy, but is is a bit complicated. Let’s also try just including the individual variables instead.

Otherwise, keep the same as model 1.

## 
## Call:
## glm(formula = TARGET_FLAG ~ Driving_kids + TRAVTIME + Commercial_vehicle + 
##     Past_claim + Revoked + MVR_PTS_MOD + Urban_not_rural + College_education + 
##     Minivan + Young_age + New_on_the_job + Married + Homeowner + 
##     Single_parent, family = "binomial", data = insurance)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4502  -0.7351  -0.4398   0.6559   3.2009  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -3.415397   0.148471 -23.004  < 2e-16 ***
## Driving_kids        0.673539   0.086214   7.812 5.61e-15 ***
## TRAVTIME            0.015198   0.001849   8.218  < 2e-16 ***
## Commercial_vehicle  0.567382   0.058794   9.650  < 2e-16 ***
## Past_claim          0.456181   0.063293   7.207 5.70e-13 ***
## Revoked             0.761552   0.078629   9.685  < 2e-16 ***
## MVR_PTS_MOD         0.094359   0.011582   8.147 3.74e-16 ***
## Urban_not_rural     2.219943   0.112806  19.679  < 2e-16 ***
## College_education  -0.845860   0.059053 -14.324  < 2e-16 ***
## Minivan            -0.693897   0.073063  -9.497  < 2e-16 ***
## Young_age           0.923453   0.277633   3.326 0.000881 ***
## New_on_the_job      0.705639   0.106023   6.656 2.82e-11 ***
## Married            -0.442006   0.077264  -5.721 1.06e-08 ***
## Homeowner          -0.251675   0.070654  -3.562 0.000368 ***
## Single_parent       0.455344   0.093359   4.877 1.08e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7524.3  on 8146  degrees of freedom
## AIC: 7554.3
## 
## Number of Fisher Scoring iterations: 5

Directions of the coefficients all make sense.

Linear model of TARGET_AMT

In this section, we will create linear models of TARGET_AMT for the records where TARGET_FLAG is true.

Model 1 - Log-transform TARGET_AMT and BLUEBOOK, model log10(TARGET_AMT) vs. log10(BLUEBOOK)

First, let’s try modeling both log10(BLUEBOOK) and “Car_type_for_claim_amount”.

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Car_type_for_claim_amount, 
##     data = insurance[insurance$TARGET_FLAG == 1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07923 -0.17163  0.01576  0.17112  1.40108 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                2.99246    0.12170  24.589  < 2e-16 ***
## log10(BLUEBOOK)            0.14699    0.03028   4.855 1.29e-06 ***
## Car_type_for_claim_amount  0.01232    0.01456   0.847    0.397    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.35 on 2150 degrees of freedom
## Multiple R-squared:  0.01725,    Adjusted R-squared:  0.01634 
## F-statistic: 18.87 on 2 and 2150 DF,  p-value: 7.528e-09

Actually looks like Car_type_for_claim_amount is not significant. What if we factor it?

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + factor(Car_type_for_claim_amount), 
##     data = insurance[insurance$TARGET_FLAG == 1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07975 -0.17217  0.01617  0.17099  1.41360 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                         2.98780    0.12198  24.493  < 2e-16
## log10(BLUEBOOK)                     0.14833    0.03037   4.884 1.12e-06
## factor(Car_type_for_claim_amount)1 -0.00124    0.02757  -0.045    0.964
## factor(Car_type_for_claim_amount)2  0.03037    0.03075   0.988    0.323
##                                       
## (Intercept)                        ***
## log10(BLUEBOOK)                    ***
## factor(Car_type_for_claim_amount)1    
## factor(Car_type_for_claim_amount)2    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3501 on 2149 degrees of freedom
## Multiple R-squared:  0.0174, Adjusted R-squared:  0.01603 
## F-statistic: 12.69 on 3 and 2149 DF,  p-value: 3.214e-08

Still not significant. Let’s actually just make model 1 log10(TARGET_AMT) vs. log10(BLUEBOOK).

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK), data = insurance[insurance$TARGET_FLAG == 
##     1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07976 -0.16969  0.01844  0.16997  1.40784 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.94366    0.10718  27.465  < 2e-16 ***
## log10(BLUEBOOK)  0.15976    0.02626   6.085 1.38e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.35 on 2151 degrees of freedom
## Multiple R-squared:  0.01692,    Adjusted R-squared:  0.01646 
## F-statistic: 37.02 on 1 and 2151 DF,  p-value: 1.377e-09

R^2 is extremely poor (less than 2%), but this is somewhat expected given that the correlation was not very good.

Model 2 - Backward selection of variables in model1 or model 3 binary logistic regression model, plus log10(BLUEBOOK)

Let’s see what a model looks like including all of the same variables as the first model for binary logistic regression, plus log10(BLUEBOOK).

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Driving_kids + 
##     New_on_the_job + TRAVTIME + Commercial_vehicle + Past_claim + 
##     Revoked + MVR_PTS_MOD + Urban_not_rural + College_education + 
##     Minivan + Young_age + marriage_home_and_kids_score, data = insurance[insurance$TARGET_FLAG == 
##     1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.06507 -0.17124  0.01447  0.17278  1.38368 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.9734149  0.1193233  24.919  < 2e-16 ***
## log10(BLUEBOOK)               0.1590873  0.0285612   5.570 2.87e-08 ***
## Driving_kids                 -0.0117642  0.0203230  -0.579   0.5627    
## New_on_the_job               -0.0190492  0.0246131  -0.774   0.4390    
## TRAVTIME                     -0.0001705  0.0005019  -0.340   0.7341    
## Commercial_vehicle            0.0027962  0.0159351   0.175   0.8607    
## Past_claim                   -0.0089963  0.0164609  -0.547   0.5848    
## Revoked                      -0.0147670  0.0187696  -0.787   0.4315    
## MVR_PTS_MOD                   0.0047147  0.0025369   1.858   0.0632 .  
## Urban_not_rural               0.0106175  0.0340155   0.312   0.7550    
## College_education            -0.0063219  0.0157211  -0.402   0.6876    
## Minivan                      -0.0101147  0.0207952  -0.486   0.6267    
## Young_age                    -0.0078310  0.0562297  -0.139   0.8893    
## marriage_home_and_kids_score -0.0050570  0.0027675  -1.827   0.0678 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3502 on 2139 degrees of freedom
## Multiple R-squared:  0.02098,    Adjusted R-squared:  0.01503 
## F-statistic: 3.526 on 13 and 2139 DF,  p-value: 1.744e-05

Remove all variables with p-value > 0.1.

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + MVR_PTS_MOD + 
##     marriage_home_and_kids_score, data = insurance[insurance$TARGET_FLAG == 
##     1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05687 -0.17393  0.01758  0.17377  1.38698 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2.946220   0.108028  27.273  < 2e-16 ***
## log10(BLUEBOOK)               0.162285   0.026251   6.182 7.55e-10 ***
## MVR_PTS_MOD                   0.004275   0.002341   1.826    0.068 .  
## marriage_home_and_kids_score -0.004399   0.002685  -1.639    0.101    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3496 on 2149 degrees of freedom
## Multiple R-squared:  0.01983,    Adjusted R-squared:  0.01846 
## F-statistic: 14.49 on 3 and 2149 DF,  p-value: 2.39e-09

We find our adjusted R^2 is improved a tiny bit from model 1.

What if we include log10(BLUEBOOK) plus Kids, Married, Homeowner, and Single_parent instead of marriage_home_and_kids_score?

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Kids + Married + 
##     Homeowner + Single_parent + MVR_PTS_MOD, data = insurance[insurance$TARGET_FLAG == 
##     1, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05026 -0.17163  0.01709  0.17423  1.39271 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.939893   0.109254  26.909  < 2e-16 ***
## log10(BLUEBOOK)  0.159614   0.026405   6.045 1.76e-09 ***
## Kids             0.005139   0.021626   0.238   0.8122    
## Married         -0.033523   0.022872  -1.466   0.1429    
## Homeowner        0.008364   0.018292   0.457   0.6475    
## Single_parent    0.006748   0.030312   0.223   0.8238    
## MVR_PTS_MOD      0.004469   0.002347   1.904   0.0571 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3497 on 2146 degrees of freedom
## Multiple R-squared:  0.02086,    Adjusted R-squared:  0.01813 
## F-statistic: 7.622 on 6 and 2146 DF,  p-value: 4.095e-08

Looks like being married is the most correlated. Let’s look just at this plus log10(BLUEBOOK) and MVR_PTS_MOD.

## 
## Call:
## lm(formula = log10(TARGET_AMT) ~ log10(BLUEBOOK) + Married + 
##     MVR_PTS_MOD, data = insurance[insurance$TARGET_FLAG == 1, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05141 -0.17297  0.01786  0.17287  1.39181 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.949001   0.107863  27.340  < 2e-16 ***
## log10(BLUEBOOK)  0.159228   0.026234   6.070 1.51e-09 ***
## Married         -0.031628   0.015073  -2.098   0.0360 *  
## MVR_PTS_MOD      0.004459   0.002336   1.909   0.0564 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3495 on 2149 degrees of freedom
## Multiple R-squared:  0.02062,    Adjusted R-squared:  0.01925 
## F-statistic: 15.08 on 3 and 2149 DF,  p-value: 1.033e-09

This seems like the best model. It improves adjusted R^2 to almost 2%, and all variables are at or very close to significant.

I am not quite sure why married individuals would have smaller claims, but at least giving married individuals a discount based on both lower crash risk and lower claim amounts are compatible.

Add log10 target amount, income and bluebook and redo models with log10.

Just to make things simple later on, add columns log10.INCOME and log10.BLUEBOOK to insurance and evaluation, and redefine models to use these.

Also log10.TARGET.

Model evaluation and selection

Binary logistic regression

Confusion matrices using p=0.5 cutoff

Let’s run caret::confusionMatrix on each of our three models using a p=0.5 cutoff.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5582 1369
##          1  426  784
##                                          
##                Accuracy : 0.7801         
##                  95% CI : (0.7709, 0.789)
##     No Information Rate : 0.7362         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3412         
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.36414        
##             Specificity : 0.92909        
##          Pos Pred Value : 0.64793        
##          Neg Pred Value : 0.80305        
##              Prevalence : 0.26382        
##          Detection Rate : 0.09607        
##    Detection Prevalence : 0.14827        
##       Balanced Accuracy : 0.64662        
##                                          
##        'Positive' Class : 1              
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5584 1359
##          1  424  794
##                                           
##                Accuracy : 0.7815          
##                  95% CI : (0.7724, 0.7904)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3465          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.36879         
##             Specificity : 0.92943         
##          Pos Pred Value : 0.65189         
##          Neg Pred Value : 0.80426         
##              Prevalence : 0.26382         
##          Detection Rate : 0.09729         
##    Detection Prevalence : 0.14925         
##       Balanced Accuracy : 0.64911         
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 5575 1356
##          1  433  797
##                                           
##                Accuracy : 0.7808          
##                  95% CI : (0.7717, 0.7897)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3457          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.37018         
##             Specificity : 0.92793         
##          Pos Pred Value : 0.64797         
##          Neg Pred Value : 0.80436         
##              Prevalence : 0.26382         
##          Detection Rate : 0.09766         
##    Detection Prevalence : 0.15072         
##       Balanced Accuracy : 0.64906         
##                                           
##        'Positive' Class : 1               
##

ROC curves

Stats look relatively similar. Let’s also plot ROC curves.

## 
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model1_numeric_predictions,     plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45,         0.5), main = "Model1")
## 
## Data: model1_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7969

## 
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model2_numeric_predictions,     plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45,         0.5), main = "Model2")
## 
## Data: model2_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7986

## 
## Call:
## roc.default(response = insurance$TARGET_FLAG, predictor = model3_numeric_predictions,     plot = TRUE, print.thres = c(0.25, 0.3, 0.35, 0.4, 0.45,         0.5), main = "Model3")
## 
## Data: model3_numeric_predictions in 6008 controls (insurance$TARGET_FLAG 0) < 2153 cases (insurance$TARGET_FLAG 1).
## Area under the curve: 0.7976

Confusion matrices using p=0.35 cutoff

From the company’s perspective, we may want to slightly increase sensitivity at the expense of some specificity. We can’t go too crazy, as our customers may go to the competititor if we charge risk premiums unnecessarily all the time. At the same time, false negatives are also very risky financially here, as one claim can be thousands of dollars.

Let’s look at the confusion matrices and stats for each of our models if we set a cutoff of p=0.35 instead of p=0.5.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4913  868
##          1 1095 1285
##                                         
##                Accuracy : 0.7595        
##                  95% CI : (0.75, 0.7687)
##     No Information Rate : 0.7362        
##     P-Value [Acc > NIR] : 7.743e-07     
##                                         
##                   Kappa : 0.401         
##  Mcnemar's Test P-Value : 3.380e-07     
##                                         
##             Sensitivity : 0.5968        
##             Specificity : 0.8177        
##          Pos Pred Value : 0.5399        
##          Neg Pred Value : 0.8499        
##              Prevalence : 0.2638        
##          Detection Rate : 0.1575        
##    Detection Prevalence : 0.2916        
##       Balanced Accuracy : 0.7073        
##                                         
##        'Positive' Class : 1             
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4933  865
##          1 1075 1288
##                                           
##                Accuracy : 0.7623          
##                  95% CI : (0.7529, 0.7715)
##     No Information Rate : 0.7362          
##     P-Value [Acc > NIR] : 3.425e-08       
##                                           
##                   Kappa : 0.4066          
##  Mcnemar's Test P-Value : 2.084e-06       
##                                           
##             Sensitivity : 0.5982          
##             Specificity : 0.8211          
##          Pos Pred Value : 0.5451          
##          Neg Pred Value : 0.8508          
##              Prevalence : 0.2638          
##          Detection Rate : 0.1578          
##    Detection Prevalence : 0.2895          
##       Balanced Accuracy : 0.7097          
##                                           
##        'Positive' Class : 1               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 4914  853
##          1 1094 1300
##                                          
##                Accuracy : 0.7614         
##                  95% CI : (0.752, 0.7706)
##     No Information Rate : 0.7362         
##     P-Value [Acc > NIR] : 9.177e-08      
##                                          
##                   Kappa : 0.4071         
##  Mcnemar's Test P-Value : 5.355e-08      
##                                          
##             Sensitivity : 0.6038         
##             Specificity : 0.8179         
##          Pos Pred Value : 0.5430         
##          Neg Pred Value : 0.8521         
##              Prevalence : 0.2638         
##          Detection Rate : 0.1593         
##    Detection Prevalence : 0.2933         
##       Balanced Accuracy : 0.7109         
##                                          
##        'Positive' Class : 1              
##

This seems like the way to go. The false negative rates we had before were simply unacceptable. Even though our false positive rates are now also a lot higher, it seems worth it to avoid having to pay out claims for tons of individuals who we did not even get to make a risk premium on.

Choosing a model based on p=0.35 cutoff results

Using AUC, sensitivity, and specificity at p=0.35 cutoff, it looks like either model2 or model3 are better than model1.

Both model2 and model3 have very similar stats, but model3 has the benefit of being more parsimonious. So let’s go with that.

Summarize model3 stats using p=0.35 cutoff

In summary, stats for model 3 using p=0.35 cutoff include:

Accuracy = 76.14%
Classification error rate = 23.86%
Sensitivity = 60.38%
Specificity = 81.79%
AUC = 0.7976

We already printed the confusion matrix.

Let’s now just calculate precision and F1 score.

## [1] "Precision = 0.543"

## [1] "F1 = 0.5718"

These stats may not seem very good. However, I think the nature of the problem here means that the limited predictive value we find from a simple binary logistic regression is somewhat expected. Lots of people who don’t have 9 points on their license, are over age 25, etc. still get into car crashes just by chance.

Linear model of TARGET_AMT

We already printed mean squared error, R^2, and F-statistic when we ran model summaries.

So main additional step now is to run residual plots.

The residuals look normal-ish, and not too biased toward any one direction as the fitted values increase.

The variability also does not seem to change too much across the range of fitted values.

The residuals are often quite large (+/- 1 is an order of magnitude since this is log-scale). But that’s somewhat expected with the extremely low R^2 even the better model has.

Choosing a linear model of TARGET_AMT

Since model2 is still quite parsimonious and has slightly better performance, let’s go with that.

Applying models to the evaluation data

Since we’ve been applying all the same formatting changes to the evaluation data all along as we have to the training, we should be good to go to apply the models.

Let’s just check and make sure everything looks OK in the evaluation data.

apply(evaluation[,c("Driving_kids","Commercial_vehicle","Past_claim","Revoked","MVR_PTS_MOD","Urban_not_rural","College_education","Minivan","Young_age","New_on_the_job","Married","Homeowner","Single_parent")],2,function(x)table(x,useNA="ifany"))

## $Driving_kids
## x
##    0    1 
## 1889  252 
## 
## $Commercial_vehicle
## x
##    0    1 
## 1381  760 
## 
## $Past_claim
## x
##    0    1 
## 1283  858 
## 
## $Revoked
## x
##    0    1 
## 1880  261 
## 
## $MVR_PTS_MOD
## x
##   0   1   2   3   4   5   6  10  11  12  13 
## 946 310 251 208 128 129  75  46  30   8  10 
## 
## $Urban_not_rural
## x
##    0    1 
##  403 1738 
## 
## $College_education
## x
##    0    1 
##  934 1207 
## 
## $Minivan
## x
##    0    1 
## 1592  549 
## 
## $Young_age
## x
##    0    1 
## 2125   16 
## 
## $New_on_the_job
## x
##    0    1 
## 1959  182 
## 
## $Married
## x
##    0    1 
##  847 1294 
## 
## $Homeowner
## x
##    0    1 
##  725 1416 
## 
## $Single_parent
## x
##    0    1 
## 1875  266

summary(evaluation$TRAVTIME)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   22.00   33.00   33.15   43.00  105.00

summary(evaluation$log10.BLUEBOOK)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.176   3.948   4.151   4.111   4.323   4.698

Looks great!

First, we will apply model3 to predict TARGET_FLAG. Then if TARGET_FLAG is predicted to be true, apply model2 from the linear models to predict TARGET_AMT. Otherwise, set TARGET_AMT = 0. Finally, convert TARGET_AMT from log10-scale to the actual values.

Remember, we are using a p=0.35 cutoff, not p=0.5.

Predictions from the models as applied to the evaluation data will be available here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/insurance_evaluation_predictions.csv

Code appendix

Code from this analysis is available here:

https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/geiger_heather_data621_hw4_cleaned.R https://raw.githubusercontent.com/heathergeiger/Data621_hw4/master/geiger_heather_data621_hw4_cleaned.Rmd

Data 621 Homework 4: Predicting Car Insurance Claims

Heather Geiger

July 8, 2018

Introduction

Libraries

Data exploration

Basic data exploration

Reading in and basic formatting of data

Check for variable type of each column and for missing values.

Distribution of variables

Frequency of binary variables

Frequency of factor and select discrete numeric variables

Before proceed - transform select discrete numeric variables to binary.

Barplot of motor vehicle record points

Before proceed - set max MVR_PTS at 10.

Frequency of CAR_AGE, TIF, and YOJ

Before proceed - convert years on the job to binary and fill in missing values.

Frequency of numeric variables with 50+ unique values

Correlation of predictor variables with TARGET_FLAG

Correlation of binary predictor variables with TARGET_FLAG

Correlation of factor variables with TARGET_FLAG

Before proceed - add binary variables “College_education”, “Minivan”, “Student”, and “Blue_collar”.

Correlation of MVR_PTS, CAR_AGE, and TIF with TARGET_FLAG

Before proceed - create a modified version of motor vehicle record points.

Correlation of AGE with TARGET_FLAG

Before proceed - create binary variable Young_age.

Correlation of INCOME, BLUEBOOK, and TRAVTIME with TARGET_FLAG

Correlations within predictor variables for TARGET_FLAG

Correlations within binary predictor variables

Correlations within numeric variables and between numeric and binary variables

Correlation of variables with TARGET_AMT

Correlation of numeric variables with TARGET_AMT

Correlation of car type and commercial vs. private vehicle with TARGET_AMT

Correlation of binary variables with TARGET_AMT

Data transformation

Data modeling

Binary logistic regression

Model 1 - Most binary variables except kids/single parent/married/homeowner, + marriage home and kids score, plus commute time and motor vehicle record points (modified)

Model 2 - Add log10(INCOME), remove New_on_the_job.

Model 3 - Separate Married, Single_parent, and Homeowner.

Linear model of TARGET_AMT

Model 1 - Log-transform TARGET_AMT and BLUEBOOK, model log10(TARGET_AMT) vs. log10(BLUEBOOK)

Model 2 - Backward selection of variables in model1 or model 3 binary logistic regression model, plus log10(BLUEBOOK)

Add log10 target amount, income and bluebook and redo models with log10.

Model evaluation and selection

Binary logistic regression

Confusion matrices using p=0.5 cutoff

ROC curves

Confusion matrices using p=0.35 cutoff

Choosing a model based on p=0.35 cutoff results

Summarize model3 stats using p=0.35 cutoff

Linear model of TARGET_AMT

Choosing a linear model of TARGET_AMT

Applying models to the evaluation data

Code appendix