We will explore, analyze and model a data set containing approximately 8,000 records. Each row represents a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Our objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. We will only use the variables given to us (or variables that we derive from the variables provided).
## Rows: 8,161
## Columns: 26
## $ INDEX <dbl> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 19, 20, 2~
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0~
## $ KIDSDRIV <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ AGE <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45~
## $ HOMEKIDS <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1~
## $ YOJ <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1~
## $ INCOME <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301~
## $ PARENT1 <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No~
## $ HOME_VAL <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"~
## $ MSTATUS <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",~
## $ SEX <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"~
## $ EDUCATION <chr> "PhD", "z_High School", "z_High School", "<High School", "~
## $ JOB <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla~
## $ TRAVTIME <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,~
## $ CAR_USE <chr> "Private", "Commercial", "Private", "Private", "Private", ~
## $ BLUEBOOK <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17~
## $ TIF <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, ~
## $ CAR_TYPE <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports~
## $ RED_CAR <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", ~
## $ OLDCLAIM <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$~
## $ CLM_FREQ <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2~
## $ REVOKED <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N~
## $ MVR_PTS <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, ~
## $ CAR_AGE <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,~
## $ URBANICITY <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba~
Right away we can see we’ll have some work to do. It appears that many features that are factors have imported as characters or doubles. It is also clear that there may be some ordinal levels within some of the factors. We’ll note this for our data cleaning section. We also note that the training set has 26 columns and 8,161 rows.
Before transformations, lets see how many of our columns are numeric and how many are characters:
Numeric:
## [1] 12
Character:
## [1] 14
Next, we’ll quickly look at a summary of each of our features to quickly get a bird’s eye view of our distributions:
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
Some notes of interest:
Before going farther, let’s see how extensive missing values are in our dataset.
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS
## 0 0 0 0 6 0
## YOJ INCOME PARENT1 HOME_VAL MSTATUS SEX
## 454 445 0 464 0 0
## EDUCATION JOB TRAVTIME CAR_USE BLUEBOOK TIF
## 0 526 0 0 0 0
## CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 0 0 0 0 0 0
## CAR_AGE URBANICITY
## 510 0
There are missing values in YOJ, INCOME, HOME_VAL, JOB, and CAR_AGE, however it does not appear to be pervasive. The feature with the most NA’s, JOB, is only missing ~7% of its values. Let’s see if the missing data is random in nature or if there is an underlying pattern.
We note that:
We will address these NAs in the data transformation section.
The dataset has 27 variables and 8,161 observations. We summarize the following issues:
As part of our data preparation, we will drop the “Index” feature as there is no use for it.
Data After Exclusions
## Rows: 10,302
## Columns: 25
## $ TARGET_FLAG <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0~
## $ KIDSDRIV <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ AGE <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45~
## $ HOMEKIDS <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1~
## $ YOJ <dbl> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, 11, 0, 1~
## $ INCOME <chr> "$67,349", "$91,449", "$16,039", NA, "$114,986", "$125,301~
## $ PARENT1 <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No", "No~
## $ HOME_VAL <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,925", "$0"~
## $ MSTATUS <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes", "Yes",~
## $ SEX <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z_F", "M"~
## $ EDUCATION <chr> "PhD", "z_High School", "z_High School", "<High School", "~
## $ JOB <chr> "Professional", "z_Blue Collar", "Clerical", "z_Blue Colla~
## $ TRAVTIME <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,~
## $ CAR_USE <chr> "Private", "Commercial", "Private", "Private", "Private", ~
## $ BLUEBOOK <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,000", "$17~
## $ TIF <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, ~
## $ CAR_TYPE <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV", "Sports~
## $ RED_CAR <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes", "no", ~
## $ OLDCLAIM <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", "$0", "$~
## $ CLM_FREQ <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2~
## $ REVOKED <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", "No", "N~
## $ MVR_PTS <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, ~
## $ CAR_AGE <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,~
## $ URBANICITY <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Highly Urba~
Additionally, we’ll change the following character type features to numeric:
Features after Transformation from character to numeric
## Rows: 10,302
## Columns: 4
## $ INCOME <dbl> 67349, 91449, 16039, NA, 114986, 125301, 18755, 107961, 62978~
## $ HOME_VAL <dbl> 0, 257252, 124191, 306251, 243925, 0, NA, 333680, 0, 0, 0, 20~
## $ BLUEBOOK <dbl> 14230, 14940, 4010, 15440, 18000, 17430, 8780, 16970, 11200, ~
## $ OLDCLAIM <dbl> 4461, 0, 38690, 0, 19217, 0, 0, 2374, 0, 0, 0, 0, 5028, 0, 0,~
We also identified that there would be numerous columns that would have to be transformed from a character data type to factor. Their final state is shown below after the transformation:
Features after Transformation from Character to Factor
## $MSTATUS
## [1] "No" "Yes"
##
## $SEX
## [1] "F" "M"
##
## $JOB
## [1] "Blue Collar" "Clerical" "Doctor" "Home Maker" "Lawyer"
## [6] "Manager" "Professional" "Student"
##
## $CAR_TYPE
## [1] "Minivan" "Panel Truck" "Pickup" "Sports Car" "SUV"
## [6] "Van"
##
## $URBANICITY
## [1] "Highly Rural/ Rural" "Highly Urban/ Urban"
##
## $CAR_USE
## [1] "Commercial" "Private"
##
## $REVOKED
## [1] "No" "Yes"
##
## $PARENT1
## [1] "No" "Yes"
##
## $RED_CAR
## [1] "no" "yes"
##
## $TARGET_FLAG
## [1] "0" "1"
As noted in our EDA section, there are NAs in several of our features. Based on our analysis above, as there weren’t any distinct patterns within missing data, we’ll impute NAs using linear interpolation (zoo
library) as well as filling in the blanks in JOBS with new level, ‘Unknown’.
## JOB
## [1,] "Blue Collar"
## [2,] "Clerical"
## [3,] "Doctor"
## [4,] "Home Maker"
## [5,] "Lawyer"
## [6,] "Manager"
## [7,] "Professional"
## [8,] "Student"
## [9,] "Unknown"
Results Post-Imputation
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ
## 2141 2141 0 0 0 0
## INCOME PARENT1 HOME_VAL MSTATUS SEX EDUCATION
## 0 0 0 0 0 0
## JOB TRAVTIME CAR_USE BLUEBOOK TIF CAR_TYPE
## 0 0 0 0 0 0
## RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 0 0 0 0 0 0
## URBANICITY dataset
## 0 0
## Rows: 10,302
## Columns: 25
## $ TARGET_FLAG <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1~
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000, 4021.0~
## $ KIDSDRIV <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ AGE <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55, 53, 45~
## $ HOMEKIDS <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0, 3, 2, 1~
## $ YOJ <dbl> 11.00000, 11.00000, 10.00000, 14.00000, 13.00000, 12.00000~
## $ INCOME <dbl> 67349.0, 91449.0, 16039.0, 65512.5, 114986.0, 125301.0, 18~
## $ PARENT1 <fct> No, No, No, No, No, Yes, No, No, No, No, No, No, No, No, N~
## $ HOME_VAL <dbl> 0, 257252, 124191, 306251, 243925, 0, 166840, 333680, 0, 0~
## $ MSTATUS <fct> No, No, Yes, Yes, Yes, No, Yes, Yes, No, No, No, Yes, Yes,~
## $ SEX <fct> M, M, F, M, F, F, F, M, F, M, F, F, M, M, F, F, M, F, F, F~
## $ EDUCATION <fct> PhD, High School, High School, <High School, PhD, Bachelor~
## $ JOB <fct> Professional, Blue Collar, Clerical, Blue Collar, Doctor, ~
## $ TRAVTIME <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25, 64, 48,~
## $ CAR_USE <fct> Private, Commercial, Private, Private, Private, Commercial~
## $ BLUEBOOK <dbl> 14230, 14940, 4010, 15440, 18000, 17430, 8780, 16970, 1120~
## $ TIF <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6, 7, 4, ~
## $ CAR_TYPE <fct> Minivan, Minivan, SUV, Minivan, SUV, Sports Car, SUV, Van,~
## $ RED_CAR <fct> yes, yes, no, yes, no, no, no, yes, no, no, no, no, yes, y~
## $ OLDCLAIM <dbl> 4461, 0, 38690, 0, 19217, 0, 0, 2374, 0, 0, 0, 0, 5028, 0,~
## $ CLM_FREQ <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 2~
## $ REVOKED <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, No, Yes, No,~
## $ MVR_PTS <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0, 0, 0, ~
## $ CAR_AGE <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5, 13, 16,~
## $ URBANICITY <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highly Urban/ Ur~
Let’s take a look at the makeup of our factor variables. We’ll look at the most common category within these variables.
In looking at the above output, notes of interest are:
In looking at the plots above, we note:
We can see above that many of these features have 0s indicating that the variable does not have a value for the field. We’ll need to make some categorical features to capture this signal in our feature engineering section.
Now let’s look at the relationship between our variables and TARGET_FLAG
Boxplots for when Predictor is TARGET_FLAG
# target_name <- 'your_target_name'
target_name <- 'TARGET_FLAG'
boxplot_depend_vs_independ(train, target_name)
In looking at the above boxplots, we note the following:
Having looked at our features, let’s now take a look at a correlation plot to see the strength between our variables.
As noted previously, multi-collinearity is not a huge issue with this dataset. We note the following variables that are collinear:
From observations made from the data exploration and preparation above, we create several flag variables to captured observed trends in the data:
## Rows: 10,302
## Columns: 31
## $ TARGET_FLAG <fct> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,~
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.~
## $ KIDSDRIV <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ AGE <dbl> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43,~
## $ HOMEKIDS <dbl> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3,~
## $ YOJ <dbl> 11.00000, 11.00000, 10.00000, 14.00000, 13.0000~
## $ INCOME <dbl> 67349.0, 91449.0, 16039.0, 65512.5, 114986.0, 1~
## $ PARENT1 <fct> No, No, No, No, No, Yes, No, No, No, No, No, No~
## $ HOME_VAL <dbl> 0, 257252, 124191, 306251, 243925, 0, 166840, 3~
## $ MSTATUS <fct> No, No, Yes, Yes, Yes, No, Yes, Yes, No, No, No~
## $ SEX <fct> M, M, F, M, F, F, F, M, F, M, F, F, M, M, F, F,~
## $ EDUCATION <fct> PhD, High School, High School, <High School, Ph~
## $ JOB <fct> Professional, Blue Collar, Clerical, Blue Colla~
## $ TRAVTIME <dbl> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, ~
## $ CAR_USE <fct> Private, Commercial, Private, Private, Private,~
## $ BLUEBOOK <dbl> 14230, 14940, 4010, 15440, 18000, 17430, 8780, ~
## $ TIF <dbl> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6~
## $ CAR_TYPE <fct> Minivan, Minivan, SUV, Minivan, SUV, Sports Car~
## $ RED_CAR <fct> yes, yes, no, yes, no, no, no, yes, no, no, no,~
## $ OLDCLAIM <dbl> 4461, 0, 38690, 0, 19217, 0, 0, 2374, 0, 0, 0, ~
## $ CLM_FREQ <dbl> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0,~
## $ REVOKED <fct> No, No, No, No, Yes, No, No, Yes, No, No, No, N~
## $ MVR_PTS <dbl> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0~
## $ CAR_AGE <dbl> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10,~
## $ URBANICITY <fct> Highly Urban/ Urban, Highly Urban/ Urban, Highl~
## $ dataset <chr> "train", "train", "train", "train", "train", "t~
## $ CAR_AGE_BRAND_NEW_FLAG <dbl> 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0,~
## $ CLM_FREQ_ZERO <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,~
## $ HOME_VAL_ZERO <dbl> 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0,~
## $ MVR_PTS_ZERO <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1,~
## $ YOJ_ZERO <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
From the insights we gained from EDA, we move forward by implementing two main modeling approaches. We begin by splitting the training data into train and validation sets. For our first modeling approach, we use TARGET_FLAG
as a binary response variable in conjunction with the original and engineered featured variables of our processed data. The second modeling approach uses the TARGET_AMT
response variable. TARGET_AMT
is a continuous numeric feature which we use to deploy multiple linear regression models. We developed several models for both the binary regression and multivariate linear regression approaches and evaluate each model’s performance to select the best model.
Here we describe binomial modeling that utilizes the feature set to predict the binary response variable, TARGET_FLAG
. Where TARGET_FLAG
coded ‘1’ is a car that was in a crash and ‘0’ otherwise.
To find a baseline for performance with the Binary Response Variable, we begin with a binary logistic regression model that includes all feature variables (original and engineered).
Model 1 Summary:
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial, data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6112 -0.7066 -0.3873 0.6071 3.1642
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6353852570 0.3692938248 -7.136 9.59e-13 ***
## KIDSDRIV 0.3571878954 0.0672347452 5.313 1.08e-07 ***
## AGE -0.0044821229 0.0044437592 -1.009 0.313151
## HOMEKIDS 0.0283779232 0.0413952129 0.686 0.493006
## YOJ 0.0199859786 0.0131421674 1.521 0.128322
## INCOME -0.0000038983 0.0000012297 -3.170 0.001524 **
## PARENT1Yes 0.4345261543 0.1195415167 3.635 0.000278 ***
## HOME_VAL -0.0000008135 0.0000006038 -1.347 0.177869
## MSTATUSYes -0.5570080425 0.0945118876 -5.894 3.78e-09 ***
## SEXM 0.0574919448 0.1233997371 0.466 0.641287
## EDUCATIONBachelors -0.3237731474 0.1259104667 -2.571 0.010127 *
## EDUCATIONHigh School 0.0477183952 0.1039762676 0.459 0.646281
## EDUCATIONMasters -0.2228806421 0.1949540289 -1.143 0.252936
## EDUCATIONPhD -0.1578312765 0.2344554502 -0.673 0.500831
## JOBClerical 0.1526848730 0.1156030866 1.321 0.186579
## JOBDoctor -0.8927033480 0.3147629808 -2.836 0.004567 **
## JOBHome Maker -0.2374443280 0.1748996401 -1.358 0.174590
## JOBLawyer -0.2610547204 0.2031345102 -1.285 0.198746
## JOBManager -0.8825723265 0.1505595628 -5.862 4.57e-09 ***
## JOBProfessional -0.1821565905 0.1305048683 -1.396 0.162780
## JOBStudent -0.3533123133 0.1548680858 -2.281 0.022526 *
## JOBUnknown -0.3575077197 0.2012075210 -1.777 0.075599 .
## TRAVTIME 0.0163233345 0.0020739450 7.871 3.53e-15 ***
## CAR_USEPrivate -0.7573111596 0.1001512340 -7.562 3.98e-14 ***
## BLUEBOOK -0.0000203253 0.0000057075 -3.561 0.000369 ***
## TIF -0.0602765321 0.0080225726 -7.513 5.76e-14 ***
## CAR_TYPEPanel Truck 0.6249326429 0.1754299365 3.562 0.000368 ***
## CAR_TYPEPickup 0.5892766952 0.1114515847 5.287 1.24e-07 ***
## CAR_TYPESports Car 1.1304010611 0.1423940203 7.939 2.05e-15 ***
## CAR_TYPESUV 0.8307123573 0.1223938619 6.787 1.14e-11 ***
## CAR_TYPEVan 0.5957753869 0.1389737506 4.287 1.81e-05 ***
## RED_CARyes 0.0739752125 0.0949719070 0.779 0.436029
## OLDCLAIM -0.0000204619 0.0000045945 -4.454 8.44e-06 ***
## CLM_FREQ 0.0263147937 0.0484439230 0.543 0.586991
## REVOKEDYes 0.9095662092 0.1017607186 8.938 < 2e-16 ***
## MVR_PTS 0.1046492368 0.0209733902 4.990 6.05e-07 ***
## CAR_AGE 0.0136207721 0.0112026232 1.216 0.224040
## URBANICITYHighly Urban/ Urban 2.4130546748 0.1229448424 19.627 < 2e-16 ***
## CAR_AGE_BRAND_NEW_FLAG 0.1738136434 0.1143758529 1.520 0.128594
## CLM_FREQ_ZERO -0.5642737003 0.1324463830 -4.260 2.04e-05 ***
## HOME_VAL_ZERO 0.1108424845 0.1510345261 0.734 0.463017
## MVR_PTS_ZERO 0.0549228747 0.0932620539 0.589 0.555922
## YOJ_ZERO 0.7488766866 0.2114227249 3.542 0.000397 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8007.9 on 6937 degrees of freedom
## Residual deviance: 6130.9 on 6895 degrees of freedom
## AIC: 6216.9
##
## Number of Fisher Scoring iterations: 5
We can see the AIC result from a binomial model using the logit link function. To invoke a parsimonious approach, another model will be derived using a stepwise method to further narrow down models based on significance and possible negative effects of multicollinearity.
Using the full model as a starting point, the following summarizes the output of a backward stepwise by AIC model selection process:
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + YOJ + INCOME + PARENT1 +
## HOME_VAL + MSTATUS + EDUCATION + JOB + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + REVOKED + MVR_PTS +
## URBANICITY + CLM_FREQ_ZERO + YOJ_ZERO, family = binomial,
## data = partial_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5898 -0.7121 -0.3868 0.6059 3.1691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4244763223 0.2662714833 -9.105 < 2e-16 ***
## KIDSDRIV 0.3732736585 0.0602428662 6.196 5.79e-10 ***
## YOJ 0.0205002704 0.0125316736 1.636 0.101865
## INCOME -0.0000034578 0.0000011281 -3.065 0.002176 **
## PARENT1Yes 0.5077903370 0.1024822079 4.955 7.24e-07 ***
## HOME_VAL -0.0000012228 0.0000003604 -3.393 0.000692 ***
## MSTATUSYes -0.5519465230 0.0878475807 -6.283 3.32e-10 ***
## EDUCATIONBachelors -0.3193276884 0.1190011519 -2.683 0.007288 **
## EDUCATIONHigh School 0.0551862180 0.1032493480 0.534 0.592999
## EDUCATIONMasters -0.1678285153 0.1749151756 -0.959 0.337314
## EDUCATIONPhD -0.1183186760 0.2184556072 -0.542 0.588084
## JOBClerical 0.1574119985 0.1150454747 1.368 0.171231
## JOBDoctor -0.8894686944 0.3140531691 -2.832 0.004623 **
## JOBHome Maker -0.2796037981 0.1727264094 -1.619 0.105497
## JOBLawyer -0.2793048471 0.2027981223 -1.377 0.168433
## JOBManager -0.8978030846 0.1501909597 -5.978 2.26e-09 ***
## JOBProfessional -0.1910254808 0.1300858481 -1.468 0.141980
## JOBStudent -0.3113797704 0.1476631446 -2.109 0.034969 *
## JOBUnknown -0.3636478361 0.2009987786 -1.809 0.070419 .
## TRAVTIME 0.0162119532 0.0020684039 7.838 4.58e-15 ***
## CAR_USEPrivate -0.7544150738 0.0999990582 -7.544 4.55e-14 ***
## BLUEBOOK -0.0000233274 0.0000051233 -4.553 5.28e-06 ***
## TIF -0.0595111446 0.0079996107 -7.439 1.01e-13 ***
## CAR_TYPEPanel Truck 0.6900331622 0.1634189761 4.222 2.42e-05 ***
## CAR_TYPEPickup 0.5863078702 0.1111672085 5.274 1.33e-07 ***
## CAR_TYPESports Car 1.0522835531 0.1171887863 8.979 < 2e-16 ***
## CAR_TYPESUV 0.7547530556 0.0945097045 7.986 1.39e-15 ***
## CAR_TYPEVan 0.6323387541 0.1341998064 4.712 2.45e-06 ***
## OLDCLAIM -0.0000204398 0.0000045839 -4.459 8.23e-06 ***
## REVOKEDYes 0.9091169573 0.1015518485 8.952 < 2e-16 ***
## MVR_PTS 0.0970773926 0.0153772677 6.313 2.74e-10 ***
## URBANICITYHighly Urban/ Urban 2.4159661906 0.1229882675 19.644 < 2e-16 ***
## CLM_FREQ_ZERO -0.6188092494 0.0854268151 -7.244 4.36e-13 ***
## YOJ_ZERO 0.7665186259 0.2061649545 3.718 0.000201 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8007.9 on 6937 degrees of freedom
## Residual deviance: 6138.1 on 6904 degrees of freedom
## AIC: 6206.1
##
## Number of Fisher Scoring iterations: 5
The stepwise model by AIC reduces the dimensionality of the model and results in a more parsimonious fit of the data. For example, Model 1 involves 28 feature variables and fits 43 coefficients whereas the stepwise Model 2 uses 19 feature variables and fits 34 model coefficients. Additionally, we see Model 2 AIC is lower indicating lower estimated prediction error. This suggests that, in addition to being a simple model, the stepwise method works better to create an overall better fit to the data.
In looking at the coefficients, most of these make sense intuitively. What is interesting is how big a difference URBANICITY makes to the log odd percent. It does makes sense that individuals who live in highly urban areas would be involved in more accidents since there are so many more people on the roads.
There are other approaches to modeling a binary response variable other than using binary linear regression. Here, we describe the fit of a random forest model that levies the same full feature variable set as Model 1
##
## Call:
## randomForest(formula = TARGET_FLAG ~ ., data = partial_train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 21.46%
## Confusion matrix:
## 0 1 class.error
## 0 4772 335 0.06559624
## 1 1154 677 0.63025669
Here we describe several multivariate regression modeling efforts to utilize the feature set to predict the continuous numeric response variable, TARGET_AMT
. TARGET_AMT
gives the monetary amount of costs incurred if a car was involved in a crash.
To begin multivariate methods, we model the response variable using the full set of feature variables
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9551 -3185 -1528 478 98590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5754.27 184.12 31.252 < 2e-16 ***
## KIDSDRIV -67.17 222.46 -0.302 0.76274
## AGE 386.33 230.67 1.675 0.09415 .
## HOMEKIDS 484.47 281.96 1.718 0.08592 .
## YOJ -213.64 351.35 -0.608 0.54323
## INCOME -158.85 336.09 -0.473 0.63653
## PARENT1Yes -38.49 271.55 -0.142 0.88731
## HOME_VAL -171.84 426.98 -0.402 0.68739
## MSTATUSYes -566.42 281.82 -2.010 0.04460 *
## SEXM 718.12 361.55 1.986 0.04716 *
## EDUCATIONBachelors 54.12 301.45 0.180 0.85754
## `EDUCATIONHigh School` -113.91 275.04 -0.414 0.67880
## EDUCATIONMasters 595.82 434.53 1.371 0.17049
## EDUCATIONPhD 669.43 343.02 1.952 0.05115 .
## JOBClerical -10.65 245.64 -0.043 0.96544
## JOBDoctor -284.33 243.11 -1.170 0.24232
## `JOBHome Maker` -127.94 285.18 -0.449 0.65375
## JOBLawyer 47.05 338.24 0.139 0.88939
## JOBManager -260.24 246.52 -1.056 0.29127
## JOBProfessional 350.05 242.81 1.442 0.14958
## JOBStudent 53.96 284.14 0.190 0.84941
## JOBUnknown 43.09 312.10 0.138 0.89021
## TRAVTIME 37.01 187.69 0.197 0.84369
## CAR_USEPrivate -313.76 290.04 -1.082 0.27950
## BLUEBOOK 895.95 282.84 3.168 0.00156 **
## TIF -61.87 186.30 -0.332 0.73987
## `CAR_TYPEPanel Truck` -336.28 295.68 -1.137 0.25557
## CAR_TYPEPickup -51.94 269.15 -0.193 0.84699
## `CAR_TYPESports Car` 375.79 294.36 1.277 0.20189
## CAR_TYPESUV 323.29 348.25 0.928 0.35337
## CAR_TYPEVan 50.84 250.89 0.203 0.83944
## RED_CARyes -223.90 250.36 -0.894 0.37129
## OLDCLAIM 218.85 276.64 0.791 0.42899
## CLM_FREQ -317.51 330.98 -0.959 0.33753
## REVOKEDYes -560.44 238.66 -2.348 0.01897 *
## MVR_PTS 157.17 266.09 0.591 0.55482
## CAR_AGE -1167.84 377.26 -3.096 0.00199 **
## `URBANICITYHighly Urban/ Urban` 154.71 189.07 0.818 0.41331
## CAR_AGE_BRAND_NEW_FLAG -718.95 306.33 -2.347 0.01904 *
## CLM_FREQ_ZERO -73.59 371.22 -0.198 0.84288
## HOME_VAL_ZERO -505.51 419.18 -1.206 0.22799
## MVR_PTS_ZERO -23.69 253.30 -0.094 0.92551
## YOJ_ZERO -362.54 385.99 -0.939 0.34773
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7883 on 1790 degrees of freedom
## Multiple R-squared: 0.03589, Adjusted R-squared: 0.01326
## F-statistic: 1.586 on 42 and 1790 DF, p-value: 0.01014
As we can see from the summary output, this approach does not yield a statistically significant fit to the data and has a very low \(R^2\) value. Model 4 gives a very poor fit to the data, but perhaps this can be improved upon.
Using Model 4 as a starting point, here we describe the results of a backwards stepwise by AIC multiple linear regression model selection process.
##
## Call:
## lm(formula = TARGET_AMT ~ AGE + HOMEKIDS + MSTATUS + SEX + BLUEBOOK +
## REVOKED + CAR_AGE + CAR_AGE_BRAND_NEW_FLAG + HOME_VAL_ZERO,
## data = partial_train_mv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9219 -3180 -1568 436 100185
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4106.0624 1185.9980 3.462 0.000548 ***
## AGE 44.3494 21.7557 2.039 0.041643 *
## HOMEKIDS 282.7888 172.2978 1.641 0.100912
## MSTATUSYes -1096.9312 450.5195 -2.435 0.014995 *
## SEXM 678.7742 374.2296 1.814 0.069874 .
## BLUEBOOK 0.1000 0.0229 4.367 0.0000133 ***
## REVOKEDYes -993.1943 459.4666 -2.162 0.030777 *
## CAR_AGE -120.0143 53.8415 -2.229 0.025933 *
## CAR_AGE_BRAND_NEW_FLAG -1124.6087 626.0132 -1.796 0.072587 .
## HOME_VAL_ZERO -756.9170 459.9343 -1.646 0.099997 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7863 on 1823 degrees of freedom
## Multiple R-squared: 0.0231, Adjusted R-squared: 0.01828
## F-statistic: 4.79 on 9 and 1823 DF, p-value: 0.000002459
The stepwise model selection resulted in a statistically significant p-value. However, the \(R^2\) value indicates the Model 5 does not describe much variability in the data. So far, the quality of the multivariate linear regression approaches have left much to be desired. Next, we perform transformations to select feature variables in an effort to improve the fit.
Here we describe a model built with the intention of testing box cox transformations on non-normally distributed variables, with model selection based on a manual selection process, including only significant independent feature variables.
## Estimated transformation parameters
## AGE BLUEBOOK CAR_AGE CLM_FREQ HOME_VAL HOMEKIDS
## -3.26296061 0.45557345 0.53194959 0.19001232 0.95680511 0.74940129
## INCOME KIDSDRIV MVR_PTS OLDCLAIM TARGET_AMT TIF
## 0.88488262 -1.57276438 0.20739359 -0.17762378 0.01363055 -0.02223101
## TRAVTIME YOJ
## 0.76666788 1.72079913
##
## Call:
## lm(formula = log(TARGET_AMT) ~ . + I(BLUEBOOK^0.5) + I(MVR_PTS^0.33) +
## I(CAR_AGE^0.5) + I(CLM_FREQ^0.33), data = partial_train_mv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0825 -0.4017 0.0406 0.4006 3.2382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3503267097 0.9481804576 7.752 1.51e-14 ***
## KIDSDRIV -0.0229075710 0.0363465752 -0.630 0.528609
## AGE 0.0034095676 0.0024432740 1.395 0.163041
## HOMEKIDS 0.0388800594 0.0241293874 1.611 0.107288
## YOJ -0.0143172874 0.0079776891 -1.795 0.072876 .
## INCOME -0.0000004264 0.0000008095 -0.527 0.598400
## PARENT1Yes -0.0080604512 0.0671577670 -0.120 0.904479
## HOME_VAL -0.0000002183 0.0000003696 -0.591 0.554883
## MSTATUSYes -0.1173962098 0.0572854150 -2.049 0.040577 *
## SEXM 0.0563239618 0.0744274068 0.757 0.449291
## EDUCATIONBachelors -0.0778961132 0.0725655348 -1.073 0.283210
## EDUCATIONHigh School 0.0013253906 0.0584803896 0.023 0.981921
## EDUCATIONMasters 0.2108908131 0.1237628596 1.704 0.088557 .
## EDUCATIONPhD 0.3209255664 0.1488179965 2.156 0.031178 *
## JOBClerical 0.0036365167 0.0666575976 0.055 0.956499
## JOBDoctor -0.0479935527 0.2091021794 -0.230 0.818489
## JOBHome Maker -0.0741275869 0.1035850942 -0.716 0.474319
## JOBLawyer -0.0670703273 0.1338682125 -0.501 0.616420
## JOBManager -0.0408282169 0.1041269248 -0.392 0.695031
## JOBProfessional 0.0694767589 0.0771279076 0.901 0.367817
## JOBStudent 0.0680772116 0.0876298039 0.777 0.437337
## JOBUnknown -0.0121676691 0.1307995491 -0.093 0.925894
## TRAVTIME -0.0008597352 0.0012380492 -0.694 0.487505
## CAR_USEPrivate -0.0076256346 0.0590807243 -0.129 0.897316
## BLUEBOOK -0.0000429770 0.0000142395 -3.018 0.002579 **
## TIF -0.0023724797 0.0048457051 -0.490 0.624474
## CAR_TYPEPanel Truck 0.1494173654 0.1139458070 1.311 0.189924
## CAR_TYPEPickup 0.0349327315 0.0675504290 0.517 0.605126
## CAR_TYPESports Car 0.0361919156 0.0855726402 0.423 0.672391
## CAR_TYPESUV 0.0208939265 0.0763594335 0.274 0.784404
## CAR_TYPEVan 0.0074392884 0.0874055025 0.085 0.932182
## RED_CARyes -0.0045825782 0.0562131658 -0.082 0.935036
## OLDCLAIM 0.0000023880 0.0000028221 0.846 0.397581
## CLM_FREQ -0.1045802790 0.1830292008 -0.571 0.567810
## REVOKEDYes -0.0892712012 0.0605611729 -1.474 0.140640
## MVR_PTS 0.0023772773 0.0443129481 0.054 0.957222
## CAR_AGE -0.0544132053 0.0453975557 -1.199 0.230845
## URBANICITYHighly Urban/ Urban 0.0772653177 0.0849519007 0.910 0.363199
## CAR_AGE_BRAND_NEW_FLAG -0.0359928141 0.2074810297 -0.173 0.862298
## CLM_FREQ_ZERO 0.0957630861 0.7331820340 0.131 0.896096
## HOME_VAL_ZERO -0.1273463015 0.0874803662 -1.456 0.145648
## MVR_PTS_ZERO -0.0457748403 0.3164487673 -0.145 0.885002
## YOJ_ZERO -0.2234081327 0.1221966902 -1.828 0.067676 .
## I(BLUEBOOK^0.5) 0.0119039936 0.0032168563 3.701 0.000222 ***
## I(MVR_PTS^0.33) 0.0047714435 0.3206990707 0.015 0.988131
## I(CAR_AGE^0.5) 0.2262671470 0.2857589599 0.792 0.428576
## I(CLM_FREQ^0.33) 0.2219460607 0.8880947239 0.250 0.802684
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8008 on 1786 degrees of freedom
## Multiple R-squared: 0.04176, Adjusted R-squared: 0.01708
## F-statistic: 1.692 on 46 and 1786 DF, p-value: 0.002745
##
## Call:
## lm(formula = log(TARGET_AMT) ~ YOJ + SEX + EDUCATION + BLUEBOOK +
## CLM_FREQ + CAR_AGE + MVR_PTS_ZERO + YOJ_ZERO + I(BLUEBOOK^0.5) +
## I(CAR_AGE^0.5) + I(CLM_FREQ^0.33), data = partial_train_mv)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0976 -0.4089 0.0282 0.3993 3.2463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.53136444 0.21459431 35.096 < 2e-16 ***
## YOJ -0.01336321 0.00722113 -1.851 0.064394 .
## SEXM 0.07397870 0.03802748 1.945 0.051881 .
## EDUCATIONBachelors -0.08224430 0.06367141 -1.292 0.196626
## EDUCATIONHigh School -0.00004578 0.05396661 -0.001 0.999323
## EDUCATIONMasters 0.14096860 0.08416967 1.675 0.094143 .
## EDUCATIONPhD 0.23262536 0.10614081 2.192 0.028530 *
## BLUEBOOK -0.00003438 0.00001219 -2.820 0.004853 **
## CLM_FREQ -0.08574624 0.04135352 -2.073 0.038267 *
## CAR_AGE -0.05663411 0.01992781 -2.842 0.004534 **
## MVR_PTS_ZERO -0.06782767 0.04106551 -1.652 0.098770 .
## YOJ_ZERO -0.20477123 0.10152579 -2.017 0.043849 *
## I(BLUEBOOK^0.5) 0.01027662 0.00292447 3.514 0.000452 ***
## I(CAR_AGE^0.5) 0.25065424 0.09251588 2.709 0.006806 **
## I(CLM_FREQ^0.33) 0.13661040 0.08226154 1.661 0.096949 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7977 on 1818 degrees of freedom
## Multiple R-squared: 0.03193, Adjusted R-squared: 0.02447
## F-statistic: 4.283 on 14 and 1818 DF, p-value: 0.0000001629
As with Model 5, while Model 6 results in a statistically significant p-value, the Adjusted \(R^2\) suggests that Model 6 has little predictive power over our target variable.
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME +
## PARENT1 + HOME_VAL + MSTATUS + SEX + EDUCATION + JOB + TRAVTIME +
## CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + CAR_AGE_BRAND_NEW_FLAG +
## CLM_FREQ_ZERO + HOME_VAL_ZERO + MVR_PTS_ZERO + YOJ_ZERO,
## data = partial_train_mv, weights = 1/resid_sq)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -2.0496 -0.9926 -0.9600 0.9545 2.3695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4873.0842551 84.3661857 57.761 < 2e-16 ***
## KIDSDRIV -111.9602346 15.7744441 -7.098 1.82e-12 ***
## AGE 40.7258174 1.1986811 33.976 < 2e-16 ***
## HOMEKIDS 392.4311076 10.6257129 36.932 < 2e-16 ***
## YOJ -39.4394157 3.4476425 -11.440 < 2e-16 ***
## INCOME -0.0041595 0.0005469 -7.605 4.56e-14 ***
## PARENT1Yes -125.3659963 35.3598233 -3.545 0.000402 ***
## HOME_VAL -0.0013641 0.0002593 -5.260 1.61e-07 ***
## MSTATUSYes -1166.9679554 30.4684593 -38.301 < 2e-16 ***
## SEXM 1430.6332253 39.5380021 36.184 < 2e-16 ***
## EDUCATIONBachelors 142.8985715 32.9140776 4.342 1.49e-05 ***
## EDUCATIONHigh School -173.4341133 31.8476691 -5.446 5.87e-08 ***
## EDUCATIONMasters 1710.9516909 58.3192487 29.338 < 2e-16 ***
## EDUCATIONPhD 2830.5017131 65.9090520 42.946 < 2e-16 ***
## JOBClerical -38.0882886 29.9556694 -1.271 0.203720
## JOBDoctor -2277.2474977 116.2062092 -19.597 < 2e-16 ***
## JOBHome Maker -576.1840898 55.0925830 -10.458 < 2e-16 ***
## JOBLawyer 74.9249751 85.7254013 0.874 0.382229
## JOBManager -1117.8182762 60.6916068 -18.418 < 2e-16 ***
## JOBProfessional 1083.1409584 39.8708612 27.166 < 2e-16 ***
## JOBStudent 164.5639266 29.9942834 5.487 4.68e-08 ***
## JOBUnknown 108.7377112 81.6214100 1.332 0.182957
## TRAVTIME 1.0390187 0.5918803 1.755 0.079353 .
## CAR_USEPrivate -592.9245135 27.3582063 -21.673 < 2e-16 ***
## BLUEBOOK 0.1032031 0.0022401 46.071 < 2e-16 ***
## TIF -11.1704211 2.1466194 -5.204 2.18e-07 ***
## CAR_TYPEPanel Truck -1143.7951785 53.1445963 -21.522 < 2e-16 ***
## CAR_TYPEPickup -46.3061317 43.8658072 -1.056 0.291279
## CAR_TYPESports Car 1114.2649673 57.9203209 19.238 < 2e-16 ***
## CAR_TYPESUV 781.1395208 52.6566308 14.835 < 2e-16 ***
## CAR_TYPEVan 254.2781843 54.3452338 4.679 3.10e-06 ***
## RED_CARyes -477.8713007 28.9438369 -16.510 < 2e-16 ***
## OLDCLAIM 0.0210646 0.0018387 11.456 < 2e-16 ***
## CLM_FREQ -254.6347971 12.8456531 -19.823 < 2e-16 ***
## REVOKEDYes -1405.7877024 32.6326465 -43.079 < 2e-16 ***
## MVR_PTS 54.0500296 5.0069287 10.795 < 2e-16 ***
## CAR_AGE -210.3342257 3.3401131 -62.972 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 736.9820612 30.2400774 24.371 < 2e-16 ***
## CAR_AGE_BRAND_NEW_FLAG -1549.7110334 29.0037684 -53.431 < 2e-16 ***
## CLM_FREQ_ZERO -179.8828706 40.1730267 -4.478 8.02e-06 ***
## HOME_VAL_ZERO -1003.8803414 54.6375143 -18.373 < 2e-16 ***
## MVR_PTS_ZERO -68.5349793 27.7667595 -2.468 0.013671 *
## YOJ_ZERO -965.4437850 50.6080486 -19.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9939 on 1790 degrees of freedom
## Multiple R-squared: 0.9935, Adjusted R-squared: 0.9933
## F-statistic: 6493 on 42 and 1790 DF, p-value: < 2.2e-16
Our Adjusted \(R^2\) value here is 99% which, at face value, appears like a miraculous improvement upon our earlier modeling attempts. However, the weighted least squares \(R^2\) value cannot be interpreted in the same way that the unweighted \(R^2\) from ordinary least squares linear regression models. Regardless, based on the F-statistic and the p-value, it looks like the model is extremely significant as well.
In looking at these coefficients we note some things that we wouldn’t have expected:
In looking at our binary logistic regression models, we’ll evaluate our binary logistic regression model (Model 1), the stepwise binary logistic regression model (Model 2), as well as the Random Forest model (Model 3) as a benchmark for accuracy.
## [1] 0.782502
## predicted
## true 0 1
## 0 823 78
## 1 188 134
We see the accuracy of this model is 79%. Our precision is 93% and our sensitivity is 82%.
## [1] 0.7800491
## predicted
## true 0 1
## 0 823 78
## 1 191 131
Our stepwise model produces VERY similar results to our previous model. Results vary only slightly.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 837 195
## 1 64 127
##
## Accuracy : 0.7882
## 95% CI : (0.7642, 0.8108)
## No Information Rate : 0.7367
## P-Value [Acc > NIR] : 1.687e-05
##
## Kappa : 0.372
##
## Mcnemar's Test P-Value : 6.594e-16
##
## Sensitivity : 0.9290
## Specificity : 0.3944
## Pos Pred Value : 0.8110
## Neg Pred Value : 0.6649
## Prevalence : 0.7367
## Detection Rate : 0.6844
## Detection Prevalence : 0.8438
## Balanced Accuracy : 0.6617
##
## 'Positive' Class : 0
##
Here we see our random forest model is in line with our other models as far as accuracy, however, we see a trade off in terms of precision and sensitivity. Our accuracy is only 0.7882257% whereas our sensitivity is now 0.9289678%.
For our current model, we’d probably be most interested in sensitivity since we’re concerned with identifying positive outcomes and the cost of a false-positive is low. Were we looking for the most accurate model, we’d move forward with the random forest model, however, for this assignment we’ll continue forward with our most accurate binary logistic regression model.
We can see that out of all the models the Random Forest and the Binomial Model using Stepwise were practically tied, with the Random Forest being slightly more accurate. As mentioned previously, we’ll move forward with the stepwise logistic regression model.
Multivariate linear regression approaches resulted in statistically significant fits to the data, however, the model predictions are have room for improvement. The figure below shows each multivariate model’s predictions plotted as a function of response variable. Ideally, this plot would result in an approximately linear agreement. However, we see a lot of deviation which suggests our multivariate models we deployed do not make accurate predictions of the numeric target variable TARGET_AMT
## RMSE
## Model 4: 6621.119
## Model 5: 6498.003
## Model 6: 8466.536
## Model 7: 6623.416
Comparing the RMSE of each of the models on the holdout validation dataset, it looks like Model 5 has the smallest errors.
In our analysis, we find that multivariate linear regression modeling approaches have a lot of room for improvement towards predictions of the numeric target feature TARGET_AMT
. However, the binary logistic regression models were able to give predictions of the binary response variable TARGET_FLAG
that are reasonably accurate. Moving forward, we will use the Stepwise Binary Logistic Model (Model 2) to predict the probability that a person will crash their car (TARGET_FLAG
) and the Stepwise Multivariate Linear Regression (Model 5) to give prediction estimates for the amount of money it will cost if a person crashes their car (TARGET_AMT
).
Logistic regression predictions: - predictions on TARGET_FLAG
, the probability that a person will crash their car
test <- final_df %>% filter(dataset == 'test') %>% dplyr::select(-dataset)
logistic_binary_final <- predict(binary.mdl.w.step, test, type = "response")
head(logistic_binary_final)
## 1 2 3 4 5 6
## 0.1323629 0.2677470 0.1328139 0.3139109 0.1323659 0.2772592
Multivariate Regression Predictions: - predictions on TARGET_AMT
, amount of money it will cost if a person crashes their car
## 1 2 3 4 5 6
## 7153.636 6733.137 5255.977 5595.512 6941.973 6853.909
R
Statistical Code# Libraries and Options
knitr::opts_chunk$set(echo = F, warning = F, message = F, eval = T,
fig.height = 5, fig.width = 10)
library(knitr)
library(skimr)
library(visdat)
library(inspectdf)
library(corrplot)
library(scales)
library(tidyverse)
library(tidyr)
library(bestglm)
library(pROC)
library(car)
library(ggcorrplot)
library(mice)
library(caret)
library(plyr)
library(dplyr)
library(MASS)
library(zoo)
options(scipen = 9)
set.seed(123)
boxplot_depend_vs_independ <- function(df_train, target_name) {
train_int_names <- df_train %>% select_if(is.numeric)
int_names <- names(train_int_names)
myGlist <- vector('list', length(int_names))
names(myGlist) <- int_names
for (i in int_names) {
myGlist[[i]] <-
ggplot(df_train, aes_string(x = target_name, y = i)) +
geom_boxplot(color = 'steelblue', outlier.color = 'firebrick',
outlier.alpha = 0.35) +
labs(title = paste0(i,' vs target'), y = i, x= 'target') +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.45),
panel.grid.major.y = element_line(color = "grey",
linetype = "dashed"),
panel.grid.major.x = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank(),
axis.ticks.x = element_line(color = "grey")
)
}
myGlist <- within(myGlist, rm(target_name))
gridExtra::grid.arrange(grobs = myGlist, ncol = 3)
}
plot_corr_matrix <- function(dataframe, significance_threshold){
title <- paste0('Correlation Matrix for significance > ',
significance_threshold)
df_cor <- dataframe %>% mutate_if(is.character, as.factor)
df_cor <- df_cor %>% mutate_if(is.factor, as.numeric)
#run a correlation and drop the insignificant ones
corr <- cor(df_cor)
#prepare to drop duplicates and correlations of 1
corr[lower.tri(corr,diag=TRUE)] <- NA
#drop perfect correlations
corr[corr == 1] <- NA
#turn into a 3-column table
corr <- as.data.frame(as.table(corr))
#remove the NA values from above
corr <- na.omit(corr)
#select significant values
corr <- subset(corr, abs(Freq) > significance_threshold)
#sort by highest correlation
corr <- corr[order(-abs(corr$Freq)),]
#print table
# print(corr)
#turn corr back into matrix in order to plot with corrplot
mtx_corr <- reshape2::acast(corr, Var1~Var2, value.var="Freq")
#plot correlations visually
corrplot(mtx_corr,
title=title,
mar=c(0,0,1,0),
method='color',
tl.col="black",
na.label= " ",
addCoef.col = 'black',
number.cex = .9)
}
test_URL <- paste0('https://raw.githubusercontent.com/AngelClaudio/',
'data-sources/master/csv/insurance-evaluation-data.csv')
train_URL <- paste0('https://raw.githubusercontent.com/AngelClaudio/',
'data-sources/master/csv/insurance_training_data.csv')
train <- readr::read_csv(train_URL)
test <- readr::read_csv(test_URL)
glimpse(train) #basic visualization
dim(train %>% select_if(is.numeric))[2] #number of numeric features
dim(train %>% select_if(is.character))[2] #number categorical features
summary(train) #basic summary
colSums(is.na(train)) #missingness by column
#visualization to find patterns in missingness
VIM::aggr(train, col=c('green','red'), numbers=T, sortVars=T,
cex.axis = .7,
ylab=c("Proportion of Data", "Combinations and Percentiles"))
train$dataset <- 'train'
test$dataset <- 'test'
final_df <- rbind(train, test) #combine test/train to preprocess the same
final_df <- dplyr::select(final_df, -c('INDEX')) #drop INDEX feature
#change the following from character to numeric
final_df <- dplyr::mutate(final_df, INCOME = as.numeric(gsub('[$,]', '', INCOME)),
HOME_VAL = as.numeric(gsub('[$,]', '', HOME_VAL)),
BLUEBOOK = as.numeric(gsub('[$,]', '', BLUEBOOK)),
OLDCLAIM = as.numeric(gsub('[$,]', '', OLDCLAIM)))
#transform from character to factor:
final_df <- dplyr::mutate(final_df,
MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
SEX = as.factor(str_remove(SEX, "^z_")),
# <HIgh School not a typo, means less than HS
EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
JOB = as.factor(str_remove(JOB, "^z_")),
CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
URBANICITY = as.factor(str_remove(URBANICITY, "^z_")),
CAR_USE = as.factor(CAR_USE),
REVOKED = as.factor(REVOKED),
PARENT1 = as.factor(PARENT1),
RED_CAR = as.factor(RED_CAR),
TARGET_FLAG = as.factor(TARGET_FLAG))
#na.approx from the zoo library to perform linear interpolation on NA values
final_df <- final_df %>% mutate_at(vars(c("CAR_AGE", "YOJ", "AGE", "INCOME",
"HOME_VAL")),
~ifelse(is.na(.), na.approx(.), .))
# impute NAs in job
final_df$JOB <- as.character(final_df$JOB)
final_df$JOB[is.na(final_df$JOB)] <- "Unknown"
final_df$JOB <- as.factor(final_df$JOB)
#visualize missingness post imputation
sapply(final_df, function(x) sum(is.na(x)))
VIM::aggr(final_df, col=c('green','red'), numbers=T, sortVars=T,
cex.axis = .7,
ylab=c("Proportion of Data", "Combinations and Percentiles"))
# unbind data
train <- dplyr::select(dplyr::filter(final_df, dataset == 'train'),
-c('dataset'))
test <- dplyr::select(dplyr::filter(final_df, dataset == 'test'),
-c('dataset'))
# factor analysis. visualize most common factors
inspectdf::inspect_imb(train) %>% show_plot()
# visualize a summary of common levels
inspectdf::inspect_num(train) %>% show_plot()
# boxplots of features x predictor
target_name <- 'TARGET_FLAG'
boxplot_depend_vs_independ(train, target_name)
#plot the correlation matrix
plot_corr_matrix(train, .2)
# feature engineering
# flag brand new cars
final_df$CAR_AGE <- ifelse(final_df$CAR_AGE < 1, 1, final_df$CAR_AGE)
final_df$CAR_AGE_BRAND_NEW_FLAG <- ifelse(final_df$CAR_AGE == 1, 1, 0)
# zero claims
final_df$CLM_FREQ_ZERO <- ifelse(final_df$CLM_FREQ == 0, 1, 0)
# Home Value
final_df$HOME_VAL_ZERO <- ifelse(final_df$HOME_VAL == 0, 1, 0)
# Motor Vehicle Record Points
final_df$MVR_PTS_ZERO <- ifelse(final_df$MVR_PTS == 0, 1, 0)
# no years on job
final_df$YOJ_ZERO <- ifelse(final_df$YOJ == 0, 1, 0)
# Prepare data
train_data <- final_df %>% filter(dataset == 'train') %>% dplyr::select(-dataset)
train_data$TARGET_AMT <- NULL
split <- caret::createDataPartition(train_data$TARGET_FLAG, p=0.85, list=FALSE)
partial_train <- train_data[split, ]
validation <- train_data[ -split, ]
# Model 1
binary.mdl <- glm(TARGET_FLAG~., family=binomial, data=partial_train)
summary(binary.mdl)
# Model 2
binary.mdl.w.step <- step(binary.mdl)
summary(binary.mdl.w.step)
# Model 3
rf <- randomForest::randomForest(TARGET_FLAG~., data=partial_train)
# Prepare data
train_mv_mdl <- final_df %>% filter(dataset == 'train') %>% dplyr::select(-dataset)
train_mv_mdl <- train_mv_mdl[train_mv_mdl$TARGET_FLAG == 1, ]
train_mv_mdl$TARGET_FLAG <- NULL
split_mv <- caret::createDataPartition(train_mv_mdl$TARGET_AMT, p=0.85, list = F)
partial_train_mv <- train_mv_mdl[split_mv, ]
validation_mv <- train_mv_mdl[-split_mv, ]
# Model 4
mv.mdl <- train(TARGET_AMT ~., data = partial_train_mv, method = "lm",
trControl = trainControl(method = "cv", number = 10,
savePredictions = TRUE),
tuneLength = 5, preProcess = c("center", "scale"))
summary(mv.mdl$finalModel)
lm1 <- mv.mdl
# Model 5
lm2_base <- lm(TARGET_AMT ~ ., data = partial_train_mv)
lm_step <- stepAIC(lm2_base, trace = F)
summary(lm_step)
lm2 <- lm_step
# Model 6
train_mv_mdl %>%
select_if(is.numeric) %>%
dplyr::mutate(row = row_number()) %>%
tidyr::gather(field, val, -row) %>%
dplyr::filter(val > 0) %>%
tidyr::spread(field, val) %>%
dplyr::select(-row, -CAR_AGE_BRAND_NEW_FLAG, -CLM_FREQ_ZERO, -HOME_VAL_ZERO, -MVR_PTS_ZERO, -YOJ_ZERO) %>%
powerTransform()
lm3_base <- lm(log(TARGET_AMT) ~ . + I(BLUEBOOK^0.5) + I(MVR_PTS^.33) + I(CAR_AGE^.5) + I(CLM_FREQ^0.33)
, data = partial_train_mv)
summary(lm3_base)
lm3_step <- stepAIC(lm3_base, trace = F, direction = 'backward')
summary(lm3_step)
lm3 <- lm3_step
# Model 7
lm4_base <- lm(TARGET_AMT ~ ., data = partial_train_mv)
resid_sq <- lm4_base$residuals^2
lm4_wls <- lm(TARGET_AMT ~ ., data = partial_train_mv, weights = 1/resid_sq)
summary(lm4_wls)
lm4_wls_step <- stepAIC(lm4_wls, trace = F)
summary(lm4_wls_step)
#
Model Selection
#evaluate Model 1
y_hat_glm <- predict(binary.mdl, validation, type = "response")
y_hat_glm_binar <- (y_hat_glm>0.5)*1
mean(validation$TARGET_FLAG==y_hat_glm_binar)
(CM <- table(true= validation$TARGET_FLAG, predicted = y_hat_glm_binar))
#Evaluate Model 2
y_hat_glm <- predict(binary.mdl.w.step, validation, type = "response")
y_hat_glm_binar <- (y_hat_glm>0.5)*1
mean(validation$TARGET_FLAG==y_hat_glm_binar)
(CM <- table(true= validation$TARGET_FLAG, predicted = y_hat_glm_binar))
#Evaluate Model 3
val <- validation
val$pred <- predict(rf, val)
val$pred <- as.factor(val$pred)
confusionMatrix(val$pred, val$TARGET_FLAG)
# Evaluating Model 4-7
validation_mv$pred1 <- predict(lm1, newdata = validation_mv)
validation_mv$pred2 <- predict(lm2, newdata = validation_mv)
validation_mv$pred3 <- predict(lm3, newdata = validation_mv)
#validation_mv$pred4 <- predict(lm4, newdata = validation_mv)
validation_mv %>%
dplyr::select(TARGET_AMT, pred1, pred2, pred3) %>%
tidyr::gather(model, prediction, -TARGET_AMT) %>%
ggplot(aes(x = TARGET_AMT, y = prediction)) +
geom_point(alpha = .2) +
geom_abline(slope = 1, intercept = 0, alpha = .3) +
facet_wrap(~model)
validation_mv$resid1 <- with(validation_mv, pred1 - TARGET_AMT)
validation_mv$resid2 <- with(validation_mv, pred2 - TARGET_AMT)
validation_mv$resid3 <- with(validation_mv, pred3 - TARGET_AMT)
#validation_mv$resid4 <- with(validation_mv, pred4 - TARGET_AMT)
validation_mv %>%
dplyr::select(TARGET_AMT, resid1, resid2, resid3) %>%
tidyr::gather(model, residuals, -TARGET_AMT) %>%
ggplot(aes(x = TARGET_AMT, y = residuals)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, alpha = .3) +
facet_wrap(~model)
# RMSE calc
validation_mv$resid_4 <- with(validation_mv, model_4 - TARGET_AMT)
validation_mv$resid_5 <- with(validation_mv, model_5 - TARGET_AMT)
validation_mv$resid_6 <- with(validation_mv, model_6 - TARGET_AMT)
validation_mv$resid_7 <- with(validation_mv, model_7 - TARGET_AMT)
cat('RMSE\n',
'Model 4: ', sqrt(sum(validation_mv$resid_4^2) / nrow(validation_mv)), '\n',
'Model 5: ', sqrt(sum(validation_mv$resid_5^2) / nrow(validation_mv)), '\n',
'Model 6: ', sqrt(sum(validation_mv$resid_6^2) / nrow(validation_mv)), '\n',
'Model 7: ', sqrt(sum(validation_mv$resid_7^2) / nrow(validation_mv))
)