Cover Page
Data 621 - Week 4 HW
Baron Curtin
CUNY School of Professional Studies
Introduction
The purpose of this assignment is to generate multiple linear regression and binary logistic regression models based on a training dataset to predict the probability that a person will crash their car and the amount of money the crash costs
Data Exploration
Non-Visual Inspection
Variables
- Response Variable:
- TARGET_FLAG: binary indicator of whether the car was involved in a crash
- TARGET_AMT: cost of the crash
- Explanatory Variables
- Identification Variable:
- INDEX: will not be used in analysis
## Observations: 8,161
## Variables: 26
## $ INDEX <int> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 1...
## $ TARGET_FLAG <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ TARGET_AMT <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000...
## $ KIDSDRIV <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ AGE <int> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55...
## $ HOMEKIDS <int> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0,...
## $ YOJ <int> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, ...
## $ INCOME <chr> "$67,349", "$91,449", "$16,039", "", "$114,986", "...
## $ PARENT1 <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "...
## $ HOME_VAL <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,92...
## $ MSTATUS <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes"...
## $ SEX <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z...
## $ EDUCATION <chr> "PhD", "z_High School", "z_High School", "<High Sc...
## $ JOB <chr> "Professional", "z_Blue Collar", "Clerical", "z_Bl...
## $ TRAVTIME <int> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25,...
## $ CAR_USE <chr> "Private", "Commercial", "Private", "Private", "Pr...
## $ BLUEBOOK <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,00...
## $ TIF <int> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6...
## $ CAR_TYPE <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV",...
## $ RED_CAR <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes"...
## $ OLDCLAIM <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", ...
## $ CLM_FREQ <int> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0,...
## $ REVOKED <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", ...
## $ MVR_PTS <int> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0...
## $ CAR_AGE <int> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5,...
## $ URBANICITY <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Hig...
The glimpse function of dplyr shows that there are 8161 observations and 26 variables * INCOME, HOME_VAL, OLDCLAIM are interestingly character type but contain monetary values + These will have to be converted from char to integer in the data preparation stage * MSTATUS, SEX, EDUCATION, URBANiCITY will have to be modified in the data preparation stage as there are inconsitent values
- There are 14 fields of the character data type
- 1 field of the double data type
- 11 fields of the interger data type
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS
## Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## SEX EDUCATION JOB TRAVTIME
## Length:8161 Length:8161 Length:8161 Min. : 5.00
## Class :character Class :character Class :character 1st Qu.: 22.00
## Mode :character Mode :character Mode :character Median : 33.00
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## CAR_USE BLUEBOOK TIF CAR_TYPE
## Length:8161 Length:8161 Min. : 1.000 Length:8161
## Class :character Class :character 1st Qu.: 1.000 Class :character
## Mode :character Mode :character Median : 4.000 Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## RED_CAR OLDCLAIM CLM_FREQ REVOKED
## Length:8161 Length:8161 Min. :0.0000 Length:8161
## Class :character Class :character 1st Qu.:0.0000 Class :character
## Mode :character Mode :character Median :0.0000 Mode :character
## Mean :0.7986
## 3rd Qu.:2.0000
## Max. :5.0000
##
## MVR_PTS CAR_AGE URBANICITY
## Min. : 0.000 Min. :-3.000 Length:8161
## 1st Qu.: 0.000 1st Qu.: 1.000 Class :character
## Median : 1.000 Median : 8.000 Mode :character
## Mean : 1.696 Mean : 8.328
## 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :13.000 Max. :28.000
## NA's :510
- YOJ and AGE have missing values, they will need to be imputed
- CAR_AGE has a negative value. How is that possible?
Basic Stats
- basicStats further confirms that CAR_AGE and YOJ have missing values
- HOME_VAL has the highest variance amongst the variables
- The largest skew value is ~8.7 in the TARGET_AMT variable
Correlation
- MVR_PTS has the highest positive correlation with TARGET_FLAG
- You would typically expect someone with a lot of points on their license to get in more accidents
- HOME_VAL has the highest negative correlation with TARGET_FLAG
- I was not expecting HOME_VAL to have the highest negative correlation
- HOMEKIDS has a positive correlation
- I would think that parents would be more responsible drivers
- Rankings were added to show if some predictors are more highly correlated with TARGET_FLAG vs TARGET_AMT
- BLUEBOOK is more positively correlated with TARGET_AMT than TARGET_FLAG, which is to be expected cause the higher a car’s value, the higher the payout
- The correlation matrix above shows that TARGET_FLAG and TARGET_AMT and KIDSDRIV and HOMEKIDS are the two sets of most highly correlated variables
- This makes sense because when kids are at home, they are likely to drive the parents car because they often can’t afford one of their own
- TARGET_FLAG and TARGET_AMT make sense because if you’re in a crash, you are very likely to be paying for the cost of the crash
- Outside of the two sets of variables outlined above, MVR_PTS and CLM_FREQ are the most highly correlated variables
- People with points on their record are more likely to file claims more frequently
Visual Inspection
Density Plots
- AGE is the only normally distributed variable
- Interesting as this provides some evidence that the data is adequate for inference
- All of the other variables are multi-modal and assymetric
Histograms
- The histograms further reiniforce the conclusions made in the density plots
Box Plots
- HOME_VAL and INCOME appear to have the highest variances
- This confirms what we saw in basicStats
Removing HOME_VAL, INCOME, OLDCLAIM, BLUEBOOK, TARGET_AMT…
- The means and medians are pretty close to each other for all of the variables, displaying evidence of very slight skews
Data Preparation
Transforming Monetary Columns to Numeric
Converting the currency columns to numeric will be important when modelling
Transform Inconsitent Observations Within Columns
There was evidence of inconsistent naming across observations. We will transform those variables here. We will first create a list of all the character columns and the values they contain
## $PARENT1
## [1] "No" "Yes"
##
## $MSTATUS
## [1] "Yes" "z_No"
##
## $SEX
## [1] "z_F" "M"
##
## $EDUCATION
## [1] "z_High School" "Bachelors" "Masters" "<High School"
## [5] "PhD"
##
## $JOB
## [1] "z_Blue Collar" "Manager" "Lawyer" "Student"
## [5] "Home Maker" "Doctor" "Clerical" ""
## [9] "Professional"
##
## $CAR_USE
## [1] "Private" "Commercial"
##
## $CAR_TYPE
## [1] "z_SUV" "Minivan" "Van" "Pickup" "Panel Truck"
## [6] "Sports Car"
##
## $RED_CAR
## [1] "no" "yes"
##
## $REVOKED
## [1] "No" "Yes"
##
## $URBANICITY
## [1] "Highly Urban/ Urban" "z_Highly Rural/ Rural"
- Many of the fields have values prefixed with “z_” which can be removed
- EDUCATION field has an addition “<” that prefixes High School which also can be removed
- RED_CHAR will be transformed so that the first letter is capitalized
- URBANICITY will be trimmed to remove extra spaces and renamed to URBANCITY
Create Binary Dummy Variable Columns for Categorical Variables
First we will take CAR_USE, PARENT1, MSTATUS, SEX, RED_CAR, REVOKED, URBANCITY and convert them to binary variables
- SINGLEPARENT will have value of 1 when TRUE, else 0
- MARRIED will have value of 1 when TRUE, else 0
- RED_CAR will have value of 1 when TRUE, else 0
- MALE will have value of 1 when TRUE, else 0
- COMMERCIALUSE will have value of 1 when TRUE, else 0
- REVOKED will have value of 1 when TRUE, else 0
- URBANICITY will have value of 1 when TRUE, else 0
The rest of the variables will be dummified using the dummies package
- Converting all of the character fields to to binary columns will make missForest’s job easier as well as provide a better regression model
- The remaining variables are EDUCATION, JOB, CAR_TYPE
The fields with spaces will be renamed to include _
Missing Value Imputation
We can use the package missForest to impute values for the NAs
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
Build Models
Leaps Subsetting
We can use the leaps package to subset the explanatory variables to find the best model. All of the character variables have already been converted to factors and subsequently numerical values and all of the missing values have been imputed using missForest
## Reordering variables and trying again:
## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train,
## method = "exhaustive", nvmax = NULL, nbest = 1)
## 39 Variables (and intercept)
## Forced in Forced out
## KIDSDRIV FALSE FALSE
## AGE FALSE FALSE
## HOMEKIDS FALSE FALSE
## YOJ FALSE FALSE
## INCOME FALSE FALSE
## SINGLEPARENT FALSE FALSE
## HOME_VAL FALSE FALSE
## MARRIED FALSE FALSE
## MALE FALSE FALSE
## EDUCATIONBachelors FALSE FALSE
## EDUCATIONHigh_School FALSE FALSE
## EDUCATIONMasters FALSE FALSE
## JOB FALSE FALSE
## JOBBlue_Collar FALSE FALSE
## JOBClerical FALSE FALSE
## JOBDoctor FALSE FALSE
## JOBHome_Maker FALSE FALSE
## JOBLawyer FALSE FALSE
## JOBManager FALSE FALSE
## JOBProfessional FALSE FALSE
## TRAVTIME FALSE FALSE
## COMMERCIALUSE FALSE FALSE
## BLUEBOOK FALSE FALSE
## TIF FALSE FALSE
## CAR_TYPEMinivan FALSE FALSE
## CAR_TYPEPanel_Truck FALSE FALSE
## CAR_TYPEPickup FALSE FALSE
## CAR_TYPESports_Car FALSE FALSE
## CAR_TYPESUV FALSE FALSE
## RED_CAR FALSE FALSE
## OLDCLAIM FALSE FALSE
## CLM_FREQ FALSE FALSE
## REVOKED FALSE FALSE
## MVR_PTS FALSE FALSE
## CAR_AGE FALSE FALSE
## URBANICITY FALSE FALSE
## EDUCATIONPhD FALSE FALSE
## JOBStudent FALSE FALSE
## CAR_TYPEVan FALSE FALSE
## 1 subsets of each size up to 36
## Selection Algorithm: exhaustive
## KIDSDRIV AGE HOMEKIDS YOJ INCOME SINGLEPARENT HOME_VAL MARRIED
## 1 ( 1 ) " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " "*" " "
## 4 ( 1 ) " " " " " " " " " " " " "*" " "
## 5 ( 1 ) " " " " " " " " " " " " "*" " "
## 6 ( 1 ) " " " " " " " " " " " " "*" " "
## 7 ( 1 ) " " " " " " " " " " "*" "*" " "
## 8 ( 1 ) " " " " " " " " " " "*" "*" " "
## 9 ( 1 ) " " " " " " " " " " "*" "*" " "
## 10 ( 1 ) " " " " " " " " " " "*" "*" " "
## 11 ( 1 ) " " " " " " " " " " "*" "*" " "
## 12 ( 1 ) "*" " " " " " " " " "*" "*" " "
## 13 ( 1 ) "*" " " " " " " " " "*" "*" " "
## 14 ( 1 ) "*" " " " " " " " " "*" "*" " "
## 15 ( 1 ) "*" " " " " " " " " "*" "*" " "
## 16 ( 1 ) "*" " " " " " " " " "*" "*" "*"
## 17 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 18 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 19 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 20 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 21 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 22 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 23 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 24 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 25 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 26 ( 1 ) "*" " " " " "*" "*" "*" "*" "*"
## 27 ( 1 ) "*" " " " " " " "*" "*" "*" "*"
## 28 ( 1 ) "*" " " " " "*" "*" "*" "*" "*"
## 29 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*"
## 30 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 31 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 32 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 34 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 35 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 36 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## MALE EDUCATIONBachelors EDUCATIONHigh_School EDUCATIONMasters
## 1 ( 1 ) " " " " " " " "
## 2 ( 1 ) " " " " " " " "
## 3 ( 1 ) " " " " " " " "
## 4 ( 1 ) " " " " " " " "
## 5 ( 1 ) " " " " " " " "
## 6 ( 1 ) " " " " " " " "
## 7 ( 1 ) " " " " " " " "
## 8 ( 1 ) " " " " " " " "
## 9 ( 1 ) " " " " " " " "
## 10 ( 1 ) " " " " " " " "
## 11 ( 1 ) " " " " " " " "
## 12 ( 1 ) " " " " " " " "
## 13 ( 1 ) " " " " " " " "
## 14 ( 1 ) " " " " " " " "
## 15 ( 1 ) " " " " " " " "
## 16 ( 1 ) " " " " " " " "
## 17 ( 1 ) " " " " " " " "
## 18 ( 1 ) " " " " " " " "
## 19 ( 1 ) "*" " " " " " "
## 20 ( 1 ) "*" " " " " " "
## 21 ( 1 ) "*" " " " " " "
## 22 ( 1 ) "*" " " " " " "
## 23 ( 1 ) "*" " " " " " "
## 24 ( 1 ) "*" " " " " " "
## 25 ( 1 ) "*" " " " " " "
## 26 ( 1 ) "*" " " " " " "
## 27 ( 1 ) "*" " " " " "*"
## 28 ( 1 ) "*" " " " " "*"
## 29 ( 1 ) "*" " " " " "*"
## 30 ( 1 ) "*" " " " " "*"
## 31 ( 1 ) "*" " " " " "*"
## 32 ( 1 ) "*" " " " " "*"
## 33 ( 1 ) "*" " " " " "*"
## 34 ( 1 ) "*" " " " " "*"
## 35 ( 1 ) "*" "*" "*" "*"
## 36 ( 1 ) "*" "*" "*" "*"
## EDUCATIONPhD JOB JOBBlue_Collar JOBClerical JOBDoctor
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " "
## 14 ( 1 ) " " " " " " " " " "
## 15 ( 1 ) " " " " "*" " " " "
## 16 ( 1 ) " " " " "*" " " " "
## 17 ( 1 ) " " " " "*" " " " "
## 18 ( 1 ) " " " " "*" " " " "
## 19 ( 1 ) " " " " " " " " " "
## 20 ( 1 ) " " " " " " " " " "
## 21 ( 1 ) " " " " " " " " " "
## 22 ( 1 ) " " " " " " " " " "
## 23 ( 1 ) " " " " " " " " "*"
## 24 ( 1 ) "*" " " " " " " "*"
## 25 ( 1 ) "*" " " " " " " "*"
## 26 ( 1 ) "*" " " " " " " "*"
## 27 ( 1 ) "*" " " "*" "*" "*"
## 28 ( 1 ) "*" " " "*" "*" "*"
## 29 ( 1 ) "*" " " "*" "*" "*"
## 30 ( 1 ) "*" " " "*" "*" "*"
## 31 ( 1 ) "*" " " "*" "*" "*"
## 32 ( 1 ) "*" " " "*" "*" "*"
## 33 ( 1 ) "*" " " "*" "*" "*"
## 34 ( 1 ) "*" "*" "*" "*" "*"
## 35 ( 1 ) " " "*" "*" "*" "*"
## 36 ( 1 ) " " "*" "*" "*" "*"
## JOBHome_Maker JOBLawyer JOBManager JOBProfessional JOBStudent
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " "
## 6 ( 1 ) " " " " "*" " " " "
## 7 ( 1 ) " " " " "*" " " " "
## 8 ( 1 ) " " " " "*" " " " "
## 9 ( 1 ) " " " " "*" " " " "
## 10 ( 1 ) " " " " "*" " " " "
## 11 ( 1 ) " " " " "*" " " " "
## 12 ( 1 ) " " " " "*" " " " "
## 13 ( 1 ) " " " " "*" " " " "
## 14 ( 1 ) " " " " "*" " " " "
## 15 ( 1 ) " " " " "*" " " " "
## 16 ( 1 ) " " " " "*" " " " "
## 17 ( 1 ) " " " " "*" " " " "
## 18 ( 1 ) " " " " "*" " " " "
## 19 ( 1 ) " " " " "*" " " " "
## 20 ( 1 ) " " " " "*" " " "*"
## 21 ( 1 ) " " " " "*" " " " "
## 22 ( 1 ) " " " " "*" " " "*"
## 23 ( 1 ) " " " " "*" " " "*"
## 24 ( 1 ) " " " " "*" " " "*"
## 25 ( 1 ) "*" " " "*" " " "*"
## 26 ( 1 ) "*" " " "*" " " "*"
## 27 ( 1 ) " " " " "*" "*" " "
## 28 ( 1 ) " " " " "*" "*" " "
## 29 ( 1 ) " " " " "*" "*" " "
## 30 ( 1 ) " " " " "*" "*" " "
## 31 ( 1 ) " " " " "*" "*" " "
## 32 ( 1 ) " " "*" "*" "*" " "
## 33 ( 1 ) " " "*" "*" "*" " "
## 34 ( 1 ) " " "*" "*" "*" " "
## 35 ( 1 ) " " "*" "*" "*" " "
## 36 ( 1 ) "*" "*" "*" "*" " "
## TRAVTIME COMMERCIALUSE BLUEBOOK TIF CAR_TYPEMinivan
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " " " "*"
## 7 ( 1 ) " " "*" " " " " "*"
## 8 ( 1 ) " " "*" " " "*" "*"
## 9 ( 1 ) " " "*" " " "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" " " "*" "*"
## 14 ( 1 ) "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" " " "*" "*"
## 18 ( 1 ) "*" "*" "*" "*" "*"
## 19 ( 1 ) "*" "*" "*" "*" " "
## 20 ( 1 ) "*" "*" "*" "*" " "
## 21 ( 1 ) "*" "*" "*" "*" " "
## 22 ( 1 ) "*" "*" "*" "*" " "
## 23 ( 1 ) "*" "*" "*" "*" " "
## 24 ( 1 ) "*" "*" "*" "*" " "
## 25 ( 1 ) "*" "*" "*" "*" " "
## 26 ( 1 ) "*" "*" "*" "*" " "
## 27 ( 1 ) "*" "*" "*" "*" " "
## 28 ( 1 ) "*" "*" "*" "*" " "
## 29 ( 1 ) "*" "*" "*" "*" " "
## 30 ( 1 ) "*" "*" "*" "*" "*"
## 31 ( 1 ) "*" "*" "*" "*" "*"
## 32 ( 1 ) "*" "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*" " "
## 34 ( 1 ) "*" "*" "*" "*" "*"
## 35 ( 1 ) "*" "*" "*" "*" " "
## 36 ( 1 ) "*" "*" "*" "*" "*"
## CAR_TYPEPanel_Truck CAR_TYPEPickup CAR_TYPESports_Car
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " " " " "
## 3 ( 1 ) " " " " " "
## 4 ( 1 ) " " " " " "
## 5 ( 1 ) " " " " " "
## 6 ( 1 ) " " " " " "
## 7 ( 1 ) " " " " " "
## 8 ( 1 ) " " " " " "
## 9 ( 1 ) " " " " " "
## 10 ( 1 ) " " " " " "
## 11 ( 1 ) " " " " " "
## 12 ( 1 ) " " " " " "
## 13 ( 1 ) " " " " "*"
## 14 ( 1 ) " " " " "*"
## 15 ( 1 ) " " " " "*"
## 16 ( 1 ) " " " " "*"
## 17 ( 1 ) " " " " "*"
## 18 ( 1 ) " " " " "*"
## 19 ( 1 ) " " "*" "*"
## 20 ( 1 ) " " "*" "*"
## 21 ( 1 ) " " "*" "*"
## 22 ( 1 ) " " "*" "*"
## 23 ( 1 ) " " "*" "*"
## 24 ( 1 ) " " "*" "*"
## 25 ( 1 ) " " "*" "*"
## 26 ( 1 ) " " "*" "*"
## 27 ( 1 ) " " "*" "*"
## 28 ( 1 ) " " "*" "*"
## 29 ( 1 ) " " "*" "*"
## 30 ( 1 ) "*" "*" "*"
## 31 ( 1 ) "*" "*" "*"
## 32 ( 1 ) "*" "*" "*"
## 33 ( 1 ) "*" "*" "*"
## 34 ( 1 ) "*" "*" "*"
## 35 ( 1 ) "*" "*" "*"
## 36 ( 1 ) "*" "*" "*"
## CAR_TYPESUV CAR_TYPEVan RED_CAR OLDCLAIM CLM_FREQ REVOKED
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " " "*"
## 12 ( 1 ) " " " " " " " " " " "*"
## 13 ( 1 ) " " " " " " " " " " "*"
## 14 ( 1 ) " " "*" " " " " " " "*"
## 15 ( 1 ) " " "*" " " " " " " "*"
## 16 ( 1 ) " " "*" " " " " " " "*"
## 17 ( 1 ) " " "*" " " " " " " "*"
## 18 ( 1 ) " " "*" " " " " " " "*"
## 19 ( 1 ) "*" "*" " " " " " " "*"
## 20 ( 1 ) "*" "*" " " " " " " "*"
## 21 ( 1 ) "*" "*" " " "*" "*" "*"
## 22 ( 1 ) "*" "*" " " "*" "*" "*"
## 23 ( 1 ) "*" "*" " " "*" "*" "*"
## 24 ( 1 ) "*" "*" " " "*" "*" "*"
## 25 ( 1 ) "*" "*" " " "*" "*" "*"
## 26 ( 1 ) "*" "*" " " "*" "*" "*"
## 27 ( 1 ) "*" "*" " " "*" "*" "*"
## 28 ( 1 ) "*" "*" " " "*" "*" "*"
## 29 ( 1 ) "*" "*" " " "*" "*" "*"
## 30 ( 1 ) " " " " " " "*" "*" "*"
## 31 ( 1 ) " " " " "*" "*" "*" "*"
## 32 ( 1 ) " " " " "*" "*" "*" "*"
## 33 ( 1 ) "*" "*" "*" "*" "*" "*"
## 34 ( 1 ) " " "*" "*" "*" "*" "*"
## 35 ( 1 ) "*" "*" "*" "*" "*" "*"
## 36 ( 1 ) "*" " " "*" "*" "*" "*"
## MVR_PTS CAR_AGE URBANICITY
## 1 ( 1 ) "*" " " " "
## 2 ( 1 ) " " " " "*"
## 3 ( 1 ) " " " " "*"
## 4 ( 1 ) "*" " " "*"
## 5 ( 1 ) "*" " " "*"
## 6 ( 1 ) "*" " " "*"
## 7 ( 1 ) "*" " " "*"
## 8 ( 1 ) "*" " " "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
## 18 ( 1 ) "*" "*" "*"
## 19 ( 1 ) "*" "*" "*"
## 20 ( 1 ) "*" "*" "*"
## 21 ( 1 ) "*" "*" "*"
## 22 ( 1 ) "*" "*" "*"
## 23 ( 1 ) "*" "*" "*"
## 24 ( 1 ) "*" "*" "*"
## 25 ( 1 ) "*" "*" "*"
## 26 ( 1 ) "*" "*" "*"
## 27 ( 1 ) "*" "*" "*"
## 28 ( 1 ) "*" "*" "*"
## 29 ( 1 ) "*" "*" "*"
## 30 ( 1 ) "*" "*" "*"
## 31 ( 1 ) "*" "*" "*"
## 32 ( 1 ) "*" "*" "*"
## 33 ( 1 ) "*" "*" "*"
## 34 ( 1 ) "*" "*" "*"
## 35 ( 1 ) "*" "*" "*"
## 36 ( 1 ) "*" "*" "*"
We can determine the best fits by ploting the number of variables in our subsets and using the cp value in combinationation with the adjusted R^2 value
- Based on the diagnostic plot, 20 variables appear to fit the data best
Multiple Linear Regression
Model 1: All Variables
We will first create a linear model with all of the predictors. It will most likely overfit the data but provides a good reference point. We also know from leaps, that the best model will use 16 of the predictors
##
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5368 -1647 -754 353 103594
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.890e+02 7.412e+02 0.525 0.599712
## KIDSDRIV 2.351e+02 1.362e+02 1.726 0.084469 .
## AGE 5.029e+00 8.436e+00 0.596 0.551070
## HOMEKIDS 6.054e+01 7.747e+01 0.782 0.434538
## YOJ -2.118e+01 1.799e+01 -1.177 0.239269
## INCOME -3.802e-03 2.365e-03 -1.608 0.107969
## SINGLEPARENT 5.606e+02 2.427e+02 2.310 0.020949 *
## HOME_VAL -2.227e-03 7.582e-04 -2.938 0.003319 **
## MARRIED -2.634e+02 1.799e+02 -1.465 0.143062
## MALE 1.950e+02 2.203e+02 0.885 0.375966
## EDUCATIONBachelors -5.675e+02 3.560e+02 -1.594 0.110941
## EDUCATIONHigh_School -5.543e+02 3.971e+02 -1.396 0.162806
## EDUCATIONMasters -3.303e+02 3.129e+02 -1.056 0.291167
## EDUCATIONPhD NA NA NA NA
## JOB 8.360e+01 4.570e+02 0.183 0.854852
## JOBBlue_Collar 5.025e+02 2.820e+02 1.782 0.074805 .
## JOBClerical 3.837e+02 2.824e+02 1.359 0.174347
## JOBDoctor -4.560e+02 5.605e+02 -0.813 0.415980
## JOBHome_Maker 2.477e+00 3.213e+02 0.008 0.993850
## JOBLawyer 1.262e+02 4.232e+02 0.298 0.765569
## JOBManager -4.687e+02 3.488e+02 -1.344 0.179119
## JOBProfessional 4.108e+02 3.266e+02 1.258 0.208459
## JOBStudent NA NA NA NA
## TRAVTIME 1.068e+01 3.870e+00 2.759 0.005820 **
## COMMERCIALUSE 7.664e+02 1.894e+02 4.047 5.27e-05 ***
## BLUEBOOK 1.750e-02 1.035e-02 1.692 0.090741 .
## TIF -5.131e+01 1.456e+01 -3.525 0.000427 ***
## CAR_TYPEMinivan -7.151e+02 2.547e+02 -2.808 0.005008 **
## CAR_TYPEPanel_Truck -6.346e+02 3.173e+02 -2.000 0.045524 *
## CAR_TYPEPickup -2.696e+02 2.657e+02 -1.015 0.310188
## CAR_TYPESports_Car 4.430e+02 3.532e+02 1.254 0.209790
## CAR_TYPESUV 6.155e+01 3.195e+02 0.193 0.847251
## CAR_TYPEVan NA NA NA NA
## RED_CAR 9.042e+01 1.780e+02 0.508 0.611545
## OLDCLAIM -1.015e-02 8.764e-03 -1.158 0.246727
## CLM_FREQ 9.105e+01 6.557e+01 1.389 0.165019
## REVOKED 5.627e+02 2.100e+02 2.679 0.007401 **
## MVR_PTS 1.615e+02 3.151e+01 5.126 3.05e-07 ***
## CAR_AGE -3.514e+01 1.567e+01 -2.242 0.024995 *
## URBANICITY 1.672e+03 1.656e+02 10.097 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4528 on 5675 degrees of freedom
## Multiple R-squared: 0.06906, Adjusted R-squared: 0.06316
## F-statistic: 11.69 on 36 and 5675 DF, p-value: < 2.2e-16
- There are 16 significant variables in Model 1, 6 of which are highly significant at a 5% significance level
- The adjusted R^2 is .06737
- The 3 variables CAR_TYPEVan, JOBStudent, and EDUCATIONPhD are NA
- Interesting Positive Predictors
- SINGLEPARENT has a positive impact on TARGET_AMT. One would expect single parents to be more responsible because they are the only care providers for their children
- Home makers have the highest positive impact on TARGET_AMT
- COMMERCIALUSE vehicles have a higher positive impact
- Urban areas as expected are more prone to crashes due to population
- Interesting Negative Predictors
- Bachelor degrees have a higher negative impact on TARGET_AMT. This may be due to a lack of people pursuing education past a Bachelors
- Doctors have one of the highest negative impacts on TARGET_AMT. They see victims of car crashes
- RED_CAR has a negative impact. It is thought that red cars make people speed more
Model 2: 20 Variables Chosen by leaps
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT +
## MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager +
## TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
## CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS +
## CAR_AGE + URBANICITY, data = imputed$train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5156 -1662 -765 329 104069
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.742e+01 3.557e+02 0.049 0.960936
## KIDSDRIV 2.868e+02 1.240e+02 2.313 0.020748 *
## INCOME -5.991e-03 1.644e-03 -3.643 0.000272 ***
## SINGLEPARENT 5.742e+02 2.122e+02 2.706 0.006836 **
## MARRIED -5.536e+02 1.423e+02 -3.890 0.000101 ***
## MALE 2.054e+02 1.770e+02 1.161 0.245888
## EDUCATIONBachelors -6.771e+01 1.377e+02 -0.492 0.622964
## JOBDoctor -3.159e+02 3.690e+02 -0.856 0.391960
## JOBManager -7.513e+02 1.972e+02 -3.810 0.000140 ***
## TRAVTIME 1.079e+01 3.863e+00 2.792 0.005257 **
## COMMERCIALUSE 7.829e+02 1.487e+02 5.265 1.46e-07 ***
## BLUEBOOK 1.281e-02 8.695e-03 1.473 0.140797
## TIF -5.044e+01 1.453e+01 -3.471 0.000523 ***
## CAR_TYPEMinivan -4.409e+02 1.739e+02 -2.535 0.011269 *
## CAR_TYPESports_Car 6.693e+02 2.685e+02 2.493 0.012693 *
## CAR_TYPESUV 2.896e+02 2.260e+02 1.281 0.200168
## CLM_FREQ 6.433e+01 5.820e+01 1.105 0.269087
## REVOKED 4.598e+02 1.870e+02 2.459 0.013981 *
## MVR_PTS 1.646e+02 3.135e+01 5.251 1.57e-07 ***
## CAR_AGE -3.292e+01 1.206e+01 -2.729 0.006377 **
## URBANICITY 1.663e+03 1.628e+02 10.218 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4530 on 5691 degrees of freedom
## Multiple R-squared: 0.06561, Adjusted R-squared: 0.06233
## F-statistic: 19.98 on 20 and 5691 DF, p-value: < 2.2e-16
- The adjusted R^2 of Model 2 is .06792, still very small but slightly improved over Model 1
- Of the 20 variables, 16 are significant, while 10 are highly significant at the 5% level
- Interesting Positive Predictors
- COMMERCIALUSE is the highest positive influencer
- SINGLEPARENT is still a high positive influencer
- Interesting Negative Predictors
- JOBManager has overtaken JOBDoctor as the highest negative influencer
Model 3: 1 + Log Transform 20 Variables from Leaps
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT +
## MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager +
## TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
## CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS +
## CAR_AGE + URBANICITY, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4712 -2.3427 -0.9312 1.9602 10.4862
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.76424 0.80434 4.680 2.94e-06 ***
## KIDSDRIV 0.91164 0.15647 5.826 5.98e-09 ***
## INCOME -0.08446 0.01548 -5.457 5.05e-08 ***
## SINGLEPARENT 1.13722 0.21950 5.181 2.28e-07 ***
## MARRIED -0.98637 0.14667 -6.725 1.93e-11 ***
## MALE -0.01271 0.18099 -0.070 0.944010
## EDUCATIONBachelors -0.40727 0.14335 -2.841 0.004512 **
## JOBDoctor -1.16308 0.36964 -3.147 0.001661 **
## JOBManager -1.34800 0.20163 -6.686 2.52e-11 ***
## TRAVTIME 0.48839 0.07578 6.444 1.26e-10 ***
## COMMERCIALUSE 1.54952 0.15211 10.187 < 2e-16 ***
## BLUEBOOK -0.38189 0.07702 -4.958 7.33e-07 ***
## TIF -0.45875 0.06074 -7.553 4.94e-14 ***
## CAR_TYPEMinivan -0.63100 0.17859 -3.533 0.000414 ***
## CAR_TYPESports_Car 0.98331 0.27569 3.567 0.000364 ***
## CAR_TYPESUV 0.60596 0.22866 2.650 0.008070 **
## CLM_FREQ 0.46401 0.08774 5.288 1.28e-07 ***
## REVOKED 1.57779 0.19279 8.184 3.36e-16 ***
## MVR_PTS 0.44336 0.06477 6.845 8.45e-12 ***
## CAR_AGE -0.28440 0.05714 -4.977 6.63e-07 ***
## URBANICITY 3.29982 0.16703 19.756 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.233 on 5690 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.2151, Adjusted R-squared: 0.2123
## F-statistic: 77.96 on 20 and 5690 DF, p-value: < 2.2e-16
- R^2 has greatly improved to .2159
- Interesting positive predictors
- URBANICITY is now the highest positive influencer
- COMMERCIALUSE still has a strong positive impact on TARGET_AMT
- Interesting negative predictors
- JOBManager is still the highest negative influencer
Binary Logistic Regression
Model 4: All Variables
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = "binomial",
## data = imputed$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2965 -0.7069 -0.3899 0.5604 3.1810
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.962e+00 4.441e-01 -6.669 2.58e-11 ***
## KIDSDRIV 3.667e-01 7.443e-02 4.927 8.36e-07 ***
## AGE -1.994e-04 4.873e-03 -0.041 0.967355
## HOMEKIDS 5.516e-02 4.462e-02 1.236 0.216359
## YOJ -2.694e-02 1.039e-02 -2.594 0.009494 **
## INCOME -3.859e-06 1.405e-06 -2.747 0.006018 **
## SINGLEPARENT 4.166e-01 1.330e-01 3.133 0.001732 **
## HOME_VAL -1.883e-06 4.413e-07 -4.267 1.98e-05 ***
## MARRIED -3.868e-01 1.056e-01 -3.664 0.000248 ***
## MALE 2.912e-02 1.355e-01 0.215 0.829846
## EDUCATIONBachelors -3.869e-01 2.210e-01 -1.750 0.080097 .
## EDUCATIONHigh_School -6.629e-02 2.418e-01 -0.274 0.784000
## EDUCATIONMasters -2.158e-01 1.889e-01 -1.142 0.253304
## EDUCATIONPhD NA NA NA NA
## JOB -9.117e-02 2.653e-01 -0.344 0.731134
## JOBBlue_Collar 3.972e-01 1.588e-01 2.501 0.012379 *
## JOBClerical 3.875e-01 1.603e-01 2.417 0.015637 *
## JOBDoctor -5.591e-01 3.645e-01 -1.534 0.125067
## JOBHome_Maker -5.611e-02 1.824e-01 -0.308 0.758343
## JOBLawyer -3.294e-02 2.540e-01 -0.130 0.896810
## JOBManager -5.134e-01 2.057e-01 -2.496 0.012572 *
## JOBProfessional 8.403e-02 1.885e-01 0.446 0.655811
## JOBStudent NA NA NA NA
## TRAVTIME 1.593e-02 2.300e-03 6.927 4.30e-12 ***
## COMMERCIALUSE 7.421e-01 1.062e-01 6.990 2.75e-12 ***
## BLUEBOOK -1.466e-05 6.325e-06 -2.318 0.020445 *
## TIF -6.569e-02 8.936e-03 -7.351 1.97e-13 ***
## CAR_TYPEMinivan -6.009e-01 1.524e-01 -3.943 8.05e-05 ***
## CAR_TYPEPanel_Truck -1.652e-01 1.797e-01 -0.919 0.357853
## CAR_TYPEPickup -3.519e-02 1.524e-01 -0.231 0.817342
## CAR_TYPESports_Car 4.697e-01 2.082e-01 2.256 0.024078 *
## CAR_TYPESUV 1.893e-01 1.914e-01 0.989 0.322673
## CAR_TYPEVan NA NA NA NA
## RED_CAR 8.422e-03 1.051e-01 0.080 0.936152
## OLDCLAIM -1.100e-05 4.689e-06 -2.345 0.019026 *
## CLM_FREQ 1.758e-01 3.442e-02 5.108 3.26e-07 ***
## REVOKED 8.780e-01 1.129e-01 7.775 7.57e-15 ***
## MVR_PTS 1.100e-01 1.675e-02 6.566 5.17e-11 ***
## CAR_AGE -2.010e-03 9.322e-03 -0.216 0.829324
## URBANICITY 2.467e+00 1.379e-01 17.893 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6527.4 on 5711 degrees of freedom
## Residual deviance: 5023.6 on 5675 degrees of freedom
## AIC: 5097.6
##
## Number of Fisher Scoring iterations: 5
## llh llhNull G2 McFadden r2ML
## -2511.8093700 -3263.7239857 1503.8292313 0.2303855 0.2314698
## r2CU
## 0.3398657
- Using the pscl package, we can get a pseudo-R^2 value of ~.2304
- There are 19 significant variables in Model 4, 16 of which are highly significant at a 5% significance level
- The 3 variables CAR_TYPEVan, JOBStudent, and EDUCATIONPhD are NA
- Interesting Positive Predictors
- SINGLEPARENT has a positive impact on TARGET_AMT. One would expect single parents to be more responsible because they are the only care providers for their children
- REVOKED has the highest positive impact on TARGET_FLAG
- Urban areas as expected are more prone to crashes probably due to population
- High school has a positive impact, as opposed to the negative impact on TARGET_AMT
- Interesting Negative Predictors
- Bachelor degrees have a higher negative impact on TARGET_AMT. This may be due to a lack of people pursuing education past a Bachelors
- Managers have one of the highest negative impacts on TARGET_AMT. They see victims of car crashes
- RED_CAR has a negative impact. It is thought that red cars make people speed more
Model 5: 20 Variables Chosen by leaps (Untransformed)
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT +
## MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager +
## TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
## CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS +
## CAR_AGE + URBANICITY, family = "binomial", data = imputed$train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3199 -0.7168 -0.3988 0.5922 3.1077
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.933e+00 2.305e-01 -12.724 < 2e-16 ***
## KIDSDRIV 4.109e-01 6.689e-02 6.144 8.06e-10 ***
## INCOME -7.445e-06 1.014e-06 -7.343 2.09e-13 ***
## SINGLEPARENT 4.884e-01 1.138e-01 4.290 1.78e-05 ***
## MARRIED -6.093e-01 8.252e-02 -7.384 1.54e-13 ***
## MALE -2.468e-02 1.082e-01 -0.228 0.819633
## EDUCATIONBachelors -2.557e-01 8.145e-02 -3.139 0.001694 **
## JOBDoctor -3.911e-01 2.613e-01 -1.497 0.134401
## JOBManager -6.580e-01 1.267e-01 -5.194 2.06e-07 ***
## TRAVTIME 1.571e-02 2.270e-03 6.919 4.55e-12 ***
## COMMERCIALUSE 8.597e-01 8.422e-02 10.208 < 2e-16 ***
## BLUEBOOK -1.911e-05 5.105e-06 -3.743 0.000182 ***
## TIF -6.509e-02 8.873e-03 -7.336 2.20e-13 ***
## CAR_TYPEMinivan -4.856e-01 1.063e-01 -4.569 4.89e-06 ***
## CAR_TYPESports_Car 5.105e-01 1.546e-01 3.301 0.000963 ***
## CAR_TYPESUV 2.384e-01 1.332e-01 1.789 0.073564 .
## CLM_FREQ 1.427e-01 3.052e-02 4.676 2.92e-06 ***
## REVOKED 7.372e-01 9.759e-02 7.554 4.21e-14 ***
## MVR_PTS 1.110e-01 1.653e-02 6.713 1.90e-11 ***
## CAR_AGE -1.924e-02 7.015e-03 -2.743 0.006083 **
## URBANICITY 2.421e+00 1.371e-01 17.658 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6527.4 on 5711 degrees of freedom
## Residual deviance: 5081.2 on 5691 degrees of freedom
## AIC: 5123.2
##
## Number of Fisher Scoring iterations: 5
## llh llhNull G2 McFadden r2ML
## -2540.6105722 -3263.7239857 1446.2268269 0.2215608 0.2236804
## r2CU
## 0.3284285
- Using the pscl package, we can get a pseudo-R^2 value of ~.2216
- Interesting Positive Predictors
- COMMERCIALUSE now has the highest predictor
- Interesting Negative Predictors
- Managers are still the lowest negative predictor. Could be due to age
- Minivans have the lowest impact on crashes amongst the car types
- MALE is not a significant predictor but it has a highest impact
Model 6: 1 + Log Transform 20 Variables Chosen by leaps
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT +
## MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager +
## TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
## CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS +
## CAR_AGE + URBANICITY, family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8040 -0.5819 -0.3530 0.1081 2.3171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.08027 0.69569 -1.553 0.120471
## KIDSDRIV 0.57531 0.12351 4.658 3.20e-06 ***
## INCOME -0.05020 0.01220 -4.116 3.85e-05 ***
## SINGLEPARENT 0.53146 0.17257 3.080 0.002073 **
## MARRIED -0.69257 0.12896 -5.371 7.85e-08 ***
## MALE 0.02306 0.16966 0.136 0.891900
## EDUCATIONBachelors -0.28741 0.12907 -2.227 0.025965 *
## JOBDoctor -0.96752 0.42015 -2.303 0.021291 *
## JOBManager -0.93577 0.20455 -4.575 4.77e-06 ***
## TRAVTIME 0.35093 0.07057 4.973 6.60e-07 ***
## COMMERCIALUSE 0.96603 0.12901 7.488 6.98e-14 ***
## BLUEBOOK -0.27581 0.06531 -4.223 2.41e-05 ***
## TIF -0.30565 0.05349 -5.715 1.10e-08 ***
## CAR_TYPEMinivan -0.54785 0.16831 -3.255 0.001134 **
## CAR_TYPESports_Car 0.61278 0.23936 2.560 0.010466 *
## CAR_TYPESUV 0.38057 0.20481 1.858 0.063151 .
## CLM_FREQ 0.30578 0.07048 4.338 1.44e-05 ***
## REVOKED 0.81154 0.14547 5.579 2.42e-08 ***
## MVR_PTS 0.23386 0.05383 4.345 1.39e-05 ***
## CAR_AGE -0.17051 0.04874 -3.498 0.000469 ***
## URBANICITY 2.95792 0.22379 13.217 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3548.4 on 5710 degrees of freedom
## Residual deviance: 2696.9 on 5690 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 5448.5
##
## Number of Fisher Scoring iterations: 6
## llh llhNull G2 McFadden r2ML
## -2703.2717769 -3374.3968320 1342.2501104 0.1988874 0.2094520
## r2CU
## 0.3021311
- Using the pscl package, we can get a pseudo-R^2 value of ~.1987
- Interesting Positive Predictors
- URBANICITY now has the highest predictor
- SUV has a positie predictor
- Interesting Negative Predictors
- Managers are still the lowest negative predictor. Could be due to age
- Minivans have the lowest impact on crashes amongst the car types
- MALE is not a significant predictor but it has a highest impact
Select Models
We will use Model 3 and Model 5 based on their R^2 values and the fact that all or most of the variables are significant at the 5% level
Model 3
- The histogram of the residuals do not show a normal distribution
- The qqplot shows a fairly linear relationship with the tails of the plot venturing away from the line
- The residual plot does not display contant variance
Test Model
Model 5
- The histogram of the residuals do not show a normal distribution
- The qqplot shows a fairly linear relationship that ventures wildly away from linear towards the upper tail
- The residual plot shows no evidence of a constant variance
Test Model
## [1] 0.7264282
- The linear model correctly predicted at a 72% rate
Performance
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1124 146
## 1 443 440
##
## Accuracy : 0.7264
## 95% CI : (0.7071, 0.7452)
## No Information Rate : 0.7278
## P-Value [Acc > NIR] : 0.5687
##
## Kappa : 0.404
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7509
## Specificity : 0.7173
## Pos Pred Value : 0.4983
## Neg Pred Value : 0.8850
## Precision : 0.4983
## Recall : 0.7509
## F1 : 0.5990
## Prevalence : 0.2722
## Detection Rate : 0.2044
## Detection Prevalence : 0.4101
## Balanced Accuracy : 0.7341
##
## 'Positive' Class : 1
##
- The accuracy is only ~73%
- Positive prediction was very poor at below 50%
- Negative prediction was decent sitting at ~89%
- Sensitivity is ~75%
- Specificity is ~72%
- The F1 is ~60%
- The AUC is 0.6916703
- Over all, I’d say the model is pretty underwhelming in prediction power
Code Appendix
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)
loadPkg <- function(x) {
if (!require(x, character.only = T))
install.packages(x, dependencies = T, repos = "http://cran.us.r-project.org")
require(x, character.only = T)
}
libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC",
"missForest", "zoo", "ISLR", "leaps", "fBasics", "reshape2", "tidyverse",
"GGally", "gridExtra", "ROCR", "dummies", "pscl")
lapply(libs, loadPkg)
insTraining <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week4/insurance_training_data.csv") %>%
as.tibble()
insTest <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week4/insurance-evaluation-data.csv") %>%
as.tibble()
data_frame(explanatory_variables = names(insTraining)) %>% filter(!explanatory_variables %in%
c("TARGET_FLAG", "TARGET_AMT", "INDEX")) %>% arrange(explanatory_variables)
glimpse(insTraining)
insTraining %>% sapply(typeof) %>% as.data.frame() %>% rownames_to_column(var = "variable") %>%
rename(vartype = 2) %>% group_by(vartype) %>% summarise(count = n())
summary(insTraining)
# function to convert text monetary values to numeric values
convertText2Num <- function(x) {
x %>% # remove symbols and punctuation
str_replace_all("[\\$[:punct:]]", "") %>% # convert to number
as.numeric() %>% # replace NA with 0
if_else(is.na(.), 0, .)
}
insStats <- insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"),
convertText2Num) %>% select_if(is.numeric) %>% basicStats(.) %>% as_tibble() %>%
rownames_to_column() %>% gather(var, value, -rowname) %>% spread(rowname,
value) %>% rename_all(str_to_lower) %>% rename_all(str_trim) %>% rename(variables = "var",
q1 = `1. quartile`, q3 = `3. quartile`, max = maximum, min = minimum, na_vals = nas,
n = nobs, sd = stdev, var = variance) %>% mutate(obs = n - na_vals, range = max -
min, iqr = q3 - q1) %>% select(variables, n, na_vals, obs, mean, min, q1,
median, q3, max, sd, var, range, iqr, skewness, kurtosis)
insStats
insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num) %>%
select_if(is.numeric) %>% cor(use = "na.or.complete") %>% as.data.frame() %>%
rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor,
TARGET_FLAG, TARGET_AMT) %>% filter(!predictor %in% c("INDEX", "TARGET_FLAG",
"TARGET_AMT")) %>% arrange(desc(TARGET_FLAG)) %>% mutate(flag_rank = dense_rank(desc(TARGET_FLAG)),
amt_rank = dense_rank(desc(TARGET_AMT)), rank_equal = flag_rank == amt_rank)
ggcorr(insTraining, palette = "RdBu", label = T, geom = "tile", size = 2)
vis <- insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"),
convertText2Num) %>% select_if(is.numeric) %>% melt(id.vars = "INDEX")
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>% keep(is.numeric) %>% gather() %>%
ggplot(aes(value)) + geom_histogram(bins = 25) + facet_wrap(~key, scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip()
vis %>% filter(!variable %in% c("HOME_VAL", "INCOME", "OLDCLAIM", "BLUEBOOK",
"TARGET_AMT")) %>% ggplot(aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip()
set.seed(777)
smp_size <- floor(0.7 * nrow(insTraining))
train_ind <- sample(seq_len(nrow(insTraining)), size = smp_size)
datasets <- list(train = insTraining[train_ind, ], test = insTest, test2 = insTraining[-train_ind,
])
# helper function for converting text to number
convertText2Num <- function(x) {
x %>% # remove symbols and punctuation
str_replace_all("[\\$[:punct:]]", "") %>% # convert to number
as.numeric()
}
datasets %<>% map(function(df) {
df %<>% # convert text to num
mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num)
# return dataframe
return(df)
})
charCols <- datasets$train %>% select_if(is.character) %>% lapply(function(x) unique(x)) %>%
print
datasets %<>% map(function(df) {
df %<>% mutate(EDUCATION = str_replace_all(EDUCATION, "<", ""), RED_CAR = str_to_title(RED_CAR),
URBANICITY = str_replace(URBANICITY, "/ ", "/")) %>% mutate_at(names(charCols),
str_replace_all, pattern = "z_", replacement = "")
# return df
return(df)
})
datasets %<>% map(function(df) {
df %<>% # PARENT1 rename to SINGLEPARENT / conversion
rename(SINGLEPARENT = PARENT1) %>% mutate(SINGLEPARENT = if_else(SINGLEPARENT ==
"Yes", 1, 0)) %>% # MSTATUS rename to MARRIED / conversion
rename(MARRIED = MSTATUS) %>% mutate(MARRIED = if_else(MARRIED == "Yes",
1, 0)) %>% # SEX rename to MALE / conversion
rename(MALE = SEX) %>% mutate(MALE = if_else(MALE == "M", 1, 0)) %>% # CAR_USE rename to COMMERCIALUSE / conversion
rename(COMMERCIALUSE = CAR_USE) %>% mutate(COMMERCIALUSE = if_else(COMMERCIALUSE ==
"Commercial", 1, 0)) %>% # RED_CAR conversion
mutate(RED_CAR = if_else(RED_CAR == "Yes", 1, 0)) %>% # REVOKED conversion
mutate(REVOKED = if_else(REVOKED == "Yes", 1, 0)) %>% # URBANICITY
mutate(URBANICITY = if_else(str_detect(URBANICITY, "Urban"), 1, 0))
})
datasets %<>% map(function(df) {
# create dummy data frame
df <- df %>% as.data.frame() %>% dummy.data.frame(names = c("EDUCATION",
"JOB", "CAR_TYPE"))
# return df
return(df)
})
replace_space <- function(x) str_replace(x, " ", "_")
datasets %<>% map(function(df) {
# rename all with spaces
df %<>% rename_all(replace_space)
# return df
return(df)
})
forests <- datasets %>% map(function(df) {
df %<>% as.data.frame() %>% select_if(is.numeric) %>% select(-INDEX) %>%
missForest()
# return df
return(df)
})
imputed <- forests %>% map(function(x) {
return(x$ximp)
})
mlrs <- regsubsets(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train, method = "exhaustive",
nvmax = NULL, nbest = 1)
mlrs.summary <- summary(mlrs)
print(mlrs.summary)
# determine best subset
plot(mlrs.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(mlrs.summary$cp), mlrs.summary$cp[which.min(mlrs.summary$cp)],
pch = 20, col = "red")
# cp plot par(mfrow=c(1,2))
plot(mlrs, scale = "Cp", main = "Cp")
# r^2 splot
plot(mlrs, scale = "adjr2", main = "Adjusted R^2")
m1 <- lm(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train)
summary(m1)
m2 <- lm(TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE + EDUCATIONBachelors +
JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
URBANICITY, data = imputed$train)
summary(m2)
m3 <- imputed$train %>% mutate_all(~log(1 + .x)) %>% lm(TARGET_AMT ~ KIDSDRIV +
INCOME + SINGLEPARENT + MARRIED + MALE + EDUCATIONBachelors + JOBDoctor +
JOBManager + TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan +
CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE +
URBANICITY, data = .)
summary(m3)
m4 <- glm(TARGET_FLAG ~ . - TARGET_AMT, data = imputed$train, family = "binomial")
summary(m4)
pR2(m4)
m5 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE +
EDUCATIONBachelors + JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE +
BLUEBOOK + TIF + CAR_TYPEMinivan + CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ +
REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = imputed$train, family = "binomial")
summary(m5)
pR2(m5)
d <- imputed$train %>% mutate_all(~log(1 + .x))
m6 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE +
EDUCATIONBachelors + JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE +
BLUEBOOK + TIF + CAR_TYPEMinivan + CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ +
REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = d, family = "binomial")
summary(m6)
pR2(m6)
par(mfrow = c(2, 2))
plot(m3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
m3_results <- predict(m3, newdata = datasets$test2)
datasets$test <- bind_cols(datasets$test2, data_frame(m3results = m3_results)) %>%
mutate(m3results = if_else(m3results < 0, 0, m3results), m3amt_match = TARGET_AMT ==
m3results)
datasets$test
par(mfrow = c(2, 2))
plot(m5)
hist(m5$residuals)
qqnorm(m5$residuals)
qqline(m5$residuals)
m5_results <- predict(m5, newdata = datasets$test2, type = "response")
datasets$test <- bind_cols(datasets$test2, data_frame(m5results = m5_results)) %>%
mutate(m5results = if_else(m5_results > mean(TARGET_FLAG), 1, 0), m5flag_match = m5results ==
TARGET_FLAG)
datasets$test
mean(datasets$test$m5flag_match, na.rm = T)
cm <- confusionMatrix(as.factor(datasets$test$m5results), as.factor(datasets$test$TARGET_FLAG),
positive = "1", mode = "everything") %>% print
curveRoc <- roc(datasets$test$m5results, datasets$test$TARGET_FLAG)
plot(curveRoc, legacy.axes = T, main = "pROC")