Data 621 - Week4

Cover Page

Data 621 - Week 4 HW

Baron Curtin

CUNY School of Professional Studies

Introduction

The purpose of this assignment is to generate multiple linear regression and binary logistic regression models based on a training dataset to predict the probability that a person will crash their car and the amount of money the crash costs

Data Exploration

Non-Visual Inspection

Variables

Response Variable:
- TARGET_FLAG: binary indicator of whether the car was involved in a crash
- TARGET_AMT: cost of the crash
Explanatory Variables
Identification Variable:
- INDEX: will not be used in analysis

## Observations: 8,161
## Variables: 26
## $ INDEX       <int> 1, 2, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 17, 1...
## $ TARGET_FLAG <int> 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,...
## $ TARGET_AMT  <dbl> 0.000, 0.000, 0.000, 0.000, 0.000, 2946.000, 0.000...
## $ KIDSDRIV    <int> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ AGE         <int> 60, 43, 35, 51, 50, 34, 54, 37, 34, 50, 53, 43, 55...
## $ HOMEKIDS    <int> 0, 0, 1, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 3, 0,...
## $ YOJ         <int> 11, 11, 10, 14, NA, 12, NA, NA, 10, 7, 14, 5, 11, ...
## $ INCOME      <chr> "$67,349", "$91,449", "$16,039", "", "$114,986", "...
## $ PARENT1     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "...
## $ HOME_VAL    <chr> "$0", "$257,252", "$124,191", "$306,251", "$243,92...
## $ MSTATUS     <chr> "z_No", "z_No", "Yes", "Yes", "Yes", "z_No", "Yes"...
## $ SEX         <chr> "M", "M", "z_F", "M", "z_F", "z_F", "z_F", "M", "z...
## $ EDUCATION   <chr> "PhD", "z_High School", "z_High School", "<High Sc...
## $ JOB         <chr> "Professional", "z_Blue Collar", "Clerical", "z_Bl...
## $ TRAVTIME    <int> 14, 22, 5, 32, 36, 46, 33, 44, 34, 48, 15, 36, 25,...
## $ CAR_USE     <chr> "Private", "Commercial", "Private", "Private", "Pr...
## $ BLUEBOOK    <chr> "$14,230", "$14,940", "$4,010", "$15,440", "$18,00...
## $ TIF         <int> 11, 1, 4, 7, 1, 1, 1, 1, 1, 7, 1, 7, 7, 6, 1, 6, 6...
## $ CAR_TYPE    <chr> "Minivan", "Minivan", "z_SUV", "Minivan", "z_SUV",...
## $ RED_CAR     <chr> "yes", "yes", "no", "yes", "no", "no", "no", "yes"...
## $ OLDCLAIM    <chr> "$4,461", "$0", "$38,690", "$0", "$19,217", "$0", ...
## $ CLM_FREQ    <int> 2, 0, 2, 0, 2, 0, 0, 1, 0, 0, 0, 0, 2, 0, 0, 0, 0,...
## $ REVOKED     <chr> "No", "No", "No", "No", "Yes", "No", "No", "Yes", ...
## $ MVR_PTS     <int> 3, 0, 3, 0, 3, 0, 0, 10, 0, 1, 0, 0, 3, 3, 3, 0, 0...
## $ CAR_AGE     <int> 18, 1, 10, 6, 17, 7, 1, 7, 1, 17, 11, 1, 9, 10, 5,...
## $ URBANICITY  <chr> "Highly Urban/ Urban", "Highly Urban/ Urban", "Hig...

The glimpse function of dplyr shows that there are 8161 observations and 26 variables * INCOME, HOME_VAL, OLDCLAIM are interestingly character type but contain monetary values + These will have to be converted from char to integer in the data preparation stage * MSTATUS, SEX, EDUCATION, URBANiCITY will have to be modified in the data preparation stage as there are inconsitent values

There are 14 fields of the character data type
1 field of the double data type
11 fields of the interger data type

##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS         
##  Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##      SEX             EDUCATION             JOB               TRAVTIME     
##  Length:8161        Length:8161        Length:8161        Min.   :  5.00  
##  Class :character   Class :character   Class :character   1st Qu.: 22.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 33.00  
##                                                           Mean   : 33.49  
##                                                           3rd Qu.: 44.00  
##                                                           Max.   :142.00  
##                                                                           
##    CAR_USE            BLUEBOOK              TIF           CAR_TYPE        
##  Length:8161        Length:8161        Min.   : 1.000   Length:8161       
##  Class :character   Class :character   1st Qu.: 1.000   Class :character  
##  Mode  :character   Mode  :character   Median : 4.000   Mode  :character  
##                                        Mean   : 5.351                     
##                                        3rd Qu.: 7.000                     
##                                        Max.   :25.000                     
##                                                                           
##    RED_CAR            OLDCLAIM            CLM_FREQ        REVOKED         
##  Length:8161        Length:8161        Min.   :0.0000   Length:8161       
##  Class :character   Class :character   1st Qu.:0.0000   Class :character  
##  Mode  :character   Mode  :character   Median :0.0000   Mode  :character  
##                                        Mean   :0.7986                     
##                                        3rd Qu.:2.0000                     
##                                        Max.   :5.0000                     
##                                                                           
##     MVR_PTS          CAR_AGE        URBANICITY       
##  Min.   : 0.000   Min.   :-3.000   Length:8161       
##  1st Qu.: 0.000   1st Qu.: 1.000   Class :character  
##  Median : 1.000   Median : 8.000   Mode  :character  
##  Mean   : 1.696   Mean   : 8.328                     
##  3rd Qu.: 3.000   3rd Qu.:12.000                     
##  Max.   :13.000   Max.   :28.000                     
##                   NA's   :510

YOJ and AGE have missing values, they will need to be imputed
CAR_AGE has a negative value. How is that possible?

Basic Stats

basicStats further confirms that CAR_AGE and YOJ have missing values
HOME_VAL has the highest variance amongst the variables
The largest skew value is ~8.7 in the TARGET_AMT variable

Correlation

MVR_PTS has the highest positive correlation with TARGET_FLAG
- You would typically expect someone with a lot of points on their license to get in more accidents
HOME_VAL has the highest negative correlation with TARGET_FLAG
- I was not expecting HOME_VAL to have the highest negative correlation
HOMEKIDS has a positive correlation
- I would think that parents would be more responsible drivers
Rankings were added to show if some predictors are more highly correlated with TARGET_FLAG vs TARGET_AMT
- BLUEBOOK is more positively correlated with TARGET_AMT than TARGET_FLAG, which is to be expected cause the higher a car’s value, the higher the payout

The correlation matrix above shows that TARGET_FLAG and TARGET_AMT and KIDSDRIV and HOMEKIDS are the two sets of most highly correlated variables
- This makes sense because when kids are at home, they are likely to drive the parents car because they often can’t afford one of their own
- TARGET_FLAG and TARGET_AMT make sense because if you’re in a crash, you are very likely to be paying for the cost of the crash
Outside of the two sets of variables outlined above, MVR_PTS and CLM_FREQ are the most highly correlated variables
- People with points on their record are more likely to file claims more frequently

Visual Inspection

Density Plots

AGE is the only normally distributed variable
- Interesting as this provides some evidence that the data is adequate for inference
All of the other variables are multi-modal and assymetric

Histograms

The histograms further reiniforce the conclusions made in the density plots

Box Plots

HOME_VAL and INCOME appear to have the highest variances
- This confirms what we saw in basicStats

Removing HOME_VAL, INCOME, OLDCLAIM, BLUEBOOK, TARGET_AMT…

The means and medians are pretty close to each other for all of the variables, displaying evidence of very slight skews

Data Preparation

Transforming Monetary Columns to Numeric

Converting the currency columns to numeric will be important when modelling

Transform Inconsitent Observations Within Columns

There was evidence of inconsistent naming across observations. We will transform those variables here. We will first create a list of all the character columns and the values they contain

## $PARENT1
## [1] "No"  "Yes"
## 
## $MSTATUS
## [1] "Yes"  "z_No"
## 
## $SEX
## [1] "z_F" "M"  
## 
## $EDUCATION
## [1] "z_High School" "Bachelors"     "Masters"       "<High School" 
## [5] "PhD"          
## 
## $JOB
## [1] "z_Blue Collar" "Manager"       "Lawyer"        "Student"      
## [5] "Home Maker"    "Doctor"        "Clerical"      ""             
## [9] "Professional" 
## 
## $CAR_USE
## [1] "Private"    "Commercial"
## 
## $CAR_TYPE
## [1] "z_SUV"       "Minivan"     "Van"         "Pickup"      "Panel Truck"
## [6] "Sports Car" 
## 
## $RED_CAR
## [1] "no"  "yes"
## 
## $REVOKED
## [1] "No"  "Yes"
## 
## $URBANICITY
## [1] "Highly Urban/ Urban"   "z_Highly Rural/ Rural"

Many of the fields have values prefixed with “z_” which can be removed
EDUCATION field has an addition “<” that prefixes High School which also can be removed
RED_CHAR will be transformed so that the first letter is capitalized
URBANICITY will be trimmed to remove extra spaces and renamed to URBANCITY

Create Binary Dummy Variable Columns for Categorical Variables

First we will take CAR_USE, PARENT1, MSTATUS, SEX, RED_CAR, REVOKED, URBANCITY and convert them to binary variables

SINGLEPARENT will have value of 1 when TRUE, else 0
MARRIED will have value of 1 when TRUE, else 0
RED_CAR will have value of 1 when TRUE, else 0
MALE will have value of 1 when TRUE, else 0
COMMERCIALUSE will have value of 1 when TRUE, else 0
REVOKED will have value of 1 when TRUE, else 0
URBANICITY will have value of 1 when TRUE, else 0

The rest of the variables will be dummified using the dummies package

Converting all of the character fields to to binary columns will make missForest’s job easier as well as provide a better regression model
The remaining variables are EDUCATION, JOB, CAR_TYPE

The fields with spaces will be renamed to include _

Missing Value Imputation

We can use the package missForest to impute values for the NAs

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!

Build Models

Leaps Subsetting

We can use the leaps package to subset the explanatory variables to find the best model. All of the character variables have already been converted to factors and subsequently numerical values and all of the missing values have been imputed using missForest

## Reordering variables and trying again:

## Subset selection object
## Call: regsubsets.formula(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train, 
##     method = "exhaustive", nvmax = NULL, nbest = 1)
## 39 Variables  (and intercept)
##                      Forced in Forced out
## KIDSDRIV                 FALSE      FALSE
## AGE                      FALSE      FALSE
## HOMEKIDS                 FALSE      FALSE
## YOJ                      FALSE      FALSE
## INCOME                   FALSE      FALSE
## SINGLEPARENT             FALSE      FALSE
## HOME_VAL                 FALSE      FALSE
## MARRIED                  FALSE      FALSE
## MALE                     FALSE      FALSE
## EDUCATIONBachelors       FALSE      FALSE
## EDUCATIONHigh_School     FALSE      FALSE
## EDUCATIONMasters         FALSE      FALSE
## JOB                      FALSE      FALSE
## JOBBlue_Collar           FALSE      FALSE
## JOBClerical              FALSE      FALSE
## JOBDoctor                FALSE      FALSE
## JOBHome_Maker            FALSE      FALSE
## JOBLawyer                FALSE      FALSE
## JOBManager               FALSE      FALSE
## JOBProfessional          FALSE      FALSE
## TRAVTIME                 FALSE      FALSE
## COMMERCIALUSE            FALSE      FALSE
## BLUEBOOK                 FALSE      FALSE
## TIF                      FALSE      FALSE
## CAR_TYPEMinivan          FALSE      FALSE
## CAR_TYPEPanel_Truck      FALSE      FALSE
## CAR_TYPEPickup           FALSE      FALSE
## CAR_TYPESports_Car       FALSE      FALSE
## CAR_TYPESUV              FALSE      FALSE
## RED_CAR                  FALSE      FALSE
## OLDCLAIM                 FALSE      FALSE
## CLM_FREQ                 FALSE      FALSE
## REVOKED                  FALSE      FALSE
## MVR_PTS                  FALSE      FALSE
## CAR_AGE                  FALSE      FALSE
## URBANICITY               FALSE      FALSE
## EDUCATIONPhD             FALSE      FALSE
## JOBStudent               FALSE      FALSE
## CAR_TYPEVan              FALSE      FALSE
## 1 subsets of each size up to 36
## Selection Algorithm: exhaustive
##           KIDSDRIV AGE HOMEKIDS YOJ INCOME SINGLEPARENT HOME_VAL MARRIED
## 1  ( 1 )  " "      " " " "      " " " "    " "          " "      " "    
## 2  ( 1 )  " "      " " " "      " " " "    " "          "*"      " "    
## 3  ( 1 )  " "      " " " "      " " " "    " "          "*"      " "    
## 4  ( 1 )  " "      " " " "      " " " "    " "          "*"      " "    
## 5  ( 1 )  " "      " " " "      " " " "    " "          "*"      " "    
## 6  ( 1 )  " "      " " " "      " " " "    " "          "*"      " "    
## 7  ( 1 )  " "      " " " "      " " " "    "*"          "*"      " "    
## 8  ( 1 )  " "      " " " "      " " " "    "*"          "*"      " "    
## 9  ( 1 )  " "      " " " "      " " " "    "*"          "*"      " "    
## 10  ( 1 ) " "      " " " "      " " " "    "*"          "*"      " "    
## 11  ( 1 ) " "      " " " "      " " " "    "*"          "*"      " "    
## 12  ( 1 ) "*"      " " " "      " " " "    "*"          "*"      " "    
## 13  ( 1 ) "*"      " " " "      " " " "    "*"          "*"      " "    
## 14  ( 1 ) "*"      " " " "      " " " "    "*"          "*"      " "    
## 15  ( 1 ) "*"      " " " "      " " " "    "*"          "*"      " "    
## 16  ( 1 ) "*"      " " " "      " " " "    "*"          "*"      "*"    
## 17  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 18  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 19  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 20  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 21  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 22  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 23  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 24  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 25  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 26  ( 1 ) "*"      " " " "      "*" "*"    "*"          "*"      "*"    
## 27  ( 1 ) "*"      " " " "      " " "*"    "*"          "*"      "*"    
## 28  ( 1 ) "*"      " " " "      "*" "*"    "*"          "*"      "*"    
## 29  ( 1 ) "*"      " " "*"      "*" "*"    "*"          "*"      "*"    
## 30  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 31  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 32  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 33  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 34  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 35  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
## 36  ( 1 ) "*"      "*" "*"      "*" "*"    "*"          "*"      "*"    
##           MALE EDUCATIONBachelors EDUCATIONHigh_School EDUCATIONMasters
## 1  ( 1 )  " "  " "                " "                  " "             
## 2  ( 1 )  " "  " "                " "                  " "             
## 3  ( 1 )  " "  " "                " "                  " "             
## 4  ( 1 )  " "  " "                " "                  " "             
## 5  ( 1 )  " "  " "                " "                  " "             
## 6  ( 1 )  " "  " "                " "                  " "             
## 7  ( 1 )  " "  " "                " "                  " "             
## 8  ( 1 )  " "  " "                " "                  " "             
## 9  ( 1 )  " "  " "                " "                  " "             
## 10  ( 1 ) " "  " "                " "                  " "             
## 11  ( 1 ) " "  " "                " "                  " "             
## 12  ( 1 ) " "  " "                " "                  " "             
## 13  ( 1 ) " "  " "                " "                  " "             
## 14  ( 1 ) " "  " "                " "                  " "             
## 15  ( 1 ) " "  " "                " "                  " "             
## 16  ( 1 ) " "  " "                " "                  " "             
## 17  ( 1 ) " "  " "                " "                  " "             
## 18  ( 1 ) " "  " "                " "                  " "             
## 19  ( 1 ) "*"  " "                " "                  " "             
## 20  ( 1 ) "*"  " "                " "                  " "             
## 21  ( 1 ) "*"  " "                " "                  " "             
## 22  ( 1 ) "*"  " "                " "                  " "             
## 23  ( 1 ) "*"  " "                " "                  " "             
## 24  ( 1 ) "*"  " "                " "                  " "             
## 25  ( 1 ) "*"  " "                " "                  " "             
## 26  ( 1 ) "*"  " "                " "                  " "             
## 27  ( 1 ) "*"  " "                " "                  "*"             
## 28  ( 1 ) "*"  " "                " "                  "*"             
## 29  ( 1 ) "*"  " "                " "                  "*"             
## 30  ( 1 ) "*"  " "                " "                  "*"             
## 31  ( 1 ) "*"  " "                " "                  "*"             
## 32  ( 1 ) "*"  " "                " "                  "*"             
## 33  ( 1 ) "*"  " "                " "                  "*"             
## 34  ( 1 ) "*"  " "                " "                  "*"             
## 35  ( 1 ) "*"  "*"                "*"                  "*"             
## 36  ( 1 ) "*"  "*"                "*"                  "*"             
##           EDUCATIONPhD JOB JOBBlue_Collar JOBClerical JOBDoctor
## 1  ( 1 )  " "          " " " "            " "         " "      
## 2  ( 1 )  " "          " " " "            " "         " "      
## 3  ( 1 )  " "          " " " "            " "         " "      
## 4  ( 1 )  " "          " " " "            " "         " "      
## 5  ( 1 )  " "          " " " "            " "         " "      
## 6  ( 1 )  " "          " " " "            " "         " "      
## 7  ( 1 )  " "          " " " "            " "         " "      
## 8  ( 1 )  " "          " " " "            " "         " "      
## 9  ( 1 )  " "          " " " "            " "         " "      
## 10  ( 1 ) " "          " " " "            " "         " "      
## 11  ( 1 ) " "          " " " "            " "         " "      
## 12  ( 1 ) " "          " " " "            " "         " "      
## 13  ( 1 ) " "          " " " "            " "         " "      
## 14  ( 1 ) " "          " " " "            " "         " "      
## 15  ( 1 ) " "          " " "*"            " "         " "      
## 16  ( 1 ) " "          " " "*"            " "         " "      
## 17  ( 1 ) " "          " " "*"            " "         " "      
## 18  ( 1 ) " "          " " "*"            " "         " "      
## 19  ( 1 ) " "          " " " "            " "         " "      
## 20  ( 1 ) " "          " " " "            " "         " "      
## 21  ( 1 ) " "          " " " "            " "         " "      
## 22  ( 1 ) " "          " " " "            " "         " "      
## 23  ( 1 ) " "          " " " "            " "         "*"      
## 24  ( 1 ) "*"          " " " "            " "         "*"      
## 25  ( 1 ) "*"          " " " "            " "         "*"      
## 26  ( 1 ) "*"          " " " "            " "         "*"      
## 27  ( 1 ) "*"          " " "*"            "*"         "*"      
## 28  ( 1 ) "*"          " " "*"            "*"         "*"      
## 29  ( 1 ) "*"          " " "*"            "*"         "*"      
## 30  ( 1 ) "*"          " " "*"            "*"         "*"      
## 31  ( 1 ) "*"          " " "*"            "*"         "*"      
## 32  ( 1 ) "*"          " " "*"            "*"         "*"      
## 33  ( 1 ) "*"          " " "*"            "*"         "*"      
## 34  ( 1 ) "*"          "*" "*"            "*"         "*"      
## 35  ( 1 ) " "          "*" "*"            "*"         "*"      
## 36  ( 1 ) " "          "*" "*"            "*"         "*"      
##           JOBHome_Maker JOBLawyer JOBManager JOBProfessional JOBStudent
## 1  ( 1 )  " "           " "       " "        " "             " "       
## 2  ( 1 )  " "           " "       " "        " "             " "       
## 3  ( 1 )  " "           " "       " "        " "             " "       
## 4  ( 1 )  " "           " "       " "        " "             " "       
## 5  ( 1 )  " "           " "       " "        " "             " "       
## 6  ( 1 )  " "           " "       "*"        " "             " "       
## 7  ( 1 )  " "           " "       "*"        " "             " "       
## 8  ( 1 )  " "           " "       "*"        " "             " "       
## 9  ( 1 )  " "           " "       "*"        " "             " "       
## 10  ( 1 ) " "           " "       "*"        " "             " "       
## 11  ( 1 ) " "           " "       "*"        " "             " "       
## 12  ( 1 ) " "           " "       "*"        " "             " "       
## 13  ( 1 ) " "           " "       "*"        " "             " "       
## 14  ( 1 ) " "           " "       "*"        " "             " "       
## 15  ( 1 ) " "           " "       "*"        " "             " "       
## 16  ( 1 ) " "           " "       "*"        " "             " "       
## 17  ( 1 ) " "           " "       "*"        " "             " "       
## 18  ( 1 ) " "           " "       "*"        " "             " "       
## 19  ( 1 ) " "           " "       "*"        " "             " "       
## 20  ( 1 ) " "           " "       "*"        " "             "*"       
## 21  ( 1 ) " "           " "       "*"        " "             " "       
## 22  ( 1 ) " "           " "       "*"        " "             "*"       
## 23  ( 1 ) " "           " "       "*"        " "             "*"       
## 24  ( 1 ) " "           " "       "*"        " "             "*"       
## 25  ( 1 ) "*"           " "       "*"        " "             "*"       
## 26  ( 1 ) "*"           " "       "*"        " "             "*"       
## 27  ( 1 ) " "           " "       "*"        "*"             " "       
## 28  ( 1 ) " "           " "       "*"        "*"             " "       
## 29  ( 1 ) " "           " "       "*"        "*"             " "       
## 30  ( 1 ) " "           " "       "*"        "*"             " "       
## 31  ( 1 ) " "           " "       "*"        "*"             " "       
## 32  ( 1 ) " "           "*"       "*"        "*"             " "       
## 33  ( 1 ) " "           "*"       "*"        "*"             " "       
## 34  ( 1 ) " "           "*"       "*"        "*"             " "       
## 35  ( 1 ) " "           "*"       "*"        "*"             " "       
## 36  ( 1 ) "*"           "*"       "*"        "*"             " "       
##           TRAVTIME COMMERCIALUSE BLUEBOOK TIF CAR_TYPEMinivan
## 1  ( 1 )  " "      " "           " "      " " " "            
## 2  ( 1 )  " "      " "           " "      " " " "            
## 3  ( 1 )  " "      "*"           " "      " " " "            
## 4  ( 1 )  " "      "*"           " "      " " " "            
## 5  ( 1 )  " "      "*"           " "      " " "*"            
## 6  ( 1 )  " "      "*"           " "      " " "*"            
## 7  ( 1 )  " "      "*"           " "      " " "*"            
## 8  ( 1 )  " "      "*"           " "      "*" "*"            
## 9  ( 1 )  " "      "*"           " "      "*" "*"            
## 10  ( 1 ) "*"      "*"           " "      "*" "*"            
## 11  ( 1 ) "*"      "*"           " "      "*" "*"            
## 12  ( 1 ) "*"      "*"           " "      "*" "*"            
## 13  ( 1 ) "*"      "*"           " "      "*" "*"            
## 14  ( 1 ) "*"      "*"           " "      "*" "*"            
## 15  ( 1 ) "*"      "*"           " "      "*" "*"            
## 16  ( 1 ) "*"      "*"           " "      "*" "*"            
## 17  ( 1 ) "*"      "*"           " "      "*" "*"            
## 18  ( 1 ) "*"      "*"           "*"      "*" "*"            
## 19  ( 1 ) "*"      "*"           "*"      "*" " "            
## 20  ( 1 ) "*"      "*"           "*"      "*" " "            
## 21  ( 1 ) "*"      "*"           "*"      "*" " "            
## 22  ( 1 ) "*"      "*"           "*"      "*" " "            
## 23  ( 1 ) "*"      "*"           "*"      "*" " "            
## 24  ( 1 ) "*"      "*"           "*"      "*" " "            
## 25  ( 1 ) "*"      "*"           "*"      "*" " "            
## 26  ( 1 ) "*"      "*"           "*"      "*" " "            
## 27  ( 1 ) "*"      "*"           "*"      "*" " "            
## 28  ( 1 ) "*"      "*"           "*"      "*" " "            
## 29  ( 1 ) "*"      "*"           "*"      "*" " "            
## 30  ( 1 ) "*"      "*"           "*"      "*" "*"            
## 31  ( 1 ) "*"      "*"           "*"      "*" "*"            
## 32  ( 1 ) "*"      "*"           "*"      "*" "*"            
## 33  ( 1 ) "*"      "*"           "*"      "*" " "            
## 34  ( 1 ) "*"      "*"           "*"      "*" "*"            
## 35  ( 1 ) "*"      "*"           "*"      "*" " "            
## 36  ( 1 ) "*"      "*"           "*"      "*" "*"            
##           CAR_TYPEPanel_Truck CAR_TYPEPickup CAR_TYPESports_Car
## 1  ( 1 )  " "                 " "            " "               
## 2  ( 1 )  " "                 " "            " "               
## 3  ( 1 )  " "                 " "            " "               
## 4  ( 1 )  " "                 " "            " "               
## 5  ( 1 )  " "                 " "            " "               
## 6  ( 1 )  " "                 " "            " "               
## 7  ( 1 )  " "                 " "            " "               
## 8  ( 1 )  " "                 " "            " "               
## 9  ( 1 )  " "                 " "            " "               
## 10  ( 1 ) " "                 " "            " "               
## 11  ( 1 ) " "                 " "            " "               
## 12  ( 1 ) " "                 " "            " "               
## 13  ( 1 ) " "                 " "            "*"               
## 14  ( 1 ) " "                 " "            "*"               
## 15  ( 1 ) " "                 " "            "*"               
## 16  ( 1 ) " "                 " "            "*"               
## 17  ( 1 ) " "                 " "            "*"               
## 18  ( 1 ) " "                 " "            "*"               
## 19  ( 1 ) " "                 "*"            "*"               
## 20  ( 1 ) " "                 "*"            "*"               
## 21  ( 1 ) " "                 "*"            "*"               
## 22  ( 1 ) " "                 "*"            "*"               
## 23  ( 1 ) " "                 "*"            "*"               
## 24  ( 1 ) " "                 "*"            "*"               
## 25  ( 1 ) " "                 "*"            "*"               
## 26  ( 1 ) " "                 "*"            "*"               
## 27  ( 1 ) " "                 "*"            "*"               
## 28  ( 1 ) " "                 "*"            "*"               
## 29  ( 1 ) " "                 "*"            "*"               
## 30  ( 1 ) "*"                 "*"            "*"               
## 31  ( 1 ) "*"                 "*"            "*"               
## 32  ( 1 ) "*"                 "*"            "*"               
## 33  ( 1 ) "*"                 "*"            "*"               
## 34  ( 1 ) "*"                 "*"            "*"               
## 35  ( 1 ) "*"                 "*"            "*"               
## 36  ( 1 ) "*"                 "*"            "*"               
##           CAR_TYPESUV CAR_TYPEVan RED_CAR OLDCLAIM CLM_FREQ REVOKED
## 1  ( 1 )  " "         " "         " "     " "      " "      " "    
## 2  ( 1 )  " "         " "         " "     " "      " "      " "    
## 3  ( 1 )  " "         " "         " "     " "      " "      " "    
## 4  ( 1 )  " "         " "         " "     " "      " "      " "    
## 5  ( 1 )  " "         " "         " "     " "      " "      " "    
## 6  ( 1 )  " "         " "         " "     " "      " "      " "    
## 7  ( 1 )  " "         " "         " "     " "      " "      " "    
## 8  ( 1 )  " "         " "         " "     " "      " "      " "    
## 9  ( 1 )  " "         " "         " "     " "      " "      " "    
## 10  ( 1 ) " "         " "         " "     " "      " "      " "    
## 11  ( 1 ) " "         " "         " "     " "      " "      "*"    
## 12  ( 1 ) " "         " "         " "     " "      " "      "*"    
## 13  ( 1 ) " "         " "         " "     " "      " "      "*"    
## 14  ( 1 ) " "         "*"         " "     " "      " "      "*"    
## 15  ( 1 ) " "         "*"         " "     " "      " "      "*"    
## 16  ( 1 ) " "         "*"         " "     " "      " "      "*"    
## 17  ( 1 ) " "         "*"         " "     " "      " "      "*"    
## 18  ( 1 ) " "         "*"         " "     " "      " "      "*"    
## 19  ( 1 ) "*"         "*"         " "     " "      " "      "*"    
## 20  ( 1 ) "*"         "*"         " "     " "      " "      "*"    
## 21  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 22  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 23  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 24  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 25  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 26  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 27  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 28  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 29  ( 1 ) "*"         "*"         " "     "*"      "*"      "*"    
## 30  ( 1 ) " "         " "         " "     "*"      "*"      "*"    
## 31  ( 1 ) " "         " "         "*"     "*"      "*"      "*"    
## 32  ( 1 ) " "         " "         "*"     "*"      "*"      "*"    
## 33  ( 1 ) "*"         "*"         "*"     "*"      "*"      "*"    
## 34  ( 1 ) " "         "*"         "*"     "*"      "*"      "*"    
## 35  ( 1 ) "*"         "*"         "*"     "*"      "*"      "*"    
## 36  ( 1 ) "*"         " "         "*"     "*"      "*"      "*"    
##           MVR_PTS CAR_AGE URBANICITY
## 1  ( 1 )  "*"     " "     " "       
## 2  ( 1 )  " "     " "     "*"       
## 3  ( 1 )  " "     " "     "*"       
## 4  ( 1 )  "*"     " "     "*"       
## 5  ( 1 )  "*"     " "     "*"       
## 6  ( 1 )  "*"     " "     "*"       
## 7  ( 1 )  "*"     " "     "*"       
## 8  ( 1 )  "*"     " "     "*"       
## 9  ( 1 )  "*"     "*"     "*"       
## 10  ( 1 ) "*"     "*"     "*"       
## 11  ( 1 ) "*"     "*"     "*"       
## 12  ( 1 ) "*"     "*"     "*"       
## 13  ( 1 ) "*"     "*"     "*"       
## 14  ( 1 ) "*"     "*"     "*"       
## 15  ( 1 ) "*"     "*"     "*"       
## 16  ( 1 ) "*"     "*"     "*"       
## 17  ( 1 ) "*"     "*"     "*"       
## 18  ( 1 ) "*"     "*"     "*"       
## 19  ( 1 ) "*"     "*"     "*"       
## 20  ( 1 ) "*"     "*"     "*"       
## 21  ( 1 ) "*"     "*"     "*"       
## 22  ( 1 ) "*"     "*"     "*"       
## 23  ( 1 ) "*"     "*"     "*"       
## 24  ( 1 ) "*"     "*"     "*"       
## 25  ( 1 ) "*"     "*"     "*"       
## 26  ( 1 ) "*"     "*"     "*"       
## 27  ( 1 ) "*"     "*"     "*"       
## 28  ( 1 ) "*"     "*"     "*"       
## 29  ( 1 ) "*"     "*"     "*"       
## 30  ( 1 ) "*"     "*"     "*"       
## 31  ( 1 ) "*"     "*"     "*"       
## 32  ( 1 ) "*"     "*"     "*"       
## 33  ( 1 ) "*"     "*"     "*"       
## 34  ( 1 ) "*"     "*"     "*"       
## 35  ( 1 ) "*"     "*"     "*"       
## 36  ( 1 ) "*"     "*"     "*"

We can determine the best fits by ploting the number of variables in our subsets and using the cp value in combinationation with the adjusted R^2 value

Based on the diagnostic plot, 20 variables appear to fit the data best

Multiple Linear Regression

Model 1: All Variables

We will first create a linear model with all of the predictors. It will most likely overfit the data but provides a good reference point. We also know from leaps, that the best model will use 16 of the predictors

## 
## Call:
## lm(formula = TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5368  -1647   -754    353 103594 
## 
## Coefficients: (3 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.890e+02  7.412e+02   0.525 0.599712    
## KIDSDRIV              2.351e+02  1.362e+02   1.726 0.084469 .  
## AGE                   5.029e+00  8.436e+00   0.596 0.551070    
## HOMEKIDS              6.054e+01  7.747e+01   0.782 0.434538    
## YOJ                  -2.118e+01  1.799e+01  -1.177 0.239269    
## INCOME               -3.802e-03  2.365e-03  -1.608 0.107969    
## SINGLEPARENT          5.606e+02  2.427e+02   2.310 0.020949 *  
## HOME_VAL             -2.227e-03  7.582e-04  -2.938 0.003319 ** 
## MARRIED              -2.634e+02  1.799e+02  -1.465 0.143062    
## MALE                  1.950e+02  2.203e+02   0.885 0.375966    
## EDUCATIONBachelors   -5.675e+02  3.560e+02  -1.594 0.110941    
## EDUCATIONHigh_School -5.543e+02  3.971e+02  -1.396 0.162806    
## EDUCATIONMasters     -3.303e+02  3.129e+02  -1.056 0.291167    
## EDUCATIONPhD                 NA         NA      NA       NA    
## JOB                   8.360e+01  4.570e+02   0.183 0.854852    
## JOBBlue_Collar        5.025e+02  2.820e+02   1.782 0.074805 .  
## JOBClerical           3.837e+02  2.824e+02   1.359 0.174347    
## JOBDoctor            -4.560e+02  5.605e+02  -0.813 0.415980    
## JOBHome_Maker         2.477e+00  3.213e+02   0.008 0.993850    
## JOBLawyer             1.262e+02  4.232e+02   0.298 0.765569    
## JOBManager           -4.687e+02  3.488e+02  -1.344 0.179119    
## JOBProfessional       4.108e+02  3.266e+02   1.258 0.208459    
## JOBStudent                   NA         NA      NA       NA    
## TRAVTIME              1.068e+01  3.870e+00   2.759 0.005820 ** 
## COMMERCIALUSE         7.664e+02  1.894e+02   4.047 5.27e-05 ***
## BLUEBOOK              1.750e-02  1.035e-02   1.692 0.090741 .  
## TIF                  -5.131e+01  1.456e+01  -3.525 0.000427 ***
## CAR_TYPEMinivan      -7.151e+02  2.547e+02  -2.808 0.005008 ** 
## CAR_TYPEPanel_Truck  -6.346e+02  3.173e+02  -2.000 0.045524 *  
## CAR_TYPEPickup       -2.696e+02  2.657e+02  -1.015 0.310188    
## CAR_TYPESports_Car    4.430e+02  3.532e+02   1.254 0.209790    
## CAR_TYPESUV           6.155e+01  3.195e+02   0.193 0.847251    
## CAR_TYPEVan                  NA         NA      NA       NA    
## RED_CAR               9.042e+01  1.780e+02   0.508 0.611545    
## OLDCLAIM             -1.015e-02  8.764e-03  -1.158 0.246727    
## CLM_FREQ              9.105e+01  6.557e+01   1.389 0.165019    
## REVOKED               5.627e+02  2.100e+02   2.679 0.007401 ** 
## MVR_PTS               1.615e+02  3.151e+01   5.126 3.05e-07 ***
## CAR_AGE              -3.514e+01  1.567e+01  -2.242 0.024995 *  
## URBANICITY            1.672e+03  1.656e+02  10.097  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4528 on 5675 degrees of freedom
## Multiple R-squared:  0.06906,    Adjusted R-squared:  0.06316 
## F-statistic: 11.69 on 36 and 5675 DF,  p-value: < 2.2e-16

There are 16 significant variables in Model 1, 6 of which are highly significant at a 5% significance level
The adjusted R^2 is .06737
The 3 variables CAR_TYPEVan, JOBStudent, and EDUCATIONPhD are NA
Interesting Positive Predictors
- SINGLEPARENT has a positive impact on TARGET_AMT. One would expect single parents to be more responsible because they are the only care providers for their children
- Home makers have the highest positive impact on TARGET_AMT
- COMMERCIALUSE vehicles have a higher positive impact
- Urban areas as expected are more prone to crashes due to population
Interesting Negative Predictors
- Bachelor degrees have a higher negative impact on TARGET_AMT. This may be due to a lack of people pursuing education past a Bachelors
- Doctors have one of the highest negative impacts on TARGET_AMT. They see victims of car crashes
- RED_CAR has a negative impact. It is thought that red cars make people speed more

Model 2: 20 Variables Chosen by leaps

## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT + 
##     MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager + 
##     TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
##     CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + 
##     CAR_AGE + URBANICITY, data = imputed$train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5156  -1662   -765    329 104069 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.742e+01  3.557e+02   0.049 0.960936    
## KIDSDRIV            2.868e+02  1.240e+02   2.313 0.020748 *  
## INCOME             -5.991e-03  1.644e-03  -3.643 0.000272 ***
## SINGLEPARENT        5.742e+02  2.122e+02   2.706 0.006836 ** 
## MARRIED            -5.536e+02  1.423e+02  -3.890 0.000101 ***
## MALE                2.054e+02  1.770e+02   1.161 0.245888    
## EDUCATIONBachelors -6.771e+01  1.377e+02  -0.492 0.622964    
## JOBDoctor          -3.159e+02  3.690e+02  -0.856 0.391960    
## JOBManager         -7.513e+02  1.972e+02  -3.810 0.000140 ***
## TRAVTIME            1.079e+01  3.863e+00   2.792 0.005257 ** 
## COMMERCIALUSE       7.829e+02  1.487e+02   5.265 1.46e-07 ***
## BLUEBOOK            1.281e-02  8.695e-03   1.473 0.140797    
## TIF                -5.044e+01  1.453e+01  -3.471 0.000523 ***
## CAR_TYPEMinivan    -4.409e+02  1.739e+02  -2.535 0.011269 *  
## CAR_TYPESports_Car  6.693e+02  2.685e+02   2.493 0.012693 *  
## CAR_TYPESUV         2.896e+02  2.260e+02   1.281 0.200168    
## CLM_FREQ            6.433e+01  5.820e+01   1.105 0.269087    
## REVOKED             4.598e+02  1.870e+02   2.459 0.013981 *  
## MVR_PTS             1.646e+02  3.135e+01   5.251 1.57e-07 ***
## CAR_AGE            -3.292e+01  1.206e+01  -2.729 0.006377 ** 
## URBANICITY          1.663e+03  1.628e+02  10.218  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4530 on 5691 degrees of freedom
## Multiple R-squared:  0.06561,    Adjusted R-squared:  0.06233 
## F-statistic: 19.98 on 20 and 5691 DF,  p-value: < 2.2e-16

The adjusted R^2 of Model 2 is .06792, still very small but slightly improved over Model 1
Of the 20 variables, 16 are significant, while 10 are highly significant at the 5% level
Interesting Positive Predictors
- COMMERCIALUSE is the highest positive influencer
- SINGLEPARENT is still a high positive influencer
Interesting Negative Predictors
- JOBManager has overtaken JOBDoctor as the highest negative influencer

Model 3: 1 + Log Transform 20 Variables from Leaps

## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT + 
##     MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager + 
##     TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
##     CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + 
##     CAR_AGE + URBANICITY, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4712 -2.3427 -0.9312  1.9602 10.4862 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.76424    0.80434   4.680 2.94e-06 ***
## KIDSDRIV            0.91164    0.15647   5.826 5.98e-09 ***
## INCOME             -0.08446    0.01548  -5.457 5.05e-08 ***
## SINGLEPARENT        1.13722    0.21950   5.181 2.28e-07 ***
## MARRIED            -0.98637    0.14667  -6.725 1.93e-11 ***
## MALE               -0.01271    0.18099  -0.070 0.944010    
## EDUCATIONBachelors -0.40727    0.14335  -2.841 0.004512 ** 
## JOBDoctor          -1.16308    0.36964  -3.147 0.001661 ** 
## JOBManager         -1.34800    0.20163  -6.686 2.52e-11 ***
## TRAVTIME            0.48839    0.07578   6.444 1.26e-10 ***
## COMMERCIALUSE       1.54952    0.15211  10.187  < 2e-16 ***
## BLUEBOOK           -0.38189    0.07702  -4.958 7.33e-07 ***
## TIF                -0.45875    0.06074  -7.553 4.94e-14 ***
## CAR_TYPEMinivan    -0.63100    0.17859  -3.533 0.000414 ***
## CAR_TYPESports_Car  0.98331    0.27569   3.567 0.000364 ***
## CAR_TYPESUV         0.60596    0.22866   2.650 0.008070 ** 
## CLM_FREQ            0.46401    0.08774   5.288 1.28e-07 ***
## REVOKED             1.57779    0.19279   8.184 3.36e-16 ***
## MVR_PTS             0.44336    0.06477   6.845 8.45e-12 ***
## CAR_AGE            -0.28440    0.05714  -4.977 6.63e-07 ***
## URBANICITY          3.29982    0.16703  19.756  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.233 on 5690 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.2151, Adjusted R-squared:  0.2123 
## F-statistic: 77.96 on 20 and 5690 DF,  p-value: < 2.2e-16

R^2 has greatly improved to .2159
Interesting positive predictors
- URBANICITY is now the highest positive influencer
- COMMERCIALUSE still has a strong positive impact on TARGET_AMT
Interesting negative predictors
- JOBManager is still the highest negative influencer

Binary Logistic Regression

Model 4: All Variables

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = "binomial", 
##     data = imputed$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2965  -0.7069  -0.3899   0.5604   3.1810  
## 
## Coefficients: (3 not defined because of singularities)
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.962e+00  4.441e-01  -6.669 2.58e-11 ***
## KIDSDRIV              3.667e-01  7.443e-02   4.927 8.36e-07 ***
## AGE                  -1.994e-04  4.873e-03  -0.041 0.967355    
## HOMEKIDS              5.516e-02  4.462e-02   1.236 0.216359    
## YOJ                  -2.694e-02  1.039e-02  -2.594 0.009494 ** 
## INCOME               -3.859e-06  1.405e-06  -2.747 0.006018 ** 
## SINGLEPARENT          4.166e-01  1.330e-01   3.133 0.001732 ** 
## HOME_VAL             -1.883e-06  4.413e-07  -4.267 1.98e-05 ***
## MARRIED              -3.868e-01  1.056e-01  -3.664 0.000248 ***
## MALE                  2.912e-02  1.355e-01   0.215 0.829846    
## EDUCATIONBachelors   -3.869e-01  2.210e-01  -1.750 0.080097 .  
## EDUCATIONHigh_School -6.629e-02  2.418e-01  -0.274 0.784000    
## EDUCATIONMasters     -2.158e-01  1.889e-01  -1.142 0.253304    
## EDUCATIONPhD                 NA         NA      NA       NA    
## JOB                  -9.117e-02  2.653e-01  -0.344 0.731134    
## JOBBlue_Collar        3.972e-01  1.588e-01   2.501 0.012379 *  
## JOBClerical           3.875e-01  1.603e-01   2.417 0.015637 *  
## JOBDoctor            -5.591e-01  3.645e-01  -1.534 0.125067    
## JOBHome_Maker        -5.611e-02  1.824e-01  -0.308 0.758343    
## JOBLawyer            -3.294e-02  2.540e-01  -0.130 0.896810    
## JOBManager           -5.134e-01  2.057e-01  -2.496 0.012572 *  
## JOBProfessional       8.403e-02  1.885e-01   0.446 0.655811    
## JOBStudent                   NA         NA      NA       NA    
## TRAVTIME              1.593e-02  2.300e-03   6.927 4.30e-12 ***
## COMMERCIALUSE         7.421e-01  1.062e-01   6.990 2.75e-12 ***
## BLUEBOOK             -1.466e-05  6.325e-06  -2.318 0.020445 *  
## TIF                  -6.569e-02  8.936e-03  -7.351 1.97e-13 ***
## CAR_TYPEMinivan      -6.009e-01  1.524e-01  -3.943 8.05e-05 ***
## CAR_TYPEPanel_Truck  -1.652e-01  1.797e-01  -0.919 0.357853    
## CAR_TYPEPickup       -3.519e-02  1.524e-01  -0.231 0.817342    
## CAR_TYPESports_Car    4.697e-01  2.082e-01   2.256 0.024078 *  
## CAR_TYPESUV           1.893e-01  1.914e-01   0.989 0.322673    
## CAR_TYPEVan                  NA         NA      NA       NA    
## RED_CAR               8.422e-03  1.051e-01   0.080 0.936152    
## OLDCLAIM             -1.100e-05  4.689e-06  -2.345 0.019026 *  
## CLM_FREQ              1.758e-01  3.442e-02   5.108 3.26e-07 ***
## REVOKED               8.780e-01  1.129e-01   7.775 7.57e-15 ***
## MVR_PTS               1.100e-01  1.675e-02   6.566 5.17e-11 ***
## CAR_AGE              -2.010e-03  9.322e-03  -0.216 0.829324    
## URBANICITY            2.467e+00  1.379e-01  17.893  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6527.4  on 5711  degrees of freedom
## Residual deviance: 5023.6  on 5675  degrees of freedom
## AIC: 5097.6
## 
## Number of Fisher Scoring iterations: 5

##           llh       llhNull            G2      McFadden          r2ML 
## -2511.8093700 -3263.7239857  1503.8292313     0.2303855     0.2314698 
##          r2CU 
##     0.3398657

Using the pscl package, we can get a pseudo-R^2 value of ~.2304
There are 19 significant variables in Model 4, 16 of which are highly significant at a 5% significance level
The 3 variables CAR_TYPEVan, JOBStudent, and EDUCATIONPhD are NA
Interesting Positive Predictors
- SINGLEPARENT has a positive impact on TARGET_AMT. One would expect single parents to be more responsible because they are the only care providers for their children
- REVOKED has the highest positive impact on TARGET_FLAG
- Urban areas as expected are more prone to crashes probably due to population
- High school has a positive impact, as opposed to the negative impact on TARGET_AMT
Interesting Negative Predictors
- Bachelor degrees have a higher negative impact on TARGET_AMT. This may be due to a lack of people pursuing education past a Bachelors
- Managers have one of the highest negative impacts on TARGET_AMT. They see victims of car crashes
- RED_CAR has a negative impact. It is thought that red cars make people speed more

Model 5: 20 Variables Chosen by leaps (Untransformed)

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + 
##     MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager + 
##     TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
##     CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + 
##     CAR_AGE + URBANICITY, family = "binomial", data = imputed$train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3199  -0.7168  -0.3988   0.5922   3.1077  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -2.933e+00  2.305e-01 -12.724  < 2e-16 ***
## KIDSDRIV            4.109e-01  6.689e-02   6.144 8.06e-10 ***
## INCOME             -7.445e-06  1.014e-06  -7.343 2.09e-13 ***
## SINGLEPARENT        4.884e-01  1.138e-01   4.290 1.78e-05 ***
## MARRIED            -6.093e-01  8.252e-02  -7.384 1.54e-13 ***
## MALE               -2.468e-02  1.082e-01  -0.228 0.819633    
## EDUCATIONBachelors -2.557e-01  8.145e-02  -3.139 0.001694 ** 
## JOBDoctor          -3.911e-01  2.613e-01  -1.497 0.134401    
## JOBManager         -6.580e-01  1.267e-01  -5.194 2.06e-07 ***
## TRAVTIME            1.571e-02  2.270e-03   6.919 4.55e-12 ***
## COMMERCIALUSE       8.597e-01  8.422e-02  10.208  < 2e-16 ***
## BLUEBOOK           -1.911e-05  5.105e-06  -3.743 0.000182 ***
## TIF                -6.509e-02  8.873e-03  -7.336 2.20e-13 ***
## CAR_TYPEMinivan    -4.856e-01  1.063e-01  -4.569 4.89e-06 ***
## CAR_TYPESports_Car  5.105e-01  1.546e-01   3.301 0.000963 ***
## CAR_TYPESUV         2.384e-01  1.332e-01   1.789 0.073564 .  
## CLM_FREQ            1.427e-01  3.052e-02   4.676 2.92e-06 ***
## REVOKED             7.372e-01  9.759e-02   7.554 4.21e-14 ***
## MVR_PTS             1.110e-01  1.653e-02   6.713 1.90e-11 ***
## CAR_AGE            -1.924e-02  7.015e-03  -2.743 0.006083 ** 
## URBANICITY          2.421e+00  1.371e-01  17.658  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6527.4  on 5711  degrees of freedom
## Residual deviance: 5081.2  on 5691  degrees of freedom
## AIC: 5123.2
## 
## Number of Fisher Scoring iterations: 5

##           llh       llhNull            G2      McFadden          r2ML 
## -2540.6105722 -3263.7239857  1446.2268269     0.2215608     0.2236804 
##          r2CU 
##     0.3284285

Using the pscl package, we can get a pseudo-R^2 value of ~.2216
Interesting Positive Predictors
- COMMERCIALUSE now has the highest predictor
Interesting Negative Predictors
- Managers are still the lowest negative predictor. Could be due to age
- Minivans have the lowest impact on crashes amongst the car types
MALE is not a significant predictor but it has a highest impact

Model 6: 1 + Log Transform 20 Variables Chosen by leaps

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + 
##     MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + JOBManager + 
##     TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
##     CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + 
##     CAR_AGE + URBANICITY, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8040  -0.5819  -0.3530   0.1081   2.3171  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.08027    0.69569  -1.553 0.120471    
## KIDSDRIV            0.57531    0.12351   4.658 3.20e-06 ***
## INCOME             -0.05020    0.01220  -4.116 3.85e-05 ***
## SINGLEPARENT        0.53146    0.17257   3.080 0.002073 ** 
## MARRIED            -0.69257    0.12896  -5.371 7.85e-08 ***
## MALE                0.02306    0.16966   0.136 0.891900    
## EDUCATIONBachelors -0.28741    0.12907  -2.227 0.025965 *  
## JOBDoctor          -0.96752    0.42015  -2.303 0.021291 *  
## JOBManager         -0.93577    0.20455  -4.575 4.77e-06 ***
## TRAVTIME            0.35093    0.07057   4.973 6.60e-07 ***
## COMMERCIALUSE       0.96603    0.12901   7.488 6.98e-14 ***
## BLUEBOOK           -0.27581    0.06531  -4.223 2.41e-05 ***
## TIF                -0.30565    0.05349  -5.715 1.10e-08 ***
## CAR_TYPEMinivan    -0.54785    0.16831  -3.255 0.001134 ** 
## CAR_TYPESports_Car  0.61278    0.23936   2.560 0.010466 *  
## CAR_TYPESUV         0.38057    0.20481   1.858 0.063151 .  
## CLM_FREQ            0.30578    0.07048   4.338 1.44e-05 ***
## REVOKED             0.81154    0.14547   5.579 2.42e-08 ***
## MVR_PTS             0.23386    0.05383   4.345 1.39e-05 ***
## CAR_AGE            -0.17051    0.04874  -3.498 0.000469 ***
## URBANICITY          2.95792    0.22379  13.217  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3548.4  on 5710  degrees of freedom
## Residual deviance: 2696.9  on 5690  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 5448.5
## 
## Number of Fisher Scoring iterations: 6

##           llh       llhNull            G2      McFadden          r2ML 
## -2703.2717769 -3374.3968320  1342.2501104     0.1988874     0.2094520 
##          r2CU 
##     0.3021311

Using the pscl package, we can get a pseudo-R^2 value of ~.1987
Interesting Positive Predictors
- URBANICITY now has the highest predictor
- SUV has a positie predictor
Interesting Negative Predictors
- Managers are still the lowest negative predictor. Could be due to age
- Minivans have the lowest impact on crashes amongst the car types
MALE is not a significant predictor but it has a highest impact

Select Models

We will use Model 3 and Model 5 based on their R^2 values and the fact that all or most of the variables are significant at the 5% level

Model 3

The histogram of the residuals do not show a normal distribution
The qqplot shows a fairly linear relationship with the tails of the plot venturing away from the line
The residual plot does not display contant variance

Test Model

Model 5

The histogram of the residuals do not show a normal distribution
The qqplot shows a fairly linear relationship that ventures wildly away from linear towards the upper tail
The residual plot shows no evidence of a constant variance

Test Model

## [1] 0.7264282

The linear model correctly predicted at a 72% rate

Performance

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1124  146
##          1  443  440
##                                           
##                Accuracy : 0.7264          
##                  95% CI : (0.7071, 0.7452)
##     No Information Rate : 0.7278          
##     P-Value [Acc > NIR] : 0.5687          
##                                           
##                   Kappa : 0.404           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7509          
##             Specificity : 0.7173          
##          Pos Pred Value : 0.4983          
##          Neg Pred Value : 0.8850          
##               Precision : 0.4983          
##                  Recall : 0.7509          
##                      F1 : 0.5990          
##              Prevalence : 0.2722          
##          Detection Rate : 0.2044          
##    Detection Prevalence : 0.4101          
##       Balanced Accuracy : 0.7341          
##                                           
##        'Positive' Class : 1               
##

The accuracy is only ~73%
Positive prediction was very poor at below 50%
Negative prediction was decent sitting at ~89%
Sensitivity is ~75%
Specificity is ~72%
The F1 is ~60%
The AUC is 0.6916703
Over all, I’d say the model is pretty underwhelming in prediction power

Code Appendix

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)

loadPkg <- function(x) {
    if (!require(x, character.only = T)) 
        install.packages(x, dependencies = T, repos = "http://cran.us.r-project.org")
    require(x, character.only = T)
}

libs <- c("knitr", "magrittr", "data.table", "kableExtra", "caret", "pROC", 
    "missForest", "zoo", "ISLR", "leaps", "fBasics", "reshape2", "tidyverse", 
    "GGally", "gridExtra", "ROCR", "dummies", "pscl")

lapply(libs, loadPkg)
insTraining <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week4/insurance_training_data.csv") %>% 
    as.tibble()
insTest <- fread("https://raw.githubusercontent.com/baroncurtin2/data621/master/week4/insurance-evaluation-data.csv") %>% 
    as.tibble()
data_frame(explanatory_variables = names(insTraining)) %>% filter(!explanatory_variables %in% 
    c("TARGET_FLAG", "TARGET_AMT", "INDEX")) %>% arrange(explanatory_variables)
glimpse(insTraining)
insTraining %>% sapply(typeof) %>% as.data.frame() %>% rownames_to_column(var = "variable") %>% 
    rename(vartype = 2) %>% group_by(vartype) %>% summarise(count = n())
summary(insTraining)
# function to convert text monetary values to numeric values
convertText2Num <- function(x) {
    x %>% # remove symbols and punctuation
    str_replace_all("[\\$[:punct:]]", "") %>% # convert to number
    as.numeric() %>% # replace NA with 0
    if_else(is.na(.), 0, .)
}

insStats <- insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), 
    convertText2Num) %>% select_if(is.numeric) %>% basicStats(.) %>% as_tibble() %>% 
    rownames_to_column() %>% gather(var, value, -rowname) %>% spread(rowname, 
    value) %>% rename_all(str_to_lower) %>% rename_all(str_trim) %>% rename(variables = "var", 
    q1 = `1. quartile`, q3 = `3. quartile`, max = maximum, min = minimum, na_vals = nas, 
    n = nobs, sd = stdev, var = variance) %>% mutate(obs = n - na_vals, range = max - 
    min, iqr = q3 - q1) %>% select(variables, n, na_vals, obs, mean, min, q1, 
    median, q3, max, sd, var, range, iqr, skewness, kurtosis)

insStats
insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num) %>% 
    select_if(is.numeric) %>% cor(use = "na.or.complete") %>% as.data.frame() %>% 
    rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor, 
    TARGET_FLAG, TARGET_AMT) %>% filter(!predictor %in% c("INDEX", "TARGET_FLAG", 
    "TARGET_AMT")) %>% arrange(desc(TARGET_FLAG)) %>% mutate(flag_rank = dense_rank(desc(TARGET_FLAG)), 
    amt_rank = dense_rank(desc(TARGET_AMT)), rank_equal = flag_rank == amt_rank)
ggcorr(insTraining, palette = "RdBu", label = T, geom = "tile", size = 2)
vis <- insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), 
    convertText2Num) %>% select_if(is.numeric) %>% melt(id.vars = "INDEX")

ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")
insTraining %>% mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num) %>% 
    mutate(TARGET_FLAG = as.factor(TARGET_FLAG)) %>% keep(is.numeric) %>% gather() %>% 
    ggplot(aes(value)) + geom_histogram(bins = 25) + facet_wrap(~key, scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip()
vis %>% filter(!variable %in% c("HOME_VAL", "INCOME", "OLDCLAIM", "BLUEBOOK", 
    "TARGET_AMT")) %>% ggplot(aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip()
set.seed(777)
smp_size <- floor(0.7 * nrow(insTraining))
train_ind <- sample(seq_len(nrow(insTraining)), size = smp_size)

datasets <- list(train = insTraining[train_ind, ], test = insTest, test2 = insTraining[-train_ind, 
    ])

# helper function for converting text to number
convertText2Num <- function(x) {
    x %>% # remove symbols and punctuation
    str_replace_all("[\\$[:punct:]]", "") %>% # convert to number
    as.numeric()
}
datasets %<>% map(function(df) {
    df %<>% # convert text to num
    mutate_at(c("INCOME", "HOME_VAL", "BLUEBOOK", "OLDCLAIM"), convertText2Num)
    
    # return dataframe
    return(df)
})
charCols <- datasets$train %>% select_if(is.character) %>% lapply(function(x) unique(x)) %>% 
    print
datasets %<>% map(function(df) {
    df %<>% mutate(EDUCATION = str_replace_all(EDUCATION, "<", ""), RED_CAR = str_to_title(RED_CAR), 
        URBANICITY = str_replace(URBANICITY, "/ ", "/")) %>% mutate_at(names(charCols), 
        str_replace_all, pattern = "z_", replacement = "")
    
    # return df
    return(df)
})
datasets %<>% map(function(df) {
    df %<>% # PARENT1 rename to SINGLEPARENT / conversion
    rename(SINGLEPARENT = PARENT1) %>% mutate(SINGLEPARENT = if_else(SINGLEPARENT == 
        "Yes", 1, 0)) %>% # MSTATUS rename to MARRIED / conversion
    rename(MARRIED = MSTATUS) %>% mutate(MARRIED = if_else(MARRIED == "Yes", 
        1, 0)) %>% # SEX rename to MALE / conversion
    rename(MALE = SEX) %>% mutate(MALE = if_else(MALE == "M", 1, 0)) %>% # CAR_USE rename to COMMERCIALUSE / conversion
    rename(COMMERCIALUSE = CAR_USE) %>% mutate(COMMERCIALUSE = if_else(COMMERCIALUSE == 
        "Commercial", 1, 0)) %>% # RED_CAR conversion
    mutate(RED_CAR = if_else(RED_CAR == "Yes", 1, 0)) %>% # REVOKED conversion
    mutate(REVOKED = if_else(REVOKED == "Yes", 1, 0)) %>% # URBANICITY
    mutate(URBANICITY = if_else(str_detect(URBANICITY, "Urban"), 1, 0))
})
datasets %<>% map(function(df) {
    # create dummy data frame
    df <- df %>% as.data.frame() %>% dummy.data.frame(names = c("EDUCATION", 
        "JOB", "CAR_TYPE"))
    
    # return df
    return(df)
})
replace_space <- function(x) str_replace(x, " ", "_")

datasets %<>% map(function(df) {
    # rename all with spaces
    df %<>% rename_all(replace_space)
    
    # return df
    return(df)
})
forests <- datasets %>% map(function(df) {
    df %<>% as.data.frame() %>% select_if(is.numeric) %>% select(-INDEX) %>% 
        missForest()
    
    # return df
    return(df)
})
imputed <- forests %>% map(function(x) {
    return(x$ximp)
})
mlrs <- regsubsets(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train, method = "exhaustive", 
    nvmax = NULL, nbest = 1)
mlrs.summary <- summary(mlrs)
print(mlrs.summary)
# determine best subset
plot(mlrs.summary$cp, xlab = "Number of Variables", ylab = "Cp")
points(which.min(mlrs.summary$cp), mlrs.summary$cp[which.min(mlrs.summary$cp)], 
    pch = 20, col = "red")

# cp plot par(mfrow=c(1,2))
plot(mlrs, scale = "Cp", main = "Cp")

# r^2 splot
plot(mlrs, scale = "adjr2", main = "Adjusted R^2")
m1 <- lm(TARGET_AMT ~ . - TARGET_FLAG, data = imputed$train)
summary(m1)
m2 <- lm(TARGET_AMT ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE + EDUCATIONBachelors + 
    JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
    CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
    URBANICITY, data = imputed$train)
summary(m2)
m3 <- imputed$train %>% mutate_all(~log(1 + .x)) %>% lm(TARGET_AMT ~ KIDSDRIV + 
    INCOME + SINGLEPARENT + MARRIED + MALE + EDUCATIONBachelors + JOBDoctor + 
    JOBManager + TRAVTIME + COMMERCIALUSE + BLUEBOOK + TIF + CAR_TYPEMinivan + 
    CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
    URBANICITY, data = .)

summary(m3)
m4 <- glm(TARGET_FLAG ~ . - TARGET_AMT, data = imputed$train, family = "binomial")
summary(m4)
pR2(m4)
m5 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE + 
    EDUCATIONBachelors + JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE + 
    BLUEBOOK + TIF + CAR_TYPEMinivan + CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + 
    REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = imputed$train, family = "binomial")
summary(m5)
pR2(m5)
d <- imputed$train %>% mutate_all(~log(1 + .x))

m6 <- glm(TARGET_FLAG ~ KIDSDRIV + INCOME + SINGLEPARENT + MARRIED + MALE + 
    EDUCATIONBachelors + JOBDoctor + JOBManager + TRAVTIME + COMMERCIALUSE + 
    BLUEBOOK + TIF + CAR_TYPEMinivan + CAR_TYPESports_Car + CAR_TYPESUV + CLM_FREQ + 
    REVOKED + MVR_PTS + CAR_AGE + URBANICITY, data = d, family = "binomial")
summary(m6)
pR2(m6)
par(mfrow = c(2, 2))

plot(m3)
hist(m3$residuals)
qqnorm(m3$residuals)
qqline(m3$residuals)
m3_results <- predict(m3, newdata = datasets$test2)
datasets$test <- bind_cols(datasets$test2, data_frame(m3results = m3_results)) %>% 
    mutate(m3results = if_else(m3results < 0, 0, m3results), m3amt_match = TARGET_AMT == 
        m3results)
datasets$test
par(mfrow = c(2, 2))

plot(m5)
hist(m5$residuals)
qqnorm(m5$residuals)
qqline(m5$residuals)
m5_results <- predict(m5, newdata = datasets$test2, type = "response")
datasets$test <- bind_cols(datasets$test2, data_frame(m5results = m5_results)) %>% 
    mutate(m5results = if_else(m5_results > mean(TARGET_FLAG), 1, 0), m5flag_match = m5results == 
        TARGET_FLAG)
datasets$test

mean(datasets$test$m5flag_match, na.rm = T)
cm <- confusionMatrix(as.factor(datasets$test$m5results), as.factor(datasets$test$TARGET_FLAG), 
    positive = "1", mode = "everything") %>% print

curveRoc <- roc(datasets$test$m5results, datasets$test$TARGET_FLAG)
plot(curveRoc, legacy.axes = T, main = "pROC")

Data 621 - Week4

Baron Curtin

2018-07-06

Cover Page

Introduction

Data Exploration

Non-Visual Inspection

Variables

Basic Stats

Correlation

Visual Inspection

Density Plots

Histograms

Box Plots

Data Preparation

Transforming Monetary Columns to Numeric

Transform Inconsitent Observations Within Columns

Create Binary Dummy Variable Columns for Categorical Variables

Missing Value Imputation

Build Models

Leaps Subsetting

Multiple Linear Regression

Model 1: All Variables

Model 2: 20 Variables Chosen by leaps

Model 3: 1 + Log Transform 20 Variables from Leaps

Binary Logistic Regression

Model 4: All Variables

Model 5: 20 Variables Chosen by leaps (Untransformed)

Model 6: 1 + Log Transform 20 Variables Chosen by leaps

Select Models

Model 3

Test Model

Model 5

Test Model

Performance

Code Appendix