DATA 621: HW 4

David Quarshie - Group 3

In this assignment we are given a dataset that represents customer records at an auto insurance company. Our goal is to build a logistic regression model to that will predict the probability that a person will crash their car and also estimate the amount of money it will cost if the person is in a crash.

1. Data Exploration

The data has to be cleaned to take away the ‘$’ and ‘,’ characters from the fields that have them.

Basic Statistics

The training data has 8,161 rows and 25 columns or features. Of the 25 columns, 14 are discrete, 11 are continuous, and none are all their missing. We do have 970 missing values out of 204,025 data points.

n mean sd median min max skew kurtosis
TARGET_FLAG* 8161 1.263816e+00 4.407276e-01 1 1 2.0 1.0716614 -0.8516462
TARGET_AMT 8161 1.504325e+03 4.704027e+03 0 0 107586.1 8.7063034 112.2884386
KIDSDRIV 8161 1.710575e-01 5.115341e-01 0 0 4.0 3.3518374 11.7801916
AGE 8155 4.479031e+01 8.627589e+00 45 16 81.0 -0.0289889 -0.0617020
HOMEKIDS 8161 7.212351e-01 1.116323e+00 0 0 5.0 1.3411271 0.6489915
YOJ 7707 1.049929e+01 4.092474e+00 11 0 23.0 -1.2029676 1.1773410
INCOME 7716 6.189809e+04 4.757268e+04 54028 0 367030.0 1.1863166 2.1290163
PARENT1* 8161 1.131969e+00 3.384779e-01 1 1 2.0 2.1743561 2.7281589
HOME_VAL 7697 1.548673e+05 1.291238e+05 161160 0 885282.0 0.4885950 -0.0160838
MSTATUS* 8161 1.400319e+00 4.899929e-01 1 1 2.0 0.4068189 -1.8347231
SEX* 8161 1.536086e+00 4.987266e-01 2 1 2.0 -0.1446959 -1.9793056
EDUCATION* 8161 3.090675e+00 1.444856e+00 3 1 5.0 0.1162654 -1.3799674
JOB* 8161 5.687171e+00 2.681873e+00 6 1 9.0 -0.3067029 -1.2222635
TRAVTIME 8161 3.348572e+01 1.590833e+01 33 5 142.0 0.4468174 0.6643331
CAR_USE* 8161 1.628845e+00 4.831436e-01 2 1 2.0 -0.5332937 -1.7158080
BLUEBOOK 8161 1.570990e+04 8.419734e+03 14440 1500 69740.0 0.7942141 0.7913559
TIF 8161 5.351305e+00 4.146635e+00 4 1 25.0 0.8908120 0.4224940
CAR_TYPE* 8161 3.529714e+00 1.965357e+00 3 1 6.0 -0.0047181 -1.5165329
RED_CAR* 8161 1.291386e+00 4.544287e-01 1 1 2.0 0.9180255 -1.1573709
OLDCLAIM 8161 4.037076e+03 8.777139e+03 0 0 57037.0 3.1190400 9.8606583
CLM_FREQ 8161 7.985541e-01 1.158453e+00 0 0 5.0 1.2087985 0.2842890
REVOKED* 8161 1.122534e+00 3.279216e-01 1 1 2.0 2.3018899 3.2991013
MVR_PTS 8161 1.695503e+00 2.147112e+00 1 0 13.0 1.3478403 1.3754900
CAR_AGE 7651 8.328323e+00 5.700742e+00 8 -3 28.0 0.2819531 -0.7489756
URBANICITY* 8161 1.204509e+00 4.033673e-01 1 1 2.0 1.4649406 0.1460688

Distribution of Target Variable

Let’s look at the target_flag variable in our training data to make sure there no one sided distribution.

Var1 Freq
0 6008
1 2153
## [1] 6008

Histogram of Variables

Boxplot of Variables

## NULL

2. DATA PREPARATION

Adjusting Variables

Looking at the plots we see we have to make a few changes to some variables. We’ll make HOMEKIDS boolean instead of a factor. For the rows where CAR_AGE are labled numbers, we make 0. For blank JOBS we label those as “Unknown”. Finally, change Education to 1 if PhD and Masters.

Missing Data

We have missing data in income, yoj, home_val, and car_age.

##  TARGET_FLAG   TARGET_AMT        KIDSDRIV           AGE       
##  0:6008      Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1:2153      1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##              Median :     0   Median :0.0000   Median :45.00  
##              Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##              3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##              Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                NA's   :6      
##     HOMEKIDS           YOJ           INCOME       PARENT1   
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   No :7084  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   Yes:1077  
##  Median :0.0000   Median :11.0   Median : 54028             
##  Mean   :0.3519   Mean   :10.5   Mean   : 61898             
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986             
##  Max.   :1.0000   Max.   :23.0   Max.   :367030             
##                   NA's   :454    NA's   :445                
##     HOME_VAL      MSTATUS      SEX         EDUCATION     
##  Min.   :     0   Yes :4894   M  :3786   Min.   :0.0000  
##  1st Qu.:     0   z_No:3267   z_F:4375   1st Qu.:0.0000  
##  Median :161160                          Median :1.0000  
##  Mean   :154867                          Mean   :0.7076  
##  3rd Qu.:238724                          3rd Qu.:1.0000  
##  Max.   :885282                          Max.   :1.0000  
##  NA's   :464                                             
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional :1117   Median : 33.00                     Median :14440  
##  Manager      : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer       : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student      : 712   Max.   :142.00                     Max.   :69740  
##  (Other)      :1413                                                     
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   no :5783   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   1st Qu.:    0  
##  Median : 4.000   Pickup     :1389              Median :    0  
##  Mean   : 5.351   Sports Car : 907              Mean   : 4037  
##  3rd Qu.: 7.000   Van        : 750              3rd Qu.: 4636  
##  Max.   :25.000   z_SUV      :2294              Max.   :57037  
##                                                                
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.329  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
## 

Imputate

Let’s runmice imputation on both the train and test set.

3. Build Models

Model 1

For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.

Model 1 uses the formula:

target ~ .

x
KIDSDRIV 7.702064
AGE 10.360192
HOMEKIDS 17.584519
YOJ 9.577655
INCOME 22.277981
PARENT1Yes 13.595112
HOME_VAL 16.757962
MSTATUSz_No 15.428021
SEXz_F 25.658144
EDUCATION 31.040495
JOBDoctor 18.268248
JOBHome Maker 11.856568
JOBLawyer 24.654979
JOBManager 16.866626
JOBProfessional 13.423685
JOBStudent 11.234607
JOBUnknown 19.132636
JOBz_Blue Collar 16.156666
TRAVTIME 7.248081
CAR_USEPrivate 14.272749
BLUEBOOK 16.154506
TIF 7.532261
CAR_TYPEPanel Truck 15.759939
CAR_TYPEPickup 11.552785
CAR_TYPESports Car 13.568756
CAR_TYPEVan 10.684761
CAR_TYPEz_SUV 20.606817
RED_CARyes 12.730645
OLDCLAIM 9.655093
CLM_FREQ 8.865006
REVOKEDYes 7.320507
MVR_PTS 7.034295
CAR_AGE 12.691818
URBANICITYz_Highly Rural/ Rural 17.144953

Model 2

For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.

Model 2 uses the formula:

TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY

x
KIDSDRIV 6.471909
INCOME 21.500907
PARENT1Yes 8.289954
HOME_VAL 16.645614
MSTATUSz_No 13.070655
JOBDoctor 14.887878
JOBHome Maker 9.706264
JOBLawyer 13.219906
JOBManager 14.724221
JOBProfessional 12.772811
JOBStudent 10.175504
JOBUnknown 12.236807
JOBz_Blue Collar 16.051965
TRAVTIME 7.207871
CAR_USEPrivate 14.115983
BLUEBOOK 12.961306
TIF 7.495193
CAR_TYPEPanel Truck 13.665753
CAR_TYPEPickup 11.492782
CAR_TYPESports Car 9.256940
CAR_TYPEVan 9.944616
CAR_TYPEz_SUV 12.283012
OLDCLAIM 9.621306
CLM_FREQ 8.838853
REVOKEDYes 7.276587
MVR_PTS 6.987320
URBANICITYz_Highly Rural/ Rural 17.206106

Model 3

Model 3 removes the variables with the 3 highest VIF values from Model 1.

Model 3 uses the formula:

TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

x
KIDSDRIV 7.671453
AGE 10.243422
HOMEKIDS 17.515556
YOJ 9.492256
PARENT1Yes 13.545140
HOME_VAL 13.655667
MSTATUSz_No 14.487994
JOBDoctor 14.769079
JOBHome Maker 11.233131
JOBLawyer 14.938707
JOBManager 14.686300
JOBProfessional 12.551505
JOBStudent 11.113690
JOBUnknown 12.074654
JOBz_Blue Collar 15.714812
TRAVTIME 7.233085
CAR_USEPrivate 14.239721
BLUEBOOK 13.370724
TIF 7.493869
CAR_TYPEPanel Truck 14.153682
CAR_TYPEPickup 11.515311
CAR_TYPESports Car 10.285132
CAR_TYPEVan 10.107125
CAR_TYPEz_SUV 14.479069
RED_CARyes 10.117468
OLDCLAIM 9.658407
CLM_FREQ 8.847909
REVOKEDYes 7.300375
MVR_PTS 7.006550
CAR_AGE 11.081616
URBANICITYz_Highly Rural/ Rural 17.122161

Model 4

Model 4 takes in Model 3 and removes thoe values that are poor predictors.

Model 4 uses the formula:

TARGET_FLAG ~ KIDSDRIV + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

x
KIDSDRIV 6.451727
PARENT1Yes 8.292518
HOME_VAL 13.566152
MSTATUSz_No 12.313098
JOBDoctor 14.608289
JOBHome Maker 9.697695
JOBLawyer 14.646576
JOBManager 14.501307
JOBProfessional 12.454683
JOBStudent 10.034136
JOBUnknown 12.017685
JOBz_Blue Collar 15.674081
TRAVTIME 7.212779
CAR_USEPrivate 14.196919
BLUEBOOK 12.433073
TIF 7.480926
CAR_TYPEPanel Truck 13.689753
CAR_TYPEPickup 11.498573
CAR_TYPESports Car 9.244599
CAR_TYPEVan 9.963572
CAR_TYPEz_SUV 12.283742
OLDCLAIM 9.654437
CLM_FREQ 8.825693
REVOKEDYes 7.279732
MVR_PTS 6.973144
CAR_AGE 11.048829
URBANICITYz_Highly Rural/ Rural 17.186581
## Regression

Regression Model 1

Regression Model 1 fits includes all the variables.

target ~ .

Regression Model 2

For Regression Model 2 we take out features with less impact.

Regression Model 2’s formula:

TARGET_AMT ~ HOME_VAL + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

Regression Model 3

Regression Model 3 deals with the issues related to the car value and driver’s legal issues.

Regression Model 3’s formula:

TARGET_AMT ~ BLUEBOOK + REVOKED + MVR_PTS + CAR_AGE

Regression Model Results

model RMSE Rsquared MAE RMSESD RsquaredSD MAESD
Mod1 7600.723 0.0095215 3754.895 1593.166 0.0083830 338.8376
Mod2 7516.968 0.0110225 3689.982 1881.096 0.0113882 464.5595
Mod3 7499.772 0.0186040 3658.981 1795.630 0.0189334 466.5175

4. SELECT MODEL

It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.

To aid in model selection for the regression problem, we’ll compare the error in the fit of our models in a table and select from there.

1: Classification

Fourfold Plots

Summary Statistics

Sensitivity Specificity Precision Recall F1
Model1 0.9112098 0.4347826 0.8185444 0.9112098 0.8623950
Model2 0.9145394 0.4285714 0.8174603 0.9145394 0.8632792
Model3 0.9023307 0.4347826 0.8170854 0.9023307 0.8575949
Model4 0.9089900 0.4347826 0.8181818 0.9089900 0.8611987

ROC / AUC

Model Selection - Classification

Our first 2 models have the most information, but they also suffer from colinearity issues as seen by the VIF output. Model 3 performs well, but has some additional variables are poor predictors. So we’ll use Model 4.

Before we make predictions, let’s run this final model over our full dataset, and review some summary diagnostic plots and output.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5844  -0.7208  -0.4001   0.6454   3.1284  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.42296    0.03476 -40.938  < 2e-16 ***
## KIDSDRIV                           0.21112    0.02805   7.527 5.18e-14 ***
## PARENT1Yes                         0.15818    0.03183   4.970 6.69e-07 ***
## HOME_VAL                          -0.24811    0.04056  -6.117 9.56e-10 ***
## MSTATUSz_No                        0.18048    0.03858   4.679 2.89e-06 ***
## JOBDoctor                         -0.17125    0.04001  -4.280 1.87e-05 ***
## `JOBHome Maker`                   -0.03334    0.03442  -0.969 0.332706    
## JOBLawyer                         -0.14305    0.04208  -3.400 0.000674 ***
## JOBManager                        -0.38541    0.04229  -9.114  < 2e-16 ***
## JOBProfessional                   -0.16704    0.03915  -4.266 1.99e-05 ***
## JOBStudent                        -0.03436    0.03509  -0.979 0.327509    
## JOBUnknown                        -0.14275    0.03832  -3.726 0.000195 ***
## `JOBz_Blue Collar`                -0.07207    0.04373  -1.648 0.099379 .  
## TRAVTIME                           0.22459    0.02982   7.530 5.06e-14 ***
## CAR_USEPrivate                    -0.34960    0.04189  -8.346  < 2e-16 ***
## BLUEBOOK                          -0.22306    0.03903  -5.715 1.10e-08 ***
## TIF                               -0.22632    0.03035  -7.458 8.81e-14 ***
## `CAR_TYPEPanel Truck`              0.17693    0.04107   4.308 1.65e-05 ***
## CAR_TYPEPickup                     0.21474    0.03742   5.738 9.56e-09 ***
## `CAR_TYPESports Car`               0.30308    0.03369   8.997  < 2e-16 ***
## CAR_TYPEVan                        0.18735    0.03505   5.346 9.01e-08 ***
## CAR_TYPEz_SUV                      0.31923    0.03855   8.281  < 2e-16 ***
## OLDCLAIM                          -0.12166    0.03425  -3.552 0.000382 ***
## CLM_FREQ                           0.22721    0.03294   6.898 5.29e-12 ***
## REVOKEDYes                         0.29136    0.02982   9.772  < 2e-16 ***
## MVR_PTS                            0.24830    0.02909   8.535  < 2e-16 ***
## CAR_AGE                           -0.11795    0.03659  -3.223 0.001267 ** 
## `URBANICITYz_Highly Rural/ Rural` -0.95632    0.04534 -21.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7329.4  on 8133  degrees of freedom
## AIC: 7385.4
## 
## Number of Fisher Scoring iterations: 5

Make Predictions

Finally we make our final predictions. We’ll create a dataframe with the predictions and the predicted probabilities for our classification.

For our predictive model we’re off on the classification. We’ll make a prediction on the target amount for all observations in the test set, regardless of whether we think they’ll make a claim.

0 1 TARGET_FLAG TARGET_AMT
0.8491764 0.1508236 0 6575.548
0.8058192 0.1941808 0 6553.661
0.8887745 0.1112255 0 4469.628
0.7318214 0.2681786 0 4394.977
0.8515868 0.1484132 0 6397.923
0.7897473 0.2102527 0 6928.606
x
6575.548
6553.661
4469.628
4394.977
6397.923
6928.606

2019-04-30