DATA 621: HW 4
1. Data Exploration
2. DATA PREPARATION
3. Build Models
- Model 1
- Model 2
- Model 3
- Model 4
4. SELECT MODEL

DATA 621: HW 4

David Quarshie - Group 3

In this assignment we are given a dataset that represents customer records at an auto insurance company. Our goal is to build a logistic regression model to that will predict the probability that a person will crash their car and also estimate the amount of money it will cost if the person is in a crash.

1. Data Exploration

The data has to be cleaned to take away the ‘$’ and ‘,’ characters from the fields that have them.

Basic Statistics

The training data has 8,161 rows and 25 columns or features. Of the 25 columns, 14 are discrete, 11 are continuous, and none are all their missing. We do have 970 missing values out of 204,025 data points.

	n	mean	sd	median	min	max	skew	kurtosis
TARGET_FLAG*	8161	1.263816e+00	4.407276e-01	1	1	2.0	1.0716614	-0.8516462
TARGET_AMT	8161	1.504325e+03	4.704027e+03	0	0	107586.1	8.7063034	112.2884386
KIDSDRIV	8161	1.710575e-01	5.115341e-01	0	0	4.0	3.3518374	11.7801916
AGE	8155	4.479031e+01	8.627589e+00	45	16	81.0	-0.0289889	-0.0617020
HOMEKIDS	8161	7.212351e-01	1.116323e+00	0	0	5.0	1.3411271	0.6489915
YOJ	7707	1.049929e+01	4.092474e+00	11	0	23.0	-1.2029676	1.1773410
INCOME	7716	6.189809e+04	4.757268e+04	54028	0	367030.0	1.1863166	2.1290163
PARENT1*	8161	1.131969e+00	3.384779e-01	1	1	2.0	2.1743561	2.7281589
HOME_VAL	7697	1.548673e+05	1.291238e+05	161160	0	885282.0	0.4885950	-0.0160838
MSTATUS*	8161	1.400319e+00	4.899929e-01	1	1	2.0	0.4068189	-1.8347231
SEX*	8161	1.536086e+00	4.987266e-01	2	1	2.0	-0.1446959	-1.9793056
EDUCATION*	8161	3.090675e+00	1.444856e+00	3	1	5.0	0.1162654	-1.3799674
JOB*	8161	5.687171e+00	2.681873e+00	6	1	9.0	-0.3067029	-1.2222635
TRAVTIME	8161	3.348572e+01	1.590833e+01	33	5	142.0	0.4468174	0.6643331
CAR_USE*	8161	1.628845e+00	4.831436e-01	2	1	2.0	-0.5332937	-1.7158080
BLUEBOOK	8161	1.570990e+04	8.419734e+03	14440	1500	69740.0	0.7942141	0.7913559
TIF	8161	5.351305e+00	4.146635e+00	4	1	25.0	0.8908120	0.4224940
CAR_TYPE*	8161	3.529714e+00	1.965357e+00	3	1	6.0	-0.0047181	-1.5165329
RED_CAR*	8161	1.291386e+00	4.544287e-01	1	1	2.0	0.9180255	-1.1573709
OLDCLAIM	8161	4.037076e+03	8.777139e+03	0	0	57037.0	3.1190400	9.8606583
CLM_FREQ	8161	7.985541e-01	1.158453e+00	0	0	5.0	1.2087985	0.2842890
REVOKED*	8161	1.122534e+00	3.279216e-01	1	1	2.0	2.3018899	3.2991013
MVR_PTS	8161	1.695503e+00	2.147112e+00	1	0	13.0	1.3478403	1.3754900
CAR_AGE	7651	8.328323e+00	5.700742e+00	8	-3	28.0	0.2819531	-0.7489756
URBANICITY*	8161	1.204509e+00	4.033673e-01	1	1	2.0	1.4649406	0.1460688

Distribution of Target Variable

Let’s look at the target_flag variable in our training data to make sure there no one sided distribution.

Var1	Freq
0	6008
1	2153

## [1] 6008

Histogram of Variables

Boxplot of Variables

## NULL

2. DATA PREPARATION

Adjusting Variables

Looking at the plots we see we have to make a few changes to some variables. We’ll make HOMEKIDS boolean instead of a factor. For the rows where CAR_AGE are labled numbers, we make 0. For blank JOBS we label those as “Unknown”. Finally, change Education to 1 if PhD and Masters.

Missing Data

We have missing data in income, yoj, home_val, and car_age.

##  TARGET_FLAG   TARGET_AMT        KIDSDRIV           AGE       
##  0:6008      Min.   :     0   Min.   :0.0000   Min.   :16.00  
##  1:2153      1st Qu.:     0   1st Qu.:0.0000   1st Qu.:39.00  
##              Median :     0   Median :0.0000   Median :45.00  
##              Mean   :  1504   Mean   :0.1711   Mean   :44.79  
##              3rd Qu.:  1036   3rd Qu.:0.0000   3rd Qu.:51.00  
##              Max.   :107586   Max.   :4.0000   Max.   :81.00  
##                                                NA's   :6      
##     HOMEKIDS           YOJ           INCOME       PARENT1   
##  Min.   :0.0000   Min.   : 0.0   Min.   :     0   No :7084  
##  1st Qu.:0.0000   1st Qu.: 9.0   1st Qu.: 28097   Yes:1077  
##  Median :0.0000   Median :11.0   Median : 54028             
##  Mean   :0.3519   Mean   :10.5   Mean   : 61898             
##  3rd Qu.:1.0000   3rd Qu.:13.0   3rd Qu.: 85986             
##  Max.   :1.0000   Max.   :23.0   Max.   :367030             
##                   NA's   :454    NA's   :445                
##     HOME_VAL      MSTATUS      SEX         EDUCATION     
##  Min.   :     0   Yes :4894   M  :3786   Min.   :0.0000  
##  1st Qu.:     0   z_No:3267   z_F:4375   1st Qu.:0.0000  
##  Median :161160                          Median :1.0000  
##  Mean   :154867                          Mean   :0.7076  
##  3rd Qu.:238724                          3rd Qu.:1.0000  
##  Max.   :885282                          Max.   :1.0000  
##  NA's   :464                                             
##             JOB          TRAVTIME            CAR_USE        BLUEBOOK    
##  z_Blue Collar:1825   Min.   :  5.00   Commercial:3029   Min.   : 1500  
##  Clerical     :1271   1st Qu.: 22.00   Private   :5132   1st Qu.: 9280  
##  Professional :1117   Median : 33.00                     Median :14440  
##  Manager      : 988   Mean   : 33.49                     Mean   :15710  
##  Lawyer       : 835   3rd Qu.: 44.00                     3rd Qu.:20850  
##  Student      : 712   Max.   :142.00                     Max.   :69740  
##  (Other)      :1413                                                     
##       TIF                CAR_TYPE    RED_CAR       OLDCLAIM    
##  Min.   : 1.000   Minivan    :2145   no :5783   Min.   :    0  
##  1st Qu.: 1.000   Panel Truck: 676   yes:2378   1st Qu.:    0  
##  Median : 4.000   Pickup     :1389              Median :    0  
##  Mean   : 5.351   Sports Car : 907              Mean   : 4037  
##  3rd Qu.: 7.000   Van        : 750              3rd Qu.: 4636  
##  Max.   :25.000   z_SUV      :2294              Max.   :57037  
##                                                                
##     CLM_FREQ      REVOKED       MVR_PTS          CAR_AGE      
##  Min.   :0.0000   No :7161   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.0000   Yes:1000   1st Qu.: 0.000   1st Qu.: 1.000  
##  Median :0.0000              Median : 1.000   Median : 8.000  
##  Mean   :0.7986              Mean   : 1.696   Mean   : 8.329  
##  3rd Qu.:2.0000              3rd Qu.: 3.000   3rd Qu.:12.000  
##  Max.   :5.0000              Max.   :13.000   Max.   :28.000  
##                                               NA's   :510     
##                  URBANICITY  
##  Highly Urban/ Urban  :6492  
##  z_Highly Rural/ Rural:1669  
##                              
##                              
##                              
##                              
##

Imputate

Let’s runmice imputation on both the train and test set.

3. Build Models

Model 1

For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.

Model 1 uses the formula:

target ~ .

	x
KIDSDRIV	7.702064
AGE	10.360192
HOMEKIDS	17.584519
YOJ	9.577655
INCOME	22.277981
PARENT1Yes	13.595112
HOME_VAL	16.757962
MSTATUSz_No	15.428021
SEXz_F	25.658144
EDUCATION	31.040495
JOBDoctor	18.268248
`JOBHome Maker`	11.856568
JOBLawyer	24.654979
JOBManager	16.866626
JOBProfessional	13.423685
JOBStudent	11.234607
JOBUnknown	19.132636
`JOBz_Blue Collar`	16.156666
TRAVTIME	7.248081
CAR_USEPrivate	14.272749
BLUEBOOK	16.154506
TIF	7.532261
`CAR_TYPEPanel Truck`	15.759939
CAR_TYPEPickup	11.552785
`CAR_TYPESports Car`	13.568756
CAR_TYPEVan	10.684761
CAR_TYPEz_SUV	20.606817
RED_CARyes	12.730645
OLDCLAIM	9.655093
CLM_FREQ	8.865006
REVOKEDYes	7.320507
MVR_PTS	7.034295
CAR_AGE	12.691818
`URBANICITYz_Highly Rural/ Rural`	17.144953

Model 2

For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.

Model 2 uses the formula:

TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY

	x
KIDSDRIV	6.471909
INCOME	21.500907
PARENT1Yes	8.289954
HOME_VAL	16.645614
MSTATUSz_No	13.070655
JOBDoctor	14.887878
`JOBHome Maker`	9.706264
JOBLawyer	13.219906
JOBManager	14.724221
JOBProfessional	12.772811
JOBStudent	10.175504
JOBUnknown	12.236807
`JOBz_Blue Collar`	16.051965
TRAVTIME	7.207871
CAR_USEPrivate	14.115983
BLUEBOOK	12.961306
TIF	7.495193
`CAR_TYPEPanel Truck`	13.665753
CAR_TYPEPickup	11.492782
`CAR_TYPESports Car`	9.256940
CAR_TYPEVan	9.944616
CAR_TYPEz_SUV	12.283012
OLDCLAIM	9.621306
CLM_FREQ	8.838853
REVOKEDYes	7.276587
MVR_PTS	6.987320
`URBANICITYz_Highly Rural/ Rural`	17.206106

Model 3

Model 3 removes the variables with the 3 highest VIF values from Model 1.

Model 3 uses the formula:

TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

	x
KIDSDRIV	7.671453
AGE	10.243422
HOMEKIDS	17.515556
YOJ	9.492256
PARENT1Yes	13.545140
HOME_VAL	13.655667
MSTATUSz_No	14.487994
JOBDoctor	14.769079
`JOBHome Maker`	11.233131
JOBLawyer	14.938707
JOBManager	14.686300
JOBProfessional	12.551505
JOBStudent	11.113690
JOBUnknown	12.074654
`JOBz_Blue Collar`	15.714812
TRAVTIME	7.233085
CAR_USEPrivate	14.239721
BLUEBOOK	13.370724
TIF	7.493869
`CAR_TYPEPanel Truck`	14.153682
CAR_TYPEPickup	11.515311
`CAR_TYPESports Car`	10.285132
CAR_TYPEVan	10.107125
CAR_TYPEz_SUV	14.479069
RED_CARyes	10.117468
OLDCLAIM	9.658407
CLM_FREQ	8.847909
REVOKEDYes	7.300375
MVR_PTS	7.006550
CAR_AGE	11.081616
`URBANICITYz_Highly Rural/ Rural`	17.122161

Model 4

Model 4 takes in Model 3 and removes thoe values that are poor predictors.

Model 4 uses the formula:

TARGET_FLAG ~ KIDSDRIV + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

	x
KIDSDRIV	6.451727
PARENT1Yes	8.292518
HOME_VAL	13.566152
MSTATUSz_No	12.313098
JOBDoctor	14.608289
`JOBHome Maker`	9.697695
JOBLawyer	14.646576
JOBManager	14.501307
JOBProfessional	12.454683
JOBStudent	10.034136
JOBUnknown	12.017685
`JOBz_Blue Collar`	15.674081
TRAVTIME	7.212779
CAR_USEPrivate	14.196919
BLUEBOOK	12.433073
TIF	7.480926
`CAR_TYPEPanel Truck`	13.689753
CAR_TYPEPickup	11.498573
`CAR_TYPESports Car`	9.244599
CAR_TYPEVan	9.963572
CAR_TYPEz_SUV	12.283742
OLDCLAIM	9.654437
CLM_FREQ	8.825693
REVOKEDYes	7.279732
MVR_PTS	6.973144
CAR_AGE	11.048829
`URBANICITYz_Highly Rural/ Rural`	17.186581
## Regression

Regression Model 1

Regression Model 1 fits includes all the variables.

target ~ .

Regression Model 2

For Regression Model 2 we take out features with less impact.

Regression Model 2’s formula:

TARGET_AMT ~ HOME_VAL + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY

Regression Model 3

Regression Model 3 deals with the issues related to the car value and driver’s legal issues.

Regression Model 3’s formula:

TARGET_AMT ~ BLUEBOOK + REVOKED + MVR_PTS + CAR_AGE

Regression Model Results

model	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
Mod1	7600.723	0.0095215	3754.895	1593.166	0.0083830	338.8376
Mod2	7516.968	0.0110225	3689.982	1881.096	0.0113882	464.5595
Mod3	7499.772	0.0186040	3658.981	1795.630	0.0189334	466.5175

4. SELECT MODEL

It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.

To aid in model selection for the regression problem, we’ll compare the error in the fit of our models in a table and select from there.

1: Classification

Fourfold Plots

Summary Statistics

	Sensitivity	Specificity	Precision	Recall	F1
Model1	0.9112098	0.4347826	0.8185444	0.9112098	0.8623950
Model2	0.9145394	0.4285714	0.8174603	0.9145394	0.8632792
Model3	0.9023307	0.4347826	0.8170854	0.9023307	0.8575949
Model4	0.9089900	0.4347826	0.8181818	0.9089900	0.8611987

ROC / AUC

Model Selection - Classification

Our first 2 models have the most information, but they also suffer from colinearity issues as seen by the VIF output. Model 3 performs well, but has some additional variables are poor predictors. So we’ll use Model 4.

Before we make predictions, let’s run this final model over our full dataset, and review some summary diagnostic plots and output.

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5844  -0.7208  -0.4001   0.6454   3.1284  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       -1.42296    0.03476 -40.938  < 2e-16 ***
## KIDSDRIV                           0.21112    0.02805   7.527 5.18e-14 ***
## PARENT1Yes                         0.15818    0.03183   4.970 6.69e-07 ***
## HOME_VAL                          -0.24811    0.04056  -6.117 9.56e-10 ***
## MSTATUSz_No                        0.18048    0.03858   4.679 2.89e-06 ***
## JOBDoctor                         -0.17125    0.04001  -4.280 1.87e-05 ***
## `JOBHome Maker`                   -0.03334    0.03442  -0.969 0.332706    
## JOBLawyer                         -0.14305    0.04208  -3.400 0.000674 ***
## JOBManager                        -0.38541    0.04229  -9.114  < 2e-16 ***
## JOBProfessional                   -0.16704    0.03915  -4.266 1.99e-05 ***
## JOBStudent                        -0.03436    0.03509  -0.979 0.327509    
## JOBUnknown                        -0.14275    0.03832  -3.726 0.000195 ***
## `JOBz_Blue Collar`                -0.07207    0.04373  -1.648 0.099379 .  
## TRAVTIME                           0.22459    0.02982   7.530 5.06e-14 ***
## CAR_USEPrivate                    -0.34960    0.04189  -8.346  < 2e-16 ***
## BLUEBOOK                          -0.22306    0.03903  -5.715 1.10e-08 ***
## TIF                               -0.22632    0.03035  -7.458 8.81e-14 ***
## `CAR_TYPEPanel Truck`              0.17693    0.04107   4.308 1.65e-05 ***
## CAR_TYPEPickup                     0.21474    0.03742   5.738 9.56e-09 ***
## `CAR_TYPESports Car`               0.30308    0.03369   8.997  < 2e-16 ***
## CAR_TYPEVan                        0.18735    0.03505   5.346 9.01e-08 ***
## CAR_TYPEz_SUV                      0.31923    0.03855   8.281  < 2e-16 ***
## OLDCLAIM                          -0.12166    0.03425  -3.552 0.000382 ***
## CLM_FREQ                           0.22721    0.03294   6.898 5.29e-12 ***
## REVOKEDYes                         0.29136    0.02982   9.772  < 2e-16 ***
## MVR_PTS                            0.24830    0.02909   8.535  < 2e-16 ***
## CAR_AGE                           -0.11795    0.03659  -3.223 0.001267 ** 
## `URBANICITYz_Highly Rural/ Rural` -0.95632    0.04534 -21.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7329.4  on 8133  degrees of freedom
## AIC: 7385.4
## 
## Number of Fisher Scoring iterations: 5

Make Predictions

Finally we make our final predictions. We’ll create a dataframe with the predictions and the predicted probabilities for our classification.

For our predictive model we’re off on the classification. We’ll make a prediction on the target amount for all observations in the test set, regardless of whether we think they’ll make a claim.

0	1	TARGET_AMT
0.8491764	0.1508236	6575.548
0.8058192	0.1941808	6553.661
0.8887745	0.1112255	4469.628
0.7318214	0.2681786	4394.977
0.8515868	0.1484132	6397.923
0.7897473	0.2102527	6928.606

x
6575.548
6553.661
4469.628
4394.977
6397.923
6928.606