DATA 621: HW 4
David Quarshie - Group 3
In this assignment we are given a dataset that represents customer records at an auto insurance company. Our goal is to build a logistic regression model to that will predict the probability that a person will crash their car and also estimate the amount of money it will cost if the person is in a crash.
1. Data Exploration
The data has to be cleaned to take away the ‘$’ and ‘,’ characters from the fields that have them.
Basic Statistics
The training data has 8,161 rows and 25 columns or features. Of the 25 columns, 14 are discrete, 11 are continuous, and none are all their missing. We do have 970 missing values out of 204,025 data points.
| n | mean | sd | median | min | max | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|
| TARGET_FLAG* | 8161 | 1.263816e+00 | 4.407276e-01 | 1 | 1 | 2.0 | 1.0716614 | -0.8516462 |
| TARGET_AMT | 8161 | 1.504325e+03 | 4.704027e+03 | 0 | 0 | 107586.1 | 8.7063034 | 112.2884386 |
| KIDSDRIV | 8161 | 1.710575e-01 | 5.115341e-01 | 0 | 0 | 4.0 | 3.3518374 | 11.7801916 |
| AGE | 8155 | 4.479031e+01 | 8.627589e+00 | 45 | 16 | 81.0 | -0.0289889 | -0.0617020 |
| HOMEKIDS | 8161 | 7.212351e-01 | 1.116323e+00 | 0 | 0 | 5.0 | 1.3411271 | 0.6489915 |
| YOJ | 7707 | 1.049929e+01 | 4.092474e+00 | 11 | 0 | 23.0 | -1.2029676 | 1.1773410 |
| INCOME | 7716 | 6.189809e+04 | 4.757268e+04 | 54028 | 0 | 367030.0 | 1.1863166 | 2.1290163 |
| PARENT1* | 8161 | 1.131969e+00 | 3.384779e-01 | 1 | 1 | 2.0 | 2.1743561 | 2.7281589 |
| HOME_VAL | 7697 | 1.548673e+05 | 1.291238e+05 | 161160 | 0 | 885282.0 | 0.4885950 | -0.0160838 |
| MSTATUS* | 8161 | 1.400319e+00 | 4.899929e-01 | 1 | 1 | 2.0 | 0.4068189 | -1.8347231 |
| SEX* | 8161 | 1.536086e+00 | 4.987266e-01 | 2 | 1 | 2.0 | -0.1446959 | -1.9793056 |
| EDUCATION* | 8161 | 3.090675e+00 | 1.444856e+00 | 3 | 1 | 5.0 | 0.1162654 | -1.3799674 |
| JOB* | 8161 | 5.687171e+00 | 2.681873e+00 | 6 | 1 | 9.0 | -0.3067029 | -1.2222635 |
| TRAVTIME | 8161 | 3.348572e+01 | 1.590833e+01 | 33 | 5 | 142.0 | 0.4468174 | 0.6643331 |
| CAR_USE* | 8161 | 1.628845e+00 | 4.831436e-01 | 2 | 1 | 2.0 | -0.5332937 | -1.7158080 |
| BLUEBOOK | 8161 | 1.570990e+04 | 8.419734e+03 | 14440 | 1500 | 69740.0 | 0.7942141 | 0.7913559 |
| TIF | 8161 | 5.351305e+00 | 4.146635e+00 | 4 | 1 | 25.0 | 0.8908120 | 0.4224940 |
| CAR_TYPE* | 8161 | 3.529714e+00 | 1.965357e+00 | 3 | 1 | 6.0 | -0.0047181 | -1.5165329 |
| RED_CAR* | 8161 | 1.291386e+00 | 4.544287e-01 | 1 | 1 | 2.0 | 0.9180255 | -1.1573709 |
| OLDCLAIM | 8161 | 4.037076e+03 | 8.777139e+03 | 0 | 0 | 57037.0 | 3.1190400 | 9.8606583 |
| CLM_FREQ | 8161 | 7.985541e-01 | 1.158453e+00 | 0 | 0 | 5.0 | 1.2087985 | 0.2842890 |
| REVOKED* | 8161 | 1.122534e+00 | 3.279216e-01 | 1 | 1 | 2.0 | 2.3018899 | 3.2991013 |
| MVR_PTS | 8161 | 1.695503e+00 | 2.147112e+00 | 1 | 0 | 13.0 | 1.3478403 | 1.3754900 |
| CAR_AGE | 7651 | 8.328323e+00 | 5.700742e+00 | 8 | -3 | 28.0 | 0.2819531 | -0.7489756 |
| URBANICITY* | 8161 | 1.204509e+00 | 4.033673e-01 | 1 | 1 | 2.0 | 1.4649406 | 0.1460688 |
Distribution of Target Variable
Let’s look at the target_flag variable in our training data to make sure there no one sided distribution.
| Var1 | Freq |
|---|---|
| 0 | 6008 |
| 1 | 2153 |
## [1] 6008
Histogram of Variables
Boxplot of Variables
## NULL
2. DATA PREPARATION
Adjusting Variables
Looking at the plots we see we have to make a few changes to some variables. We’ll make HOMEKIDS boolean instead of a factor. For the rows where CAR_AGE are labled numbers, we make 0. For blank JOBS we label those as “Unknown”. Finally, change Education to 1 if PhD and Masters.
Missing Data
We have missing data in income, yoj, home_val, and car_age.
## TARGET_FLAG TARGET_AMT KIDSDRIV AGE
## 0:6008 Min. : 0 Min. :0.0000 Min. :16.00
## 1:2153 1st Qu.: 0 1st Qu.:0.0000 1st Qu.:39.00
## Median : 0 Median :0.0000 Median :45.00
## Mean : 1504 Mean :0.1711 Mean :44.79
## 3rd Qu.: 1036 3rd Qu.:0.0000 3rd Qu.:51.00
## Max. :107586 Max. :4.0000 Max. :81.00
## NA's :6
## HOMEKIDS YOJ INCOME PARENT1
## Min. :0.0000 Min. : 0.0 Min. : 0 No :7084
## 1st Qu.:0.0000 1st Qu.: 9.0 1st Qu.: 28097 Yes:1077
## Median :0.0000 Median :11.0 Median : 54028
## Mean :0.3519 Mean :10.5 Mean : 61898
## 3rd Qu.:1.0000 3rd Qu.:13.0 3rd Qu.: 85986
## Max. :1.0000 Max. :23.0 Max. :367030
## NA's :454 NA's :445
## HOME_VAL MSTATUS SEX EDUCATION
## Min. : 0 Yes :4894 M :3786 Min. :0.0000
## 1st Qu.: 0 z_No:3267 z_F:4375 1st Qu.:0.0000
## Median :161160 Median :1.0000
## Mean :154867 Mean :0.7076
## 3rd Qu.:238724 3rd Qu.:1.0000
## Max. :885282 Max. :1.0000
## NA's :464
## JOB TRAVTIME CAR_USE BLUEBOOK
## z_Blue Collar:1825 Min. : 5.00 Commercial:3029 Min. : 1500
## Clerical :1271 1st Qu.: 22.00 Private :5132 1st Qu.: 9280
## Professional :1117 Median : 33.00 Median :14440
## Manager : 988 Mean : 33.49 Mean :15710
## Lawyer : 835 3rd Qu.: 44.00 3rd Qu.:20850
## Student : 712 Max. :142.00 Max. :69740
## (Other) :1413
## TIF CAR_TYPE RED_CAR OLDCLAIM
## Min. : 1.000 Minivan :2145 no :5783 Min. : 0
## 1st Qu.: 1.000 Panel Truck: 676 yes:2378 1st Qu.: 0
## Median : 4.000 Pickup :1389 Median : 0
## Mean : 5.351 Sports Car : 907 Mean : 4037
## 3rd Qu.: 7.000 Van : 750 3rd Qu.: 4636
## Max. :25.000 z_SUV :2294 Max. :57037
##
## CLM_FREQ REVOKED MVR_PTS CAR_AGE
## Min. :0.0000 No :7161 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.0000 Yes:1000 1st Qu.: 0.000 1st Qu.: 1.000
## Median :0.0000 Median : 1.000 Median : 8.000
## Mean :0.7986 Mean : 1.696 Mean : 8.329
## 3rd Qu.:2.0000 3rd Qu.: 3.000 3rd Qu.:12.000
## Max. :5.0000 Max. :13.000 Max. :28.000
## NA's :510
## URBANICITY
## Highly Urban/ Urban :6492
## z_Highly Rural/ Rural:1669
##
##
##
##
##
Imputate
Let’s runmice imputation on both the train and test set.
3. Build Models
Model 1
For the first model we will include all the variables. Looking at the output of the model we see that some points are highly colinear and a some variables that may not be necessary.
Model 1 uses the formula:
target ~ .
| x | |
|---|---|
| KIDSDRIV | 7.702064 |
| AGE | 10.360192 |
| HOMEKIDS | 17.584519 |
| YOJ | 9.577655 |
| INCOME | 22.277981 |
| PARENT1Yes | 13.595112 |
| HOME_VAL | 16.757962 |
| MSTATUSz_No | 15.428021 |
| SEXz_F | 25.658144 |
| EDUCATION | 31.040495 |
| JOBDoctor | 18.268248 |
JOBHome Maker |
11.856568 |
| JOBLawyer | 24.654979 |
| JOBManager | 16.866626 |
| JOBProfessional | 13.423685 |
| JOBStudent | 11.234607 |
| JOBUnknown | 19.132636 |
JOBz_Blue Collar |
16.156666 |
| TRAVTIME | 7.248081 |
| CAR_USEPrivate | 14.272749 |
| BLUEBOOK | 16.154506 |
| TIF | 7.532261 |
CAR_TYPEPanel Truck |
15.759939 |
| CAR_TYPEPickup | 11.552785 |
CAR_TYPESports Car |
13.568756 |
| CAR_TYPEVan | 10.684761 |
| CAR_TYPEz_SUV | 20.606817 |
| RED_CARyes | 12.730645 |
| OLDCLAIM | 9.655093 |
| CLM_FREQ | 8.865006 |
| REVOKEDYes | 7.320507 |
| MVR_PTS | 7.034295 |
| CAR_AGE | 12.691818 |
URBANICITYz_Highly Rural/ Rural |
17.144953 |
Model 2
For the second model we ignore what’s colinear but remove unneccessary variables shown in model 1.
Model 2 uses the formula:
TARGET_FLAG ~ KIDSDRIV + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY
| x | |
|---|---|
| KIDSDRIV | 6.471909 |
| INCOME | 21.500907 |
| PARENT1Yes | 8.289954 |
| HOME_VAL | 16.645614 |
| MSTATUSz_No | 13.070655 |
| JOBDoctor | 14.887878 |
JOBHome Maker |
9.706264 |
| JOBLawyer | 13.219906 |
| JOBManager | 14.724221 |
| JOBProfessional | 12.772811 |
| JOBStudent | 10.175504 |
| JOBUnknown | 12.236807 |
JOBz_Blue Collar |
16.051965 |
| TRAVTIME | 7.207871 |
| CAR_USEPrivate | 14.115983 |
| BLUEBOOK | 12.961306 |
| TIF | 7.495193 |
CAR_TYPEPanel Truck |
13.665753 |
| CAR_TYPEPickup | 11.492782 |
CAR_TYPESports Car |
9.256940 |
| CAR_TYPEVan | 9.944616 |
| CAR_TYPEz_SUV | 12.283012 |
| OLDCLAIM | 9.621306 |
| CLM_FREQ | 8.838853 |
| REVOKEDYes | 7.276587 |
| MVR_PTS | 6.987320 |
URBANICITYz_Highly Rural/ Rural |
17.206106 |
Model 3
Model 3 removes the variables with the 3 highest VIF values from Model 1.
Model 3 uses the formula:
TARGET_FLAG ~ KIDSDRIV + AGE + HOMEKIDS + YOJ + INCOME + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + RED_CAR + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY
| x | |
|---|---|
| KIDSDRIV | 7.671453 |
| AGE | 10.243422 |
| HOMEKIDS | 17.515556 |
| YOJ | 9.492256 |
| PARENT1Yes | 13.545140 |
| HOME_VAL | 13.655667 |
| MSTATUSz_No | 14.487994 |
| JOBDoctor | 14.769079 |
JOBHome Maker |
11.233131 |
| JOBLawyer | 14.938707 |
| JOBManager | 14.686300 |
| JOBProfessional | 12.551505 |
| JOBStudent | 11.113690 |
| JOBUnknown | 12.074654 |
JOBz_Blue Collar |
15.714812 |
| TRAVTIME | 7.233085 |
| CAR_USEPrivate | 14.239721 |
| BLUEBOOK | 13.370724 |
| TIF | 7.493869 |
CAR_TYPEPanel Truck |
14.153682 |
| CAR_TYPEPickup | 11.515311 |
CAR_TYPESports Car |
10.285132 |
| CAR_TYPEVan | 10.107125 |
| CAR_TYPEz_SUV | 14.479069 |
| RED_CARyes | 10.117468 |
| OLDCLAIM | 9.658407 |
| CLM_FREQ | 8.847909 |
| REVOKEDYes | 7.300375 |
| MVR_PTS | 7.006550 |
| CAR_AGE | 11.081616 |
URBANICITYz_Highly Rural/ Rural |
17.122161 |
Model 4
Model 4 takes in Model 3 and removes thoe values that are poor predictors.
Model 4 uses the formula:
TARGET_FLAG ~ KIDSDRIV + PARENT1 + HOME_VAL + MSTATUS + JOB + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY
| x | |
|---|---|
| KIDSDRIV | 6.451727 |
| PARENT1Yes | 8.292518 |
| HOME_VAL | 13.566152 |
| MSTATUSz_No | 12.313098 |
| JOBDoctor | 14.608289 |
JOBHome Maker |
9.697695 |
| JOBLawyer | 14.646576 |
| JOBManager | 14.501307 |
| JOBProfessional | 12.454683 |
| JOBStudent | 10.034136 |
| JOBUnknown | 12.017685 |
JOBz_Blue Collar |
15.674081 |
| TRAVTIME | 7.212779 |
| CAR_USEPrivate | 14.196919 |
| BLUEBOOK | 12.433073 |
| TIF | 7.480926 |
CAR_TYPEPanel Truck |
13.689753 |
| CAR_TYPEPickup | 11.498573 |
CAR_TYPESports Car |
9.244599 |
| CAR_TYPEVan | 9.963572 |
| CAR_TYPEz_SUV | 12.283742 |
| OLDCLAIM | 9.654437 |
| CLM_FREQ | 8.825693 |
| REVOKEDYes | 7.279732 |
| MVR_PTS | 6.973144 |
| CAR_AGE | 11.048829 |
URBANICITYz_Highly Rural/ Rural |
17.186581 |
| ## Regression |
Regression Model 1
Regression Model 1 fits includes all the variables.
target ~ .
Regression Model 2
For Regression Model 2 we take out features with less impact.
Regression Model 2’s formula:
TARGET_AMT ~ HOME_VAL + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY
Regression Model 3
Regression Model 3 deals with the issues related to the car value and driver’s legal issues.
Regression Model 3’s formula:
TARGET_AMT ~ BLUEBOOK + REVOKED + MVR_PTS + CAR_AGE
Regression Model Results
| model | RMSE | Rsquared | MAE | RMSESD | RsquaredSD | MAESD |
|---|---|---|---|---|---|---|
| Mod1 | 7600.723 | 0.0095215 | 3754.895 | 1593.166 | 0.0083830 | 338.8376 |
| Mod2 | 7516.968 | 0.0110225 | 3689.982 | 1881.096 | 0.0113882 | 464.5595 |
| Mod3 | 7499.772 | 0.0186040 | 3658.981 | 1795.630 | 0.0189334 | 466.5175 |
4. SELECT MODEL
It’s time for us to pick which model we want to use. To help do this we’ll review each model’s accuracy by making predictions on the 20% we kept and comparing their results. We’ll use fourfold plots, summary statistics, and ROC / AUC plots to determine overall accuarcy.
To aid in model selection for the regression problem, we’ll compare the error in the fit of our models in a table and select from there.
1: Classification
Fourfold Plots
Summary Statistics
| Sensitivity | Specificity | Precision | Recall | F1 | |
|---|---|---|---|---|---|
| Model1 | 0.9112098 | 0.4347826 | 0.8185444 | 0.9112098 | 0.8623950 |
| Model2 | 0.9145394 | 0.4285714 | 0.8174603 | 0.9145394 | 0.8632792 |
| Model3 | 0.9023307 | 0.4347826 | 0.8170854 | 0.9023307 | 0.8575949 |
| Model4 | 0.9089900 | 0.4347826 | 0.8181818 | 0.9089900 | 0.8611987 |
ROC / AUC
Model Selection - Classification
Our first 2 models have the most information, but they also suffer from colinearity issues as seen by the VIF output. Model 3 performs well, but has some additional variables are poor predictors. So we’ll use Model 4.
Before we make predictions, let’s run this final model over our full dataset, and review some summary diagnostic plots and output.
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5844 -0.7208 -0.4001 0.6454 3.1284
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.42296 0.03476 -40.938 < 2e-16 ***
## KIDSDRIV 0.21112 0.02805 7.527 5.18e-14 ***
## PARENT1Yes 0.15818 0.03183 4.970 6.69e-07 ***
## HOME_VAL -0.24811 0.04056 -6.117 9.56e-10 ***
## MSTATUSz_No 0.18048 0.03858 4.679 2.89e-06 ***
## JOBDoctor -0.17125 0.04001 -4.280 1.87e-05 ***
## `JOBHome Maker` -0.03334 0.03442 -0.969 0.332706
## JOBLawyer -0.14305 0.04208 -3.400 0.000674 ***
## JOBManager -0.38541 0.04229 -9.114 < 2e-16 ***
## JOBProfessional -0.16704 0.03915 -4.266 1.99e-05 ***
## JOBStudent -0.03436 0.03509 -0.979 0.327509
## JOBUnknown -0.14275 0.03832 -3.726 0.000195 ***
## `JOBz_Blue Collar` -0.07207 0.04373 -1.648 0.099379 .
## TRAVTIME 0.22459 0.02982 7.530 5.06e-14 ***
## CAR_USEPrivate -0.34960 0.04189 -8.346 < 2e-16 ***
## BLUEBOOK -0.22306 0.03903 -5.715 1.10e-08 ***
## TIF -0.22632 0.03035 -7.458 8.81e-14 ***
## `CAR_TYPEPanel Truck` 0.17693 0.04107 4.308 1.65e-05 ***
## CAR_TYPEPickup 0.21474 0.03742 5.738 9.56e-09 ***
## `CAR_TYPESports Car` 0.30308 0.03369 8.997 < 2e-16 ***
## CAR_TYPEVan 0.18735 0.03505 5.346 9.01e-08 ***
## CAR_TYPEz_SUV 0.31923 0.03855 8.281 < 2e-16 ***
## OLDCLAIM -0.12166 0.03425 -3.552 0.000382 ***
## CLM_FREQ 0.22721 0.03294 6.898 5.29e-12 ***
## REVOKEDYes 0.29136 0.02982 9.772 < 2e-16 ***
## MVR_PTS 0.24830 0.02909 8.535 < 2e-16 ***
## CAR_AGE -0.11795 0.03659 -3.223 0.001267 **
## `URBANICITYz_Highly Rural/ Rural` -0.95632 0.04534 -21.091 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7329.4 on 8133 degrees of freedom
## AIC: 7385.4
##
## Number of Fisher Scoring iterations: 5
Make Predictions
Finally we make our final predictions. We’ll create a dataframe with the predictions and the predicted probabilities for our classification.
For our predictive model we’re off on the classification. We’ll make a prediction on the target amount for all observations in the test set, regardless of whether we think they’ll make a claim.
| 0 | 1 | TARGET_FLAG | TARGET_AMT |
|---|---|---|---|
| 0.8491764 | 0.1508236 | 0 | 6575.548 |
| 0.8058192 | 0.1941808 | 0 | 6553.661 |
| 0.8887745 | 0.1112255 | 0 | 4469.628 |
| 0.7318214 | 0.2681786 | 0 | 4394.977 |
| 0.8515868 | 0.1484132 | 0 | 6397.923 |
| 0.7897473 | 0.2102527 | 0 | 6928.606 |
| x |
|---|
| 6575.548 |
| 6553.661 |
| 4469.628 |
| 4394.977 |
| 6397.923 |
| 6928.606 |