In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.
Include your R statistical programming code in an Appendix.
Training and testing datasets have been uploaded to my github account and loaded here using the read.csv function. Our training dataset has 8161 observations & 26 variables whereas testing dataset has 2141 observations and 26 variables. Each observation represent a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
## [1] 8161 26
## [1] 2141 26
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV
## Min. : 1 Min. :0.0000 Min. : 0 Min. :0.0000
## 1st Qu.: 2559 1st Qu.:0.0000 1st Qu.: 0 1st Qu.:0.0000
## Median : 5133 Median :0.0000 Median : 0 Median :0.0000
## Mean : 5152 Mean :0.2638 Mean : 1504 Mean :0.1711
## 3rd Qu.: 7745 3rd Qu.:1.0000 3rd Qu.: 1036 3rd Qu.:0.0000
## Max. :10302 Max. :1.0000 Max. :107586 Max. :4.0000
##
## AGE HOMEKIDS YOJ INCOME
## Min. :16.00 Min. :0.0000 Min. : 0.0 Length:8161
## 1st Qu.:39.00 1st Qu.:0.0000 1st Qu.: 9.0 Class :character
## Median :45.00 Median :0.0000 Median :11.0 Mode :character
## Mean :44.79 Mean :0.7212 Mean :10.5
## 3rd Qu.:51.00 3rd Qu.:1.0000 3rd Qu.:13.0
## Max. :81.00 Max. :5.0000 Max. :23.0
## NA's :6 NA's :454
## PARENT1 HOME_VAL MSTATUS SEX
## Length:8161 Length:8161 Length:8161 Length:8161
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## EDUCATION JOB TRAVTIME CAR_USE
## Length:8161 Length:8161 Min. : 5.00 Length:8161
## Class :character Class :character 1st Qu.: 22.00 Class :character
## Mode :character Mode :character Median : 33.00 Mode :character
## Mean : 33.49
## 3rd Qu.: 44.00
## Max. :142.00
##
## BLUEBOOK TIF CAR_TYPE RED_CAR
## Length:8161 Min. : 1.000 Length:8161 Length:8161
## Class :character 1st Qu.: 1.000 Class :character Class :character
## Mode :character Median : 4.000 Mode :character Mode :character
## Mean : 5.351
## 3rd Qu.: 7.000
## Max. :25.000
##
## OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## Length:8161 Min. :0.0000 Length:8161 Min. : 0.000
## Class :character 1st Qu.:0.0000 Class :character 1st Qu.: 0.000
## Mode :character Median :0.0000 Mode :character Median : 1.000
## Mean :0.7986 Mean : 1.696
## 3rd Qu.:2.0000 3rd Qu.: 3.000
## Max. :5.0000 Max. :13.000
##
## CAR_AGE URBANICITY
## Min. :-3.000 Length:8161
## 1st Qu.: 1.000 Class :character
## Median : 8.000 Mode :character
## Mean : 8.328
## 3rd Qu.:12.000
## Max. :28.000
## NA's :510
We see 0.9% of missing data in the dataset further in the table below we see AGE has 6 missing data, 454 missing data in YOJ, 464 in HOME_VAL 510 in CAR_AGE
| x | |
|---|---|
| TARGET_FLAG | 0 |
| TARGET_AMT | 0 |
| KIDSDRIV | 0 |
| AGE | 6 |
| HOMEKIDS | 0 |
| YOJ | 454 |
| INCOME | 445 |
| PARENT1 | 0 |
| HOME_VAL | 464 |
| MSTATUS | 0 |
| SEX | 0 |
| EDUCATION | 0 |
| JOB | 0 |
| TRAVTIME | 0 |
| CAR_USE | 0 |
| BLUEBOOK | 0 |
| TIF | 0 |
| CAR_TYPE | 0 |
| RED_CAR | 0 |
| OLDCLAIM | 0 |
| CLM_FREQ | 0 |
| REVOKED | 0 |
| MVR_PTS | 0 |
| CAR_AGE | 510 |
| URBANICITY | 0 |
From the histograms below we see the prevalence of kurtosis, specifically right skew in variables - BLUEBOOK,INCOME MVR_PTS,OLDCLAIM,``TARGET_AMT, TRAVTIME and we see almost normal distributions in AGE, CAR_AGE and YOJ. We would need transformations of the variables and we will be doing this in the next section of data preparation.
From the scatter plots below which is plotted by each variable wise against the target variable TARGET_AMT, we see notable trends in the scatterplots below such as our response variable TARGET_AMT is likely to be lower when individuals have more kids at home as indicated by the HOMEKIDS feature, and when they have more teenagers driving the car indicated by the feature KIDSDRIV.
There aren’t any significant amount of correlated features as we see from the correlation plot.
As we saw in our earlier analysis we saw missing data in age, yoj, home_val, car_age and income. We replace the missing values with their respective means.
We see from the below that there aren’t any missing values anymore.
We create a clean df df_clean by removing the INDEX column. We also replace the yes/ no with 1/ 0.
As we saw in our earlier analysis certain variables requires transformations due to the presence of skewing.
## Fitted parameters:
## lambda beta sigmasq
## 0.4436021 291.9569344 8541.2268213
##
## Convergence code returned by optim: 0
## Fitted parameters:
## lambda beta sigmasq
## 0.1016119 24.1545983 2.1701992
##
## Convergence code returned by optim: 0
## Fitted parameters:
## lambda beta sigmasq
## 0.4610754 177.4257712 2217.4825612
##
## Convergence code returned by optim: 0
## Fitted parameters:
## lambda beta sigmasq
## -0.04511237 7.22517933 0.44041250
##
## Convergence code returned by optim: 0
INCOME_TRANSFORMED, HOME_VAL_TRANSFORMED, BLUEBOOK_MOD_TRANSFORMED, OLD_CLAIM_TRANSFORMED are the tranformed variables.
We build our first model Model1 by including all the variables and run the model. Many variables have significant p-value and JOBClerical has the largest p-value.
##
## Call:
## glm(formula = TARGET_FLAG ~ . - INCOME_TRANSFORMED - HOME_VAL_TRANSFORMED -
## BLUEBOOK_MOD_TRANSFORMED - OLD_CLAIM_TRANSFORMED, family = binomial(link = "logit"),
## data = df_clean[, -c(2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2562 -0.7497 -0.5050 0.7314 2.6763
##
## Coefficients: (19 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.308e+00 3.015e-01 -4.338 1.44e-05 ***
## KIDSDRIV 3.016e-01 5.746e-02 5.249 1.53e-07 ***
## AGE -3.570e-03 3.843e-03 -0.929 0.352925
## HOMEKIDS 3.792e-02 3.527e-02 1.075 0.282307
## YOJ -6.714e-03 8.111e-03 -0.828 0.407792
## INCOME -3.076e-06 1.051e-06 -2.927 0.003423 **
## PARENT1 3.208e-01 1.039e-01 3.087 0.002022 **
## HOME_VAL -1.312e-06 3.318e-07 -3.952 7.75e-05 ***
## MSTATUS -3.933e-01 7.994e-02 -4.920 8.64e-07 ***
## SEX 9.232e-02 9.697e-02 0.952 0.341098
## EDUCATIONBachelors -2.923e-01 1.091e-01 -2.679 0.007385 **
## EDUCATIONMasters -1.709e-01 1.717e-01 -0.995 0.319626
## EDUCATIONPhD -8.169e-02 2.062e-01 -0.396 0.691982
## EDUCATIONz_High_School 2.895e-02 8.851e-02 0.327 0.743612
## JOBClerical -2.363e-03 1.914e-01 -0.012 0.990150
## JOBDoctor -4.431e-01 2.671e-01 -1.659 0.097109 .
## JOBHome_Maker -1.280e-01 2.037e-01 -0.629 0.529619
## JOBLawyer 3.370e-02 1.672e-01 0.202 0.840275
## JOBManager -5.048e-01 1.703e-01 -2.963 0.003044 **
## JOBProfessional 2.300e-02 1.757e-01 0.131 0.895878
## JOBStudent -1.899e-01 2.080e-01 -0.913 0.361214
## JOBz_Blue_Collar 1.168e-01 1.816e-01 0.643 0.520368
## TRAVTIME 6.586e-03 1.733e-03 3.801 0.000144 ***
## CAR_USE 6.247e-01 8.685e-02 7.192 6.37e-13 ***
## BLUEBOOK -1.755e-05 5.055e-06 -3.473 0.000515 ***
## TIF -5.014e-02 7.046e-03 -7.116 1.11e-12 ***
## CAR_TYPEPanel_Truck 5.257e-01 1.557e-01 3.376 0.000736 ***
## CAR_TYPEPickup 5.150e-01 9.663e-02 5.330 9.84e-08 ***
## CAR_TYPESports_Car 8.987e-01 1.249e-01 7.193 6.36e-13 ***
## CAR_TYPEVan 5.612e-01 1.220e-01 4.598 4.26e-06 ***
## CAR_TYPEz_SUV 6.972e-01 1.073e-01 6.499 8.10e-11 ***
## RED_CAR NA NA NA NA
## OLDCLAIM -1.327e-05 3.789e-06 -3.502 0.000461 ***
## CLM_FREQ 3.127e-01 2.730e-02 11.456 < 2e-16 ***
## REVOKED 9.795e-01 8.742e-02 11.204 < 2e-16 ***
## MVR_PTS 1.354e-01 1.308e-02 10.352 < 2e-16 ***
## CAR_AGE -1.086e-03 7.232e-03 -0.150 0.880688
## URBANICITY NA NA NA NA
## HSDropout NA NA NA NA
## Bachelors NA NA NA NA
## Masters NA NA NA NA
## PhD NA NA NA NA
## Panel_Truck NA NA NA NA
## Pickup NA NA NA NA
## Sports_Car NA NA NA NA
## Van NA NA NA NA
## SUV NA NA NA NA
## Professional NA NA NA NA
## Blue_Collar NA NA NA NA
## Clerical NA NA NA NA
## Doctor NA NA NA NA
## Lawyer NA NA NA NA
## Manager NA NA NA NA
## Home_Maker NA NA NA NA
## Student NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7946.3 on 8125 degrees of freedom
## AIC: 8018.3
##
## Number of Fisher Scoring iterations: 5
In our model 2 we include the original variables as well as the transformed variables. We see that even though some of the p-values are reduced.
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"),
## data = df_clean[, -c(2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3309 -0.7444 -0.4913 0.7536 2.7137
##
## Coefficients: (19 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.061e-01 4.148e-01 -0.497 0.61924
## KIDSDRIV 3.124e-01 5.794e-02 5.391 7.00e-08 ***
## AGE -4.433e-03 3.873e-03 -1.145 0.25236
## HOMEKIDS 2.162e-02 3.579e-02 0.604 0.54587
## YOJ 7.510e-03 9.406e-03 0.798 0.42463
## INCOME 2.271e-06 2.356e-06 0.964 0.33513
## PARENT1 3.108e-01 1.047e-01 2.968 0.00300 **
## HOME_VAL -5.229e-07 6.301e-07 -0.830 0.40657
## MSTATUS -3.816e-01 8.312e-02 -4.591 4.42e-06 ***
## SEX 1.248e-01 9.772e-02 1.277 0.20159
## EDUCATIONBachelors -2.448e-01 1.110e-01 -2.206 0.02735 *
## EDUCATIONMasters -1.121e-01 1.733e-01 -0.647 0.51771
## EDUCATIONPhD -1.183e-01 2.083e-01 -0.568 0.56994
## EDUCATIONz_High_School 6.719e-02 8.987e-02 0.748 0.45467
## JOBClerical 2.629e-02 1.921e-01 0.137 0.89115
## JOBDoctor -3.884e-01 2.656e-01 -1.462 0.14362
## JOBHome_Maker -3.070e-01 2.158e-01 -1.422 0.15495
## JOBLawyer 4.006e-02 1.677e-01 0.239 0.81121
## JOBManager -4.830e-01 1.702e-01 -2.838 0.00454 **
## JOBProfessional 6.366e-02 1.760e-01 0.362 0.71763
## JOBStudent -4.582e-01 2.242e-01 -2.044 0.04094 *
## JOBz_Blue_Collar 1.861e-01 1.824e-01 1.020 0.30757
## TRAVTIME 7.001e-03 1.747e-03 4.008 6.12e-05 ***
## CAR_USE 6.170e-01 8.748e-02 7.053 1.75e-12 ***
## BLUEBOOK 3.753e-05 1.900e-05 1.975 0.04825 *
## TIF -4.941e-02 7.078e-03 -6.981 2.93e-12 ***
## CAR_TYPEPanel_Truck 3.919e-01 1.601e-01 2.448 0.01436 *
## CAR_TYPEPickup 5.215e-01 9.710e-02 5.371 7.84e-08 ***
## CAR_TYPESports_Car 8.933e-01 1.257e-01 7.109 1.17e-12 ***
## CAR_TYPEVan 5.659e-01 1.224e-01 4.625 3.74e-06 ***
## CAR_TYPEz_SUV 7.178e-01 1.085e-01 6.617 3.66e-11 ***
## RED_CAR NA NA NA NA
## OLDCLAIM -3.373e-05 4.655e-06 -7.245 4.32e-13 ***
## CLM_FREQ 5.348e-02 4.274e-02 1.251 0.21077
## REVOKED 1.087e+00 8.931e-02 12.176 < 2e-16 ***
## MVR_PTS 1.064e-01 1.356e-02 7.847 4.25e-15 ***
## CAR_AGE -7.413e-04 7.264e-03 -0.102 0.91872
## URBANICITY NA NA NA NA
## HSDropout NA NA NA NA
## Bachelors NA NA NA NA
## Masters NA NA NA NA
## PhD NA NA NA NA
## Panel_Truck NA NA NA NA
## Pickup NA NA NA NA
## Sports_Car NA NA NA NA
## Van NA NA NA NA
## SUV NA NA NA NA
## Professional NA NA NA NA
## Blue_Collar NA NA NA NA
## Clerical NA NA NA NA
## Doctor NA NA NA NA
## Lawyer NA NA NA NA
## Manager NA NA NA NA
## Home_Maker NA NA NA NA
## Student NA NA NA NA
## INCOME_TRANSFORMED -6.794e-03 2.218e-03 -3.063 0.00219 **
## HOME_VAL_TRANSFORMED -6.199e-02 4.413e-02 -1.405 0.16004
## BLUEBOOK_MOD_TRANSFORMED -2.052e-02 6.847e-03 -2.996 0.00273 **
## OLD_CLAIM_TRANSFORMED 1.164e-01 1.459e-02 7.978 1.49e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7863.3 on 8121 degrees of freedom
## AIC: 7943.3
##
## Number of Fisher Scoring iterations: 5
In this model we tried including the variables with significant p-values.
##
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - HOMEKIDS - YOJ - INCOME -
## PARENT1 - HOME_VAL - MSTATUS - SEX - RED_CAR - CLM_FREQ -
## CAR_AGE - HSDropout - Professional - Blue_Collar - Clerical -
## Lawyer - Home_Maker - HOME_VAL_TRANSFORMED - Student - Doctor -
## CAR_USE - REVOKED - URBANICITY - Bachelors - Masters - PhD -
## Panel_Truck - Pickup - Sports_Car - Van - SUV - Manager,
## family = binomial(link = "logit"), data = df_clean[, -c(2)])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9313 -0.7729 -0.5421 0.8900 2.4980
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.193e-01 3.601e-01 -1.164 0.244243
## KIDSDRIV 3.607e-01 4.850e-02 7.438 1.02e-13 ***
## EDUCATIONBachelors -1.028e-02 9.662e-02 -0.106 0.915242
## EDUCATIONMasters 5.759e-02 1.495e-01 0.385 0.700140
## EDUCATIONPhD 1.153e-01 1.827e-01 0.631 0.527973
## EDUCATIONz_High_School 2.454e-01 8.257e-02 2.972 0.002958 **
## JOBClerical -1.242e-01 1.855e-01 -0.669 0.503205
## JOBDoctor -6.858e-01 2.553e-01 -2.686 0.007230 **
## JOBHome_Maker -5.750e-01 2.034e-01 -2.827 0.004697 **
## JOBLawyer -2.453e-01 1.577e-01 -1.556 0.119767
## JOBManager -7.001e-01 1.648e-01 -4.247 2.17e-05 ***
## JOBProfessional -1.546e-01 1.697e-01 -0.911 0.362176
## JOBStudent -2.315e-01 2.096e-01 -1.104 0.269477
## JOBz_Blue_Collar 3.197e-01 1.760e-01 1.816 0.069394 .
## TRAVTIME 6.131e-03 1.695e-03 3.617 0.000299 ***
## BLUEBOOK 3.374e-05 1.782e-05 1.893 0.058347 .
## TIF -4.774e-02 6.858e-03 -6.961 3.39e-12 ***
## CAR_TYPEPanel_Truck 8.404e-01 1.359e-01 6.185 6.19e-10 ***
## CAR_TYPEPickup 7.250e-01 8.877e-02 8.167 3.17e-16 ***
## CAR_TYPESports_Car 7.555e-01 1.006e-01 7.512 5.82e-14 ***
## CAR_TYPEVan 8.006e-01 1.114e-01 7.188 6.59e-13 ***
## CAR_TYPEz_SUV 6.330e-01 8.070e-02 7.844 4.35e-15 ***
## OLDCLAIM -7.858e-06 3.724e-06 -2.110 0.034837 *
## MVR_PTS 1.143e-01 1.314e-02 8.699 < 2e-16 ***
## INCOME_TRANSFORMED -5.370e-03 8.929e-04 -6.014 1.80e-09 ***
## BLUEBOOK_MOD_TRANSFORMED -2.003e-02 6.540e-03 -3.063 0.002189 **
## OLD_CLAIM_TRANSFORMED 1.038e-01 8.744e-03 11.868 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418 on 8160 degrees of freedom
## Residual deviance: 8263 on 8134 degrees of freedom
## AIC: 8317
##
## Number of Fisher Scoring iterations: 4
We see from the below summary table that Model1 and Model2 have similar accuracy of around 77%. Model 2 has higher AUC and lower AIC which is preferable. We choose Model 2 as our best model among the 3.
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| Accuracy | 0.7729445 | 0.7758853 | 0.7565249 |
| Class. Error Rate | 0.2270555 | 0.2241147 | 0.2434751 |
| Sensitivity | 0.3107292 | 0.3311658 | 0.2503484 |
| Specificity | 0.9385819 | 0.9352530 | 0.9379161 |
| Precision | 0.6445087 | 0.6470054 | 0.5910088 |
| F1 | 0.4193043 | 0.4380952 | 0.3517129 |
| AUC | 0.7639174 | 0.7712313 | 0.7363641 |
| AIC | 8018.2934438 | 7943.3348149 | 8317.0128168 |
| Predictors | 55.0000000 | 59.0000000 | 27.0000000 |