Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Deliverables

 A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

 Assigned predictions (probabilities, classifications, cost) for the evaluation data set. Use 0.5 threshold.

 Include your R statistical programming code in an Appendix.

Data exploration

Training and testing datasets have been uploaded to my github account and loaded here using the read.csv function. Our training dataset has 8161 observations & 26 variables whereas testing dataset has 2141 observations and 26 variables. Each observation represent a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

## [1] 8161   26
## [1] 2141   26
##      INDEX        TARGET_FLAG       TARGET_AMT        KIDSDRIV     
##  Min.   :    1   Min.   :0.0000   Min.   :     0   Min.   :0.0000  
##  1st Qu.: 2559   1st Qu.:0.0000   1st Qu.:     0   1st Qu.:0.0000  
##  Median : 5133   Median :0.0000   Median :     0   Median :0.0000  
##  Mean   : 5152   Mean   :0.2638   Mean   :  1504   Mean   :0.1711  
##  3rd Qu.: 7745   3rd Qu.:1.0000   3rd Qu.:  1036   3rd Qu.:0.0000  
##  Max.   :10302   Max.   :1.0000   Max.   :107586   Max.   :4.0000  
##                                                                    
##       AGE           HOMEKIDS           YOJ          INCOME         
##  Min.   :16.00   Min.   :0.0000   Min.   : 0.0   Length:8161       
##  1st Qu.:39.00   1st Qu.:0.0000   1st Qu.: 9.0   Class :character  
##  Median :45.00   Median :0.0000   Median :11.0   Mode  :character  
##  Mean   :44.79   Mean   :0.7212   Mean   :10.5                     
##  3rd Qu.:51.00   3rd Qu.:1.0000   3rd Qu.:13.0                     
##  Max.   :81.00   Max.   :5.0000   Max.   :23.0                     
##  NA's   :6                        NA's   :454                      
##    PARENT1            HOME_VAL           MSTATUS              SEX           
##  Length:8161        Length:8161        Length:8161        Length:8161       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   EDUCATION             JOB               TRAVTIME        CAR_USE         
##  Length:8161        Length:8161        Min.   :  5.00   Length:8161       
##  Class :character   Class :character   1st Qu.: 22.00   Class :character  
##  Mode  :character   Mode  :character   Median : 33.00   Mode  :character  
##                                        Mean   : 33.49                     
##                                        3rd Qu.: 44.00                     
##                                        Max.   :142.00                     
##                                                                           
##    BLUEBOOK              TIF           CAR_TYPE           RED_CAR         
##  Length:8161        Min.   : 1.000   Length:8161        Length:8161       
##  Class :character   1st Qu.: 1.000   Class :character   Class :character  
##  Mode  :character   Median : 4.000   Mode  :character   Mode  :character  
##                     Mean   : 5.351                                        
##                     3rd Qu.: 7.000                                        
##                     Max.   :25.000                                        
##                                                                           
##    OLDCLAIM            CLM_FREQ        REVOKED             MVR_PTS      
##  Length:8161        Min.   :0.0000   Length:8161        Min.   : 0.000  
##  Class :character   1st Qu.:0.0000   Class :character   1st Qu.: 0.000  
##  Mode  :character   Median :0.0000   Mode  :character   Median : 1.000  
##                     Mean   :0.7986                      Mean   : 1.696  
##                     3rd Qu.:2.0000                      3rd Qu.: 3.000  
##                     Max.   :5.0000                      Max.   :13.000  
##                                                                         
##     CAR_AGE        URBANICITY       
##  Min.   :-3.000   Length:8161       
##  1st Qu.: 1.000   Class :character  
##  Median : 8.000   Mode  :character  
##  Mean   : 8.328                     
##  3rd Qu.:12.000                     
##  Max.   :28.000                     
##  NA's   :510

Data Visualization

We see 0.9% of missing data in the dataset further in the table below we see AGE has 6 missing data, 454 missing data in YOJ, 464 in HOME_VAL 510 in CAR_AGE

x
TARGET_FLAG 0
TARGET_AMT 0
KIDSDRIV 0
AGE 6
HOMEKIDS 0
YOJ 454
INCOME 445
PARENT1 0
HOME_VAL 464
MSTATUS 0
SEX 0
EDUCATION 0
JOB 0
TRAVTIME 0
CAR_USE 0
BLUEBOOK 0
TIF 0
CAR_TYPE 0
RED_CAR 0
OLDCLAIM 0
CLM_FREQ 0
REVOKED 0
MVR_PTS 0
CAR_AGE 510
URBANICITY 0

From the histograms below we see the prevalence of kurtosis, specifically right skew in variables - BLUEBOOK,INCOME MVR_PTS,OLDCLAIM,``TARGET_AMT, TRAVTIME and we see almost normal distributions in AGE, CAR_AGE and YOJ. We would need transformations of the variables and we will be doing this in the next section of data preparation.

From the scatter plots below which is plotted by each variable wise against the target variable TARGET_AMT, we see notable trends in the scatterplots below such as our response variable TARGET_AMT is likely to be lower when individuals have more kids at home as indicated by the HOMEKIDS feature, and when they have more teenagers driving the car indicated by the feature KIDSDRIV.

There aren’t any significant amount of correlated features as we see from the correlation plot.

Data preparation

As we saw in our earlier analysis we saw missing data in age, yoj, home_val, car_age and income. We replace the missing values with their respective means.

We see from the below that there aren’t any missing values anymore.

We create a clean df df_clean by removing the INDEX column. We also replace the yes/ no with 1/ 0.

As we saw in our earlier analysis certain variables requires transformations due to the presence of skewing.

## Fitted parameters:
##       lambda         beta      sigmasq 
##    0.4436021  291.9569344 8541.2268213 
## 
## Convergence code returned by optim: 0
## Fitted parameters:
##     lambda       beta    sigmasq 
##  0.1016119 24.1545983  2.1701992 
## 
## Convergence code returned by optim: 0
## Fitted parameters:
##       lambda         beta      sigmasq 
##    0.4610754  177.4257712 2217.4825612 
## 
## Convergence code returned by optim: 0
## Fitted parameters:
##      lambda        beta     sigmasq 
## -0.04511237  7.22517933  0.44041250 
## 
## Convergence code returned by optim: 0

INCOME_TRANSFORMED, HOME_VAL_TRANSFORMED, BLUEBOOK_MOD_TRANSFORMED, OLD_CLAIM_TRANSFORMED are the tranformed variables.

Building Models

Model 1

We build our first model Model1 by including all the variables and run the model. Many variables have significant p-value and JOBClerical has the largest p-value.

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - INCOME_TRANSFORMED - HOME_VAL_TRANSFORMED - 
##     BLUEBOOK_MOD_TRANSFORMED - OLD_CLAIM_TRANSFORMED, family = binomial(link = "logit"), 
##     data = df_clean[, -c(2)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2562  -0.7497  -0.5050   0.7314   2.6763  
## 
## Coefficients: (19 not defined because of singularities)
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.308e+00  3.015e-01  -4.338 1.44e-05 ***
## KIDSDRIV                3.016e-01  5.746e-02   5.249 1.53e-07 ***
## AGE                    -3.570e-03  3.843e-03  -0.929 0.352925    
## HOMEKIDS                3.792e-02  3.527e-02   1.075 0.282307    
## YOJ                    -6.714e-03  8.111e-03  -0.828 0.407792    
## INCOME                 -3.076e-06  1.051e-06  -2.927 0.003423 ** 
## PARENT1                 3.208e-01  1.039e-01   3.087 0.002022 ** 
## HOME_VAL               -1.312e-06  3.318e-07  -3.952 7.75e-05 ***
## MSTATUS                -3.933e-01  7.994e-02  -4.920 8.64e-07 ***
## SEX                     9.232e-02  9.697e-02   0.952 0.341098    
## EDUCATIONBachelors     -2.923e-01  1.091e-01  -2.679 0.007385 ** 
## EDUCATIONMasters       -1.709e-01  1.717e-01  -0.995 0.319626    
## EDUCATIONPhD           -8.169e-02  2.062e-01  -0.396 0.691982    
## EDUCATIONz_High_School  2.895e-02  8.851e-02   0.327 0.743612    
## JOBClerical            -2.363e-03  1.914e-01  -0.012 0.990150    
## JOBDoctor              -4.431e-01  2.671e-01  -1.659 0.097109 .  
## JOBHome_Maker          -1.280e-01  2.037e-01  -0.629 0.529619    
## JOBLawyer               3.370e-02  1.672e-01   0.202 0.840275    
## JOBManager             -5.048e-01  1.703e-01  -2.963 0.003044 ** 
## JOBProfessional         2.300e-02  1.757e-01   0.131 0.895878    
## JOBStudent             -1.899e-01  2.080e-01  -0.913 0.361214    
## JOBz_Blue_Collar        1.168e-01  1.816e-01   0.643 0.520368    
## TRAVTIME                6.586e-03  1.733e-03   3.801 0.000144 ***
## CAR_USE                 6.247e-01  8.685e-02   7.192 6.37e-13 ***
## BLUEBOOK               -1.755e-05  5.055e-06  -3.473 0.000515 ***
## TIF                    -5.014e-02  7.046e-03  -7.116 1.11e-12 ***
## CAR_TYPEPanel_Truck     5.257e-01  1.557e-01   3.376 0.000736 ***
## CAR_TYPEPickup          5.150e-01  9.663e-02   5.330 9.84e-08 ***
## CAR_TYPESports_Car      8.987e-01  1.249e-01   7.193 6.36e-13 ***
## CAR_TYPEVan             5.612e-01  1.220e-01   4.598 4.26e-06 ***
## CAR_TYPEz_SUV           6.972e-01  1.073e-01   6.499 8.10e-11 ***
## RED_CAR                        NA         NA      NA       NA    
## OLDCLAIM               -1.327e-05  3.789e-06  -3.502 0.000461 ***
## CLM_FREQ                3.127e-01  2.730e-02  11.456  < 2e-16 ***
## REVOKED                 9.795e-01  8.742e-02  11.204  < 2e-16 ***
## MVR_PTS                 1.354e-01  1.308e-02  10.352  < 2e-16 ***
## CAR_AGE                -1.086e-03  7.232e-03  -0.150 0.880688    
## URBANICITY                     NA         NA      NA       NA    
## HSDropout                      NA         NA      NA       NA    
## Bachelors                      NA         NA      NA       NA    
## Masters                        NA         NA      NA       NA    
## PhD                            NA         NA      NA       NA    
## Panel_Truck                    NA         NA      NA       NA    
## Pickup                         NA         NA      NA       NA    
## Sports_Car                     NA         NA      NA       NA    
## Van                            NA         NA      NA       NA    
## SUV                            NA         NA      NA       NA    
## Professional                   NA         NA      NA       NA    
## Blue_Collar                    NA         NA      NA       NA    
## Clerical                       NA         NA      NA       NA    
## Doctor                         NA         NA      NA       NA    
## Lawyer                         NA         NA      NA       NA    
## Manager                        NA         NA      NA       NA    
## Home_Maker                     NA         NA      NA       NA    
## Student                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7946.3  on 8125  degrees of freedom
## AIC: 8018.3
## 
## Number of Fisher Scoring iterations: 5

Model 2

In our model 2 we include the original variables as well as the transformed variables. We see that even though some of the p-values are reduced.

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = binomial(link = "logit"), 
##     data = df_clean[, -c(2)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3309  -0.7444  -0.4913   0.7536   2.7137  
## 
## Coefficients: (19 not defined because of singularities)
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -2.061e-01  4.148e-01  -0.497  0.61924    
## KIDSDRIV                  3.124e-01  5.794e-02   5.391 7.00e-08 ***
## AGE                      -4.433e-03  3.873e-03  -1.145  0.25236    
## HOMEKIDS                  2.162e-02  3.579e-02   0.604  0.54587    
## YOJ                       7.510e-03  9.406e-03   0.798  0.42463    
## INCOME                    2.271e-06  2.356e-06   0.964  0.33513    
## PARENT1                   3.108e-01  1.047e-01   2.968  0.00300 ** 
## HOME_VAL                 -5.229e-07  6.301e-07  -0.830  0.40657    
## MSTATUS                  -3.816e-01  8.312e-02  -4.591 4.42e-06 ***
## SEX                       1.248e-01  9.772e-02   1.277  0.20159    
## EDUCATIONBachelors       -2.448e-01  1.110e-01  -2.206  0.02735 *  
## EDUCATIONMasters         -1.121e-01  1.733e-01  -0.647  0.51771    
## EDUCATIONPhD             -1.183e-01  2.083e-01  -0.568  0.56994    
## EDUCATIONz_High_School    6.719e-02  8.987e-02   0.748  0.45467    
## JOBClerical               2.629e-02  1.921e-01   0.137  0.89115    
## JOBDoctor                -3.884e-01  2.656e-01  -1.462  0.14362    
## JOBHome_Maker            -3.070e-01  2.158e-01  -1.422  0.15495    
## JOBLawyer                 4.006e-02  1.677e-01   0.239  0.81121    
## JOBManager               -4.830e-01  1.702e-01  -2.838  0.00454 ** 
## JOBProfessional           6.366e-02  1.760e-01   0.362  0.71763    
## JOBStudent               -4.582e-01  2.242e-01  -2.044  0.04094 *  
## JOBz_Blue_Collar          1.861e-01  1.824e-01   1.020  0.30757    
## TRAVTIME                  7.001e-03  1.747e-03   4.008 6.12e-05 ***
## CAR_USE                   6.170e-01  8.748e-02   7.053 1.75e-12 ***
## BLUEBOOK                  3.753e-05  1.900e-05   1.975  0.04825 *  
## TIF                      -4.941e-02  7.078e-03  -6.981 2.93e-12 ***
## CAR_TYPEPanel_Truck       3.919e-01  1.601e-01   2.448  0.01436 *  
## CAR_TYPEPickup            5.215e-01  9.710e-02   5.371 7.84e-08 ***
## CAR_TYPESports_Car        8.933e-01  1.257e-01   7.109 1.17e-12 ***
## CAR_TYPEVan               5.659e-01  1.224e-01   4.625 3.74e-06 ***
## CAR_TYPEz_SUV             7.178e-01  1.085e-01   6.617 3.66e-11 ***
## RED_CAR                          NA         NA      NA       NA    
## OLDCLAIM                 -3.373e-05  4.655e-06  -7.245 4.32e-13 ***
## CLM_FREQ                  5.348e-02  4.274e-02   1.251  0.21077    
## REVOKED                   1.087e+00  8.931e-02  12.176  < 2e-16 ***
## MVR_PTS                   1.064e-01  1.356e-02   7.847 4.25e-15 ***
## CAR_AGE                  -7.413e-04  7.264e-03  -0.102  0.91872    
## URBANICITY                       NA         NA      NA       NA    
## HSDropout                        NA         NA      NA       NA    
## Bachelors                        NA         NA      NA       NA    
## Masters                          NA         NA      NA       NA    
## PhD                              NA         NA      NA       NA    
## Panel_Truck                      NA         NA      NA       NA    
## Pickup                           NA         NA      NA       NA    
## Sports_Car                       NA         NA      NA       NA    
## Van                              NA         NA      NA       NA    
## SUV                              NA         NA      NA       NA    
## Professional                     NA         NA      NA       NA    
## Blue_Collar                      NA         NA      NA       NA    
## Clerical                         NA         NA      NA       NA    
## Doctor                           NA         NA      NA       NA    
## Lawyer                           NA         NA      NA       NA    
## Manager                          NA         NA      NA       NA    
## Home_Maker                       NA         NA      NA       NA    
## Student                          NA         NA      NA       NA    
## INCOME_TRANSFORMED       -6.794e-03  2.218e-03  -3.063  0.00219 ** 
## HOME_VAL_TRANSFORMED     -6.199e-02  4.413e-02  -1.405  0.16004    
## BLUEBOOK_MOD_TRANSFORMED -2.052e-02  6.847e-03  -2.996  0.00273 ** 
## OLD_CLAIM_TRANSFORMED     1.164e-01  1.459e-02   7.978 1.49e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7863.3  on 8121  degrees of freedom
## AIC: 7943.3
## 
## Number of Fisher Scoring iterations: 5

Model 3

In this model we tried including the variables with significant p-values.

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - AGE - HOMEKIDS - YOJ - INCOME - 
##     PARENT1 - HOME_VAL - MSTATUS - SEX - RED_CAR - CLM_FREQ - 
##     CAR_AGE - HSDropout - Professional - Blue_Collar - Clerical - 
##     Lawyer - Home_Maker - HOME_VAL_TRANSFORMED - Student - Doctor - 
##     CAR_USE - REVOKED - URBANICITY - Bachelors - Masters - PhD - 
##     Panel_Truck - Pickup - Sports_Car - Van - SUV - Manager, 
##     family = binomial(link = "logit"), data = df_clean[, -c(2)])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9313  -0.7729  -0.5421   0.8900   2.4980  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -4.193e-01  3.601e-01  -1.164 0.244243    
## KIDSDRIV                  3.607e-01  4.850e-02   7.438 1.02e-13 ***
## EDUCATIONBachelors       -1.028e-02  9.662e-02  -0.106 0.915242    
## EDUCATIONMasters          5.759e-02  1.495e-01   0.385 0.700140    
## EDUCATIONPhD              1.153e-01  1.827e-01   0.631 0.527973    
## EDUCATIONz_High_School    2.454e-01  8.257e-02   2.972 0.002958 ** 
## JOBClerical              -1.242e-01  1.855e-01  -0.669 0.503205    
## JOBDoctor                -6.858e-01  2.553e-01  -2.686 0.007230 ** 
## JOBHome_Maker            -5.750e-01  2.034e-01  -2.827 0.004697 ** 
## JOBLawyer                -2.453e-01  1.577e-01  -1.556 0.119767    
## JOBManager               -7.001e-01  1.648e-01  -4.247 2.17e-05 ***
## JOBProfessional          -1.546e-01  1.697e-01  -0.911 0.362176    
## JOBStudent               -2.315e-01  2.096e-01  -1.104 0.269477    
## JOBz_Blue_Collar          3.197e-01  1.760e-01   1.816 0.069394 .  
## TRAVTIME                  6.131e-03  1.695e-03   3.617 0.000299 ***
## BLUEBOOK                  3.374e-05  1.782e-05   1.893 0.058347 .  
## TIF                      -4.774e-02  6.858e-03  -6.961 3.39e-12 ***
## CAR_TYPEPanel_Truck       8.404e-01  1.359e-01   6.185 6.19e-10 ***
## CAR_TYPEPickup            7.250e-01  8.877e-02   8.167 3.17e-16 ***
## CAR_TYPESports_Car        7.555e-01  1.006e-01   7.512 5.82e-14 ***
## CAR_TYPEVan               8.006e-01  1.114e-01   7.188 6.59e-13 ***
## CAR_TYPEz_SUV             6.330e-01  8.070e-02   7.844 4.35e-15 ***
## OLDCLAIM                 -7.858e-06  3.724e-06  -2.110 0.034837 *  
## MVR_PTS                   1.143e-01  1.314e-02   8.699  < 2e-16 ***
## INCOME_TRANSFORMED       -5.370e-03  8.929e-04  -6.014 1.80e-09 ***
## BLUEBOOK_MOD_TRANSFORMED -2.003e-02  6.540e-03  -3.063 0.002189 ** 
## OLD_CLAIM_TRANSFORMED     1.038e-01  8.744e-03  11.868  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418  on 8160  degrees of freedom
## Residual deviance: 8263  on 8134  degrees of freedom
## AIC: 8317
## 
## Number of Fisher Scoring iterations: 4

Choosing the model

We see from the below summary table that Model1 and Model2 have similar accuracy of around 77%. Model 2 has higher AUC and lower AIC which is preferable. We choose Model 2 as our best model among the 3.

Model 1 Model 2 Model 3
Accuracy 0.7729445 0.7758853 0.7565249
Class. Error Rate 0.2270555 0.2241147 0.2434751
Sensitivity 0.3107292 0.3311658 0.2503484
Specificity 0.9385819 0.9352530 0.9379161
Precision 0.6445087 0.6470054 0.5910088
F1 0.4193043 0.4380952 0.3517129
AUC 0.7639174 0.7712313 0.7363641
AIC 8018.2934438 7943.3348149 8317.0128168
Predictors 55.0000000 59.0000000 27.0000000

Link for the .rmd