Overview

In this homework assignment, you will explore, analyze and model a dataset containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first responsevariable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash.The second responsevariable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).

1. DATA EXPLORATION

## [1] "Training dataset dimensions:   Number of rows: 8161, Number of cols: 26"
##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ   INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349      No
## 2     2           0          0        0  43        0  11  $91,449      No
## 3     4           0          0        0  35        1  10  $16,039      No
## 4     5           0          0        0  51        0  14               No
## 5     6           0          0        0  50        0  NA $114,986      No
## 6     7           1       2946        0  34        1  12 $125,301     Yes
##   HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE BLUEBOOK
## 1       $0    z_No   M           PhD  Professional       14    Private  $14,230
## 2 $257,252    z_No   M z_High School z_Blue Collar       22 Commercial  $14,940
## 3 $124,191     Yes z_F z_High School      Clerical        5    Private   $4,010
## 4 $306,251     Yes   M  <High School z_Blue Collar       32    Private  $15,440
## 5 $243,925     Yes z_F           PhD        Doctor       36    Private  $18,000
## 6       $0    z_No z_F     Bachelors z_Blue Collar       46 Commercial  $17,430
##   TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS CAR_AGE
## 1  11    Minivan     yes   $4,461        2      No       3      18
## 2   1    Minivan     yes       $0        0      No       0       1
## 3   4      z_SUV      no  $38,690        2      No       3      10
## 4   7    Minivan     yes       $0        0      No       0       6
## 5   1      z_SUV      no  $19,217        2     Yes       3      17
## 6   1 Sports Car      no       $0        0      No       0       7
##            URBANICITY
## 1 Highly Urban/ Urban
## 2 Highly Urban/ Urban
## 3 Highly Urban/ Urban
## 4 Highly Urban/ Urban
## 5 Highly Urban/ Urban
## 6 Highly Urban/ Urban
## 'data.frame':    8161 obs. of  26 variables:
##  $ INDEX      : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET_FLAG: int  0 0 0 0 0 1 0 1 1 0 ...
##  $ TARGET_AMT : num  0 0 0 0 0 ...
##  $ KIDSDRIV   : int  0 0 0 0 0 0 0 1 0 0 ...
##  $ AGE        : int  60 43 35 51 50 34 54 37 34 50 ...
##  $ HOMEKIDS   : int  0 0 1 0 0 1 0 2 0 0 ...
##  $ YOJ        : int  11 11 10 14 NA 12 NA NA 10 7 ...
##  $ INCOME     : chr  "$67,349" "$91,449" "$16,039" "" ...
##  $ PARENT1    : chr  "No" "No" "No" "No" ...
##  $ HOME_VAL   : chr  "$0" "$257,252" "$124,191" "$306,251" ...
##  $ MSTATUS    : chr  "z_No" "z_No" "Yes" "Yes" ...
##  $ SEX        : chr  "M" "M" "z_F" "M" ...
##  $ EDUCATION  : chr  "PhD" "z_High School" "z_High School" "<High School" ...
##  $ JOB        : chr  "Professional" "z_Blue Collar" "Clerical" "z_Blue Collar" ...
##  $ TRAVTIME   : int  14 22 5 32 36 46 33 44 34 48 ...
##  $ CAR_USE    : chr  "Private" "Commercial" "Private" "Private" ...
##  $ BLUEBOOK   : chr  "$14,230" "$14,940" "$4,010" "$15,440" ...
##  $ TIF        : int  11 1 4 7 1 1 1 1 1 7 ...
##  $ CAR_TYPE   : chr  "Minivan" "Minivan" "z_SUV" "Minivan" ...
##  $ RED_CAR    : chr  "yes" "yes" "no" "yes" ...
##  $ OLDCLAIM   : chr  "$4,461" "$0" "$38,690" "$0" ...
##  $ CLM_FREQ   : int  2 0 2 0 2 0 0 1 0 0 ...
##  $ REVOKED    : chr  "No" "No" "No" "No" ...
##  $ MVR_PTS    : int  3 0 3 0 3 0 0 10 0 1 ...
##  $ CAR_AGE    : int  18 1 10 6 17 7 1 7 1 17 ...
##  $ URBANICITY : chr  "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" "Highly Urban/ Urban" ...

The training dataset consists of 26 variables and 8161 observations

Correlation
TARGET_FLAG TARGET_AMT
TARGET_FLAG 1.0000000 1.0000000
TARGET_AMT 0.8334240 0.8334240
MVR_PTS 0.2191323 0.1970216
CLM_FREQ 0.2161961 0.1741927
OLDCLAIM 0.1947302 0.1611626
PARENT1 0.1576222 0.1359305
REVOKED 0.1519391 0.1263285
MSTATUS 0.1351248 0.1214701
HOMEKIDS 0.1156210 0.1008356
KIDSDRIV 0.1036683 0.0877148
CAR_TYPE 0.1023650 0.0797487
JOB 0.0612262 0.0488313
TRAVTIME 0.0492559 0.0401971
EDUCATION 0.0428730 0.0397864
SEX 0.0210786 0.0088270
RED_CAR -0.0069473 0.0005877
TIF -0.0823431 -0.0683183
BLUEBOOK -0.1092768 -0.0709830
CAR_USE -0.1426737 -0.1287263
URBANICITY -0.2242509 -0.1904945

2. DATA PREPARATION

## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## [1] "Missing value after imputation: 0"
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## 
##  iter imp variable
##   1   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   1   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   2   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   3   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   4   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   1  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   2  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   3  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   4  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
##   5   5  AGE  YOJ  INCOME  HOME_VAL  CAR_AGE
## [1] "Missing value after imputation: 0"
VIF Score
TARGET_AMT 1.183605
KIDSDRIV 1.322411
AGE 1.412150
HOMEKIDS 2.068708
YOJ 1.225576
INCOME 2.738123
PARENT1 1.844249
HOME_VAL 2.454186
MSTATUS 1.924588
SEX 2.265022
EDUCATION 1.043666
JOB 1.153863
TRAVTIME 1.038907
CAR_USE 1.353494
BLUEBOOK 1.378285
TIF 1.009140
CAR_TYPE 1.410025
RED_CAR 1.809130
OLDCLAIM 2.201369
CLM_FREQ 2.131246
REVOKED 1.148729
MVR_PTS 1.249246
CAR_AGE 1.325166
URBANICITY 1.241628

3. BUILD MODELS

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = correlated_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -898.90 -286.25 -134.43   62.85 1927.07 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.018e+02  7.820e+01   5.138 2.86e-07 ***
## KIDSDRIV     4.205e+01  1.318e+01   3.189  0.00143 ** 
## AGE         -2.046e-01  8.070e-01  -0.254  0.79986    
## HOMEKIDS     1.359e+01  7.532e+00   1.804  0.07121 .  
## YOJ         -3.491e-01  1.590e+00  -0.220  0.82625    
## INCOME      -2.270e-02  4.803e-03  -4.727 2.33e-06 ***
## PARENT1      7.414e+01  2.336e+01   3.173  0.00151 ** 
## HOME_VAL    -1.082e-02  5.505e-03  -1.965  0.04946 *  
## MSTATUS      7.301e+01  1.685e+01   4.332 1.50e-05 ***
## SEX         -9.673e+00  1.756e+01  -0.551  0.58178    
## EDUCATION    6.679e+00  4.132e+00   1.616  0.10606    
## JOB         -6.667e-01  2.347e+00  -0.284  0.77637    
## TRAVTIME     2.185e+00  3.772e-01   5.793 7.25e-09 ***
## CAR_USE     -1.472e+02  1.392e+01 -10.571  < 2e-16 ***
## BLUEBOOK    -2.513e-02  9.504e-03  -2.644  0.00821 ** 
## TIF         -7.286e+00  1.416e+00  -5.147 2.73e-07 ***
## CAR_TYPE     1.877e+01  3.517e+00   5.335 9.87e-08 ***
## RED_CAR     -1.768e+01  1.732e+01  -1.021  0.30740    
## OLDCLAIM    -4.355e-03  1.039e-02  -0.419  0.67518    
## CLM_FREQ     2.315e+01  7.358e+00   3.147  0.00166 ** 
## REVOKED      1.281e+02  1.915e+01   6.693 2.37e-11 ***
## MVR_PTS      2.597e+01  3.034e+00   8.560  < 2e-16 ***
## CAR_AGE     -3.795e+00  1.182e+00  -3.210  0.00133 ** 
## URBANICITY  -2.685e+02  1.590e+01 -16.891  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 470.1 on 6424 degrees of freedom
##   (1713 observations deleted due to missingness)
## Multiple R-squared:  0.1564, Adjusted R-squared:  0.1534 
## F-statistic: 51.79 on 23 and 6424 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = vif_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -911.15 -287.66 -135.27   65.26 1922.93 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.515e+02  6.999e+01   5.022 5.21e-07 ***
## KIDSDRIV     5.526e+01  1.173e+01   4.713 2.49e-06 ***
## AGE          5.953e-02  7.197e-01   0.083  0.93409    
## HOMEKIDS     1.391e+01  6.728e+00   2.067  0.03876 *  
## YOJ         -9.904e-01  1.413e+00  -0.701  0.48339    
## INCOME      -2.305e-02  4.244e-03  -5.432 5.73e-08 ***
## PARENT1      6.756e+01  2.094e+01   3.226  0.00126 ** 
## HOME_VAL    -1.116e-02  4.812e-03  -2.319  0.02044 *  
## MSTATUS      8.276e+01  1.476e+01   5.607 2.12e-08 ***
## SEX         -3.660e+00  1.576e+01  -0.232  0.81636    
## EDUCATION    7.924e+00  3.692e+00   2.146  0.03187 *  
## JOB         -4.376e-02  2.092e+00  -0.021  0.98331    
## TRAVTIME     2.078e+00  3.357e-01   6.189 6.34e-10 ***
## CAR_USE     -1.435e+02  1.248e+01 -11.504  < 2e-16 ***
## BLUEBOOK    -2.488e-02  8.508e-03  -2.924  0.00347 ** 
## TIF         -7.542e+00  1.263e+00  -5.970 2.47e-09 ***
## CAR_TYPE     1.669e+01  3.150e+00   5.297 1.21e-07 ***
## RED_CAR     -3.939e+00  1.546e+01  -0.255  0.79887    
## OLDCLAIM    -7.244e-03  9.189e-03  -0.788  0.43052    
## CLM_FREQ     2.147e+01  6.578e+00   3.264  0.00110 ** 
## REVOKED      1.307e+02  1.701e+01   7.684 1.72e-14 ***
## MVR_PTS      2.597e+01  2.705e+00   9.602  < 2e-16 ***
## CAR_AGE     -3.134e+00  1.055e+00  -2.971  0.00297 ** 
## URBANICITY  -2.714e+02  1.411e+01 -19.231  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.8 on 8137 degrees of freedom
## Multiple R-squared:  0.1551, Adjusted R-squared:  0.1527 
## F-statistic: 64.96 on 23 and 8137 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, data = vif_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -916.6 -287.0 -135.0   65.4 1927.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.394e+02  4.979e+01   6.817 9.95e-12 ***
## KIDSDRIV     5.568e+01  1.155e+01   4.819 1.47e-06 ***
## HOMEKIDS     1.308e+01  6.167e+00   2.122  0.03388 *  
## INCOME      -2.375e-02  4.098e-03  -5.795 7.07e-09 ***
## PARENT1      6.771e+01  2.083e+01   3.251  0.00115 ** 
## HOME_VAL    -1.110e-02  4.788e-03  -2.317  0.02050 *  
## MSTATUS      8.381e+01  1.468e+01   5.708 1.18e-08 ***
## EDUCATION    7.980e+00  3.656e+00   2.183  0.02908 *  
## TRAVTIME     2.080e+00  3.355e-01   6.202 5.86e-10 ***
## CAR_USE     -1.439e+02  1.136e+01 -12.667  < 2e-16 ***
## BLUEBOOK    -2.499e-02  8.280e-03  -3.018  0.00255 ** 
## TIF         -7.551e+00  1.262e+00  -5.982 2.30e-09 ***
## CAR_TYPE     1.661e+01  2.741e+00   6.059 1.43e-09 ***
## CLM_FREQ     1.817e+01  5.059e+00   3.592  0.00033 ***
## REVOKED      1.263e+02  1.606e+01   7.859 4.35e-15 ***
## MVR_PTS      2.568e+01  2.674e+00   9.605  < 2e-16 ***
## CAR_AGE     -3.065e+00  1.031e+00  -2.972  0.00296 ** 
## URBANICITY  -2.707e+02  1.408e+01 -19.222  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 471.7 on 8143 degrees of freedom
## Multiple R-squared:  0.155,  Adjusted R-squared:  0.1532 
## F-statistic: 87.86 on 17 and 8143 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = boxcoxed_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7391 -0.5452 -0.2238  0.5891  2.3647 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.1119671  0.1100133  10.108  < 2e-16 ***
## KIDSDRIV     0.4309404  0.0744645   5.787 7.42e-09 ***
## AGE         -0.0008369  0.0011832  -0.707 0.479380    
## HOMEKIDS     0.1060352  0.0603551   1.757 0.078980 .  
## YOJ          0.0004341  0.0006386   0.680 0.496655    
## INCOME      -0.0019144  0.0002550  -7.507 6.67e-14 ***
## PARENT1      0.1193992  0.0354643   3.367 0.000764 ***
## HOME_VAL    -0.0049140  0.0013070  -3.760 0.000171 ***
## MSTATUS      0.1354766  0.0248347   5.455 5.04e-08 ***
## SEX          0.0036217  0.0249212   0.145 0.884457    
## EDUCATION    0.0171283  0.0089235   1.919 0.054961 .  
## JOB          0.0009017  0.0033090   0.272 0.785258    
## TRAVTIME     0.0112483  0.0013997   8.036 1.06e-15 ***
## CAR_USE     -0.2703675  0.0197399 -13.696  < 2e-16 ***
## BLUEBOOK    -0.0013081  0.0002076  -6.301 3.11e-10 ***
## TIF         -0.0527999  0.0068477  -7.711 1.40e-14 ***
## CAR_TYPE     0.0523189  0.0077014   6.793 1.17e-11 ***
## RED_CAR     -0.0101470  0.0245368  -0.414 0.679220    
## OLDCLAIM    -0.0063592  0.0104988  -0.606 0.544729    
## CLM_FREQ     0.3706891  0.1402772   2.643 0.008244 ** 
## REVOKED      0.2553268  0.0260135   9.815  < 2e-16 ***
## MVR_PTS      0.1327628  0.0184169   7.209 6.15e-13 ***
## CAR_AGE     -0.0241521  0.0049178  -4.911 9.23e-07 ***
## URBANICITY  -0.5171850  0.0225073 -22.979  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7488 on 8137 degrees of freedom
## Multiple R-squared:  0.2153, Adjusted R-squared:  0.213 
## F-statistic: 97.05 on 23 and 8137 DF,  p-value: < 2.2e-16
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = binomial_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5020  -0.7290  -0.4165   0.6571   3.1081  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.488e-01  3.802e-01   1.180 0.237905    
## KIDSDRIV     3.743e-01  6.035e-02   6.202 5.57e-10 ***
## AGE         -2.848e-03  3.907e-03  -0.729 0.466048    
## HOMEKIDS     5.714e-02  3.653e-02   1.564 0.117764    
## YOJ         -7.003e-03  7.610e-03  -0.920 0.357406    
## INCOME      -1.358e-04  2.262e-05  -6.003 1.93e-09 ***
## PARENT1      3.659e-01  1.082e-01   3.380 0.000724 ***
## HOME_VAL    -9.205e-05  2.595e-05  -3.547 0.000389 ***
## MSTATUS      4.992e-01  8.097e-02   6.166 7.02e-10 ***
## SEX          1.491e-02  8.798e-02   0.169 0.865455    
## EDUCATION    3.428e-02  1.985e-02   1.727 0.084199 .  
## JOB         -7.651e-03  1.130e-02  -0.677 0.498387    
## TRAVTIME     1.529e-02  1.874e-03   8.161 3.32e-16 ***
## CAR_USE     -9.293e-01  6.833e-02 -13.600  < 2e-16 ***
## BLUEBOOK    -2.806e-04  4.703e-05  -5.967 2.42e-09 ***
## TIF         -5.438e-02  7.276e-03  -7.474 7.79e-14 ***
## CAR_TYPE     1.177e-01  1.788e-02   6.583 4.60e-11 ***
## RED_CAR     -2.848e-02  8.544e-02  -0.333 0.738853    
## OLDCLAIM    -4.625e-05  4.495e-05  -1.029 0.303521    
## CLM_FREQ     1.721e-01  3.203e-02   5.372 7.78e-08 ***
## REVOKED      7.673e-01  8.448e-02   9.083  < 2e-16 ***
## MVR_PTS      1.160e-01  1.358e-02   8.539  < 2e-16 ***
## CAR_AGE     -2.195e-02  5.850e-03  -3.753 0.000175 ***
## URBANICITY  -2.310e+00  1.126e-01 -20.515  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7424.9  on 8137  degrees of freedom
## AIC: 7472.9
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + INCOME + PARENT1 + 
##     HOME_VAL + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + 
##     TIF + CAR_TYPE + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + 
##     URBANICITY, family = "binomial", data = binomial_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5092  -0.7278  -0.4182   0.6541   3.0800  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.003e-01  2.672e-01   0.750 0.453393    
## KIDSDRIV     3.681e-01  5.933e-02   6.205 5.48e-10 ***
## HOMEKIDS     6.343e-02  3.351e-02   1.893 0.058424 .  
## INCOME      -1.422e-04  2.175e-05  -6.534 6.39e-11 ***
## PARENT1      3.770e-01  1.075e-01   3.507 0.000454 ***
## HOME_VAL    -9.408e-05  2.585e-05  -3.640 0.000273 ***
## MSTATUS      5.073e-01  8.059e-02   6.295 3.07e-10 ***
## EDUCATION    3.626e-02  1.968e-02   1.843 0.065391 .  
## TRAVTIME     1.526e-02  1.872e-03   8.154 3.53e-16 ***
## CAR_USE     -9.124e-01  6.203e-02 -14.709  < 2e-16 ***
## BLUEBOOK    -2.782e-04  4.594e-05  -6.055 1.40e-09 ***
## TIF         -5.427e-02  7.268e-03  -7.467 8.20e-14 ***
## CAR_TYPE     1.217e-01  1.545e-02   7.878 3.32e-15 ***
## CLM_FREQ     1.513e-01  2.517e-02   6.012 1.84e-09 ***
## REVOKED      7.367e-01  7.929e-02   9.291  < 2e-16 ***
## MVR_PTS      1.146e-01  1.341e-02   8.547  < 2e-16 ***
## CAR_AGE     -2.101e-02  5.695e-03  -3.689 0.000225 ***
## URBANICITY  -2.300e+00  1.123e-01 -20.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7428.3  on 8143  degrees of freedom
## AIC: 7464.3
## 
## Number of Fisher Scoring iterations: 5
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = in_bc_transformed1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3386  -0.7288  -0.4181   0.6763   3.1405  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.5280207  0.3790752   4.031 5.56e-05 ***
## KIDSDRIV     1.4378762  0.2440809   5.891 3.84e-09 ***
## AGE         -0.0022458  0.0040536  -0.554 0.579565    
## HOMEKIDS     0.4265052  0.2133621   1.999 0.045611 *  
## YOJ          0.0018798  0.0022036   0.853 0.393629    
## INCOME      -0.0065045  0.0008632  -7.535 4.87e-14 ***
## PARENT1      0.2612104  0.1181074   2.212 0.026992 *  
## HOME_VAL    -0.0158167  0.0043145  -3.666 0.000246 ***
## MSTATUS      0.5257550  0.0869478   6.047 1.48e-09 ***
## SEX         -0.0083849  0.0876730  -0.096 0.923808    
## EDUCATION    0.0454300  0.0303019   1.499 0.133810    
## JOB         -0.0047592  0.0112436  -0.423 0.672088    
## TRAVTIME     0.0419500  0.0049865   8.413  < 2e-16 ***
## CAR_USE     -0.9163469  0.0681251 -13.451  < 2e-16 ***
## BLUEBOOK    -0.0047117  0.0007119  -6.619 3.62e-11 ***
## TIF         -0.1815658  0.0238069  -7.627 2.41e-14 ***
## CAR_TYPE     0.1988017  0.0278696   7.133 9.80e-13 ***
## RED_CAR     -0.0317191  0.0854514  -0.371 0.710492    
## OLDCLAIM    -0.0213959  0.0316719  -0.676 0.499327    
## CLM_FREQ     1.1610295  0.4227084   2.747 0.006021 ** 
## REVOKED      0.7479378  0.0811108   9.221  < 2e-16 ***
## MVR_PTS      0.4120924  0.0621516   6.630 3.35e-11 ***
## CAR_AGE     -0.0821287  0.0169290  -4.851 1.23e-06 ***
## URBANICITY  -2.2941991  0.1132080 -20.265  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7413.1  on 8137  degrees of freedom
## AIC: 7461.1
## 
## Number of Fisher Scoring iterations: 5

4. SELECT MODELS

## [1] "1782 not in a car crash and 359 in a car crash"
F-Statistic
MSE R-Squared value numdf dendf
Model 1 2.201851e+05 0.1551232 51.79085 23 6424
Model 2 2.219448e+05 0.1549972 64.95611 23 8137
Model 3 2.219779e+05 0.1564228 87.86204 17 8143
Model 4 5.590155e-01 0.2152664 97.04888 23 8137
Model 5 Model 6 Model 7
Accuracy 0.7851979 0.7851979 0.7853204
Class. Error Rate 0.2148021 0.2148021 0.2146796
Sensitivity 0.3943335 0.3929401 0.3915467
Specificity 0.9252663 0.9257656 0.9264314
Precision 0.6540832 0.6547988 0.6560311
F1 0.4920313 0.4911466 0.4904014
AUC 0.8048747 0.8046572 0.5782198