Data 621 - HW 4

NIcholas Schettini

June 30, 2018

Cover Page

CUNY MSDS HW4- Binary Logistic Regression and Linear Regression

Nicholas Schettini

CUNY School of Professional Studies

Abstract

In this research assignment, we investigated data on a customer at an auto insurance company. The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj. The data consits of 8161 observatrions and 26 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, backwards elimination), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors of an accident than others.

Keywords: R, auto insurance, prediction, modeling, logistic binary regression, linear regression

Overview

Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

vars n mean sd median trimmed mad min max range skew kurtosis se na_count
TARGET_FLAG 1 8161 0.2638157 0.4407276 0 0.2047787 0.0000 0 1.0 1.0 1.0716614 -0.8516462 0.0048786 0
TARGET_AMT 2 8161 1504.3246481 4704.0269298 0 593.7121106 0.0000 0 107586.1 107586.1 8.7063034 112.2884386 52.0712628 0
KIDSDRIV 3 8161 0.1710575 0.5115341 0 0.0252719 0.0000 0 4.0 4.0 3.3518374 11.7801916 0.0056624 0
AGE 4 8155 44.7903127 8.6275895 45 44.8306513 8.8956 16 81.0 65.0 -0.0289889 -0.0617020 0.0955383 6
HOMEKIDS 5 8161 0.7212351 1.1163233 0 0.4971665 0.0000 0 5.0 5.0 1.3411271 0.6489915 0.0123571 0
YOJ 6 7707 10.4992864 4.0924742 11 11.0711853 2.9652 0 23.0 23.0 -1.2029676 1.1773410 0.0466169 454
INCOME 7 8161 2875.5505453 2090.6786785 2817 2816.9534385 2799.1488 1 6613.0 6612.0 0.1094699 -1.2853032 23.1427840 0
PARENT1* 8 8161 1.1319691 0.3384779 1 1.0399755 0.0000 1 2.0 1.0 2.1743561 2.7281589 0.0037468 0
HOME_VAL 9 8161 1684.8931503 1697.3791897 1245 1516.4994639 1842.8718 1 5107.0 5106.0 0.5162324 -1.1810965 18.7891522 0
MSTATUS* 10 8161 1.5996814 0.4899929 2 1.6245979 0.0000 1 2.0 1.0 -0.4068189 -1.8347231 0.0054240 0
SEX* 11 8161 1.4639137 0.4987266 1 1.4548936 0.0000 1 2.0 1.0 0.1446959 -1.9793056 0.0055207 0
EDUCATION* 12 8161 2.8120328 1.1786322 3 2.7785266 1.4826 1 5.0 4.0 0.1543452 -0.8453783 0.0130469 0
JOB* 13 8161 4.8337214 2.6238293 5 4.7636698 4.4478 1 9.0 8.0 0.1300643 -1.4594539 0.0290445 0
TRAVTIME 14 8161 33.4857248 15.9083334 33 32.9954051 16.3086 5 142.0 137.0 0.4468174 0.6643331 0.1760974 0
CAR_USE* 15 8161 1.6288445 0.4831436 2 1.6610507 0.0000 1 2.0 1.0 -0.5332937 -1.7158080 0.0053482 0
BLUEBOOK 16 8161 1283.6185516 893.5117428 1124 1259.5665492 1132.7064 1 2789.0 2788.0 0.2472837 -1.3624655 9.8907352 0
TIF 17 8161 5.3513050 4.1466353 4 4.8402512 4.4478 1 25.0 24.0 0.8908120 0.4224940 0.0459012 0
CAR_TYPE* 18 8161 3.3405220 1.7553381 3 3.3107673 2.9652 1 6.0 5.0 -0.0981926 -1.4298002 0.0194307 0
RED_CAR* 19 8161 1.2913859 0.4544287 1 1.2392403 0.0000 1 2.0 1.0 0.9180255 -1.1573709 0.0050303 0
OLDCLAIM 20 8161 552.2714128 862.2006829 1 380.3196508 0.0000 1 2857.0 2856.0 1.3085876 0.2461666 9.5441372 0
CLM_FREQ 21 8161 0.7985541 1.1584527 0 0.5886047 0.0000 0 5.0 5.0 1.2087985 0.2842890 0.0128235 0
REVOKED* 22 8161 1.1225340 0.3279216 1 1.0281820 0.0000 1 2.0 1.0 2.3018899 3.2991013 0.0036299 0
MVR_PTS 23 8161 1.6955030 2.1471117 1 1.3138306 1.4826 0 13.0 13.0 1.3478403 1.3754900 0.0237675 0
CAR_AGE 24 7651 8.3283231 5.7007424 8 7.9632413 7.4130 -3 28.0 31.0 0.2819531 -0.7489756 0.0651737 510
URBANICITY* 25 8161 1.7954907 0.4033673 2 1.8693521 0.0000 1 2.0 1.0 -1.4649406 0.1460688 0.0044651 0

The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj.

The data consits of 8161 observatrions and 26 variables. The data consists of multiple NA values, which will have to be taken care of during the data prep.

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable. The boxplots show that some variables have a large amount of variance between each other.

INCOME and HOME_VAL seem to have the highest variance in the dataset.

Histograms

Histograms are useful to show how the data is distributed in the dataset.

TARGET_FLAG TARGET_AMT
TARGET_FLAG 1.0000000 0.5342461
TARGET_AMT 0.5342461 1.0000000
KIDSDRIV 0.1036683 0.0553942
AGE NA NA
HOMEKIDS 0.1156210 0.0619880
YOJ NA NA
INCOME -0.0338365 -0.0084193
HOME_VAL -0.1485715 -0.0768246
TRAVTIME 0.0483683 0.0279870
BLUEBOOK 0.0504453 0.0235955
TIF -0.0823700 -0.0464808
OLDCLAIM 0.1902875 0.0971478
CLM_FREQ 0.2161961 0.1164192
MVR_PTS 0.2191971 0.1378655
CAR_AGE NA NA

Looking at the above table, MVR_PTS (total points on motor vehicle record ) has the highest correlation with TARGET_FLAG (was in a car crash) - which makes sense since you’d expect someone with a lot of points on their record to might be in more accidents. HOME_VAL has the lowest negative correlation with TARGET_FLAG, meaning that those who have a higher house value are less likely to be in an accident.

As mentioned earlier, the data seem to show a good amount of missing (NA) values.

Data Preparation

Cleaning Data

Data seems to be somewhat unstructured when loading into R. For example, some variables were not classified as something that made sense - income wasn’t a numeric variable, so in order for it to make sense in this analysis, it had to be converted into numeric.

Some of the data also had extra characters, such as ‘z_’ before a variable. For example “z_F” for female. These characters had to be cleaned from the data. This was done by the following code:

SEX = as.factor(str_remove(SEX, “^z_”)),

It also makes sense to convert certain catagorical variables into dummy variables. For example, for the sex catagory, male could be 1, and female could be 0 using simple ifelse statements.

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_FLAG 1 8161 0.2638157 0.4407276 0 0.2047787 0.0000 0 1.0 1.0 1.0716614 -0.8516462 0.0048786
TARGET_AMT 2 8161 1504.3246481 4704.0269298 0 593.7121106 0.0000 0 107586.1 107586.1 8.7063034 112.2884386 52.0712628
KIDSDRIV 3 8161 0.1710575 0.5115341 0 0.0252719 0.0000 0 4.0 4.0 3.3518374 11.7801916 0.0056624
AGE 4 8161 44.7818895 8.6346725 45 44.8240159 8.8956 16 81.0 65.0 -0.0303175 -0.0636768 0.0955816
HOMEKIDS 5 8161 0.7212351 1.1163233 0 0.4971665 0.0000 0 5.0 5.0 1.3411271 0.6489915 0.0123571
YOJ 6 8161 10.5119471 4.0773629 11 11.0879155 2.9652 0 23.0 23.0 -1.2232265 1.2313536 0.0451344
INCOME 7 8161 2875.5505453 2090.6786785 2817 2816.9534385 2799.1488 1 6613.0 6612.0 0.1094699 -1.2853032 23.1427840
PARENT1 8 8161 0.1319691 0.3384779 0 0.0399755 0.0000 0 1.0 1.0 2.1743561 2.7281589 0.0037468
HOME_VAL 9 8161 1684.8931503 1697.3791897 1245 1516.4994639 1842.8718 1 5107.0 5106.0 0.5162324 -1.1810965 18.7891522
MSTATUS 10 8161 0.5996814 0.4899929 1 0.6245979 0.0000 0 1.0 1.0 -0.4068189 -1.8347231 0.0054240
SEX 11 8161 0.4639137 0.4987266 0 0.4548936 0.0000 0 1.0 1.0 0.1446959 -1.9793056 0.0055207
EDUCATION* 12 8161 2.8120328 1.1786322 3 2.7785266 1.4826 1 5.0 4.0 0.1543452 -0.8453783 0.0130469
JOB* 13 8161 4.8337214 2.6238293 5 4.7636698 4.4478 1 9.0 8.0 0.1300643 -1.4594539 0.0290445
TRAVTIME 14 8161 33.4857248 15.9083334 33 32.9954051 16.3086 5 142.0 137.0 0.4468174 0.6643331 0.1760974
CAR_USE 15 8161 0.6288445 0.4831436 1 0.6610507 0.0000 0 1.0 1.0 -0.5332937 -1.7158080 0.0053482
BLUEBOOK 16 8161 1283.6185516 893.5117428 1124 1259.5665492 1132.7064 1 2789.0 2788.0 0.2472837 -1.3624655 9.8907352
TIF 17 8161 5.3513050 4.1466353 4 4.8402512 4.4478 1 25.0 24.0 0.8908120 0.4224940 0.0459012
CAR_TYPE* 18 8161 3.3405220 1.7553381 3 3.3107673 2.9652 1 6.0 5.0 -0.0981926 -1.4298002 0.0194307
RED_CAR 19 8161 0.2913859 0.4544287 0 0.2392403 0.0000 0 1.0 1.0 0.9180255 -1.1573709 0.0050303
OLDCLAIM 20 8161 552.2714128 862.2006829 1 380.3196508 0.0000 1 2857.0 2856.0 1.3085876 0.2461666 9.5441372
CLM_FREQ 21 8161 0.7985541 1.1584527 0 0.5886047 0.0000 0 5.0 5.0 1.2087985 0.2842890 0.0128235
REVOKED 22 8161 0.1225340 0.3279216 0 0.0281820 0.0000 0 1.0 1.0 2.3018899 3.2991013 0.0036299
MVR_PTS 23 8161 1.6955030 2.1471117 1 1.3138306 1.4826 0 13.0 13.0 1.3478403 1.3754900 0.0237675
CAR_AGE 24 8161 8.2220316 5.7255627 8 7.8367284 7.4130 -3 28.0 31.0 0.2950526 -0.7681362 0.0633792
URBANICITY* 25 8161 1.7954907 0.4033673 2 1.8693521 0.0000 1 2.0 1.0 -1.4649406 0.1460688 0.0044651
NOHOMEKIDS 26 8161 0.6480823 0.4775977 1 0.6850973 0.0000 0 1.0 1.0 -0.6200373 -1.6157517 0.0052868
NOKIDSDRIV 27 8161 0.8797941 0.3252220 1 0.9747281 0.0000 0 1.0 1.0 -2.3353129 3.4541097 0.0036000
HASCOLLEGE 28 8161 0.5670874 0.4955092 1 0.5838566 0.0000 0 1.0 1.0 -0.2707483 -1.9269314 0.0054850
ISPROFESSIONAL 29 8161 0.3903933 0.4878684 0 0.3629959 0.0000 0 1.0 1.0 0.4492738 -1.7983734 0.0054005

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather someone would be in an accident. Different methods of model creation will be used, as discussed below

Linear Regression

Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5978   -417    -44    206 100891 
## 
## Coefficients: (2 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.558e+02  5.915e+02   0.263  0.79225    
## TARGET_FLAG                    5.686e+03  1.134e+02  50.130  < 2e-16 ***
## KIDSDRIV                      -3.065e+02  2.099e+02  -1.460  0.14429    
## AGE                            9.539e+00  6.306e+00   1.513  0.13036    
## HOMEKIDS                       3.658e+01  8.635e+01   0.424  0.67188    
## YOJ                            8.731e+00  1.321e+01   0.661  0.50878    
## INCOME                         2.138e-03  2.371e-02   0.090  0.92814    
## PARENT1                        1.361e+02  1.898e+02   0.717  0.47342    
## HOME_VAL                      -3.265e-03  3.072e-02  -0.106  0.91537    
## MSTATUS                       -1.281e+02  1.209e+02  -1.059  0.28949    
## SEX                            7.066e+01  1.500e+02   0.471  0.63758    
## EDUCATIONBachelors             3.377e+01  1.781e+02   0.190  0.84968    
## EDUCATIONHigh School          -1.401e+02  1.504e+02  -0.931  0.35173    
## EDUCATIONMasters               1.654e+02  2.600e+02   0.636  0.52479    
## EDUCATIONPhD                   3.546e+02  2.993e+02   1.185  0.23613    
## JOBBlue Collar                 7.733e+01  2.808e+02   0.275  0.78303    
## JOBClerical                    2.553e+01  2.955e+02   0.086  0.93116    
## JOBDoctor                     -2.422e+02  3.572e+02  -0.678  0.49784    
## JOBHome Maker                 -2.765e+01  3.075e+02  -0.090  0.92836    
## JOBLawyer                      1.037e+02  2.580e+02   0.402  0.68767    
## JOBManager                    -9.434e+01  2.519e+02  -0.375  0.70802    
## JOBProfessional                2.074e+02  2.694e+02   0.770  0.44138    
## JOBStudent                    -1.202e+02  3.206e+02  -0.375  0.70765    
## TRAVTIME                       5.192e-01  2.826e+00   0.184  0.85422    
## CAR_USE                       -1.169e+02  1.444e+02  -0.809  0.41831    
## BLUEBOOK                      -1.533e-02  5.217e-02  -0.294  0.76882    
## TIF                           -3.149e+00  1.068e+01  -0.295  0.76824    
## CAR_TYPEPanel Truck            3.567e+02  2.221e+02   1.606  0.10828    
## CAR_TYPEPickup                -8.538e+01  1.516e+02  -0.563  0.57332    
## CAR_TYPESports Car            -2.313e+01  1.809e+02  -0.128  0.89828    
## CAR_TYPESUV                   -5.793e+01  1.455e+02  -0.398  0.69045    
## CAR_TYPEVan                    2.621e+02  1.815e+02   1.444  0.14891    
## RED_CAR                       -4.005e+01  1.303e+02  -0.307  0.75853    
## OLDCLAIM                      -9.852e-02  7.314e-02  -1.347  0.17802    
## CLM_FREQ                       9.030e+00  5.513e+01   0.164  0.86989    
## REVOKED                       -3.099e+02  1.369e+02  -2.264  0.02359 *  
## MVR_PTS                        6.043e+01  2.298e+01   2.630  0.00856 ** 
## CAR_AGE                       -2.110e+01  1.088e+01  -1.939  0.05253 .  
## URBANICITYHighly Urban/ Urban -1.502e+01  1.267e+02  -0.119  0.90567    
## NOHOMEKIDS                    -4.506e+00  2.233e+02  -0.020  0.98390    
## NOKIDSDRIV                    -4.855e+02  3.346e+02  -1.451  0.14685    
## HASCOLLEGE                            NA         NA      NA       NA    
## ISPROFESSIONAL                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3973 on 8120 degrees of freedom
## Multiple R-squared:  0.2901, Adjusted R-squared:  0.2866 
## F-statistic: 82.96 on 40 and 8120 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.28, and F-statistic of 83.01 with a small p-value.

Model 2 - Forward Selection

Variables will be removed one by one to determine best fit model. After each variable is added, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown.

## 
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +imputed1$AGE + imputed1$EDUCATION + 
##     imputed1$REVOKED + imputed1$MVR_PTS + imputed1$JOB + imputed1$YOJ + 
##     imputed1$CLM_FREQ + imputed1$HOME_VAL + imputed1$URBANICITY + 
##     imputed1$PARENT1 + imputed1$MSTATUS + imputed1$TRAVTIME + 
##     imputed1$BLUEBOOK, data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5489  -1689   -766    210 104861 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            -5.686e+02  5.051e+02  -1.126
## imputed1$AGE                            6.420e+00  6.479e+00   0.991
## imputed1$EDUCATIONBachelors            -2.497e+02  1.846e+02  -1.353
## imputed1$EDUCATIONHigh School           4.803e+01  1.645e+02   0.292
## imputed1$EDUCATIONMasters              -1.665e+02  2.641e+02  -0.630
## imputed1$EDUCATIONPhD                  -1.373e-01  3.145e+02   0.000
## imputed1$REVOKED                        5.172e+02  1.555e+02   3.325
## imputed1$MVR_PTS                        1.899e+02  2.590e+01   7.333
## imputed1$JOBBlue Collar                 4.927e+02  3.095e+02   1.592
## imputed1$JOBClerical                    1.434e+02  3.255e+02   0.440
## imputed1$JOBDoctor                     -1.241e+03  3.868e+02  -3.208
## imputed1$JOBHome Maker                  5.840e+01  3.260e+02   0.179
## imputed1$JOBLawyer                     -4.729e+02  2.663e+02  -1.776
## imputed1$JOBManager                    -9.751e+02  2.739e+02  -3.561
## imputed1$JOBProfessional               -4.053e+01  2.956e+02  -0.137
## imputed1$JOBStudent                     2.414e+02  3.545e+02   0.681
## imputed1$YOJ                           -7.945e+00  1.459e+01  -0.545
## imputed1$CLM_FREQ                       1.433e+02  4.891e+01   2.930
## imputed1$HOME_VAL                      -6.350e-02  3.515e-02  -1.806
## imputed1$URBANICITYHighly Urban/ Urban  1.582e+03  1.399e+02  11.307
## imputed1$PARENT1                        8.578e+02  1.801e+02   4.762
## imputed1$MSTATUS                       -4.256e+02  1.299e+02  -3.277
## imputed1$TRAVTIME                       1.217e+01  3.238e+00   3.757
## imputed1$BLUEBOOK                       7.166e-02  5.753e-02   1.246
##                                        Pr(>|t|)    
## (Intercept)                            0.260359    
## imputed1$AGE                           0.321784    
## imputed1$EDUCATIONBachelors            0.176153    
## imputed1$EDUCATIONHigh School          0.770339    
## imputed1$EDUCATIONMasters              0.528567    
## imputed1$EDUCATIONPhD                  0.999652    
## imputed1$REVOKED                       0.000887 ***
## imputed1$MVR_PTS                       2.46e-13 ***
## imputed1$JOBBlue Collar                0.111381    
## imputed1$JOBClerical                   0.659676    
## imputed1$JOBDoctor                     0.001344 ** 
## imputed1$JOBHome Maker                 0.857814    
## imputed1$JOBLawyer                     0.075753 .  
## imputed1$JOBManager                    0.000372 ***
## imputed1$JOBProfessional               0.890922    
## imputed1$JOBStudent                    0.495847    
## imputed1$YOJ                           0.586023    
## imputed1$CLM_FREQ                      0.003402 ** 
## imputed1$HOME_VAL                      0.070891 .  
## imputed1$URBANICITYHighly Urban/ Urban  < 2e-16 ***
## imputed1$PARENT1                       1.95e-06 ***
## imputed1$MSTATUS                       0.001055 ** 
## imputed1$TRAVTIME                      0.000173 ***
## imputed1$BLUEBOOK                      0.212884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4571 on 8137 degrees of freedom
## Multiple R-squared:  0.05851,    Adjusted R-squared:  0.05585 
## F-statistic: 21.99 on 23 and 8137 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.055, and F-statistic of 21.98 with a small p-value.

Model 3 - Leaps Package

The Leaps package is an “regression subset selection” tool. The package automatically generates all possible models. The tool is basically used to find the “best” model.

The leaps package will analyze the “best model” using adjusted r2 and CP.

## 
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +REVOKED + MVR_PTS + imputed1$CAR_TYPE + 
##     CAR_AGE + SEX + imputed1$TRAVTIME + imputed1$JOB + imputed1$URBANICITY + 
##     imputed1$MSTATUS + imputed1$CAR_USE, data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5342  -1693   -792    288 104059 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                            -393.964    366.798  -1.074
## REVOKED                                 514.222    155.204   3.313
## MVR_PTS                                 210.279     24.081   8.732
## imputed1$CAR_TYPEPanel Truck            464.730    247.741   1.876
## imputed1$CAR_TYPEPickup                 394.119    168.470   2.339
## imputed1$CAR_TYPESports Car             972.216    203.930   4.767
## imputed1$CAR_TYPESUV                    699.868    164.985   4.242
## imputed1$CAR_TYPEVan                    567.103    207.133   2.738
## CAR_AGE                                 -28.256     10.859  -2.602
## SEX                                     224.400    146.141   1.536
## imputed1$TRAVTIME                        11.811      3.228   3.659
## imputed1$JOBBlue Collar                 578.898    261.861   2.211
## imputed1$JOBClerical                    716.828    280.004   2.560
## imputed1$JOBDoctor                     -439.746    376.224  -1.169
## imputed1$JOBHome Maker                  635.220    309.202   2.054
## imputed1$JOBLawyer                      245.971    286.522   0.858
## imputed1$JOBManager                    -548.828    266.887  -2.056
## imputed1$JOBProfessional                330.257    265.251   1.245
## imputed1$JOBStudent                     676.833    297.923   2.272
## imputed1$URBANICITYHighly Urban/ Urban 1694.854    136.497  12.417
## imputed1$MSTATUS                       -802.042    103.469  -7.752
## imputed1$CAR_USE                       -729.892    157.091  -4.646
##                                        Pr(>|t|)    
## (Intercept)                            0.282827    
## REVOKED                                0.000926 ***
## MVR_PTS                                 < 2e-16 ***
## imputed1$CAR_TYPEPanel Truck           0.060709 .  
## imputed1$CAR_TYPEPickup                0.019339 *  
## imputed1$CAR_TYPESports Car            1.90e-06 ***
## imputed1$CAR_TYPESUV                   2.24e-05 ***
## imputed1$CAR_TYPEVan                   0.006197 ** 
## CAR_AGE                                0.009281 ** 
## SEX                                    0.124699    
## imputed1$TRAVTIME                      0.000254 ***
## imputed1$JOBBlue Collar                0.027084 *  
## imputed1$JOBClerical                   0.010483 *  
## imputed1$JOBDoctor                     0.242502    
## imputed1$JOBHome Maker                 0.039970 *  
## imputed1$JOBLawyer                     0.390659    
## imputed1$JOBManager                    0.039776 *  
## imputed1$JOBProfessional               0.213141    
## imputed1$JOBStudent                    0.023122 *  
## imputed1$URBANICITYHighly Urban/ Urban  < 2e-16 ***
## imputed1$MSTATUS                       1.02e-14 ***
## imputed1$CAR_USE                       3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4562 on 8139 degrees of freedom
## Multiple R-squared:  0.06199,    Adjusted R-squared:  0.05956 
## F-statistic: 25.61 on 21 and 8139 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.059, and F-statistic of 25.72 with a small p-value.

Binary Logistic Regression

Model 1 - Base Model: All variables

All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = imputed1)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.003710   0.000000   0.000000   0.000000   0.005352  
## 
## Coefficients: (2 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)                   -3.534e+02  2.189e+04  -0.016    0.987
## TARGET_AMT                     3.925e-01  2.744e+00   0.143    0.886
## KIDSDRIV                       1.484e+01  1.500e+03   0.010    0.992
## AGE                           -7.056e-01  2.576e+01  -0.027    0.978
## HOMEKIDS                       2.243e+00  6.711e+02   0.003    0.997
## YOJ                           -2.420e-01  5.761e+01  -0.004    0.997
## INCOME                        -1.731e-04  1.642e-01  -0.001    0.999
## PARENT1                       -3.865e+00  3.642e+03  -0.001    0.999
## HOME_VAL                       1.651e-03  1.304e-01   0.013    0.990
## MSTATUS                        1.874e+00  4.472e+02   0.004    0.997
## SEX                            7.708e+00  1.673e+03   0.005    0.996
## EDUCATIONBachelors            -9.223e+00  1.735e+03  -0.005    0.996
## EDUCATIONHigh School           4.761e+00  4.609e+02   0.010    0.992
## EDUCATIONMasters              -2.636e+00  3.731e+03  -0.001    0.999
## EDUCATIONPhD                   1.090e+01  1.688e+04   0.001    0.999
## JOBBlue Collar                 2.930e+02  2.154e+04   0.014    0.989
## JOBClerical                    2.935e+02  2.154e+04   0.014    0.989
## JOBDoctor                     -5.549e+01  9.203e+05   0.000    1.000
## JOBHome Maker                  2.811e+02  2.582e+04   0.011    0.991
## JOBLawyer                      2.196e+02  2.511e+04   0.009    0.993
## JOBManager                     2.770e+02  2.737e+04   0.010    0.992
## JOBProfessional                2.763e+02  4.388e+04   0.006    0.995
## JOBStudent                     2.913e+02  2.150e+04   0.014    0.989
## TRAVTIME                      -3.357e-02  1.323e+01  -0.003    0.998
## CAR_USE                       -4.805e+00  5.301e+02  -0.009    0.993
## BLUEBOOK                       4.589e-03  2.900e-01   0.016    0.987
## TIF                           -5.620e-01  7.309e+01  -0.008    0.994
## CAR_TYPEPanel Truck           -1.304e+02  4.253e+04  -0.003    0.998
## CAR_TYPEPickup                 2.878e+00  6.093e+02   0.005    0.996
## CAR_TYPESports Car             8.145e+00  2.046e+03   0.004    0.997
## CAR_TYPESUV                   -1.904e+00  1.703e+03  -0.001    0.999
## CAR_TYPEVan                   -1.698e+01  3.361e+03  -0.005    0.996
## RED_CAR                       -7.328e-01  5.167e+02  -0.001    0.999
## OLDCLAIM                      -1.463e-04  2.365e-01  -0.001    1.000
## CLM_FREQ                       3.775e+00  2.146e+02   0.018    0.986
## REVOKED                        9.272e+00  5.920e+02   0.016    0.988
## MVR_PTS                       -1.159e+00  9.130e+01  -0.013    0.990
## CAR_AGE                        3.482e-01  5.490e+01   0.006    0.995
## URBANICITYHighly Urban/ Urban  8.681e+00  6.175e+02   0.014    0.989
## NOHOMEKIDS                     1.583e+01  1.807e+03   0.009    0.993
## NOKIDSDRIV                     2.283e+01  4.079e+03   0.006    0.996
## HASCOLLEGE                            NA         NA      NA       NA
## ISPROFESSIONAL                        NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9.4180e+03  on 8160  degrees of freedom
## Residual deviance: 1.2291e-04  on 8120  degrees of freedom
## AIC: 82
## 
## Number of Fisher Scoring iterations: 25
##           llh       llhNull            G2      McFadden          r2ML 
## -6.145302e-05 -4.708981e+03  9.417962e+03  1.000000e+00  6.846337e-01 
##          r2CU 
##  1.000000e+00

Model2 - Backwards Elimination

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output are produced. Only the final output will be shown.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM + 
##     CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = "binomial", 
##     data = dplyr::select(imputed1, -TARGET_AMT))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4154  -0.7469  -0.4494   0.7210   2.9518  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.015e+00  1.609e-01 -12.521  < 2e-16 ***
## KIDSDRIV                       3.110e-01  5.827e-02   5.338 9.42e-08 ***
## HOMEKIDS                       1.273e-01  3.305e-02   3.851 0.000117 ***
## YOJ                           -4.336e-02  7.021e-03  -6.175 6.60e-10 ***
## PARENT1                        3.421e-01  1.057e-01   3.237 0.001206 ** 
## HOME_VAL                      -1.394e-04  1.900e-05  -7.335 2.21e-13 ***
## MSTATUS                       -3.601e-01  7.466e-02  -4.824 1.41e-06 ***
## SEX                           -2.879e-01  6.010e-02  -4.790 1.66e-06 ***
## TRAVTIME                       1.474e-02  1.829e-03   8.062 7.48e-16 ***
## CAR_USE                       -7.682e-01  6.026e-02 -12.748  < 2e-16 ***
## TIF                           -5.104e-02  7.132e-03  -7.156 8.30e-13 ***
## OLDCLAIM                       9.888e-05  4.149e-05   2.383 0.017171 *  
## CLM_FREQ                       1.225e-01  3.129e-02   3.913 9.10e-05 ***
## REVOKED                        7.620e-01  7.790e-02   9.782  < 2e-16 ***
## MVR_PTS                        1.135e-01  1.336e-02   8.502  < 2e-16 ***
## CAR_AGE                       -4.178e-02  5.196e-03  -8.040 8.98e-16 ***
## URBANICITYHighly Urban/ Urban  2.111e+00  1.105e-01  19.106  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7658.3  on 8144  degrees of freedom
## AIC: 7692.3
## 
## Number of Fisher Scoring iterations: 5
##           llh       llhNull            G2      McFadden          r2ML 
## -3829.1421401 -4708.9811460  1759.6780119     0.1868428     0.1939588 
##          r2CU 
##     0.2833030

Model3 - glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

The exhaustive approach was not practical for this dataset, as it continued to run after 2 hours.

After running the package, I inputed the “best” model mannualy in R, as to not have to rerun (10+ mins) each time.

glmulti - all data including transformations

## 
## Call:
## glm(formula = imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + 
##     SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY + 
##     KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM + 
##     CLM_FREQ + MVR_PTS, data = imputed1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9825  -0.2814  -0.1104   0.2880   1.2269  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -9.978e-04  4.639e-02  -0.022 0.982839    
## PARENT1                        8.504e-02  1.577e-02   5.391 7.18e-08 ***
## MSTATUS                       -6.891e-02  1.117e-02  -6.167 7.29e-10 ***
## SEX                            3.759e-02  1.254e-02   2.997 0.002736 ** 
## EDUCATIONBachelors            -7.200e-02  1.632e-02  -4.412 1.04e-05 ***
## EDUCATIONHigh School           2.204e-05  1.461e-02   0.002 0.998797    
## EDUCATIONMasters              -5.591e-02  2.288e-02  -2.444 0.014562 *  
## EDUCATIONPhD                  -6.488e-02  2.720e-02  -2.385 0.017086 *  
## JOBBlue Collar                 8.642e-02  2.745e-02   3.149 0.001645 ** 
## JOBClerical                    1.103e-01  2.884e-02   3.825 0.000132 ***
## JOBDoctor                     -3.505e-02  3.495e-02  -1.003 0.315944    
## JOBHome Maker                  1.177e-01  2.986e-02   3.942 8.15e-05 ***
## JOBLawyer                      3.552e-02  2.523e-02   1.408 0.159243    
## JOBManager                    -5.822e-02  2.463e-02  -2.363 0.018140 *  
## JOBProfessional                5.529e-02  2.634e-02   2.099 0.035840 *  
## JOBStudent                     1.077e-01  3.108e-02   3.467 0.000529 ***
## CAR_USE                       -1.198e-01  1.406e-02  -8.517  < 2e-16 ***
## CAR_TYPEPanel Truck            1.076e-02  2.136e-02   0.504 0.614312    
## CAR_TYPEPickup                 7.856e-02  1.449e-02   5.422 6.07e-08 ***
## CAR_TYPESports Car             1.693e-01  1.748e-02   9.684  < 2e-16 ***
## CAR_TYPESUV                    1.295e-01  1.411e-02   9.174  < 2e-16 ***
## CAR_TYPEVan                    5.330e-02  1.774e-02   3.005 0.002667 ** 
## REVOKED                        1.314e-01  1.331e-02   9.875  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  2.952e-01  1.196e-02  24.689  < 2e-16 ***
## KIDSDRIV                       6.636e-02  8.742e-03   7.591 3.54e-14 ***
## AGE                           -7.909e-04  5.531e-04  -1.430 0.152788    
## YOJ                           -3.153e-03  1.243e-03  -2.537 0.011208 *  
## HOME_VAL                      -1.143e-05  2.999e-06  -3.811 0.000139 ***
## TRAVTIME                       2.020e-03  2.755e-04   7.332 2.49e-13 ***
## TIF                           -7.827e-03  1.042e-03  -7.515 6.29e-14 ***
## OLDCLAIM                       1.264e-05  7.153e-06   1.766 0.077376 .  
## CLM_FREQ                       1.936e-02  5.388e-03   3.592 0.000330 ***
## MVR_PTS                        2.033e-02  2.235e-03   9.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1511599)
## 
##     Null deviance: 1585.0  on 8160  degrees of freedom
## Residual deviance: 1228.6  on 8128  degrees of freedom
## AIC: 7775.3
## 
## Number of Fisher Scoring iterations: 2
##           llh       llhNull            G2      McFadden          r2ML 
## -3853.6578451 -4892.9184843  2078.5212784     0.2124010     0.2248429 
##          r2CU 
##     0.3218783

Model Selection.

Based on the above models, I’ve decided to use the model provided by the glmulti package. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction.

Evaulating the model

Before I ran the evaulation data through the model, I decided to split the trianing data into an 80/20 split. My thoughts are that this will allow me to better check the accuracy of my model, given the fact that this way I can actually check if the ‘target’ variable as predicted by the model is correct or not.

After splitting the data and creating two new variables (training and testing), I created an ROC graph to help determine what threshold I should use in my model

Looking at the graph, the 0.4 threshold seems to be the most ideal soultion for my testing. 0.4 gives about a 0.6 TP rate, while giving only ~0.2 FP rate. 0.3 and 0.2 give slightly higher TP rates, but also give a high FP rate - which, in my opinion, isn’t worth the slight increase in TP. 0.5 gives a slightly lower FP rate, but a siginifcant different in TP rate.

##            PredictedValue
## ActualValue FALSE TRUE
##           0  4026  703
##           1   741 1003
## [1] 0.774

After testing the model the accuracy is around .774, which seems like a decent fit.

The misclassification Rate is about 0.2246254.

The true positive rate is about 0.572759.

The false positive rate is about 0.1514196.

The specificity is about 0.8487907.

The precision is about 0.5774648

Testing With Model glmulti model

## predict12
##    0    1 
## 1550  591
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2808  0.1296  0.2822  0.2776  0.4121  0.9482

This model predicts that 591 insurance customers would have an auto accident, while 1550 will not.

References

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf

Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R

Appendex

train_insurance <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance_training_data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “z_“)), SEX = as.factor(str_remove(SEX,”z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “z_“)), JOB = as.factor(str_remove(JOB,”z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”z_”)))

eval_data <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance-evaluation-data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “z_“)), SEX = as.factor(str_remove(SEX,”z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “z_“)), JOB = as.factor(str_remove(JOB,”z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”z_”)) )

insurance_desc <- describe(train_insurance) insurance_desc$na_count <- sapply(train_insurance, function(y) sum(length(which(is.na(y)))))

kable(insurance_desc, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)

ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) + facet_wrap(~variable, scale=“free”) + geom_boxplot()

ggplot1 <- train_insurance[,-c(1, 2)]

ggplot(melt(ggplot1), aes(x=factor(variable), y=value)) + geom_boxplot() + stat_summary(fun.y = mean, color = “blue”, geom = “point”) +
stat_summary(fun.y = median, color = “red”, geom = “point”) + coord_flip() + theme_bw()

ggplot(melt(train_insurance), aes(x=value)) + facet_wrap(~variable, scale=“free”) + geom_histogram(bins=50)

num.cols <- sapply(train_insurance, is.numeric) cor.data <- cor(train_insurance[,num.cols])

kable(cor.data[,1:2], “html”, escape = F) %>% kable_styling(“striped”, full_width = F) %>% column_spec(1, bold = T) %>% scroll_box(height = “500px”)

corrgram(drop_na(train_insurance), order=TRUE, upper.panel=panel.cor, main=“Crime”)

library(Amelia) missmap(train_insurance, main = “Missing values vs observed”)

train_ins <- train_insurance

init = mice(train_ins, maxit=0) meth = init\(method predM = init\)predictorMatrix

predM[, c(“TARGET_FLAG”)]=0

imputed = mice(train_ins, method=“rf”, predictorMatrix=predM, m=5)

imputed1 <- complete(imputed)

imputed1 <- imputed1 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))

imputed1 <- imputed1 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))

eval_ins <- eval_data

init = mice(eval_ins, maxit=0) meth = init\(method predM = init\)predictorMatrix predM[, c(“TARGET_FLAG”)]=0

imputed2 = mice(eval_ins, method=“rf”, predictorMatrix=predM, m=5) imputed3 <- complete(imputed2)

imputed3 <- imputed3 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))

imputed3 <- imputed3 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))

imputed3\(TARGET_FLAG <- as.numeric(eval_data\)TARGET_FLAG) imputed3\(TARGET_AMT <- as.numeric(eval_data\)TARGET_AMT)

imputedtable <- describe(imputed1)

kable(imputedtable, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)

model1 <- lm(TARGET_AMT ~., imputed1)

(summary(model1)) summodel1 <- summary(model1)

model2 <- lm(imputed1\(TARGET_AMT ~ + imputed1\)AGE + imputed1\(EDUCATION + imputed1\)REVOKED + imputed1\(MVR_PTS + imputed1\)JOB +imputed1\(YOJ + imputed1\)CLM_FREQ + imputed1\(HOME_VAL + imputed1\)URBANICITY + imputed1\(PARENT1 + imputed1\)MSTATUS + imputed1\(TRAVTIME + imputed1\)BLUEBOOK, data = imputed1)
summary(model2)

best.subset <- regsubsets(imputed1\(TARGET_AMT~., imputed1, nvmax=5) best.subset.summary <- summary(best.subset) best.subset.summary\)outmat

best.subset.by.adjr2 <- which.max(best.subset.summary$adjr2) best.subset.by.adjr2

best.subset.by.cp <- which.min(best.subset.summary$cp) best.subset.by.cp

model3 <- lm(imputed1\(TARGET_AMT ~ + REVOKED + MVR_PTS + imputed1\)CAR_TYPE + CAR_AGE + SEX + imputed1\(TRAVTIME + imputed1\)JOB + imputed1\(URBANICITY + imputed1\)MSTATUS + imputed1$CAR_USE, data = imputed1)
summary(model3)

modellog1 <- glm(TARGET_FLAG ~., family = “binomial”, data=imputed1) summary(modellog1) pR2(modellog1)

model2 <- glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 + HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = “binomial”, data = dplyr::select(imputed1, -TARGET_AMT)) summary(model2) pR2(model2)

library(rJava) library(glmulti)

glmulti.lm.out <- glmulti(imputed$TARGET_FLAG ~., data = imputed, level = 1,
method = “g”,
crit = “aic”,
confsetsize = 5,
plotty = F, report = F,
fitfunction = “lm”)

modelglmulti <- glm(imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY + TARGET_AMT + KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS, data = imputed1)

summary(modelglmulti) pR2(modelglmulti)

modelglmulti1 <- modelglmulti

imputed1\(TARGET_FLAG <- as.factor(imputed1\)TARGET_FLAG) splitdata <- imputed1

split <- sample.split(splitdata, SplitRatio = 0.8) split training <- subset(splitdata, split == “TRUE”) testing <- subset(splitdata, split == “FALSE”)

res <- predict(modelglmulti, newdata=training, type=“response”)

ROCRPred = prediction(res, training$TARGET_FLAG) ROCRPref <- performance(ROCRPred, “tpr”,“fpr”)

plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))

(table(ActualValue=training$TARGET_FLAG, PredictedValue=res>0.3)) round((4005+1532)/(4005 +1532+755+181),3)

PredictedValue <- res>03 res <- predict(modelglmulti, newdata=testing, type=“response”) (table(ActualValue=testing$TARGET_FLAG, PredictedValue=res>0.3)) round((1037+393)/(1037+393+47+211),3)

imputed4 <- imputed3[, -1]

predict1 <- predict(modelglmulti, newdata=imputed3, type=“response”) predict12 <- ifelse(predict1 > 0.3, 1, 0) table(predict12) summary(predict1)