Cover Page
Abstract
Overview
- Visual Exploration
Data Preparation
Build Models
Binary Logistic Regression
Model Selection.
- Evaulating the model
- Testing With Model glmulti model
References
Appendex

Cover Page

CUNY MSDS HW4- Binary Logistic Regression and Linear Regression

Nicholas Schettini

CUNY School of Professional Studies

Abstract

In this research assignment, we investigated data on a customer at an auto insurance company. The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj. The data consits of 8161 observatrions and 26 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, backwards elimination), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors of an accident than others.

Keywords: R, auto insurance, prediction, modeling, logistic binary regression, linear regression

Overview

Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.

Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
TARGET_FLAG	1	8161	0.2638157	0.4407276	0	0.2047787	0.0000	0	1.0	1.0	1.0716614	-0.8516462	0.0048786	0
TARGET_AMT	2	8161	1504.3246481	4704.0269298	0	593.7121106	0.0000	0	107586.1	107586.1	8.7063034	112.2884386	52.0712628	0
KIDSDRIV	3	8161	0.1710575	0.5115341	0	0.0252719	0.0000	0	4.0	4.0	3.3518374	11.7801916	0.0056624	0
AGE	4	8155	44.7903127	8.6275895	45	44.8306513	8.8956	16	81.0	65.0	-0.0289889	-0.0617020	0.0955383	6
HOMEKIDS	5	8161	0.7212351	1.1163233	0	0.4971665	0.0000	0	5.0	5.0	1.3411271	0.6489915	0.0123571	0
YOJ	6	7707	10.4992864	4.0924742	11	11.0711853	2.9652	0	23.0	23.0	-1.2029676	1.1773410	0.0466169	454
INCOME	7	8161	2875.5505453	2090.6786785	2817	2816.9534385	2799.1488	1	6613.0	6612.0	0.1094699	-1.2853032	23.1427840	0
PARENT1*	8	8161	1.1319691	0.3384779	1	1.0399755	0.0000	1	2.0	1.0	2.1743561	2.7281589	0.0037468	0
HOME_VAL	9	8161	1684.8931503	1697.3791897	1245	1516.4994639	1842.8718	1	5107.0	5106.0	0.5162324	-1.1810965	18.7891522	0
MSTATUS*	10	8161	1.5996814	0.4899929	2	1.6245979	0.0000	1	2.0	1.0	-0.4068189	-1.8347231	0.0054240	0
SEX*	11	8161	1.4639137	0.4987266	1	1.4548936	0.0000	1	2.0	1.0	0.1446959	-1.9793056	0.0055207	0
EDUCATION*	12	8161	2.8120328	1.1786322	3	2.7785266	1.4826	1	5.0	4.0	0.1543452	-0.8453783	0.0130469	0
JOB*	13	8161	4.8337214	2.6238293	5	4.7636698	4.4478	1	9.0	8.0	0.1300643	-1.4594539	0.0290445	0
TRAVTIME	14	8161	33.4857248	15.9083334	33	32.9954051	16.3086	5	142.0	137.0	0.4468174	0.6643331	0.1760974	0
CAR_USE*	15	8161	1.6288445	0.4831436	2	1.6610507	0.0000	1	2.0	1.0	-0.5332937	-1.7158080	0.0053482	0
BLUEBOOK	16	8161	1283.6185516	893.5117428	1124	1259.5665492	1132.7064	1	2789.0	2788.0	0.2472837	-1.3624655	9.8907352	0
TIF	17	8161	5.3513050	4.1466353	4	4.8402512	4.4478	1	25.0	24.0	0.8908120	0.4224940	0.0459012	0
CAR_TYPE*	18	8161	3.3405220	1.7553381	3	3.3107673	2.9652	1	6.0	5.0	-0.0981926	-1.4298002	0.0194307	0
RED_CAR*	19	8161	1.2913859	0.4544287	1	1.2392403	0.0000	1	2.0	1.0	0.9180255	-1.1573709	0.0050303	0
OLDCLAIM	20	8161	552.2714128	862.2006829	1	380.3196508	0.0000	1	2857.0	2856.0	1.3085876	0.2461666	9.5441372	0
CLM_FREQ	21	8161	0.7985541	1.1584527	0	0.5886047	0.0000	0	5.0	5.0	1.2087985	0.2842890	0.0128235	0
REVOKED*	22	8161	1.1225340	0.3279216	1	1.0281820	0.0000	1	2.0	1.0	2.3018899	3.2991013	0.0036299	0
MVR_PTS	23	8161	1.6955030	2.1471117	1	1.3138306	1.4826	0	13.0	13.0	1.3478403	1.3754900	0.0237675	0
CAR_AGE	24	7651	8.3283231	5.7007424	8	7.9632413	7.4130	-3	28.0	31.0	0.2819531	-0.7489756	0.0651737	510
URBANICITY*	25	8161	1.7954907	0.4033673	2	1.8693521	0.0000	1	2.0	1.0	-1.4649406	0.1460688	0.0044651	0

The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj.

The data consits of 8161 observatrions and 26 variables. The data consists of multiple NA values, which will have to be taken care of during the data prep.

Visual Exploration

Boxplots

The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable. The boxplots show that some variables have a large amount of variance between each other.

INCOME and HOME_VAL seem to have the highest variance in the dataset.

Histograms

Histograms are useful to show how the data is distributed in the dataset.

	TARGET_FLAG	TARGET_AMT
TARGET_FLAG	1.0000000	0.5342461
TARGET_AMT	0.5342461	1.0000000
KIDSDRIV	0.1036683	0.0553942
AGE	NA	NA
HOMEKIDS	0.1156210	0.0619880
YOJ	NA	NA
INCOME	-0.0338365	-0.0084193
HOME_VAL	-0.1485715	-0.0768246
TRAVTIME	0.0483683	0.0279870
BLUEBOOK	0.0504453	0.0235955
TIF	-0.0823700	-0.0464808
OLDCLAIM	0.1902875	0.0971478
CLM_FREQ	0.2161961	0.1164192
MVR_PTS	0.2191971	0.1378655
CAR_AGE	NA	NA

Looking at the above table, MVR_PTS (total points on motor vehicle record ) has the highest correlation with TARGET_FLAG (was in a car crash) - which makes sense since you’d expect someone with a lot of points on their record to might be in more accidents. HOME_VAL has the lowest negative correlation with TARGET_FLAG, meaning that those who have a higher house value are less likely to be in an accident.

As mentioned earlier, the data seem to show a good amount of missing (NA) values.

Data Preparation

Cleaning Data

Data seems to be somewhat unstructured when loading into R. For example, some variables were not classified as something that made sense - income wasn’t a numeric variable, so in order for it to make sense in this analysis, it had to be converted into numeric.

Some of the data also had extra characters, such as ‘z_’ before a variable. For example “z_F” for female. These characters had to be cleaned from the data. This was done by the following code:

SEX = as.factor(str_remove(SEX, “^z_”)),

It also makes sense to convert certain catagorical variables into dummy variables. For example, for the sex catagory, male could be 1, and female could be 0 using simple ifelse statements.

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_FLAG	1	8161	0.2638157	0.4407276	0	0.2047787	0.0000	0	1.0	1.0	1.0716614	-0.8516462	0.0048786
TARGET_AMT	2	8161	1504.3246481	4704.0269298	0	593.7121106	0.0000	0	107586.1	107586.1	8.7063034	112.2884386	52.0712628
KIDSDRIV	3	8161	0.1710575	0.5115341	0	0.0252719	0.0000	0	4.0	4.0	3.3518374	11.7801916	0.0056624
AGE	4	8161	44.7818895	8.6346725	45	44.8240159	8.8956	16	81.0	65.0	-0.0303175	-0.0636768	0.0955816
HOMEKIDS	5	8161	0.7212351	1.1163233	0	0.4971665	0.0000	0	5.0	5.0	1.3411271	0.6489915	0.0123571
YOJ	6	8161	10.5119471	4.0773629	11	11.0879155	2.9652	0	23.0	23.0	-1.2232265	1.2313536	0.0451344
INCOME	7	8161	2875.5505453	2090.6786785	2817	2816.9534385	2799.1488	1	6613.0	6612.0	0.1094699	-1.2853032	23.1427840
PARENT1	8	8161	0.1319691	0.3384779	0	0.0399755	0.0000	0	1.0	1.0	2.1743561	2.7281589	0.0037468
HOME_VAL	9	8161	1684.8931503	1697.3791897	1245	1516.4994639	1842.8718	1	5107.0	5106.0	0.5162324	-1.1810965	18.7891522
MSTATUS	10	8161	0.5996814	0.4899929	1	0.6245979	0.0000	0	1.0	1.0	-0.4068189	-1.8347231	0.0054240
SEX	11	8161	0.4639137	0.4987266	0	0.4548936	0.0000	0	1.0	1.0	0.1446959	-1.9793056	0.0055207
EDUCATION*	12	8161	2.8120328	1.1786322	3	2.7785266	1.4826	1	5.0	4.0	0.1543452	-0.8453783	0.0130469
JOB*	13	8161	4.8337214	2.6238293	5	4.7636698	4.4478	1	9.0	8.0	0.1300643	-1.4594539	0.0290445
TRAVTIME	14	8161	33.4857248	15.9083334	33	32.9954051	16.3086	5	142.0	137.0	0.4468174	0.6643331	0.1760974
CAR_USE	15	8161	0.6288445	0.4831436	1	0.6610507	0.0000	0	1.0	1.0	-0.5332937	-1.7158080	0.0053482
BLUEBOOK	16	8161	1283.6185516	893.5117428	1124	1259.5665492	1132.7064	1	2789.0	2788.0	0.2472837	-1.3624655	9.8907352
TIF	17	8161	5.3513050	4.1466353	4	4.8402512	4.4478	1	25.0	24.0	0.8908120	0.4224940	0.0459012
CAR_TYPE*	18	8161	3.3405220	1.7553381	3	3.3107673	2.9652	1	6.0	5.0	-0.0981926	-1.4298002	0.0194307
RED_CAR	19	8161	0.2913859	0.4544287	0	0.2392403	0.0000	0	1.0	1.0	0.9180255	-1.1573709	0.0050303
OLDCLAIM	20	8161	552.2714128	862.2006829	1	380.3196508	0.0000	1	2857.0	2856.0	1.3085876	0.2461666	9.5441372
CLM_FREQ	21	8161	0.7985541	1.1584527	0	0.5886047	0.0000	0	5.0	5.0	1.2087985	0.2842890	0.0128235
REVOKED	22	8161	0.1225340	0.3279216	0	0.0281820	0.0000	0	1.0	1.0	2.3018899	3.2991013	0.0036299
MVR_PTS	23	8161	1.6955030	2.1471117	1	1.3138306	1.4826	0	13.0	13.0	1.3478403	1.3754900	0.0237675
CAR_AGE	24	8161	8.2220316	5.7255627	8	7.8367284	7.4130	-3	28.0	31.0	0.2950526	-0.7681362	0.0633792
URBANICITY*	25	8161	1.7954907	0.4033673	2	1.8693521	0.0000	1	2.0	1.0	-1.4649406	0.1460688	0.0044651
NOHOMEKIDS	26	8161	0.6480823	0.4775977	1	0.6850973	0.0000	0	1.0	1.0	-0.6200373	-1.6157517	0.0052868
NOKIDSDRIV	27	8161	0.8797941	0.3252220	1	0.9747281	0.0000	0	1.0	1.0	-2.3353129	3.4541097	0.0036000
HASCOLLEGE	28	8161	0.5670874	0.4955092	1	0.5838566	0.0000	0	1.0	1.0	-0.2707483	-1.9269314	0.0054850
ISPROFESSIONAL	29	8161	0.3903933	0.4878684	0	0.3629959	0.0000	0	1.0	1.0	0.4492738	-1.7983734	0.0054005

Build Models

Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather someone would be in an accident. Different methods of model creation will be used, as discussed below

Linear Regression

Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_AMT ~ ., data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5978   -417    -44    206 100891 
## 
## Coefficients: (2 not defined because of singularities)
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    1.558e+02  5.915e+02   0.263  0.79225    
## TARGET_FLAG                    5.686e+03  1.134e+02  50.130  < 2e-16 ***
## KIDSDRIV                      -3.065e+02  2.099e+02  -1.460  0.14429    
## AGE                            9.539e+00  6.306e+00   1.513  0.13036    
## HOMEKIDS                       3.658e+01  8.635e+01   0.424  0.67188    
## YOJ                            8.731e+00  1.321e+01   0.661  0.50878    
## INCOME                         2.138e-03  2.371e-02   0.090  0.92814    
## PARENT1                        1.361e+02  1.898e+02   0.717  0.47342    
## HOME_VAL                      -3.265e-03  3.072e-02  -0.106  0.91537    
## MSTATUS                       -1.281e+02  1.209e+02  -1.059  0.28949    
## SEX                            7.066e+01  1.500e+02   0.471  0.63758    
## EDUCATIONBachelors             3.377e+01  1.781e+02   0.190  0.84968    
## EDUCATIONHigh School          -1.401e+02  1.504e+02  -0.931  0.35173    
## EDUCATIONMasters               1.654e+02  2.600e+02   0.636  0.52479    
## EDUCATIONPhD                   3.546e+02  2.993e+02   1.185  0.23613    
## JOBBlue Collar                 7.733e+01  2.808e+02   0.275  0.78303    
## JOBClerical                    2.553e+01  2.955e+02   0.086  0.93116    
## JOBDoctor                     -2.422e+02  3.572e+02  -0.678  0.49784    
## JOBHome Maker                 -2.765e+01  3.075e+02  -0.090  0.92836    
## JOBLawyer                      1.037e+02  2.580e+02   0.402  0.68767    
## JOBManager                    -9.434e+01  2.519e+02  -0.375  0.70802    
## JOBProfessional                2.074e+02  2.694e+02   0.770  0.44138    
## JOBStudent                    -1.202e+02  3.206e+02  -0.375  0.70765    
## TRAVTIME                       5.192e-01  2.826e+00   0.184  0.85422    
## CAR_USE                       -1.169e+02  1.444e+02  -0.809  0.41831    
## BLUEBOOK                      -1.533e-02  5.217e-02  -0.294  0.76882    
## TIF                           -3.149e+00  1.068e+01  -0.295  0.76824    
## CAR_TYPEPanel Truck            3.567e+02  2.221e+02   1.606  0.10828    
## CAR_TYPEPickup                -8.538e+01  1.516e+02  -0.563  0.57332    
## CAR_TYPESports Car            -2.313e+01  1.809e+02  -0.128  0.89828    
## CAR_TYPESUV                   -5.793e+01  1.455e+02  -0.398  0.69045    
## CAR_TYPEVan                    2.621e+02  1.815e+02   1.444  0.14891    
## RED_CAR                       -4.005e+01  1.303e+02  -0.307  0.75853    
## OLDCLAIM                      -9.852e-02  7.314e-02  -1.347  0.17802    
## CLM_FREQ                       9.030e+00  5.513e+01   0.164  0.86989    
## REVOKED                       -3.099e+02  1.369e+02  -2.264  0.02359 *  
## MVR_PTS                        6.043e+01  2.298e+01   2.630  0.00856 ** 
## CAR_AGE                       -2.110e+01  1.088e+01  -1.939  0.05253 .  
## URBANICITYHighly Urban/ Urban -1.502e+01  1.267e+02  -0.119  0.90567    
## NOHOMEKIDS                    -4.506e+00  2.233e+02  -0.020  0.98390    
## NOKIDSDRIV                    -4.855e+02  3.346e+02  -1.451  0.14685    
## HASCOLLEGE                            NA         NA      NA       NA    
## ISPROFESSIONAL                        NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3973 on 8120 degrees of freedom
## Multiple R-squared:  0.2901, Adjusted R-squared:  0.2866 
## F-statistic: 82.96 on 40 and 8120 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.28, and F-statistic of 83.01 with a small p-value.

Model 2 - Forward Selection

Variables will be removed one by one to determine best fit model. After each variable is added, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown.

## 
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +imputed1$AGE + imputed1$EDUCATION + 
##     imputed1$REVOKED + imputed1$MVR_PTS + imputed1$JOB + imputed1$YOJ + 
##     imputed1$CLM_FREQ + imputed1$HOME_VAL + imputed1$URBANICITY + 
##     imputed1$PARENT1 + imputed1$MSTATUS + imputed1$TRAVTIME + 
##     imputed1$BLUEBOOK, data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5489  -1689   -766    210 104861 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            -5.686e+02  5.051e+02  -1.126
## imputed1$AGE                            6.420e+00  6.479e+00   0.991
## imputed1$EDUCATIONBachelors            -2.497e+02  1.846e+02  -1.353
## imputed1$EDUCATIONHigh School           4.803e+01  1.645e+02   0.292
## imputed1$EDUCATIONMasters              -1.665e+02  2.641e+02  -0.630
## imputed1$EDUCATIONPhD                  -1.373e-01  3.145e+02   0.000
## imputed1$REVOKED                        5.172e+02  1.555e+02   3.325
## imputed1$MVR_PTS                        1.899e+02  2.590e+01   7.333
## imputed1$JOBBlue Collar                 4.927e+02  3.095e+02   1.592
## imputed1$JOBClerical                    1.434e+02  3.255e+02   0.440
## imputed1$JOBDoctor                     -1.241e+03  3.868e+02  -3.208
## imputed1$JOBHome Maker                  5.840e+01  3.260e+02   0.179
## imputed1$JOBLawyer                     -4.729e+02  2.663e+02  -1.776
## imputed1$JOBManager                    -9.751e+02  2.739e+02  -3.561
## imputed1$JOBProfessional               -4.053e+01  2.956e+02  -0.137
## imputed1$JOBStudent                     2.414e+02  3.545e+02   0.681
## imputed1$YOJ                           -7.945e+00  1.459e+01  -0.545
## imputed1$CLM_FREQ                       1.433e+02  4.891e+01   2.930
## imputed1$HOME_VAL                      -6.350e-02  3.515e-02  -1.806
## imputed1$URBANICITYHighly Urban/ Urban  1.582e+03  1.399e+02  11.307
## imputed1$PARENT1                        8.578e+02  1.801e+02   4.762
## imputed1$MSTATUS                       -4.256e+02  1.299e+02  -3.277
## imputed1$TRAVTIME                       1.217e+01  3.238e+00   3.757
## imputed1$BLUEBOOK                       7.166e-02  5.753e-02   1.246
##                                        Pr(>|t|)    
## (Intercept)                            0.260359    
## imputed1$AGE                           0.321784    
## imputed1$EDUCATIONBachelors            0.176153    
## imputed1$EDUCATIONHigh School          0.770339    
## imputed1$EDUCATIONMasters              0.528567    
## imputed1$EDUCATIONPhD                  0.999652    
## imputed1$REVOKED                       0.000887 ***
## imputed1$MVR_PTS                       2.46e-13 ***
## imputed1$JOBBlue Collar                0.111381    
## imputed1$JOBClerical                   0.659676    
## imputed1$JOBDoctor                     0.001344 ** 
## imputed1$JOBHome Maker                 0.857814    
## imputed1$JOBLawyer                     0.075753 .  
## imputed1$JOBManager                    0.000372 ***
## imputed1$JOBProfessional               0.890922    
## imputed1$JOBStudent                    0.495847    
## imputed1$YOJ                           0.586023    
## imputed1$CLM_FREQ                      0.003402 ** 
## imputed1$HOME_VAL                      0.070891 .  
## imputed1$URBANICITYHighly Urban/ Urban  < 2e-16 ***
## imputed1$PARENT1                       1.95e-06 ***
## imputed1$MSTATUS                       0.001055 ** 
## imputed1$TRAVTIME                      0.000173 ***
## imputed1$BLUEBOOK                      0.212884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4571 on 8137 degrees of freedom
## Multiple R-squared:  0.05851,    Adjusted R-squared:  0.05585 
## F-statistic: 21.99 on 23 and 8137 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.055, and F-statistic of 21.98 with a small p-value.

Model 3 - Leaps Package

The Leaps package is an “regression subset selection” tool. The package automatically generates all possible models. The tool is basically used to find the “best” model.

The leaps package will analyze the “best model” using adjusted r2 and CP.

## 
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +REVOKED + MVR_PTS + imputed1$CAR_TYPE + 
##     CAR_AGE + SEX + imputed1$TRAVTIME + imputed1$JOB + imputed1$URBANICITY + 
##     imputed1$MSTATUS + imputed1$CAR_USE, data = imputed1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5342  -1693   -792    288 104059 
## 
## Coefficients:
##                                        Estimate Std. Error t value
## (Intercept)                            -393.964    366.798  -1.074
## REVOKED                                 514.222    155.204   3.313
## MVR_PTS                                 210.279     24.081   8.732
## imputed1$CAR_TYPEPanel Truck            464.730    247.741   1.876
## imputed1$CAR_TYPEPickup                 394.119    168.470   2.339
## imputed1$CAR_TYPESports Car             972.216    203.930   4.767
## imputed1$CAR_TYPESUV                    699.868    164.985   4.242
## imputed1$CAR_TYPEVan                    567.103    207.133   2.738
## CAR_AGE                                 -28.256     10.859  -2.602
## SEX                                     224.400    146.141   1.536
## imputed1$TRAVTIME                        11.811      3.228   3.659
## imputed1$JOBBlue Collar                 578.898    261.861   2.211
## imputed1$JOBClerical                    716.828    280.004   2.560
## imputed1$JOBDoctor                     -439.746    376.224  -1.169
## imputed1$JOBHome Maker                  635.220    309.202   2.054
## imputed1$JOBLawyer                      245.971    286.522   0.858
## imputed1$JOBManager                    -548.828    266.887  -2.056
## imputed1$JOBProfessional                330.257    265.251   1.245
## imputed1$JOBStudent                     676.833    297.923   2.272
## imputed1$URBANICITYHighly Urban/ Urban 1694.854    136.497  12.417
## imputed1$MSTATUS                       -802.042    103.469  -7.752
## imputed1$CAR_USE                       -729.892    157.091  -4.646
##                                        Pr(>|t|)    
## (Intercept)                            0.282827    
## REVOKED                                0.000926 ***
## MVR_PTS                                 < 2e-16 ***
## imputed1$CAR_TYPEPanel Truck           0.060709 .  
## imputed1$CAR_TYPEPickup                0.019339 *  
## imputed1$CAR_TYPESports Car            1.90e-06 ***
## imputed1$CAR_TYPESUV                   2.24e-05 ***
## imputed1$CAR_TYPEVan                   0.006197 ** 
## CAR_AGE                                0.009281 ** 
## SEX                                    0.124699    
## imputed1$TRAVTIME                      0.000254 ***
## imputed1$JOBBlue Collar                0.027084 *  
## imputed1$JOBClerical                   0.010483 *  
## imputed1$JOBDoctor                     0.242502    
## imputed1$JOBHome Maker                 0.039970 *  
## imputed1$JOBLawyer                     0.390659    
## imputed1$JOBManager                    0.039776 *  
## imputed1$JOBProfessional               0.213141    
## imputed1$JOBStudent                    0.023122 *  
## imputed1$URBANICITYHighly Urban/ Urban  < 2e-16 ***
## imputed1$MSTATUS                       1.02e-14 ***
## imputed1$CAR_USE                       3.43e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4562 on 8139 degrees of freedom
## Multiple R-squared:  0.06199,    Adjusted R-squared:  0.05956 
## F-statistic: 25.61 on 21 and 8139 DF,  p-value: < 2.2e-16

This model shows an adj R2 as 0.059, and F-statistic of 25.72 with a small p-value.

Binary Logistic Regression

Model 1 - Base Model: All variables

All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = imputed1)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.003710   0.000000   0.000000   0.000000   0.005352  
## 
## Coefficients: (2 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)
## (Intercept)                   -3.534e+02  2.189e+04  -0.016    0.987
## TARGET_AMT                     3.925e-01  2.744e+00   0.143    0.886
## KIDSDRIV                       1.484e+01  1.500e+03   0.010    0.992
## AGE                           -7.056e-01  2.576e+01  -0.027    0.978
## HOMEKIDS                       2.243e+00  6.711e+02   0.003    0.997
## YOJ                           -2.420e-01  5.761e+01  -0.004    0.997
## INCOME                        -1.731e-04  1.642e-01  -0.001    0.999
## PARENT1                       -3.865e+00  3.642e+03  -0.001    0.999
## HOME_VAL                       1.651e-03  1.304e-01   0.013    0.990
## MSTATUS                        1.874e+00  4.472e+02   0.004    0.997
## SEX                            7.708e+00  1.673e+03   0.005    0.996
## EDUCATIONBachelors            -9.223e+00  1.735e+03  -0.005    0.996
## EDUCATIONHigh School           4.761e+00  4.609e+02   0.010    0.992
## EDUCATIONMasters              -2.636e+00  3.731e+03  -0.001    0.999
## EDUCATIONPhD                   1.090e+01  1.688e+04   0.001    0.999
## JOBBlue Collar                 2.930e+02  2.154e+04   0.014    0.989
## JOBClerical                    2.935e+02  2.154e+04   0.014    0.989
## JOBDoctor                     -5.549e+01  9.203e+05   0.000    1.000
## JOBHome Maker                  2.811e+02  2.582e+04   0.011    0.991
## JOBLawyer                      2.196e+02  2.511e+04   0.009    0.993
## JOBManager                     2.770e+02  2.737e+04   0.010    0.992
## JOBProfessional                2.763e+02  4.388e+04   0.006    0.995
## JOBStudent                     2.913e+02  2.150e+04   0.014    0.989
## TRAVTIME                      -3.357e-02  1.323e+01  -0.003    0.998
## CAR_USE                       -4.805e+00  5.301e+02  -0.009    0.993
## BLUEBOOK                       4.589e-03  2.900e-01   0.016    0.987
## TIF                           -5.620e-01  7.309e+01  -0.008    0.994
## CAR_TYPEPanel Truck           -1.304e+02  4.253e+04  -0.003    0.998
## CAR_TYPEPickup                 2.878e+00  6.093e+02   0.005    0.996
## CAR_TYPESports Car             8.145e+00  2.046e+03   0.004    0.997
## CAR_TYPESUV                   -1.904e+00  1.703e+03  -0.001    0.999
## CAR_TYPEVan                   -1.698e+01  3.361e+03  -0.005    0.996
## RED_CAR                       -7.328e-01  5.167e+02  -0.001    0.999
## OLDCLAIM                      -1.463e-04  2.365e-01  -0.001    1.000
## CLM_FREQ                       3.775e+00  2.146e+02   0.018    0.986
## REVOKED                        9.272e+00  5.920e+02   0.016    0.988
## MVR_PTS                       -1.159e+00  9.130e+01  -0.013    0.990
## CAR_AGE                        3.482e-01  5.490e+01   0.006    0.995
## URBANICITYHighly Urban/ Urban  8.681e+00  6.175e+02   0.014    0.989
## NOHOMEKIDS                     1.583e+01  1.807e+03   0.009    0.993
## NOKIDSDRIV                     2.283e+01  4.079e+03   0.006    0.996
## HASCOLLEGE                            NA         NA      NA       NA
## ISPROFESSIONAL                        NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9.4180e+03  on 8160  degrees of freedom
## Residual deviance: 1.2291e-04  on 8120  degrees of freedom
## AIC: 82
## 
## Number of Fisher Scoring iterations: 25

##           llh       llhNull            G2      McFadden          r2ML 
## -6.145302e-05 -4.708981e+03  9.417962e+03  1.000000e+00  6.846337e-01 
##          r2CU 
##  1.000000e+00

Model2 - Backwards Elimination

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output are produced. Only the final output will be shown.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 + 
##     HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM + 
##     CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = "binomial", 
##     data = dplyr::select(imputed1, -TARGET_AMT))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4154  -0.7469  -0.4494   0.7210   2.9518  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.015e+00  1.609e-01 -12.521  < 2e-16 ***
## KIDSDRIV                       3.110e-01  5.827e-02   5.338 9.42e-08 ***
## HOMEKIDS                       1.273e-01  3.305e-02   3.851 0.000117 ***
## YOJ                           -4.336e-02  7.021e-03  -6.175 6.60e-10 ***
## PARENT1                        3.421e-01  1.057e-01   3.237 0.001206 ** 
## HOME_VAL                      -1.394e-04  1.900e-05  -7.335 2.21e-13 ***
## MSTATUS                       -3.601e-01  7.466e-02  -4.824 1.41e-06 ***
## SEX                           -2.879e-01  6.010e-02  -4.790 1.66e-06 ***
## TRAVTIME                       1.474e-02  1.829e-03   8.062 7.48e-16 ***
## CAR_USE                       -7.682e-01  6.026e-02 -12.748  < 2e-16 ***
## TIF                           -5.104e-02  7.132e-03  -7.156 8.30e-13 ***
## OLDCLAIM                       9.888e-05  4.149e-05   2.383 0.017171 *  
## CLM_FREQ                       1.225e-01  3.129e-02   3.913 9.10e-05 ***
## REVOKED                        7.620e-01  7.790e-02   9.782  < 2e-16 ***
## MVR_PTS                        1.135e-01  1.336e-02   8.502  < 2e-16 ***
## CAR_AGE                       -4.178e-02  5.196e-03  -8.040 8.98e-16 ***
## URBANICITYHighly Urban/ Urban  2.111e+00  1.105e-01  19.106  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9418.0  on 8160  degrees of freedom
## Residual deviance: 7658.3  on 8144  degrees of freedom
## AIC: 7692.3
## 
## Number of Fisher Scoring iterations: 5

##           llh       llhNull            G2      McFadden          r2ML 
## -3829.1421401 -4708.9811460  1759.6780119     0.1868428     0.1939588 
##          r2CU 
##     0.2833030

Model3 - glmulti Package

The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.

The exhaustive approach was not practical for this dataset, as it continued to run after 2 hours.

After running the package, I inputed the “best” model mannualy in R, as to not have to rerun (10+ mins) each time.

glmulti - all data including transformations

## 
## Call:
## glm(formula = imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + 
##     SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY + 
##     KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM + 
##     CLM_FREQ + MVR_PTS, data = imputed1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.9825  -0.2814  -0.1104   0.2880   1.2269  
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   -9.978e-04  4.639e-02  -0.022 0.982839    
## PARENT1                        8.504e-02  1.577e-02   5.391 7.18e-08 ***
## MSTATUS                       -6.891e-02  1.117e-02  -6.167 7.29e-10 ***
## SEX                            3.759e-02  1.254e-02   2.997 0.002736 ** 
## EDUCATIONBachelors            -7.200e-02  1.632e-02  -4.412 1.04e-05 ***
## EDUCATIONHigh School           2.204e-05  1.461e-02   0.002 0.998797    
## EDUCATIONMasters              -5.591e-02  2.288e-02  -2.444 0.014562 *  
## EDUCATIONPhD                  -6.488e-02  2.720e-02  -2.385 0.017086 *  
## JOBBlue Collar                 8.642e-02  2.745e-02   3.149 0.001645 ** 
## JOBClerical                    1.103e-01  2.884e-02   3.825 0.000132 ***
## JOBDoctor                     -3.505e-02  3.495e-02  -1.003 0.315944    
## JOBHome Maker                  1.177e-01  2.986e-02   3.942 8.15e-05 ***
## JOBLawyer                      3.552e-02  2.523e-02   1.408 0.159243    
## JOBManager                    -5.822e-02  2.463e-02  -2.363 0.018140 *  
## JOBProfessional                5.529e-02  2.634e-02   2.099 0.035840 *  
## JOBStudent                     1.077e-01  3.108e-02   3.467 0.000529 ***
## CAR_USE                       -1.198e-01  1.406e-02  -8.517  < 2e-16 ***
## CAR_TYPEPanel Truck            1.076e-02  2.136e-02   0.504 0.614312    
## CAR_TYPEPickup                 7.856e-02  1.449e-02   5.422 6.07e-08 ***
## CAR_TYPESports Car             1.693e-01  1.748e-02   9.684  < 2e-16 ***
## CAR_TYPESUV                    1.295e-01  1.411e-02   9.174  < 2e-16 ***
## CAR_TYPEVan                    5.330e-02  1.774e-02   3.005 0.002667 ** 
## REVOKED                        1.314e-01  1.331e-02   9.875  < 2e-16 ***
## URBANICITYHighly Urban/ Urban  2.952e-01  1.196e-02  24.689  < 2e-16 ***
## KIDSDRIV                       6.636e-02  8.742e-03   7.591 3.54e-14 ***
## AGE                           -7.909e-04  5.531e-04  -1.430 0.152788    
## YOJ                           -3.153e-03  1.243e-03  -2.537 0.011208 *  
## HOME_VAL                      -1.143e-05  2.999e-06  -3.811 0.000139 ***
## TRAVTIME                       2.020e-03  2.755e-04   7.332 2.49e-13 ***
## TIF                           -7.827e-03  1.042e-03  -7.515 6.29e-14 ***
## OLDCLAIM                       1.264e-05  7.153e-06   1.766 0.077376 .  
## CLM_FREQ                       1.936e-02  5.388e-03   3.592 0.000330 ***
## MVR_PTS                        2.033e-02  2.235e-03   9.094  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1511599)
## 
##     Null deviance: 1585.0  on 8160  degrees of freedom
## Residual deviance: 1228.6  on 8128  degrees of freedom
## AIC: 7775.3
## 
## Number of Fisher Scoring iterations: 2

##           llh       llhNull            G2      McFadden          r2ML 
## -3853.6578451 -4892.9184843  2078.5212784     0.2124010     0.2248429 
##          r2CU 
##     0.3218783

Model Selection.

Based on the above models, I’ve decided to use the model provided by the glmulti package. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction.

Evaulating the model

Before I ran the evaulation data through the model, I decided to split the trianing data into an 80/20 split. My thoughts are that this will allow me to better check the accuracy of my model, given the fact that this way I can actually check if the ‘target’ variable as predicted by the model is correct or not.

After splitting the data and creating two new variables (training and testing), I created an ROC graph to help determine what threshold I should use in my model

Looking at the graph, the 0.4 threshold seems to be the most ideal soultion for my testing. 0.4 gives about a 0.6 TP rate, while giving only ~0.2 FP rate. 0.3 and 0.2 give slightly higher TP rates, but also give a high FP rate - which, in my opinion, isn’t worth the slight increase in TP. 0.5 gives a slightly lower FP rate, but a siginifcant different in TP rate.

##            PredictedValue
## ActualValue FALSE TRUE
##           0  4026  703
##           1   741 1003

## [1] 0.774

After testing the model the accuracy is around .774, which seems like a decent fit.

The misclassification Rate is about 0.2246254.

The true positive rate is about 0.572759.

The false positive rate is about 0.1514196.

The specificity is about 0.8487907.

The precision is about 0.5774648

Testing With Model glmulti model

## predict12
##    0    1 
## 1550  591

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2808  0.1296  0.2822  0.2776  0.4121  0.9482

This model predicts that 591 insurance customers would have an auto accident, while 1550 will not.

References

All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html

Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf

Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R

Appendex

train_insurance <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance_training_data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “^{z_“)), SEX = as.factor(str_remove(SEX,”}z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “^{z_“)), JOB = as.factor(str_remove(JOB,”}z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “^{z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”}z_”)))

eval_data <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance-evaluation-data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “^{z_“)), SEX = as.factor(str_remove(SEX,”}z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “^{z_“)), JOB = as.factor(str_remove(JOB,”}z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “^{z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”}z_”)) )

insurance_desc <- describe(train_insurance) insurance_desc$na_count <- sapply(train_insurance, function(y) sum(length(which(is.na(y)))))

kable(insurance_desc, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)

ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) + facet_wrap(~variable, scale=“free”) + geom_boxplot()

ggplot1 <- train_insurance[,-c(1, 2)]

ggplot(melt(ggplot1), aes(x=factor(variable), y=value)) + geom_boxplot() + stat_summary(fun.y = mean, color = “blue”, geom = “point”) +
stat_summary(fun.y = median, color = “red”, geom = “point”) + coord_flip() + theme_bw()

ggplot(melt(train_insurance), aes(x=value)) + facet_wrap(~variable, scale=“free”) + geom_histogram(bins=50)

num.cols <- sapply(train_insurance, is.numeric) cor.data <- cor(train_insurance[,num.cols])

kable(cor.data[,1:2], “html”, escape = F) %>% kable_styling(“striped”, full_width = F) %>% column_spec(1, bold = T) %>% scroll_box(height = “500px”)

corrgram(drop_na(train_insurance), order=TRUE, upper.panel=panel.cor, main=“Crime”)

library(Amelia) missmap(train_insurance, main = “Missing values vs observed”)

train_ins <- train_insurance

init = mice(train_ins, maxit=0) meth = init$method predM = init$predictorMatrix

predM[, c(“TARGET_FLAG”)]=0

imputed = mice(train_ins, method=“rf”, predictorMatrix=predM, m=5)

imputed1 <- complete(imputed)

imputed1 <- imputed1 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))

imputed1 <- imputed1 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))

eval_ins <- eval_data

init = mice(eval_ins, maxit=0) meth = init$method predM = init$predictorMatrix predM[, c(“TARGET_FLAG”)]=0

imputed2 = mice(eval_ins, method=“rf”, predictorMatrix=predM, m=5) imputed3 <- complete(imputed2)

imputed3 <- imputed3 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))

imputed3 <- imputed3 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))

imputed3$TARGET_FLAG <- as.numeric(eval_data$TARGET_FLAG) imputed3$TARGET_AMT <- as.numeric(eval_data$TARGET_AMT)

imputedtable <- describe(imputed1)

kable(imputedtable, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)

model1 <- lm(TARGET_AMT ~., imputed1)

(summary(model1)) summodel1 <- summary(model1)

model2 <- lm(imputed1$TARGET_AMT ~ + imputed1$AGE + imputed1$EDUCATION + imputed1$REVOKED + imputed1$MVR_PTS + imputed1$JOB +imputed1$YOJ + imputed1$CLM_FREQ + imputed1$HOME_VAL + imputed1$URBANICITY + imputed1$PARENT1 + imputed1$MSTATUS + imputed1$TRAVTIME + imputed1$BLUEBOOK, data = imputed1)
summary(model2)

best.subset <- regsubsets(imputed1$TARGET_AMT~., imputed1, nvmax=5) best.subset.summary <- summary(best.subset) best.subset.summary$outmat

best.subset.by.adjr2 <- which.max(best.subset.summary$adjr2) best.subset.by.adjr2

best.subset.by.cp <- which.min(best.subset.summary$cp) best.subset.by.cp

model3 <- lm(imputed1$TARGET_AMT ~ + REVOKED + MVR_PTS + imputed1$CAR_TYPE + CAR_AGE + SEX + imputed1$TRAVTIME + imputed1$JOB + imputed1$URBANICITY + imputed1$MSTATUS + imputed1$CAR_USE, data = imputed1)
summary(model3)

modellog1 <- glm(TARGET_FLAG ~., family = “binomial”, data=imputed1) summary(modellog1) pR2(modellog1)

model2 <- glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 + HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = “binomial”, data = dplyr::select(imputed1, -TARGET_AMT)) summary(model2) pR2(model2)

library(rJava) library(glmulti)

glmulti.lm.out <- glmulti(imputed$TARGET_FLAG ~., data = imputed, level = 1,
method = “g”,
crit = “aic”,
confsetsize = 5,
plotty = F, report = F,
fitfunction = “lm”)

modelglmulti <- glm(imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY + TARGET_AMT + KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS, data = imputed1)

summary(modelglmulti) pR2(modelglmulti)

modelglmulti1 <- modelglmulti

imputed1$TARGET_FLAG <- as.factor(imputed1$TARGET_FLAG) splitdata <- imputed1

split <- sample.split(splitdata, SplitRatio = 0.8) split training <- subset(splitdata, split == “TRUE”) testing <- subset(splitdata, split == “FALSE”)

res <- predict(modelglmulti, newdata=training, type=“response”)

ROCRPred = prediction(res, training$TARGET_FLAG) ROCRPref <- performance(ROCRPred, “tpr”,“fpr”)

plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))

(table(ActualValue=training$TARGET_FLAG, PredictedValue=res>0.3)) round((4005+1532)/(4005 +1532+755+181),3)

PredictedValue <- res>03 res <- predict(modelglmulti, newdata=testing, type=“response”) (table(ActualValue=testing$TARGET_FLAG, PredictedValue=res>0.3)) round((1037+393)/(1037+393+47+211),3)

imputed4 <- imputed3[, -1]

predict1 <- predict(modelglmulti, newdata=imputed3, type=“response”) predict12 <- ifelse(predict1 > 0.3, 1, 0) table(predict12) summary(predict1)

Data 621 - HW 4

NIcholas Schettini

June 30, 2018