Cover Page
CUNY MSDS HW4- Binary Logistic Regression and Linear Regression
Nicholas Schettini
CUNY School of Professional Studies
Abstract
In this research assignment, we investigated data on a customer at an auto insurance company. The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj. The data consits of 8161 observatrions and 26 variables. The research included 4 overall groups: data exploration, data preparation, creating models, and selecting the best model. The data was visualized using multiple methods, including histograms and boxplots. The data was prepped by adding imputations using the mice package in R to correct NA values. Different models were created based on different approaches (for example, backwards elimination), and finally the best model was selected. The research shows that certain variables from within the dataset set were better predictors of an accident than others.
Keywords: R, auto insurance, prediction, modeling, logistic binary regression, linear regression
Overview
Overview In this homework assignment, you will explore, analyze and model a data set containing approximately 8000 records representing a customer at an auto insurance company. Each record has two response variables. The first response variable, TARGET_FLAG, is a 1 or a 0. A “1” means that the person was in a car crash. A zero means that the person was not in a car crash. The second response variable is TARGET_AMT. This value is zero if the person did not crash their car. But if they did crash their car, this number will be a value greater than zero.
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_FLAG | 1 | 8161 | 0.2638157 | 0.4407276 | 0 | 0.2047787 | 0.0000 | 0 | 1.0 | 1.0 | 1.0716614 | -0.8516462 | 0.0048786 | 0 |
TARGET_AMT | 2 | 8161 | 1504.3246481 | 4704.0269298 | 0 | 593.7121106 | 0.0000 | 0 | 107586.1 | 107586.1 | 8.7063034 | 112.2884386 | 52.0712628 | 0 |
KIDSDRIV | 3 | 8161 | 0.1710575 | 0.5115341 | 0 | 0.0252719 | 0.0000 | 0 | 4.0 | 4.0 | 3.3518374 | 11.7801916 | 0.0056624 | 0 |
AGE | 4 | 8155 | 44.7903127 | 8.6275895 | 45 | 44.8306513 | 8.8956 | 16 | 81.0 | 65.0 | -0.0289889 | -0.0617020 | 0.0955383 | 6 |
HOMEKIDS | 5 | 8161 | 0.7212351 | 1.1163233 | 0 | 0.4971665 | 0.0000 | 0 | 5.0 | 5.0 | 1.3411271 | 0.6489915 | 0.0123571 | 0 |
YOJ | 6 | 7707 | 10.4992864 | 4.0924742 | 11 | 11.0711853 | 2.9652 | 0 | 23.0 | 23.0 | -1.2029676 | 1.1773410 | 0.0466169 | 454 |
INCOME | 7 | 8161 | 2875.5505453 | 2090.6786785 | 2817 | 2816.9534385 | 2799.1488 | 1 | 6613.0 | 6612.0 | 0.1094699 | -1.2853032 | 23.1427840 | 0 |
PARENT1* | 8 | 8161 | 1.1319691 | 0.3384779 | 1 | 1.0399755 | 0.0000 | 1 | 2.0 | 1.0 | 2.1743561 | 2.7281589 | 0.0037468 | 0 |
HOME_VAL | 9 | 8161 | 1684.8931503 | 1697.3791897 | 1245 | 1516.4994639 | 1842.8718 | 1 | 5107.0 | 5106.0 | 0.5162324 | -1.1810965 | 18.7891522 | 0 |
MSTATUS* | 10 | 8161 | 1.5996814 | 0.4899929 | 2 | 1.6245979 | 0.0000 | 1 | 2.0 | 1.0 | -0.4068189 | -1.8347231 | 0.0054240 | 0 |
SEX* | 11 | 8161 | 1.4639137 | 0.4987266 | 1 | 1.4548936 | 0.0000 | 1 | 2.0 | 1.0 | 0.1446959 | -1.9793056 | 0.0055207 | 0 |
EDUCATION* | 12 | 8161 | 2.8120328 | 1.1786322 | 3 | 2.7785266 | 1.4826 | 1 | 5.0 | 4.0 | 0.1543452 | -0.8453783 | 0.0130469 | 0 |
JOB* | 13 | 8161 | 4.8337214 | 2.6238293 | 5 | 4.7636698 | 4.4478 | 1 | 9.0 | 8.0 | 0.1300643 | -1.4594539 | 0.0290445 | 0 |
TRAVTIME | 14 | 8161 | 33.4857248 | 15.9083334 | 33 | 32.9954051 | 16.3086 | 5 | 142.0 | 137.0 | 0.4468174 | 0.6643331 | 0.1760974 | 0 |
CAR_USE* | 15 | 8161 | 1.6288445 | 0.4831436 | 2 | 1.6610507 | 0.0000 | 1 | 2.0 | 1.0 | -0.5332937 | -1.7158080 | 0.0053482 | 0 |
BLUEBOOK | 16 | 8161 | 1283.6185516 | 893.5117428 | 1124 | 1259.5665492 | 1132.7064 | 1 | 2789.0 | 2788.0 | 0.2472837 | -1.3624655 | 9.8907352 | 0 |
TIF | 17 | 8161 | 5.3513050 | 4.1466353 | 4 | 4.8402512 | 4.4478 | 1 | 25.0 | 24.0 | 0.8908120 | 0.4224940 | 0.0459012 | 0 |
CAR_TYPE* | 18 | 8161 | 3.3405220 | 1.7553381 | 3 | 3.3107673 | 2.9652 | 1 | 6.0 | 5.0 | -0.0981926 | -1.4298002 | 0.0194307 | 0 |
RED_CAR* | 19 | 8161 | 1.2913859 | 0.4544287 | 1 | 1.2392403 | 0.0000 | 1 | 2.0 | 1.0 | 0.9180255 | -1.1573709 | 0.0050303 | 0 |
OLDCLAIM | 20 | 8161 | 552.2714128 | 862.2006829 | 1 | 380.3196508 | 0.0000 | 1 | 2857.0 | 2856.0 | 1.3085876 | 0.2461666 | 9.5441372 | 0 |
CLM_FREQ | 21 | 8161 | 0.7985541 | 1.1584527 | 0 | 0.5886047 | 0.0000 | 0 | 5.0 | 5.0 | 1.2087985 | 0.2842890 | 0.0128235 | 0 |
REVOKED* | 22 | 8161 | 1.1225340 | 0.3279216 | 1 | 1.0281820 | 0.0000 | 1 | 2.0 | 1.0 | 2.3018899 | 3.2991013 | 0.0036299 | 0 |
MVR_PTS | 23 | 8161 | 1.6955030 | 2.1471117 | 1 | 1.3138306 | 1.4826 | 0 | 13.0 | 13.0 | 1.3478403 | 1.3754900 | 0.0237675 | 0 |
CAR_AGE | 24 | 7651 | 8.3283231 | 5.7007424 | 8 | 7.9632413 | 7.4130 | -3 | 28.0 | 31.0 | 0.2819531 | -0.7489756 | 0.0651737 | 510 |
URBANICITY* | 25 | 8161 | 1.7954907 | 0.4033673 | 2 | 1.8693521 | 0.0000 | 1 | 2.0 | 1.0 | -1.4649406 | 0.1460688 | 0.0044651 | 0 |
The data consists of two response variables: TARGET_FLAG and TARGET_AMT. TARGET_FLAG is a binary variable - indicating if the car was involved in a crash. TARGET_AMT is the cost of the crash. The explanatory variables in this dataset include: age, bluebook, car_age, car_type, clm_freq, car_use, car_type, sex, red_car, urbancity and yoj.
The data consits of 8161 observatrions and 26 variables. The data consists of multiple NA values, which will have to be taken care of during the data prep.
Visual Exploration
Boxplots
The below boxplots show all of the variables listed in the dataset. This visualization will assist in showing how the data is spread for each variable. The boxplots show that some variables have a large amount of variance between each other.
INCOME and HOME_VAL seem to have the highest variance in the dataset.
Histograms
Histograms are useful to show how the data is distributed in the dataset.
TARGET_FLAG | TARGET_AMT | |
---|---|---|
TARGET_FLAG | 1.0000000 | 0.5342461 |
TARGET_AMT | 0.5342461 | 1.0000000 |
KIDSDRIV | 0.1036683 | 0.0553942 |
AGE | NA | NA |
HOMEKIDS | 0.1156210 | 0.0619880 |
YOJ | NA | NA |
INCOME | -0.0338365 | -0.0084193 |
HOME_VAL | -0.1485715 | -0.0768246 |
TRAVTIME | 0.0483683 | 0.0279870 |
BLUEBOOK | 0.0504453 | 0.0235955 |
TIF | -0.0823700 | -0.0464808 |
OLDCLAIM | 0.1902875 | 0.0971478 |
CLM_FREQ | 0.2161961 | 0.1164192 |
MVR_PTS | 0.2191971 | 0.1378655 |
CAR_AGE | NA | NA |
Looking at the above table, MVR_PTS (total points on motor vehicle record ) has the highest correlation with TARGET_FLAG (was in a car crash) - which makes sense since you’d expect someone with a lot of points on their record to might be in more accidents. HOME_VAL has the lowest negative correlation with TARGET_FLAG, meaning that those who have a higher house value are less likely to be in an accident.
As mentioned earlier, the data seem to show a good amount of missing (NA) values.
Data Preparation
Cleaning Data
Data seems to be somewhat unstructured when loading into R. For example, some variables were not classified as something that made sense - income wasn’t a numeric variable, so in order for it to make sense in this analysis, it had to be converted into numeric.
Some of the data also had extra characters, such as ‘z_’ before a variable. For example “z_F” for female. These characters had to be cleaned from the data. This was done by the following code:
SEX = as.factor(str_remove(SEX, “^z_”)),
It also makes sense to convert certain catagorical variables into dummy variables. For example, for the sex catagory, male could be 1, and female could be 0 using simple ifelse statements.
Imputation of Missing (NA) values
The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.
Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.
In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.
Output - The below table shows the results of above data manipulation.
The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_FLAG | 1 | 8161 | 0.2638157 | 0.4407276 | 0 | 0.2047787 | 0.0000 | 0 | 1.0 | 1.0 | 1.0716614 | -0.8516462 | 0.0048786 |
TARGET_AMT | 2 | 8161 | 1504.3246481 | 4704.0269298 | 0 | 593.7121106 | 0.0000 | 0 | 107586.1 | 107586.1 | 8.7063034 | 112.2884386 | 52.0712628 |
KIDSDRIV | 3 | 8161 | 0.1710575 | 0.5115341 | 0 | 0.0252719 | 0.0000 | 0 | 4.0 | 4.0 | 3.3518374 | 11.7801916 | 0.0056624 |
AGE | 4 | 8161 | 44.7818895 | 8.6346725 | 45 | 44.8240159 | 8.8956 | 16 | 81.0 | 65.0 | -0.0303175 | -0.0636768 | 0.0955816 |
HOMEKIDS | 5 | 8161 | 0.7212351 | 1.1163233 | 0 | 0.4971665 | 0.0000 | 0 | 5.0 | 5.0 | 1.3411271 | 0.6489915 | 0.0123571 |
YOJ | 6 | 8161 | 10.5119471 | 4.0773629 | 11 | 11.0879155 | 2.9652 | 0 | 23.0 | 23.0 | -1.2232265 | 1.2313536 | 0.0451344 |
INCOME | 7 | 8161 | 2875.5505453 | 2090.6786785 | 2817 | 2816.9534385 | 2799.1488 | 1 | 6613.0 | 6612.0 | 0.1094699 | -1.2853032 | 23.1427840 |
PARENT1 | 8 | 8161 | 0.1319691 | 0.3384779 | 0 | 0.0399755 | 0.0000 | 0 | 1.0 | 1.0 | 2.1743561 | 2.7281589 | 0.0037468 |
HOME_VAL | 9 | 8161 | 1684.8931503 | 1697.3791897 | 1245 | 1516.4994639 | 1842.8718 | 1 | 5107.0 | 5106.0 | 0.5162324 | -1.1810965 | 18.7891522 |
MSTATUS | 10 | 8161 | 0.5996814 | 0.4899929 | 1 | 0.6245979 | 0.0000 | 0 | 1.0 | 1.0 | -0.4068189 | -1.8347231 | 0.0054240 |
SEX | 11 | 8161 | 0.4639137 | 0.4987266 | 0 | 0.4548936 | 0.0000 | 0 | 1.0 | 1.0 | 0.1446959 | -1.9793056 | 0.0055207 |
EDUCATION* | 12 | 8161 | 2.8120328 | 1.1786322 | 3 | 2.7785266 | 1.4826 | 1 | 5.0 | 4.0 | 0.1543452 | -0.8453783 | 0.0130469 |
JOB* | 13 | 8161 | 4.8337214 | 2.6238293 | 5 | 4.7636698 | 4.4478 | 1 | 9.0 | 8.0 | 0.1300643 | -1.4594539 | 0.0290445 |
TRAVTIME | 14 | 8161 | 33.4857248 | 15.9083334 | 33 | 32.9954051 | 16.3086 | 5 | 142.0 | 137.0 | 0.4468174 | 0.6643331 | 0.1760974 |
CAR_USE | 15 | 8161 | 0.6288445 | 0.4831436 | 1 | 0.6610507 | 0.0000 | 0 | 1.0 | 1.0 | -0.5332937 | -1.7158080 | 0.0053482 |
BLUEBOOK | 16 | 8161 | 1283.6185516 | 893.5117428 | 1124 | 1259.5665492 | 1132.7064 | 1 | 2789.0 | 2788.0 | 0.2472837 | -1.3624655 | 9.8907352 |
TIF | 17 | 8161 | 5.3513050 | 4.1466353 | 4 | 4.8402512 | 4.4478 | 1 | 25.0 | 24.0 | 0.8908120 | 0.4224940 | 0.0459012 |
CAR_TYPE* | 18 | 8161 | 3.3405220 | 1.7553381 | 3 | 3.3107673 | 2.9652 | 1 | 6.0 | 5.0 | -0.0981926 | -1.4298002 | 0.0194307 |
RED_CAR | 19 | 8161 | 0.2913859 | 0.4544287 | 0 | 0.2392403 | 0.0000 | 0 | 1.0 | 1.0 | 0.9180255 | -1.1573709 | 0.0050303 |
OLDCLAIM | 20 | 8161 | 552.2714128 | 862.2006829 | 1 | 380.3196508 | 0.0000 | 1 | 2857.0 | 2856.0 | 1.3085876 | 0.2461666 | 9.5441372 |
CLM_FREQ | 21 | 8161 | 0.7985541 | 1.1584527 | 0 | 0.5886047 | 0.0000 | 0 | 5.0 | 5.0 | 1.2087985 | 0.2842890 | 0.0128235 |
REVOKED | 22 | 8161 | 0.1225340 | 0.3279216 | 0 | 0.0281820 | 0.0000 | 0 | 1.0 | 1.0 | 2.3018899 | 3.2991013 | 0.0036299 |
MVR_PTS | 23 | 8161 | 1.6955030 | 2.1471117 | 1 | 1.3138306 | 1.4826 | 0 | 13.0 | 13.0 | 1.3478403 | 1.3754900 | 0.0237675 |
CAR_AGE | 24 | 8161 | 8.2220316 | 5.7255627 | 8 | 7.8367284 | 7.4130 | -3 | 28.0 | 31.0 | 0.2950526 | -0.7681362 | 0.0633792 |
URBANICITY* | 25 | 8161 | 1.7954907 | 0.4033673 | 2 | 1.8693521 | 0.0000 | 1 | 2.0 | 1.0 | -1.4649406 | 0.1460688 | 0.0044651 |
NOHOMEKIDS | 26 | 8161 | 0.6480823 | 0.4775977 | 1 | 0.6850973 | 0.0000 | 0 | 1.0 | 1.0 | -0.6200373 | -1.6157517 | 0.0052868 |
NOKIDSDRIV | 27 | 8161 | 0.8797941 | 0.3252220 | 1 | 0.9747281 | 0.0000 | 0 | 1.0 | 1.0 | -2.3353129 | 3.4541097 | 0.0036000 |
HASCOLLEGE | 28 | 8161 | 0.5670874 | 0.4955092 | 1 | 0.5838566 | 0.0000 | 0 | 1.0 | 1.0 | -0.2707483 | -1.9269314 | 0.0054850 |
ISPROFESSIONAL | 29 | 8161 | 0.3903933 | 0.4878684 | 0 | 0.3629959 | 0.0000 | 0 | 1.0 | 1.0 | 0.4492738 | -1.7983734 | 0.0054005 |
Build Models
Throughout this section, various models will be created to try to determine which will allow for the best “fit” to predict weather someone would be in an accident. Different methods of model creation will be used, as discussed below
Linear Regression
Model 1: All Variables
All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.
##
## Call:
## lm(formula = TARGET_AMT ~ ., data = imputed1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5978 -417 -44 206 100891
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.558e+02 5.915e+02 0.263 0.79225
## TARGET_FLAG 5.686e+03 1.134e+02 50.130 < 2e-16 ***
## KIDSDRIV -3.065e+02 2.099e+02 -1.460 0.14429
## AGE 9.539e+00 6.306e+00 1.513 0.13036
## HOMEKIDS 3.658e+01 8.635e+01 0.424 0.67188
## YOJ 8.731e+00 1.321e+01 0.661 0.50878
## INCOME 2.138e-03 2.371e-02 0.090 0.92814
## PARENT1 1.361e+02 1.898e+02 0.717 0.47342
## HOME_VAL -3.265e-03 3.072e-02 -0.106 0.91537
## MSTATUS -1.281e+02 1.209e+02 -1.059 0.28949
## SEX 7.066e+01 1.500e+02 0.471 0.63758
## EDUCATIONBachelors 3.377e+01 1.781e+02 0.190 0.84968
## EDUCATIONHigh School -1.401e+02 1.504e+02 -0.931 0.35173
## EDUCATIONMasters 1.654e+02 2.600e+02 0.636 0.52479
## EDUCATIONPhD 3.546e+02 2.993e+02 1.185 0.23613
## JOBBlue Collar 7.733e+01 2.808e+02 0.275 0.78303
## JOBClerical 2.553e+01 2.955e+02 0.086 0.93116
## JOBDoctor -2.422e+02 3.572e+02 -0.678 0.49784
## JOBHome Maker -2.765e+01 3.075e+02 -0.090 0.92836
## JOBLawyer 1.037e+02 2.580e+02 0.402 0.68767
## JOBManager -9.434e+01 2.519e+02 -0.375 0.70802
## JOBProfessional 2.074e+02 2.694e+02 0.770 0.44138
## JOBStudent -1.202e+02 3.206e+02 -0.375 0.70765
## TRAVTIME 5.192e-01 2.826e+00 0.184 0.85422
## CAR_USE -1.169e+02 1.444e+02 -0.809 0.41831
## BLUEBOOK -1.533e-02 5.217e-02 -0.294 0.76882
## TIF -3.149e+00 1.068e+01 -0.295 0.76824
## CAR_TYPEPanel Truck 3.567e+02 2.221e+02 1.606 0.10828
## CAR_TYPEPickup -8.538e+01 1.516e+02 -0.563 0.57332
## CAR_TYPESports Car -2.313e+01 1.809e+02 -0.128 0.89828
## CAR_TYPESUV -5.793e+01 1.455e+02 -0.398 0.69045
## CAR_TYPEVan 2.621e+02 1.815e+02 1.444 0.14891
## RED_CAR -4.005e+01 1.303e+02 -0.307 0.75853
## OLDCLAIM -9.852e-02 7.314e-02 -1.347 0.17802
## CLM_FREQ 9.030e+00 5.513e+01 0.164 0.86989
## REVOKED -3.099e+02 1.369e+02 -2.264 0.02359 *
## MVR_PTS 6.043e+01 2.298e+01 2.630 0.00856 **
## CAR_AGE -2.110e+01 1.088e+01 -1.939 0.05253 .
## URBANICITYHighly Urban/ Urban -1.502e+01 1.267e+02 -0.119 0.90567
## NOHOMEKIDS -4.506e+00 2.233e+02 -0.020 0.98390
## NOKIDSDRIV -4.855e+02 3.346e+02 -1.451 0.14685
## HASCOLLEGE NA NA NA NA
## ISPROFESSIONAL NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3973 on 8120 degrees of freedom
## Multiple R-squared: 0.2901, Adjusted R-squared: 0.2866
## F-statistic: 82.96 on 40 and 8120 DF, p-value: < 2.2e-16
This model shows an adj R2 as 0.28, and F-statistic of 83.01 with a small p-value.
Model 2 - Forward Selection
Variables will be removed one by one to determine best fit model. After each variable is added, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown.
##
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +imputed1$AGE + imputed1$EDUCATION +
## imputed1$REVOKED + imputed1$MVR_PTS + imputed1$JOB + imputed1$YOJ +
## imputed1$CLM_FREQ + imputed1$HOME_VAL + imputed1$URBANICITY +
## imputed1$PARENT1 + imputed1$MSTATUS + imputed1$TRAVTIME +
## imputed1$BLUEBOOK, data = imputed1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5489 -1689 -766 210 104861
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -5.686e+02 5.051e+02 -1.126
## imputed1$AGE 6.420e+00 6.479e+00 0.991
## imputed1$EDUCATIONBachelors -2.497e+02 1.846e+02 -1.353
## imputed1$EDUCATIONHigh School 4.803e+01 1.645e+02 0.292
## imputed1$EDUCATIONMasters -1.665e+02 2.641e+02 -0.630
## imputed1$EDUCATIONPhD -1.373e-01 3.145e+02 0.000
## imputed1$REVOKED 5.172e+02 1.555e+02 3.325
## imputed1$MVR_PTS 1.899e+02 2.590e+01 7.333
## imputed1$JOBBlue Collar 4.927e+02 3.095e+02 1.592
## imputed1$JOBClerical 1.434e+02 3.255e+02 0.440
## imputed1$JOBDoctor -1.241e+03 3.868e+02 -3.208
## imputed1$JOBHome Maker 5.840e+01 3.260e+02 0.179
## imputed1$JOBLawyer -4.729e+02 2.663e+02 -1.776
## imputed1$JOBManager -9.751e+02 2.739e+02 -3.561
## imputed1$JOBProfessional -4.053e+01 2.956e+02 -0.137
## imputed1$JOBStudent 2.414e+02 3.545e+02 0.681
## imputed1$YOJ -7.945e+00 1.459e+01 -0.545
## imputed1$CLM_FREQ 1.433e+02 4.891e+01 2.930
## imputed1$HOME_VAL -6.350e-02 3.515e-02 -1.806
## imputed1$URBANICITYHighly Urban/ Urban 1.582e+03 1.399e+02 11.307
## imputed1$PARENT1 8.578e+02 1.801e+02 4.762
## imputed1$MSTATUS -4.256e+02 1.299e+02 -3.277
## imputed1$TRAVTIME 1.217e+01 3.238e+00 3.757
## imputed1$BLUEBOOK 7.166e-02 5.753e-02 1.246
## Pr(>|t|)
## (Intercept) 0.260359
## imputed1$AGE 0.321784
## imputed1$EDUCATIONBachelors 0.176153
## imputed1$EDUCATIONHigh School 0.770339
## imputed1$EDUCATIONMasters 0.528567
## imputed1$EDUCATIONPhD 0.999652
## imputed1$REVOKED 0.000887 ***
## imputed1$MVR_PTS 2.46e-13 ***
## imputed1$JOBBlue Collar 0.111381
## imputed1$JOBClerical 0.659676
## imputed1$JOBDoctor 0.001344 **
## imputed1$JOBHome Maker 0.857814
## imputed1$JOBLawyer 0.075753 .
## imputed1$JOBManager 0.000372 ***
## imputed1$JOBProfessional 0.890922
## imputed1$JOBStudent 0.495847
## imputed1$YOJ 0.586023
## imputed1$CLM_FREQ 0.003402 **
## imputed1$HOME_VAL 0.070891 .
## imputed1$URBANICITYHighly Urban/ Urban < 2e-16 ***
## imputed1$PARENT1 1.95e-06 ***
## imputed1$MSTATUS 0.001055 **
## imputed1$TRAVTIME 0.000173 ***
## imputed1$BLUEBOOK 0.212884
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4571 on 8137 degrees of freedom
## Multiple R-squared: 0.05851, Adjusted R-squared: 0.05585
## F-statistic: 21.99 on 23 and 8137 DF, p-value: < 2.2e-16
This model shows an adj R2 as 0.055, and F-statistic of 21.98 with a small p-value.
Model 3 - Leaps Package
The Leaps package is an “regression subset selection” tool. The package automatically generates all possible models. The tool is basically used to find the “best” model.
The leaps package will analyze the “best model” using adjusted r2 and CP.
##
## Call:
## lm(formula = imputed1$TARGET_AMT ~ +REVOKED + MVR_PTS + imputed1$CAR_TYPE +
## CAR_AGE + SEX + imputed1$TRAVTIME + imputed1$JOB + imputed1$URBANICITY +
## imputed1$MSTATUS + imputed1$CAR_USE, data = imputed1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5342 -1693 -792 288 104059
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -393.964 366.798 -1.074
## REVOKED 514.222 155.204 3.313
## MVR_PTS 210.279 24.081 8.732
## imputed1$CAR_TYPEPanel Truck 464.730 247.741 1.876
## imputed1$CAR_TYPEPickup 394.119 168.470 2.339
## imputed1$CAR_TYPESports Car 972.216 203.930 4.767
## imputed1$CAR_TYPESUV 699.868 164.985 4.242
## imputed1$CAR_TYPEVan 567.103 207.133 2.738
## CAR_AGE -28.256 10.859 -2.602
## SEX 224.400 146.141 1.536
## imputed1$TRAVTIME 11.811 3.228 3.659
## imputed1$JOBBlue Collar 578.898 261.861 2.211
## imputed1$JOBClerical 716.828 280.004 2.560
## imputed1$JOBDoctor -439.746 376.224 -1.169
## imputed1$JOBHome Maker 635.220 309.202 2.054
## imputed1$JOBLawyer 245.971 286.522 0.858
## imputed1$JOBManager -548.828 266.887 -2.056
## imputed1$JOBProfessional 330.257 265.251 1.245
## imputed1$JOBStudent 676.833 297.923 2.272
## imputed1$URBANICITYHighly Urban/ Urban 1694.854 136.497 12.417
## imputed1$MSTATUS -802.042 103.469 -7.752
## imputed1$CAR_USE -729.892 157.091 -4.646
## Pr(>|t|)
## (Intercept) 0.282827
## REVOKED 0.000926 ***
## MVR_PTS < 2e-16 ***
## imputed1$CAR_TYPEPanel Truck 0.060709 .
## imputed1$CAR_TYPEPickup 0.019339 *
## imputed1$CAR_TYPESports Car 1.90e-06 ***
## imputed1$CAR_TYPESUV 2.24e-05 ***
## imputed1$CAR_TYPEVan 0.006197 **
## CAR_AGE 0.009281 **
## SEX 0.124699
## imputed1$TRAVTIME 0.000254 ***
## imputed1$JOBBlue Collar 0.027084 *
## imputed1$JOBClerical 0.010483 *
## imputed1$JOBDoctor 0.242502
## imputed1$JOBHome Maker 0.039970 *
## imputed1$JOBLawyer 0.390659
## imputed1$JOBManager 0.039776 *
## imputed1$JOBProfessional 0.213141
## imputed1$JOBStudent 0.023122 *
## imputed1$URBANICITYHighly Urban/ Urban < 2e-16 ***
## imputed1$MSTATUS 1.02e-14 ***
## imputed1$CAR_USE 3.43e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4562 on 8139 degrees of freedom
## Multiple R-squared: 0.06199, Adjusted R-squared: 0.05956
## F-statistic: 25.61 on 21 and 8139 DF, p-value: < 2.2e-16
This model shows an adj R2 as 0.059, and F-statistic of 25.72 with a small p-value.
Binary Logistic Regression
Model 1 - Base Model: All variables
All of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = TARGET_FLAG ~ ., family = "binomial", data = imputed1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.003710 0.000000 0.000000 0.000000 0.005352
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.534e+02 2.189e+04 -0.016 0.987
## TARGET_AMT 3.925e-01 2.744e+00 0.143 0.886
## KIDSDRIV 1.484e+01 1.500e+03 0.010 0.992
## AGE -7.056e-01 2.576e+01 -0.027 0.978
## HOMEKIDS 2.243e+00 6.711e+02 0.003 0.997
## YOJ -2.420e-01 5.761e+01 -0.004 0.997
## INCOME -1.731e-04 1.642e-01 -0.001 0.999
## PARENT1 -3.865e+00 3.642e+03 -0.001 0.999
## HOME_VAL 1.651e-03 1.304e-01 0.013 0.990
## MSTATUS 1.874e+00 4.472e+02 0.004 0.997
## SEX 7.708e+00 1.673e+03 0.005 0.996
## EDUCATIONBachelors -9.223e+00 1.735e+03 -0.005 0.996
## EDUCATIONHigh School 4.761e+00 4.609e+02 0.010 0.992
## EDUCATIONMasters -2.636e+00 3.731e+03 -0.001 0.999
## EDUCATIONPhD 1.090e+01 1.688e+04 0.001 0.999
## JOBBlue Collar 2.930e+02 2.154e+04 0.014 0.989
## JOBClerical 2.935e+02 2.154e+04 0.014 0.989
## JOBDoctor -5.549e+01 9.203e+05 0.000 1.000
## JOBHome Maker 2.811e+02 2.582e+04 0.011 0.991
## JOBLawyer 2.196e+02 2.511e+04 0.009 0.993
## JOBManager 2.770e+02 2.737e+04 0.010 0.992
## JOBProfessional 2.763e+02 4.388e+04 0.006 0.995
## JOBStudent 2.913e+02 2.150e+04 0.014 0.989
## TRAVTIME -3.357e-02 1.323e+01 -0.003 0.998
## CAR_USE -4.805e+00 5.301e+02 -0.009 0.993
## BLUEBOOK 4.589e-03 2.900e-01 0.016 0.987
## TIF -5.620e-01 7.309e+01 -0.008 0.994
## CAR_TYPEPanel Truck -1.304e+02 4.253e+04 -0.003 0.998
## CAR_TYPEPickup 2.878e+00 6.093e+02 0.005 0.996
## CAR_TYPESports Car 8.145e+00 2.046e+03 0.004 0.997
## CAR_TYPESUV -1.904e+00 1.703e+03 -0.001 0.999
## CAR_TYPEVan -1.698e+01 3.361e+03 -0.005 0.996
## RED_CAR -7.328e-01 5.167e+02 -0.001 0.999
## OLDCLAIM -1.463e-04 2.365e-01 -0.001 1.000
## CLM_FREQ 3.775e+00 2.146e+02 0.018 0.986
## REVOKED 9.272e+00 5.920e+02 0.016 0.988
## MVR_PTS -1.159e+00 9.130e+01 -0.013 0.990
## CAR_AGE 3.482e-01 5.490e+01 0.006 0.995
## URBANICITYHighly Urban/ Urban 8.681e+00 6.175e+02 0.014 0.989
## NOHOMEKIDS 1.583e+01 1.807e+03 0.009 0.993
## NOKIDSDRIV 2.283e+01 4.079e+03 0.006 0.996
## HASCOLLEGE NA NA NA NA
## ISPROFESSIONAL NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9.4180e+03 on 8160 degrees of freedom
## Residual deviance: 1.2291e-04 on 8120 degrees of freedom
## AIC: 82
##
## Number of Fisher Scoring iterations: 25
## llh llhNull G2 McFadden r2ML
## -6.145302e-05 -4.708981e+03 9.417962e+03 1.000000e+00 6.846337e-01
## r2CU
## 1.000000e+00
Model2 - Backwards Elimination
Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output are produced. Only the final output will be shown.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 +
## HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = "binomial",
## data = dplyr::select(imputed1, -TARGET_AMT))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4154 -0.7469 -0.4494 0.7210 2.9518
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.015e+00 1.609e-01 -12.521 < 2e-16 ***
## KIDSDRIV 3.110e-01 5.827e-02 5.338 9.42e-08 ***
## HOMEKIDS 1.273e-01 3.305e-02 3.851 0.000117 ***
## YOJ -4.336e-02 7.021e-03 -6.175 6.60e-10 ***
## PARENT1 3.421e-01 1.057e-01 3.237 0.001206 **
## HOME_VAL -1.394e-04 1.900e-05 -7.335 2.21e-13 ***
## MSTATUS -3.601e-01 7.466e-02 -4.824 1.41e-06 ***
## SEX -2.879e-01 6.010e-02 -4.790 1.66e-06 ***
## TRAVTIME 1.474e-02 1.829e-03 8.062 7.48e-16 ***
## CAR_USE -7.682e-01 6.026e-02 -12.748 < 2e-16 ***
## TIF -5.104e-02 7.132e-03 -7.156 8.30e-13 ***
## OLDCLAIM 9.888e-05 4.149e-05 2.383 0.017171 *
## CLM_FREQ 1.225e-01 3.129e-02 3.913 9.10e-05 ***
## REVOKED 7.620e-01 7.790e-02 9.782 < 2e-16 ***
## MVR_PTS 1.135e-01 1.336e-02 8.502 < 2e-16 ***
## CAR_AGE -4.178e-02 5.196e-03 -8.040 8.98e-16 ***
## URBANICITYHighly Urban/ Urban 2.111e+00 1.105e-01 19.106 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7658.3 on 8144 degrees of freedom
## AIC: 7692.3
##
## Number of Fisher Scoring iterations: 5
## llh llhNull G2 McFadden r2ML
## -3829.1421401 -4708.9811460 1759.6780119 0.1868428 0.1939588
## r2CU
## 0.2833030
Model3 - glmulti Package
The glmulti package is an “automated model selection and model averaging” tool. The package automatically generates all possible models “with the specified response and explanatory variables”. The tool is basically used to find the “best” model.
The exhaustive approach was not practical for this dataset, as it continued to run after 2 hours.
After running the package, I inputed the “best” model mannualy in R, as to not have to rerun (10+ mins) each time.
glmulti - all data including transformations
##
## Call:
## glm(formula = imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS +
## SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY +
## KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM +
## CLM_FREQ + MVR_PTS, data = imputed1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9825 -0.2814 -0.1104 0.2880 1.2269
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.978e-04 4.639e-02 -0.022 0.982839
## PARENT1 8.504e-02 1.577e-02 5.391 7.18e-08 ***
## MSTATUS -6.891e-02 1.117e-02 -6.167 7.29e-10 ***
## SEX 3.759e-02 1.254e-02 2.997 0.002736 **
## EDUCATIONBachelors -7.200e-02 1.632e-02 -4.412 1.04e-05 ***
## EDUCATIONHigh School 2.204e-05 1.461e-02 0.002 0.998797
## EDUCATIONMasters -5.591e-02 2.288e-02 -2.444 0.014562 *
## EDUCATIONPhD -6.488e-02 2.720e-02 -2.385 0.017086 *
## JOBBlue Collar 8.642e-02 2.745e-02 3.149 0.001645 **
## JOBClerical 1.103e-01 2.884e-02 3.825 0.000132 ***
## JOBDoctor -3.505e-02 3.495e-02 -1.003 0.315944
## JOBHome Maker 1.177e-01 2.986e-02 3.942 8.15e-05 ***
## JOBLawyer 3.552e-02 2.523e-02 1.408 0.159243
## JOBManager -5.822e-02 2.463e-02 -2.363 0.018140 *
## JOBProfessional 5.529e-02 2.634e-02 2.099 0.035840 *
## JOBStudent 1.077e-01 3.108e-02 3.467 0.000529 ***
## CAR_USE -1.198e-01 1.406e-02 -8.517 < 2e-16 ***
## CAR_TYPEPanel Truck 1.076e-02 2.136e-02 0.504 0.614312
## CAR_TYPEPickup 7.856e-02 1.449e-02 5.422 6.07e-08 ***
## CAR_TYPESports Car 1.693e-01 1.748e-02 9.684 < 2e-16 ***
## CAR_TYPESUV 1.295e-01 1.411e-02 9.174 < 2e-16 ***
## CAR_TYPEVan 5.330e-02 1.774e-02 3.005 0.002667 **
## REVOKED 1.314e-01 1.331e-02 9.875 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.952e-01 1.196e-02 24.689 < 2e-16 ***
## KIDSDRIV 6.636e-02 8.742e-03 7.591 3.54e-14 ***
## AGE -7.909e-04 5.531e-04 -1.430 0.152788
## YOJ -3.153e-03 1.243e-03 -2.537 0.011208 *
## HOME_VAL -1.143e-05 2.999e-06 -3.811 0.000139 ***
## TRAVTIME 2.020e-03 2.755e-04 7.332 2.49e-13 ***
## TIF -7.827e-03 1.042e-03 -7.515 6.29e-14 ***
## OLDCLAIM 1.264e-05 7.153e-06 1.766 0.077376 .
## CLM_FREQ 1.936e-02 5.388e-03 3.592 0.000330 ***
## MVR_PTS 2.033e-02 2.235e-03 9.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1511599)
##
## Null deviance: 1585.0 on 8160 degrees of freedom
## Residual deviance: 1228.6 on 8128 degrees of freedom
## AIC: 7775.3
##
## Number of Fisher Scoring iterations: 2
## llh llhNull G2 McFadden r2ML
## -3853.6578451 -4892.9184843 2078.5212784 0.2124010 0.2248429
## r2CU
## 0.3218783
Model Selection.
Based on the above models, I’ve decided to use the model provided by the glmulti package. The AIC and residual deviance for this model seemed to give the best values that would be suited for the prediction.
Evaulating the model
Before I ran the evaulation data through the model, I decided to split the trianing data into an 80/20 split. My thoughts are that this will allow me to better check the accuracy of my model, given the fact that this way I can actually check if the ‘target’ variable as predicted by the model is correct or not.
After splitting the data and creating two new variables (training and testing), I created an ROC graph to help determine what threshold I should use in my model
Looking at the graph, the 0.4 threshold seems to be the most ideal soultion for my testing. 0.4 gives about a 0.6 TP rate, while giving only ~0.2 FP rate. 0.3 and 0.2 give slightly higher TP rates, but also give a high FP rate - which, in my opinion, isn’t worth the slight increase in TP. 0.5 gives a slightly lower FP rate, but a siginifcant different in TP rate.
## PredictedValue
## ActualValue FALSE TRUE
## 0 4026 703
## 1 741 1003
## [1] 0.774
After testing the model the accuracy is around .774, which seems like a decent fit.
The misclassification Rate is about 0.2246254.
The true positive rate is about 0.572759.
The false positive rate is about 0.1514196.
The specificity is about 0.8487907.
The precision is about 0.5774648
Testing With Model glmulti model
## predict12
## 0 1
## 1550 591
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2808 0.1296 0.2822 0.2776 0.4121 0.9482
This model predicts that 591 insurance customers would have an auto accident, while 1550 will not.
References
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
All subset regression with leaps, bestglm, glmulti, and meifly. (n.d.). Retrieved from https://rstudio-pubs-static.s3.amazonaws.com/2897_9220b21cfc0c43a396ff9abf122bb351.html
Model selection and multimodel inference made easy. (n.d.). Retrieved from https://cran.r-project.org/web/packages/glmulti/glmulti.pdf
Best subset model selection with R.(n.d.). Retrieved from http://jadianes.me/best-subset-model-selection-with-R
Appendex
train_insurance <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance_training_data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “z_“)), SEX = as.factor(str_remove(SEX,”z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “z_“)), JOB = as.factor(str_remove(JOB,”z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”z_”)))
eval_data <- read.csv(“https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/insurance-evaluation-data.csv”) %>% dplyr::select(-INDEX) %>% mutate( INCOME = as.numeric(INCOME), HOME_VAL = as.numeric(HOME_VAL), BLUEBOOK = as.numeric(BLUEBOOK), OLDCLAIM = as.numeric(OLDCLAIM), MSTATUS = as.factor(str_remove(MSTATUS, “z_“)), SEX = as.factor(str_remove(SEX,”z_”)), EDUCATION = as.factor(str_remove(EDUCATION, “z_“)), JOB = as.factor(str_remove(JOB,”z_”)), CAR_TYPE = as.factor(str_remove(CAR_TYPE, “z_“)), URBANICITY = as.factor(str_remove(URBANICITY,”z_”)) )
insurance_desc <- describe(train_insurance) insurance_desc$na_count <- sapply(train_insurance, function(y) sum(length(which(is.na(y)))))
kable(insurance_desc, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)
ggplot(melt(train_insurance), aes(x=factor(variable), y=value)) + facet_wrap(~variable, scale=“free”) + geom_boxplot()
ggplot1 <- train_insurance[,-c(1, 2)]
ggplot(melt(ggplot1), aes(x=factor(variable), y=value)) + geom_boxplot() + stat_summary(fun.y = mean, color = “blue”, geom = “point”) +
stat_summary(fun.y = median, color = “red”, geom = “point”) + coord_flip() + theme_bw()
ggplot(melt(train_insurance), aes(x=value)) + facet_wrap(~variable, scale=“free”) + geom_histogram(bins=50)
num.cols <- sapply(train_insurance, is.numeric) cor.data <- cor(train_insurance[,num.cols])
kable(cor.data[,1:2], “html”, escape = F) %>% kable_styling(“striped”, full_width = F) %>% column_spec(1, bold = T) %>% scroll_box(height = “500px”)
corrgram(drop_na(train_insurance), order=TRUE, upper.panel=panel.cor, main=“Crime”)
library(Amelia) missmap(train_insurance, main = “Missing values vs observed”)
train_ins <- train_insurance
init = mice(train_ins, maxit=0) meth = init\(method predM = init\)predictorMatrix
predM[, c(“TARGET_FLAG”)]=0
imputed = mice(train_ins, method=“rf”, predictorMatrix=predM, m=5)
imputed1 <- complete(imputed)
imputed1 <- imputed1 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))
imputed1 <- imputed1 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))
eval_ins <- eval_data
init = mice(eval_ins, maxit=0) meth = init\(method predM = init\)predictorMatrix predM[, c(“TARGET_FLAG”)]=0
imputed2 = mice(eval_ins, method=“rf”, predictorMatrix=predM, m=5) imputed3 <- complete(imputed2)
imputed3 <- imputed3 %>% mutate(PARENT1 = if_else(PARENT1 == “Yes”, 1, 0)) %>% mutate(MSTATUS = if_else(MSTATUS == “Yes”, 1, 0)) %>% mutate(SEX = if_else(SEX == “M”, 1, 0)) %>% mutate(CAR_USE = if_else(CAR_USE == “Private”, 1, 0)) %>% mutate(RED_CAR = if_else(RED_CAR == “yes”, 1, 0)) %>% mutate(REVOKED = if_else(REVOKED == “Yes”, 1, 0))
imputed3 <- imputed3 %>% mutate(NOHOMEKIDS = as.integer(HOMEKIDS == 0), NOKIDSDRIV = as.integer(KIDSDRIV == 0), HASCOLLEGE = as.integer(EDUCATION %in% c(“Bachelors”, “Masters”, “PhD”)), ISPROFESSIONAL = as.integer(JOB %in% c(“Doctor”, “Lawyer”, “Manager”, “Professional”)))
imputed3\(TARGET_FLAG <- as.numeric(eval_data\)TARGET_FLAG) imputed3\(TARGET_AMT <- as.numeric(eval_data\)TARGET_AMT)
imputedtable <- describe(imputed1)
kable(imputedtable, “html”, escape = F) %>% kable_styling(“striped”, full_width = T) %>% column_spec(1, bold = T) %>% scroll_box(width = “100%”, height = “700px”)
model1 <- lm(TARGET_AMT ~., imputed1)
(summary(model1)) summodel1 <- summary(model1)
model2 <- lm(imputed1\(TARGET_AMT ~ + imputed1\)AGE + imputed1\(EDUCATION + imputed1\)REVOKED + imputed1\(MVR_PTS + imputed1\)JOB +imputed1\(YOJ + imputed1\)CLM_FREQ + imputed1\(HOME_VAL + imputed1\)URBANICITY + imputed1\(PARENT1 + imputed1\)MSTATUS + imputed1\(TRAVTIME + imputed1\)BLUEBOOK, data = imputed1)
summary(model2)
best.subset <- regsubsets(imputed1\(TARGET_AMT~., imputed1, nvmax=5) best.subset.summary <- summary(best.subset) best.subset.summary\)outmat
best.subset.by.adjr2 <- which.max(best.subset.summary$adjr2) best.subset.by.adjr2
best.subset.by.cp <- which.min(best.subset.summary$cp) best.subset.by.cp
model3 <- lm(imputed1\(TARGET_AMT ~ + REVOKED + MVR_PTS + imputed1\)CAR_TYPE + CAR_AGE + SEX + imputed1\(TRAVTIME + imputed1\)JOB + imputed1\(URBANICITY + imputed1\)MSTATUS + imputed1$CAR_USE, data = imputed1)
summary(model3)
modellog1 <- glm(TARGET_FLAG ~., family = “binomial”, data=imputed1) summary(modellog1) pR2(modellog1)
model2 <- glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 + HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = “binomial”, data = dplyr::select(imputed1, -TARGET_AMT)) summary(model2) pR2(model2)
library(rJava) library(glmulti)
glmulti.lm.out <- glmulti(imputed$TARGET_FLAG ~., data = imputed, level = 1,
method = “g”,
crit = “aic”,
confsetsize = 5,
plotty = F, report = F,
fitfunction = “lm”)
modelglmulti <- glm(imputed1$TARGET_FLAG ~ 1 + PARENT1 + MSTATUS + SEX + EDUCATION + JOB + CAR_USE + CAR_TYPE + REVOKED + URBANICITY + TARGET_AMT + KIDSDRIV + AGE + YOJ + HOME_VAL + TRAVTIME + TIF + OLDCLAIM + CLM_FREQ + MVR_PTS, data = imputed1)
summary(modelglmulti) pR2(modelglmulti)
modelglmulti1 <- modelglmulti
imputed1\(TARGET_FLAG <- as.factor(imputed1\)TARGET_FLAG) splitdata <- imputed1
split <- sample.split(splitdata, SplitRatio = 0.8) split training <- subset(splitdata, split == “TRUE”) testing <- subset(splitdata, split == “FALSE”)
res <- predict(modelglmulti, newdata=training, type=“response”)
ROCRPred = prediction(res, training$TARGET_FLAG) ROCRPref <- performance(ROCRPred, “tpr”,“fpr”)
plot(ROCRPref, colorize=TRUE, print.cutoffs.at=seq(0.1,by=0.1))
(table(ActualValue=training$TARGET_FLAG, PredictedValue=res>0.3)) round((4005+1532)/(4005 +1532+755+181),3)
PredictedValue <- res>03 res <- predict(modelglmulti, newdata=testing, type=“response”) (table(ActualValue=testing$TARGET_FLAG, PredictedValue=res>0.3)) round((1037+393)/(1037+393+47+211),3)
imputed4 <- imputed3[, -1]
predict1 <- predict(modelglmulti, newdata=imputed3, type=“response”) predict12 <- ifelse(predict1 > 0.3, 1, 0) table(predict12) summary(predict1)