Abstract

The main purpose of this homework is building two models, (a binary regression model to predict whether a vehicle will be involved in a crash and a linear regression model to predict possible payout).

Binary model generated fairly good results. Due to the accuracy of aproximately 80% the binary model has generated a better result instead of gussing the outcome. We woould like to make a note here that the accuracy level may not be high in the real world due to the complexity of car insurance business as the reasons for claim are diverse .But we belive that the results of this project are reasonable to achive the required model.

Based on our observation , linear model have problems due to the significant predictor of depending on the Blue Book Value for payment purposes. However, with \(R^2\) value of 0.005, this model barely explains any variance in the outcome variable. Adding other variables that should influence the payout amount, such as CAR_AGE, did not have a significant effect.

The model was created using only onservations for vehicles involved in a crash (TARGET_FLAG==1). It is possible to create a linear regression model on all data, without this limitation. \(R^2\) is significantly improved. In some tests, \(R^2\) was about 0.27. From the obervation we can undertsnad that the $0 payout (vehicles not involved in a crash) does not help fit the actual payouts.

Looking at the relationship between Blue Book value and payout amount, there is a high amount of variability. There are many observations with low value, but high payout which might have been caused by missing of some critical data which would have enabled to accurately predict payouts . We should not also ignore the importance of variables which will dramatically influence the payout amount . This varivales are : the intesity of the crash, the number of vehicles involved on th crash and the assesment of whethere the vehicle was totaled or not . ## Data Exploration

The data set includes 8,161 observations with 24 variables.

Summary of Variables

Variable Type Description Comments
KIDSDRIV Integer No of children driving. Ranges from 0 to 4.
AGE Integer Age of driver. Ranges from 16 to 81. Contains 6 NAs.
HOMEKIDS Integer No of children at home. Ranges from 0 to 5.
YOJ Integer Years on the job. Ranges from 0 to 23. Contains 454 NAs (about 5.56% of all observations).
INCOME Numeric Income. Ranges from $0 to $367,000. Contains 445 NAs (about 5.45% of all observations). Was converted to numeric by removing dollar signs and commas.
PARENT1 Factor Single parent flag. Values: No, Yes.
HOME_VAL Numeric Home value. Ranges from $0 to $885,000. Contains 464 NAs (about 5.69% of all observations). Was converted to numeric by removing dollar signs and commas.
MSTATUS Factor Married flag. Values: Yes, No.
SEX Factor Gender. Values: M, F.
EDUCATION Factor Maximum education level. Values: <High School, High School, Bachelors, Masters, PhD.
JOB Factor Job category. Values: [Blank], Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student, Blue Collar.
TRAVTIME Integer Distance to work. Ranges from 5 to 142.
CAR_USE Factor Vehicle use. Values: Commercial, Private.
BLUEBOOK Numeric Vehicle value. Ranges from $1,500 to $69,740. Was converted to numeric by removing dollar signs and commas.
TIF Integer Time in force. Ranges from 1 to 25.
CAR_TYPE Factor Vehicle type. Values: Minivan, Panel Truck, Pickup, Sports Car, Van, SUV.
RED_CAR Factor Red car flag. Values: No, Yes
OLDCLAIM Numeric Total payout of claims. Ranges from $0 to $57,040. Was converted to numeric by removing dollar signs and commas.
CLM_FREQ Integer No of claims (past 5 years). Ranges from 0 to 5.
REVOKED Factor Revoked license flag. Values: No, Yes.
MVR_PTS Integer Motor vehicle record points. Ranges from 0 to 13.
CAR_AGE Integer Vehicle age. Ranges from -3 to 28. Contains 510 NAs (about 6.25% of all observations).
URBANICITY Factor Home/work area. Values: Urban, Rural.

Observation of summary of numeric variables in a table .

Variable Min Median Mean SD Max Num of NAs Num of Zeros
KIDSDRIV 0 0 0.1711 0.5115 4 0 7180
AGE 16 45 44.79 8.628 81 6 0
HOMEKIDS 0 0 0.7212 1.116 5 0 5289
YOJ 0 11 10.5 4.092 23 454 625
INCOME 0 54028 61898 47573 367030 445 615
HOME_VAL 0 161160 154867 129124 885282 464 2294
TRAVTIME 5 33 33.49 15.91 142 0 0
BLUEBOOK 1500 14440 15710 8420 69740 0 0
TIF 1 4 5.351 4.147 25 0 0
OLDCLAIM 0 0 4037 8777 57037 0 5009
CLM_FREQ 0 0 0.7986 1.158 5 0 5009
MVR_PTS 0 1 1.696 2.147 13 0 3712
CAR_AGE -3 8 8.328 5.701 28 510 3

As we have obsrved from the above table, there are significant number of observations with value of 0 and there is a logical explanation that these observations are valid:

  • KIDSDRIV and HOMEKIDS: households without children
  • YOJ and INCOME: unemployed individuals
  • HOME_VAL: renters
  • OLDCLAIM, CLM_FREQ and MVR_PTS: safe drivers with no claims or DMV points

Hangling NAs

Several variables - AGE, YOJ, INCOME, HOME_VAL, CAR_AGE - contained some NA values.

By removing incomplete observation we have 6,041 observations which are suffient for the sake of our analysis .

Additional Details

  • Boxplots and histograms were inspected for all variable in order to see any anomalies.
  • Multiple values were prefixed with text ‘z_’, which was removed from all observations. Affected variables are MSTATUS, EDUCATION, JOB, CAR_TYPE and URBANICITY.
  • Index column present in the data set has been removed.
  • Levels for EDUCATION have been re-ordered to follow the most common order: <High School, High School, Bachelors, Masters, and PhD.
  • Levels for JOB have been re-ordered to follow general order from low-paying to high-paying occupations: Student, Blue Collar, Home Maker, Clerical, Professional, Manager, Lawyer, and Doctor.
  • Counts of observations for possible values of PARENT1:
No Yes
5219 822
  • Counts of observations for possible values of MSTATUS:
Yes No
3594 2447
  • Counts of observations for possible values of SEX:
M F
2683 3358
  • Counts of observations for possible values of EDUCATION:
<High School High School Bachelors Masters PhD
955 1871 1739 1060 416
  • Counts of observations for possible values of JOB:
Student Blue Collar Home Maker Clerical Professional Manager Lawyer Doctor
537 1476 484 1030 867 778 670 199
  • Counts of observations for possible values of CAR_USE:
Commercial Private
2040 4001
  • Counts of observations for possible values of CAR_TYPE:
Minivan Panel Truck Pickup Sports Car Van SUV
1699 347 1012 714 488 1781
  • Counts of observations for possible values of RED_CAR:
No Yes
4350 1691
  • Counts of observations for possible values of REVOKED:
No Yes
5297 744
  • Counts of observations for possible values of URBANICITY:
Urban Rural
4742 1299
  • Counts of observations for possible values of KIDSDRIV:
0 1 2 3 4
5309 473 207 50 2
  • Counts of observations for possible values of HOMEKIDS
0 1 2 3 4 5
3872 667 837 525 129 11
  • Counts of observations for possible values of CLM_FREQ
0 1 2 3 4 5
3758 715 848 568 141 11

Target/Dependent Variable TARGET_FLAG

To identify whether a vehicle was involved in a crash we have used target variable for the binary regression model representing a flag . There are 4,440 observation with TARGET_FLAG value of 0 and 1,601 observations with TARGET_FLAG value of 1 making it about 75/25 split, or more precisely there are 73.5% of 0s and 26.5% of 1s.

Target/Dependent Variable TARGET_AMT

Target variable for the linear regression model represents the cost if a vehicle was involved in a crash. The value is presented only for observations with TARGET_FLAG of 1. Distribution of TARGET_AMT has a long right tail. There are no missing values.

Min Median Mean SD Max Num_NAs Num_Zeros
30.28 4136 5586 7440 85524 0 0

Modelling: Generalized Linear Model

The aim of this model is predicting whether a vehicle will be involved in a crash. The dependent variable, TARGET_FLAG, is binary. For this project it is assumed that observations are independent of each other as there is no reason to believe otherwise.

The main data set is split into training and testing data sets for the purpose of testing the accuracy of the model. The training set includes 75% of randomly chosen observations (4,531) while the testing set includes remaining 25% (1,510).

The starting point is a model that includes all independent variables. It has AIC value of 4085.5 and accuracy of 79.47%. Summary is below.

## 
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial(link = "logit"), 
##     data = insTRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2532  -0.7047  -0.3874   0.6212   2.9097  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.057e-01  3.474e-01  -0.880 0.378824    
## KIDSDRIV              3.254e-01  8.162e-02   3.987 6.70e-05 ***
## AGE                  -6.524e-03  5.457e-03  -1.195 0.231914    
## HOMEKIDS              1.095e-02  4.955e-02   0.221 0.825173    
## YOJ                  -4.818e-03  1.137e-02  -0.424 0.671742    
## INCOME               -3.061e-06  1.632e-06  -1.876 0.060660 .  
## PARENT1Yes            3.940e-01  1.454e-01   2.709 0.006749 ** 
## HOME_VAL             -1.383e-06  4.914e-07  -2.815 0.004879 ** 
## MSTATUSNo             4.710e-01  1.165e-01   4.043 5.28e-05 ***
## SEXF                 -1.745e-01  1.504e-01  -1.160 0.245995    
## EDUCATIONHigh School -4.663e-02  1.224e-01  -0.381 0.703169    
## EDUCATIONBachelors   -5.190e-01  1.524e-01  -3.405 0.000662 ***
## EDUCATIONMasters     -6.353e-01  2.499e-01  -2.542 0.011006 *  
## EDUCATIONPhD         -6.920e-02  3.049e-01  -0.227 0.820482    
## JOBBlue Collar        9.702e-02  1.762e-01   0.551 0.581935    
## JOBHome Maker        -8.012e-02  2.029e-01  -0.395 0.692862    
## JOBClerical           2.432e-01  1.762e-01   1.380 0.167584    
## JOBProfessional      -9.445e-03  2.109e-01  -0.045 0.964277    
## JOBManager           -6.364e-01  2.306e-01  -2.760 0.005782 ** 
## JOBLawyer             2.617e-01  2.815e-01   0.930 0.352593    
## JOBDoctor            -5.717e-01  4.068e-01  -1.405 0.159912    
## TRAVTIME              1.420e-02  2.536e-03   5.599 2.15e-08 ***
## CAR_USEPrivate       -8.163e-01  1.224e-01  -6.669 2.57e-11 ***
## BLUEBOOK             -2.233e-05  7.034e-06  -3.174 0.001503 ** 
## TIF                  -5.337e-02  9.914e-03  -5.383 7.32e-08 ***
## CAR_TYPEPanel Truck   6.422e-01  2.271e-01   2.827 0.004694 ** 
## CAR_TYPEPickup        5.283e-01  1.322e-01   3.995 6.47e-05 ***
## CAR_TYPESports Car    1.104e+00  1.696e-01   6.512 7.42e-11 ***
## CAR_TYPEVan           4.579e-01  1.759e-01   2.603 0.009232 ** 
## CAR_TYPESUV           7.862e-01  1.456e-01   5.398 6.74e-08 ***
## RED_CARYes           -1.989e-01  1.199e-01  -1.658 0.097220 .  
## OLDCLAIM             -1.796e-05  5.177e-06  -3.468 0.000524 ***
## CLM_FREQ              2.075e-01  3.832e-02   5.416 6.10e-08 ***
## REVOKEDYes            8.732e-01  1.218e-01   7.167 7.67e-13 ***
## MVR_PTS               1.147e-01  1.851e-02   6.197 5.75e-10 ***
## CAR_AGE               7.534e-04  1.030e-02   0.073 0.941712    
## URBANICITYRural      -2.360e+00  1.463e-01 -16.129  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5240.4  on 4530  degrees of freedom
## Residual deviance: 4011.5  on 4494  degrees of freedom
## AIC: 4085.5
## 
## Number of Fisher Scoring iterations: 5

Running this model through the stepwise algorithm (stepAIC from the MASS package), removed AGE, HOMEKIDS, YOJ, SEX, RED_CAR and CAR_AGE. AIC is reduced slightly to 4078.2 and accuracy is improved very slightly to 79.54%.

INCOME and HOME_VAL are very right-skewed. To make results more normal, they are log-transformed (adding 1 to make sure that log-transformation is possible for 0 values). The new model again has slightly lower AIC of 4073.2 and slightly higher accuracy at 79.73%.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME + 1) + PARENT1 + 
##     log(HOME_VAL + 1) + MSTATUS + EDUCATION + JOB + TRAVTIME + 
##     CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + 
##     REVOKED + MVR_PTS + URBANICITY, family = binomial(link = "logit"), 
##     data = insTRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3062  -0.7025  -0.3932   0.6137   2.9590  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -3.484e-01  2.494e-01  -1.397 0.162383    
## KIDSDRIV              3.248e-01  7.318e-02   4.439 9.05e-06 ***
## log(INCOME + 1)      -6.371e-02  1.901e-02  -3.351 0.000805 ***
## PARENT1Yes            4.522e-01  1.255e-01   3.604 0.000313 ***
## log(HOME_VAL + 1)    -3.312e-02  9.474e-03  -3.496 0.000473 ***
## MSTATUSNo             3.924e-01  1.145e-01   3.426 0.000613 ***
## EDUCATIONHigh School -5.933e-02  1.221e-01  -0.486 0.626878    
## EDUCATIONBachelors   -5.946e-01  1.387e-01  -4.288 1.80e-05 ***
## EDUCATIONMasters     -7.595e-01  2.179e-01  -3.485 0.000492 ***
## EDUCATIONPhD         -3.272e-01  2.696e-01  -1.214 0.224811    
## JOBBlue Collar        3.822e-01  2.018e-01   1.894 0.058233 .  
## JOBHome Maker         6.055e-02  2.130e-01   0.284 0.776192    
## JOBClerical           6.048e-01  2.050e-01   2.950 0.003175 ** 
## JOBProfessional       2.447e-01  2.315e-01   1.057 0.290386    
## JOBManager           -4.020e-01  2.480e-01  -1.621 0.105038    
## JOBLawyer             5.227e-01  2.983e-01   1.752 0.079716 .  
## JOBDoctor            -3.706e-01  4.133e-01  -0.897 0.369957    
## TRAVTIME              1.422e-02  2.540e-03   5.599 2.16e-08 ***
## CAR_USEPrivate       -8.122e-01  1.225e-01  -6.632 3.32e-11 ***
## BLUEBOOK             -2.595e-05  6.304e-06  -4.116 3.85e-05 ***
## TIF                  -5.211e-02  9.911e-03  -5.258 1.45e-07 ***
## CAR_TYPEPanel Truck   6.821e-01  2.146e-01   3.179 0.001478 ** 
## CAR_TYPEPickup        5.412e-01  1.321e-01   4.097 4.18e-05 ***
## CAR_TYPESports Car    1.057e+00  1.392e-01   7.594 3.10e-14 ***
## CAR_TYPEVan           4.567e-01  1.709e-01   2.673 0.007516 ** 
## CAR_TYPESUV           7.558e-01  1.120e-01   6.750 1.48e-11 ***
## OLDCLAIM             -1.792e-05  5.186e-06  -3.455 0.000551 ***
## CLM_FREQ              2.081e-01  3.827e-02   5.439 5.37e-08 ***
## REVOKEDYes            8.647e-01  1.219e-01   7.092 1.32e-12 ***
## MVR_PTS               1.148e-01  1.853e-02   6.196 5.79e-10 ***
## URBANICITYRural      -2.391e+00  1.471e-01 -16.254  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5240.4  on 4530  degrees of freedom
## Residual deviance: 4011.2  on 4500  degrees of freedom
## AIC: 4073.2
## 
## Number of Fisher Scoring iterations: 5

It is good to note that a potential issue was discovered with variance-inflation (using vif in the car package) by anlyzing the model using various evaluation methods.

##                        GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV           1.088467  1        1.043296
## log(INCOME + 1)    2.530758  1        1.590836
## PARENT1            1.427433  1        1.194752
## log(HOME_VAL + 1)  1.965195  1        1.401854
## MSTATUS            2.118875  1        1.455636
## EDUCATION          7.054761  4        1.276616
## JOB               29.531678  7        1.273563
## TRAVTIME           1.034921  1        1.017311
## CAR_USE            2.336829  1        1.528669
## BLUEBOOK           1.575884  1        1.255342
## TIF                1.010050  1        1.005012
## CAR_TYPE           2.221595  5        1.083095
## OLDCLAIM           1.679788  1        1.296066
## CLM_FREQ           1.449881  1        1.204110
## REVOKED            1.341382  1        1.158181
## MVR_PTS            1.174200  1        1.083605
## URBANICITY         1.142851  1        1.069042

TThere is a possible correlation between JOB and EDUCATION

Final Model

We would like to provide the formula for the final generalized linear model to be TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 + log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY. AIC is 4107.

## 
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME + 1) + PARENT1 + 
##     log(HOME_VAL + 1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + 
##     BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + 
##     MVR_PTS + URBANICITY, family = binomial(link = "logit"), 
##     data = insTRAIN)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3080  -0.7092  -0.4102   0.6195   2.9691  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.012e-01  2.367e-01  -0.850 0.395436    
## KIDSDRIV              3.163e-01  7.239e-02   4.369 1.25e-05 ***
## log(INCOME + 1)      -4.046e-02  1.317e-02  -3.073 0.002119 ** 
## PARENT1Yes            4.630e-01  1.241e-01   3.730 0.000192 ***
## log(HOME_VAL + 1)    -2.605e-02  8.541e-03  -3.050 0.002291 ** 
## MSTATUSNo             4.317e-01  1.092e-01   3.952 7.75e-05 ***
## EDUCATIONHigh School -1.621e-01  1.183e-01  -1.371 0.170482    
## EDUCATIONBachelors   -8.241e-01  1.235e-01  -6.673 2.50e-11 ***
## EDUCATIONMasters     -8.845e-01  1.426e-01  -6.202 5.59e-10 ***
## EDUCATIONPhD         -8.916e-01  1.931e-01  -4.617 3.89e-06 ***
## TRAVTIME              1.438e-02  2.515e-03   5.717 1.09e-08 ***
## CAR_USEPrivate       -8.525e-01  9.843e-02  -8.661  < 2e-16 ***
## BLUEBOOK             -2.722e-05  6.234e-06  -4.367 1.26e-05 ***
## TIF                  -5.137e-02  9.859e-03  -5.211 1.88e-07 ***
## CAR_TYPEPanel Truck   5.620e-01  2.039e-01   2.757 0.005841 ** 
## CAR_TYPEPickup        4.841e-01  1.287e-01   3.761 0.000169 ***
## CAR_TYPESports Car    1.006e+00  1.371e-01   7.335 2.21e-13 ***
## CAR_TYPEVan           4.125e-01  1.692e-01   2.438 0.014771 *  
## CAR_TYPESUV           7.359e-01  1.106e-01   6.655 2.83e-11 ***
## OLDCLAIM             -1.776e-05  5.131e-06  -3.461 0.000538 ***
## CLM_FREQ              2.059e-01  3.794e-02   5.428 5.71e-08 ***
## REVOKEDYes            8.704e-01  1.209e-01   7.199 6.07e-13 ***
## MVR_PTS               1.218e-01  1.840e-02   6.618 3.64e-11 ***
## URBANICITYRural      -2.332e+00  1.471e-01 -15.855  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5240.4  on 4530  degrees of freedom
## Residual deviance: 4059.0  on 4507  degrees of freedom
## AIC: 4107
## 
## Number of Fisher Scoring iterations: 5
  0 1
0 1046 64
1 241 159
llh llhNull G2 McFadden r2ML r2CU
-2029 -2620 1181 0.2254 0.2295 0.3349

The accuracy level is 79.80%. McFadden \(R^2\) is 0.2254. As we can see on the plot below the area under the curve is 0.8022.

Please see below the results for K-fold cross validation using 10 iterations. T

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1046   64
##          1  241  159
##                                          
##                Accuracy : 0.798          
##                  95% CI : (0.7769, 0.818)
##     No Information Rate : 0.8523         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3959         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.8127         
##             Specificity : 0.7130         
##          Pos Pred Value : 0.9423         
##          Neg Pred Value : 0.3975         
##              Prevalence : 0.8523         
##          Detection Rate : 0.6927         
##    Detection Prevalence : 0.7351         
##       Balanced Accuracy : 0.7629         
##                                          
##        'Positive' Class : 0              
## 

By utlizing anovawe can analyze the deviance and observe that all terms are significant although significance for MSTATUS and TRAVTIME are less significant than all other terms.

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: TARGET_FLAG
## 
## Terms added sequentially (first to last)
## 
## 
##                   Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                               4530     5240.4              
## KIDSDRIV           1    31.64      4529     5208.8 1.860e-08 ***
## log(INCOME + 1)    1    45.45      4528     5163.3 1.570e-11 ***
## PARENT1            1    99.30      4527     5064.1 < 2.2e-16 ***
## log(HOME_VAL + 1)  1    45.03      4526     5019.0 1.938e-11 ***
## MSTATUS            1     3.56      4525     5015.5   0.05907 .  
## EDUCATION          4    85.47      4521     4930.0 < 2.2e-16 ***
## TRAVTIME           1     5.70      4520     4924.3   0.01697 *  
## CAR_USE            1    81.67      4519     4842.6 < 2.2e-16 ***
## BLUEBOOK           1    38.64      4518     4804.0 5.089e-10 ***
## TIF                1    27.91      4517     4776.1 1.272e-07 ***
## CAR_TYPE           5    67.15      4512     4708.9 4.014e-13 ***
## OLDCLAIM           1    49.73      4511     4659.2 1.766e-12 ***
## CLM_FREQ           1   109.12      4510     4550.1 < 2.2e-16 ***
## REVOKED            1    61.82      4509     4488.2 3.762e-15 ***
## MVR_PTS            1    69.84      4508     4418.4 < 2.2e-16 ***
## URBANICITY         1   359.42      4507     4059.0 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Variance inflation factors are low for all variables, so correlation is not high.

##                       GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV          1.082636  1        1.040498
## log(INCOME + 1)   1.214555  1        1.102068
## PARENT1           1.418362  1        1.190950
## log(HOME_VAL + 1) 1.622557  1        1.273796
## MSTATUS           1.954360  1        1.397984
## EDUCATION         1.345798  4        1.037821
## TRAVTIME          1.033747  1        1.016734
## CAR_USE           1.527886  1        1.236077
## BLUEBOOK          1.565014  1        1.251005
## TIF               1.008322  1        1.004152
## CAR_TYPE          1.935421  5        1.068261
## OLDCLAIM          1.677986  1        1.295371
## CLM_FREQ          1.449134  1        1.203800
## REVOKED           1.339632  1        1.157425
## MVR_PTS           1.171422  1        1.082322
## URBANICITY        1.135802  1        1.065740

Coefficient Analysis

Crash is more likely to occur due to the following varibales : - Higher number of DMV points - Higher number of children driving - Single parents and unmarried individuals - Higher commuting distance - License revocation within the past 7 years - Higher number of claims within the past 5 years - Commercial vehicle use is more likely to result in a crash than private. - Rural environment is less likely to result in a crash than urban. - Individuals who have been customers longer are less likely to have a crash. - Higher Blue Book value makes it less likely to result in a crash. - Education makes it less likely to result in a crash. - Higher income and home value make it less likely to result in a crash. - Car type is hard to evaluate, but generally various types make it more likely to have a crash (to different degrees). - Interestingly, larger previous payout make it less likely to result in a crash.

Modelling: Linear Model

Linear modelling is used to predict the amount of the payout in case of a crash (TARGET_AMT). Only observations where a crash has occurred (TARGET_FLAG==1) are used in training the linear model. If there is no crash, payout will not be needed. As such if observations without a crash are included in the model, they may skew the results.Isolating all observations with a reported crash, leaves 1,601 observations.

As mentioned above the data is divided into a training set (75%; 1,200 observations) and a testing set (25%; 401 observations).

We have managed to build 5 models and compare them :

Model Adjusted R^2 Root-Mean Square Error
Model 1 0.01808 8416
Model 2 0.02218 8367
Model 3 0.01366 8344
Model 4 0.01351 8373
Model 5 0.01202 8336

Based on our observation we have decided to use model 5 due to the best RMSE value and its simplicity eventhough it had the worst adjusted \(R^2\). In the next section we will be focusing on model 5 for furher analysis.

## 
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK, data = insLMtrain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -7307  -2864  -1349    483  79170 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.041e+03  4.023e+02  10.044  < 2e-16 ***
## BLUEBOOK    1.013e-01  2.566e-02   3.948 8.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7063 on 1198 degrees of freedom
## Multiple R-squared:  0.01284,    Adjusted R-squared:  0.01202 
## F-statistic: 15.58 on 1 and 1198 DF,  p-value: 8.349e-05

…

Consider Box-Cox transformation (plot below is generated using boxcox from the MASS package).

if lambda is picked to be 0, then target variable should be log-transformed.

## 
## Call:
## lm(formula = log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7770 -0.3877  0.0599  0.4177  3.0424 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 8.128e+00  4.592e-02 176.993  < 2e-16 ***
## BLUEBOOK    8.166e-06  2.929e-06   2.788  0.00539 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8062 on 1198 degrees of freedom
## Multiple R-squared:  0.006446,   Adjusted R-squared:  0.005617 
## F-statistic: 7.772 on 1 and 1198 DF,  p-value: 0.005389

This drastically reduced already low adjusted \(R^2\). However, Q-Q plot is improved. I

Robust/Quantile Regression

Looking at the scatterplot of BLUEBOOK vs log(TARGET_AMT), there is a lot of variance and a lot of outliers. It is possible that some points are leverage points that interfere with the model. Two additional model were built in an effort to account for that.

The first model was created using robust regression (rlm in the MASS package). The second model was created using quantile regression (rq in the quantreg package).

Original model has RMSE of 0.7815. RLM model very slightly improves it to 0.7811. Finally, RQ model also very slightly improves it to 0.7771. Fits for all three models are presented in the scatterplot below.

APPENDIX A: Evaluation Data Set

For the sake of clarfication we will be using the first 100 entries of the data; insurance-evaluation-data.csv file.

Evalution data is missing some INCOME and HOME_EVAL values that are used in the binary regression model. Rather than trying to impute those values by replacing them with 0 or mean or media values or by building a model to predict them, these values are left as they are and the model cannot be used to predict the outcome for corresponding observations.

Index TARGET_FLAG Prob. TARGET_FLAG TARGET_AMT
3 0.1816 0 NA
9 0.3604 0 NA
10 0.1493 0 NA
18 0.2319 0 NA
21 0.2756 0 NA
30 NA NA NA
31 0.4432 0 NA
37 0.4418 0 NA
39 0.0289 0 NA
47 0.1904 0 NA
60 0.0279 0 NA
62 0.5549 1 3812
63 NA NA NA
64 0.0792 0 NA
68 0.0261 0 NA
75 NA NA NA
76 0.671 1 3429
83 0.1779 0 NA
87 0.5449 1 3651
92 0.3123 0 NA
98 0.1521 0 NA
106 0.395 0 NA
107 0.0851 0 NA
113 0.3603 0 NA
120 0.2745 0 NA
123 0.4411 0 NA
125 0.4386 0 NA
126 0.3961 0 NA
128 0.1665 0 NA
129 0.1554 0 NA
131 0.221 0 NA
135 0.3263 0 NA
141 0.079 0 NA
147 0.1926 0 NA
148 0.1651 0 NA
151 0.0361 0 NA
156 0.1378 0 NA
157 0.1199 0 NA
174 0.0691 0 NA
186 0.5531 1 3668
193 0.2295 0 NA
195 0.5546 1 3831
212 0.0136 0 NA
213 0.5128 1 3920
217 0.0041 0 NA
223 0.2552 0 NA
226 0.1033 0 NA
228 0.2615 0 NA
230 0.0138 0 NA
241 0.5685 1 3610
243 0.1487 0 NA
249 0.3335 0 NA
281 0.7619 1 3691
288 0.1072 0 NA
294 NA NA NA
295 0.2027 0 NA
300 0.4442 0 NA
302 0.3606 0 NA
303 0.1077 0 NA
308 0.5506 1 3594
319 0.0118 0 NA
320 0.0944 0 NA
324 0.3479 0 NA
331 0.2167 0 NA
343 0.0511 0 NA
347 0.3246 0 NA
348 0.8159 1 3531
350 0.568 1 3967
357 0.1465 0 NA
358 0.058 0 NA
360 NA NA NA
366 0.1974 0 NA
367 0.7401 1 3580
368 0.2771 0 NA
376 0.7287 1 3645
380 0.3893 0 NA
388 0.3897 0 NA
396 0.252 0 NA
398 0.1245 0 NA
403 0.0448 0 NA
410 0.5654 1 3598
412 0.34 0 NA
420 0.2981 0 NA
434 0.0345 0 NA
440 0.4766 0 NA
450 0.5612 1 3887
453 0.2757 0 NA
464 0.2841 0 NA
465 0.0522 0 NA
466 NA NA NA
473 0.0847 0 NA
476 0.0903 0 NA
478 NA NA NA
479 0.1898 0 NA
493 0.0347 0 NA
497 0.2245 0 NA
503 0.0068 0 NA
504 0.3887 0 NA
505 0.3087 0 NA
507 0.2996 0 NA

APPENDIX B: R Script based on the requirements of the project

# Required libraries
library(knitr)
library(kableExtra)
library(gridExtra)
library(ggplot2)
library(dplyr)
library(caTools)
library(pscl)
library(ROCR)
library(MASS)
library(caret)
library(car)
library(Metrics)
library(quantreg)

# Import data
ins <- read.csv(url(paste0("https://raw.githubusercontent.com/",
                           "ilyakats/CUNY-DATA621/master/hw4/",
                           "insurance_training_data.csv")),
                na.strings=c("","NA"))

# Basic statistic
nrow(ins); ncol(ins)
summary(ins)

# TARGET_FLAG - 6008 are 0, 2153 are 1
table(ins$TARGET_FLAG)
class(ins$TARGET_FLAG) 
# Integer (0/1)

# TARGET_AMT
summary(ins[ins$TARGET_FLAG==0, 'TARGET_AMT'])
summary(ins[ins$TARGET_FLAG==1, 'TARGET_AMT'])
class(ins$TARGET_AMT) 
# Only available for TARGET_FLAG=1
# Numeric: Ranges from 30.28 to 107600

# KIDSDRIV - No of Driving Children
table(ins$KIDSDRIV)
class(ins$KIDSDRIV)
# Integer - Ranges from 0 to 4

# AGE
summary(ins$AGE)
class(ins$AGE)
# Integer - Ranges from 16 to 81
# 6 NAs

# HOMEKIDS
table(ins$HOMEKIDS)
class(ins$HOMEKIDS)
# Integer - Ranges from 0 to 5

# YOJ - Years on Job
summary(ins$YOJ)
class(ins$YOJ)
# Integer - Ranges from 0 to 23
# 454 NAs - 5.56% of observations

# INCOME
class(ins$INCOME)
summary(ins$INCOME)
# Convert to Numeric - Ranges from $0 to $367,000
# 445 NAs - 5.45% of observations

# PARENT1 - Single Parent?
table(ins$PARENT1)
class(ins$PARENT1); levels(ins$PARENT1)
# Factor - No, Yes

# HOME_VAL - Home Value
class(ins$HOME_VAL)
summary(ins$HOME_VAL)
# Converted to Numeric - Ranges from $0 to $885,300
# 464 NAs - 5.69% of observations

# MSTATUS
table(ins$MSTATUS)
class(ins$MSTATUS); levels(ins$MSTATUS)
# Factor - Yes, No

# SEX
table(ins$SEX)
class(ins$SEX); levels(ins$SEX)
# Factor - M, F

# EDUCATION
table(ins$EDUCATION)
class(ins$EDUCATION); levels(ins$EDUCATION)
# Factor - <HS, HS, BA, MA, PhD

# JOB
table(ins$JOB)
class(ins$JOB); levels(ins$JOB)
# Factor - [Blank], Clerical, Doctor, Home Maker, Lawyer, Manager, 
# Professional, Student, Blue Collar

# TRAVTIME - Distance to work
summary(ins$TRAVTIME)
class(ins$TRAVTIME)
# Integer - Ranges from 5 to 142

# CAR_USE
class(ins$CAR_USE); levels(ins$CAR_USE)
table(ins$CAR_USE)
# Factor - Commercial, Private

# BLUEBOOK
class(ins$BLUEBOOK)
summary(ins$BLUEBOOK)
# Numeric - Ranges from $1,500 to $69,740

# TIF - Time in Force
class(ins$TIF)
summary(ins$TIF)
# Integer - Ranges from 1 to 25

# CAR_TYPE
class(ins$CAR_TYPE); levels(ins$CAR_TYPE)
table(ins$CAR_TYPE)
# FActor - Minivan, Panel Truck, Pickup, Sports Car, Van, SUV

# RED_CAR
class(ins$RED_CAR); levels(ins$RED_CAR)
table(ins$RED_CAR)
# Factor - No, Yes

# OLDCLAIM
class(ins$OLDCLAIM)
summary(ins$OLDCLAIM)
# Numeric - Ranges from $0 to $57,040

# CLM_FREQ
class(ins$CLM_FREQ)
summary(ins$CLM_FREQ)
# Integer - Ranges from 0 to 5

# REVOKED
class(ins$REVOKED); levels(ins$REVOKED)
table(ins$REVOKED)
# Factor - No, Yes

# MVR_PTS
class(ins$MVR_PTS)
summary(ins$MVR_PTS)
# Integer - Ranges from 0 to 13

# CAR_AGE
class(ins$CAR_AGE)
summary(ins$CAR_AGE)
nrow(ins[ins$CAR_AGE<0 & !is.na(ins$CAR_AGE), ])
nrow(ins[ins$CAR_AGE==0 & !is.na(ins$CAR_AGE), ])
nrow(ins[ins$CAR_AGE==1 & !is.na(ins$CAR_AGE), ])
# Integer - Ranges from -3 to 28
# 1 observation of -3 - invalid
# 3 observations of 0 - likely invalid
# 1,934 observations of 1 - reasonable (new car)
# 510 NAs - 6.25% of observations

# URBANICITY
class(ins$URBANICITY); levels(ins$URBANICITY)
table(ins$URBANICITY)
# Factor - Urban, Rural

ins$INCOME <- as.numeric(gsub('[$,]', '', ins$INCOME))
ins$HOME_VAL <- as.numeric(gsub('[$,]', '', ins$HOME_VAL))
levels(ins$MSTATUS)[match("z_No",levels(ins$MSTATUS))] <- "No"
levels(ins$SEX)[match("z_F",levels(ins$SEX))] <- "F"
levels(ins$EDUCATION)[match("z_High School",
                            levels(ins$EDUCATION))] <- "High School"
ins$EDUCATION <- factor(ins$EDUCATION,levels(ins$EDUCATION)[c(1,5,2:4)])
levels(ins$JOB)[match("z_Blue Collar",levels(ins$JOB))] <- "Blue Collar"
ins$BLUEBOOK <- as.numeric(gsub('[$,]', '', ins$BLUEBOOK))
levels(ins$CAR_TYPE)[match("z_SUV",levels(ins$CAR_TYPE))] <- "SUV"
levels(ins$RED_CAR)[match("no",levels(ins$RED_CAR))] <- "No"
levels(ins$RED_CAR)[match("yes",levels(ins$RED_CAR))] <- "Yes"
ins$OLDCLAIM <- as.numeric(gsub('[$,]', '', ins$OLDCLAIM))
levels(ins$URBANICITY)[match("Highly Urban/ Urban",
                             levels(ins$URBANICITY))] <- "Urban"
levels(ins$URBANICITY)[match("z_Highly Rural/ Rural",
                             levels(ins$URBANICITY))] <- "Rural"
ins$JOB <- factor(ins$JOB,levels(ins$JOB)[c(7, 8, 3, 1, 6, 5, 4, 2)])

# Drop index column
ins <- ins[-c(1)]

insFull <- ins

ins <- ins[complete.cases(ins), ]
ins[ins$CAR_AGE<1,'CAR_AGE'] <- NA
ins <- ins[complete.cases(ins), ]
# Cuts down from 8,161 to 6,045

# Get only complete cases
nrow(ins[complete.cases(ins), ])
nrow(ins)

insBackup <- ins

# Summary table
sumIns = data.frame(Variable = character(),
                    Min = integer(),
                    Median = integer(),
                    Mean = double(),
                    SD = double(),
                    Max = integer(),
                    Num_NAs = integer(),
                    Num_Zeros = integer())
for (i in c(3:7,9,14,16,17,20,21,23,24)) {
  sumIns <- rbind(sumIns, data.frame(Variable = colnames(ins)[i],
                                     Min = min(ins[,i], na.rm=TRUE),
                                     Median = median(ins[,i], na.rm=TRUE),
                                     Mean = mean(ins[,i], na.rm=TRUE),
                                     SD = sd(ins[,i], na.rm=TRUE),
                                     Max = max(ins[,i], na.rm=TRUE),
                                     Num_NAs = sum(is.na(ins[,i])),
                                     Num_Zeros = length(which(ins[,i]==0)))
  )
}
colnames(sumIns) <- c("", "Min", "Median", "Mean", "SD", "Max", "Num of NAs", 
                      "Num of Zeros")
sumIns

# Proportion of target variable
table(ins$TARGET_FLAG)
table(ins$TARGET_FLAG)/sum(table(ins$TARGET_FLAG))

# Exploratory plots (repeated for each variable)
# Get descriptive plots:
# Variables: 
# INDEX, TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, INCOME, 
# PARENT1, HOME_VAL, MSTATUS, SEX, EDUCATION, JOB, TRAVTIME, CAR_USE, 
# BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS, 
# CAR_AGE, URBANICITY, 
v <- "TARGET_AMT" # Variable to view
pd <- as.data.frame(cbind(ins[, v], ins$TARGET_FLAG))
colnames(pd) <- c("X", "Y")

# Boxplot
bp <- ggplot(pd, aes(x = 1, y = X)) + 
  stat_boxplot(geom ='errorbar') + geom_boxplot() + 
  xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(), 
                                     axis.ticks.x=element_blank())

# Density plot
hp <- ggplot(pd, aes(x = X)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white") +
  geom_density(alpha=.2, fill="#FF6666") + 
  ylab("") + xlab("Density Plot with Mean") +
  geom_vline(aes(xintercept=mean(X, na.rm=TRUE)), color="red", 
             linetype="dashed", size=1)

# Scatterplot
sp <- ggplot(pd, aes(x=X, y=Y)) + 
  geom_point() +   
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE) +
  xlab("Scatterplot with Logistic Regression Line")

grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))

# Correlation matrix
cm <- cor(ins, use="pairwise.complete.obs")
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
  cell_spec(x, "html", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
  })
rownames(cmout) <- colnames(cmout)
cmout %>%
  kable("html", escape = F, align = "c", row.names = TRUE) %>%
  kable_styling("striped", full_width = F)

pairs(ins)

# Split into train and validation sets
set.seed(88)
split <- sample.split(ins$TARGET_FLAG, SplitRatio = 0.75)
insTRAIN <- subset(ins, split == TRUE)
insTEST <- subset(ins, split == FALSE)

# BINARY REGRESSION MODEL

# Modelling - Basic model
model <- glm (TARGET_FLAG ~ .-TARGET_AMT, data = insTRAIN, 
              family = binomial(link="logit"))
summary(model)
pred <- predict(model, newdata=subset(insTEST, select=c(1:25)), 
                type='response')
cm <- confusionMatrix(as.factor(insTEST$TARGET_FLAG), 
                      as.factor(ifelse(pred > 0.5,1,0)))
cm$table
cm$overall['Accuracy']
pR2(model) # McFadden R^2

# Stepwise approach
model <- stepAIC(model, trace=FALSE, direction='both')

# Model tweaking
model <- glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 + 
               log(HOME_VAL+1) + MSTATUS + EDUCATION + JOB + TRAVTIME + 
               CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + 
               REVOKED + MVR_PTS + URBANICITY, 
             family = binomial(link = "logit"), data = insTRAIN)

# ROC
pr <- prediction(pred, insTEST$TARGET_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, colorize = TRUE, text.adj = c(-0.2,1.7))
auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])

# K-Fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10, 
                     savePredictions = TRUE)
model_fit <- train(TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 + 
                     log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME + 
                     CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + 
                     CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,  
                   data=insTRAIN, method="glm", family="binomial",
                   trControl = ctrl, tuneLength = 5)
pred <- predict(model_fit, newdata=insTEST)
confusionMatrix(as.factor(insTEST$TARGET_FLAG),
                as.factor(ifelse(pred > 0.5,1,0)))

# Deviance residuals
anova(model, test="Chisq")

# VIF
vif(model)
# Take out JOB 
ggplot(data = ins, aes(JOB, EDUCATION)) +
  geom_jitter()
model <- glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 + 
               log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + 
               BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + 
               MVR_PTS + URBANICITY, 
             family = binomial(link = "logit"), data = insTRAIN)

# LINEAR MODEL

insLM <- ins
insLM <- ins[ins$TARGET_FLAG==1,]

# Split into training and testing sets
split <- sample.split(insLM$TARGET_AMT, SplitRatio = 0.75)
insLMtrain <- subset(insLM, split == TRUE)
insLMtest <- subset(insLM, split == FALSE)

# Initial models
lmModel <- lm(TARGET_AMT ~ .-TARGET_FLAG,data = insLMtrain)
summary(lmModel)
lmModel <- stepAIC(lmModel, trace=FALSE, direction='both')
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ PARENT1 + MSTATUS + BLUEBOOK + OLDCLAIM + 
                CLM_FREQ + REVOKED + JOB,data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK + OLDCLAIM + CLM_FREQ + REVOKED,
              data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK + CAR_AGE + CAR_TYPE,data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK,data = insLMtrain)
summary(lmModel)

# Calculate RMSE
pred <- predict(lmModel, newdata=insLMtest)
rmse(insLMtest$TARGET_AMT, pred)

# Model plots
plot(lmModel$residuals, ylab="Residuals")
abline(h=0)

plot(lmModel$fitted.values, lmModel$residuals, 
     xlab="Fitted Values", ylab="Residuals")
abline(h=0)

qqnorm(lmModel$residuals)
qqline(lmModel$residuals)

boxcox(lmModel)

lmModel <- lm(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel)
pred <- predict(lmModel, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)
lmModel2 <- rlm(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel2)
pred <- predict(lmModel2, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)
lmModel3 <- rq(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel3)
pred <- predict(lmModel3, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)


plot(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
abline(lmModel)
abline(lmModel2, col="red")
abline(lmModel3, col="blue")
legend("topright", inset=0.05, bty="n",
       legend=c("lm fit", "rlm fit", "rq fit"),
       lty=c(1,1,1),
       col=c("black", "red", "blue"))

# Prediction
eval <- read.csv(url(paste0("https://raw.githubusercontent.com/",
                            "ilyakats/CUNY-DATA621/master/hw4/",
                            "insurance-evaluation-data.csv")),
                 na.strings=c("","NA"))
results <- eval[,1]
eval$INCOME <- as.numeric(gsub('[$,]', '', eval$INCOME))
eval$HOME_VAL <- as.numeric(gsub('[$,]', '', eval$HOME_VAL))
levels(eval$MSTATUS)[match("z_No",levels(eval$MSTATUS))] <- "No"
levels(eval$SEX)[match("z_F",levels(eval$SEX))] <- "F"
levels(eval$EDUCATION)[match("z_High School",
                             levels(eval$EDUCATION))] <- "High School"
eval$EDUCATION <- factor(eval$EDUCATION,levels(eval$EDUCATION)[c(1,5,2:4)])
levels(eval$JOB)[match("z_Blue Collar",levels(eval$JOB))] <- "Blue Collar"
eval$BLUEBOOK <- as.numeric(gsub('[$,]', '', eval$BLUEBOOK))
levels(eval$CAR_TYPE)[match("z_SUV",levels(eval$CAR_TYPE))] <- "SUV"
levels(eval$RED_CAR)[match("no",levels(eval$RED_CAR))] <- "No"
levels(eval$RED_CAR)[match("yes",levels(eval$RED_CAR))] <- "Yes"
eval$OLDCLAIM <- as.numeric(gsub('[$,]', '', eval$OLDCLAIM))
levels(eval$URBANICITY)[match("Highly Urban/ Urban",
                              levels(eval$URBANICITY))] <- "Urban"
levels(eval$URBANICITY)[match("z_Highly Rural/ Rural",
                              levels(eval$URBANICITY))] <- "Rural"
eval$JOB <- factor(eval$JOB,levels(eval$JOB)[c(7, 8, 3, 1, 6, 5, 4, 2)])
eval <- eval[-c(1)]

pred <- predict(model, newdata=eval, type="response")
results <- cbind(results, prob=round(pred,4))
results <- cbind(results, predict=round(pred,0))

pred <- predict(lmModel, newdata=eval, type="response")
results <- cbind(results, exp(pred))
results <- as.data.frame(results)

results[results$predict==0 & !is.na(results$predict),'V4'] <- NA
results[is.na(results$predict),'V4'] <- NA
colnames(results) <- c("Index", "TARGET_FLAG Prob.", 
                       "TARGET_FLAG", "TARGET_AMT")
pander(head(results, 100))