The main purpose of this homework is building two models, (a binary regression model to predict whether a vehicle will be involved in a crash and a linear regression model to predict possible payout).
Binary model generated fairly good results. Due to the accuracy of aproximately 80% the binary model has generated a better result instead of gussing the outcome. We woould like to make a note here that the accuracy level may not be high in the real world due to the complexity of car insurance business as the reasons for claim are diverse .But we belive that the results of this project are reasonable to achive the required model.
Based on our observation , linear model have problems due to the significant predictor of depending on the Blue Book Value for payment purposes. However, with \(R^2\) value of 0.005, this model barely explains any variance in the outcome variable. Adding other variables that should influence the payout amount, such as CAR_AGE, did not have a significant effect.
The model was created using only onservations for vehicles involved in a crash (TARGET_FLAG==1). It is possible to create a linear regression model on all data, without this limitation. \(R^2\) is significantly improved. In some tests, \(R^2\) was about 0.27. From the obervation we can undertsnad that the $0 payout (vehicles not involved in a crash) does not help fit the actual payouts.
Looking at the relationship between Blue Book value and payout amount, there is a high amount of variability. There are many observations with low value, but high payout which might have been caused by missing of some critical data which would have enabled to accurately predict payouts . We should not also ignore the importance of variables which will dramatically influence the payout amount . This varivales are : the intesity of the crash, the number of vehicles involved on th crash and the assesment of whethere the vehicle was totaled or not . ## Data Exploration
The data set includes 8,161 observations with 24 variables.
| Variable | Type | Description | Comments |
|---|---|---|---|
| KIDSDRIV | Integer | No of children driving. | Ranges from 0 to 4. |
| AGE | Integer | Age of driver. | Ranges from 16 to 81. Contains 6 NAs. |
| HOMEKIDS | Integer | No of children at home. | Ranges from 0 to 5. |
| YOJ | Integer | Years on the job. | Ranges from 0 to 23. Contains 454 NAs (about 5.56% of all observations). |
| INCOME | Numeric | Income. | Ranges from $0 to $367,000. Contains 445 NAs (about 5.45% of all observations). Was converted to numeric by removing dollar signs and commas. |
| PARENT1 | Factor | Single parent flag. | Values: No, Yes. |
| HOME_VAL | Numeric | Home value. | Ranges from $0 to $885,000. Contains 464 NAs (about 5.69% of all observations). Was converted to numeric by removing dollar signs and commas. |
| MSTATUS | Factor | Married flag. | Values: Yes, No. |
| SEX | Factor | Gender. | Values: M, F. |
| EDUCATION | Factor | Maximum education level. | Values: <High School, High School, Bachelors, Masters, PhD. |
| JOB | Factor | Job category. | Values: [Blank], Clerical, Doctor, Home Maker, Lawyer, Manager, Professional, Student, Blue Collar. |
| TRAVTIME | Integer | Distance to work. | Ranges from 5 to 142. |
| CAR_USE | Factor | Vehicle use. | Values: Commercial, Private. |
| BLUEBOOK | Numeric | Vehicle value. | Ranges from $1,500 to $69,740. Was converted to numeric by removing dollar signs and commas. |
| TIF | Integer | Time in force. | Ranges from 1 to 25. |
| CAR_TYPE | Factor | Vehicle type. | Values: Minivan, Panel Truck, Pickup, Sports Car, Van, SUV. |
| RED_CAR | Factor | Red car flag. | Values: No, Yes |
| OLDCLAIM | Numeric | Total payout of claims. | Ranges from $0 to $57,040. Was converted to numeric by removing dollar signs and commas. |
| CLM_FREQ | Integer | No of claims (past 5 years). | Ranges from 0 to 5. |
| REVOKED | Factor | Revoked license flag. | Values: No, Yes. |
| MVR_PTS | Integer | Motor vehicle record points. | Ranges from 0 to 13. |
| CAR_AGE | Integer | Vehicle age. | Ranges from -3 to 28. Contains 510 NAs (about 6.25% of all observations). |
| URBANICITY | Factor | Home/work area. | Values: Urban, Rural. |
Observation of summary of numeric variables in a table .
| Variable | Min | Median | Mean | SD | Max | Num of NAs | Num of Zeros |
|---|---|---|---|---|---|---|---|
| KIDSDRIV | 0 | 0 | 0.1711 | 0.5115 | 4 | 0 | 7180 |
| AGE | 16 | 45 | 44.79 | 8.628 | 81 | 6 | 0 |
| HOMEKIDS | 0 | 0 | 0.7212 | 1.116 | 5 | 0 | 5289 |
| YOJ | 0 | 11 | 10.5 | 4.092 | 23 | 454 | 625 |
| INCOME | 0 | 54028 | 61898 | 47573 | 367030 | 445 | 615 |
| HOME_VAL | 0 | 161160 | 154867 | 129124 | 885282 | 464 | 2294 |
| TRAVTIME | 5 | 33 | 33.49 | 15.91 | 142 | 0 | 0 |
| BLUEBOOK | 1500 | 14440 | 15710 | 8420 | 69740 | 0 | 0 |
| TIF | 1 | 4 | 5.351 | 4.147 | 25 | 0 | 0 |
| OLDCLAIM | 0 | 0 | 4037 | 8777 | 57037 | 0 | 5009 |
| CLM_FREQ | 0 | 0 | 0.7986 | 1.158 | 5 | 0 | 5009 |
| MVR_PTS | 0 | 1 | 1.696 | 2.147 | 13 | 0 | 3712 |
| CAR_AGE | -3 | 8 | 8.328 | 5.701 | 28 | 510 | 3 |
As we have obsrved from the above table, there are significant number of observations with value of 0 and there is a logical explanation that these observations are valid:
KIDSDRIV and HOMEKIDS: households without childrenYOJ and INCOME: unemployed individualsHOME_VAL: rentersOLDCLAIM, CLM_FREQ and MVR_PTS: safe drivers with no claims or DMV pointsSeveral variables - AGE, YOJ, INCOME, HOME_VAL, CAR_AGE - contained some NA values.
By removing incomplete observation we have 6,041 observations which are suffient for the sake of our analysis .
MSTATUS, EDUCATION, JOB, CAR_TYPE and URBANICITY.EDUCATION have been re-ordered to follow the most common order: <High School, High School, Bachelors, Masters, and PhD.JOB have been re-ordered to follow general order from low-paying to high-paying occupations: Student, Blue Collar, Home Maker, Clerical, Professional, Manager, Lawyer, and Doctor.PARENT1:| No | Yes |
|---|---|
| 5219 | 822 |
MSTATUS:| Yes | No |
|---|---|
| 3594 | 2447 |
SEX:| M | F |
|---|---|
| 2683 | 3358 |
EDUCATION:| <High School | High School | Bachelors | Masters | PhD |
|---|---|---|---|---|
| 955 | 1871 | 1739 | 1060 | 416 |
JOB:| Student | Blue Collar | Home Maker | Clerical | Professional | Manager | Lawyer | Doctor |
|---|---|---|---|---|---|---|---|
| 537 | 1476 | 484 | 1030 | 867 | 778 | 670 | 199 |
CAR_USE:| Commercial | Private |
|---|---|
| 2040 | 4001 |
CAR_TYPE:| Minivan | Panel Truck | Pickup | Sports Car | Van | SUV |
|---|---|---|---|---|---|
| 1699 | 347 | 1012 | 714 | 488 | 1781 |
RED_CAR:| No | Yes |
|---|---|
| 4350 | 1691 |
REVOKED:| No | Yes |
|---|---|
| 5297 | 744 |
URBANICITY:| Urban | Rural |
|---|---|
| 4742 | 1299 |
KIDSDRIV:| 0 | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| 5309 | 473 | 207 | 50 | 2 |
HOMEKIDS| 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 3872 | 667 | 837 | 525 | 129 | 11 |
CLM_FREQ| 0 | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| 3758 | 715 | 848 | 568 | 141 | 11 |
TARGET_FLAGTo identify whether a vehicle was involved in a crash we have used target variable for the binary regression model representing a flag . There are 4,440 observation with TARGET_FLAG value of 0 and 1,601 observations with TARGET_FLAG value of 1 making it about 75/25 split, or more precisely there are 73.5% of 0s and 26.5% of 1s.
TARGET_AMTTarget variable for the linear regression model represents the cost if a vehicle was involved in a crash. The value is presented only for observations with TARGET_FLAG of 1. Distribution of TARGET_AMT has a long right tail. There are no missing values.
| Min | Median | Mean | SD | Max | Num_NAs | Num_Zeros |
|---|---|---|---|---|---|---|
| 30.28 | 4136 | 5586 | 7440 | 85524 | 0 | 0 |
The aim of this model is predicting whether a vehicle will be involved in a crash. The dependent variable, TARGET_FLAG, is binary. For this project it is assumed that observations are independent of each other as there is no reason to believe otherwise.
The main data set is split into training and testing data sets for the purpose of testing the accuracy of the model. The training set includes 75% of randomly chosen observations (4,531) while the testing set includes remaining 25% (1,510).
The starting point is a model that includes all independent variables. It has AIC value of 4085.5 and accuracy of 79.47%. Summary is below.
##
## Call:
## glm(formula = TARGET_FLAG ~ . - TARGET_AMT, family = binomial(link = "logit"),
## data = insTRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2532 -0.7047 -0.3874 0.6212 2.9097
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.057e-01 3.474e-01 -0.880 0.378824
## KIDSDRIV 3.254e-01 8.162e-02 3.987 6.70e-05 ***
## AGE -6.524e-03 5.457e-03 -1.195 0.231914
## HOMEKIDS 1.095e-02 4.955e-02 0.221 0.825173
## YOJ -4.818e-03 1.137e-02 -0.424 0.671742
## INCOME -3.061e-06 1.632e-06 -1.876 0.060660 .
## PARENT1Yes 3.940e-01 1.454e-01 2.709 0.006749 **
## HOME_VAL -1.383e-06 4.914e-07 -2.815 0.004879 **
## MSTATUSNo 4.710e-01 1.165e-01 4.043 5.28e-05 ***
## SEXF -1.745e-01 1.504e-01 -1.160 0.245995
## EDUCATIONHigh School -4.663e-02 1.224e-01 -0.381 0.703169
## EDUCATIONBachelors -5.190e-01 1.524e-01 -3.405 0.000662 ***
## EDUCATIONMasters -6.353e-01 2.499e-01 -2.542 0.011006 *
## EDUCATIONPhD -6.920e-02 3.049e-01 -0.227 0.820482
## JOBBlue Collar 9.702e-02 1.762e-01 0.551 0.581935
## JOBHome Maker -8.012e-02 2.029e-01 -0.395 0.692862
## JOBClerical 2.432e-01 1.762e-01 1.380 0.167584
## JOBProfessional -9.445e-03 2.109e-01 -0.045 0.964277
## JOBManager -6.364e-01 2.306e-01 -2.760 0.005782 **
## JOBLawyer 2.617e-01 2.815e-01 0.930 0.352593
## JOBDoctor -5.717e-01 4.068e-01 -1.405 0.159912
## TRAVTIME 1.420e-02 2.536e-03 5.599 2.15e-08 ***
## CAR_USEPrivate -8.163e-01 1.224e-01 -6.669 2.57e-11 ***
## BLUEBOOK -2.233e-05 7.034e-06 -3.174 0.001503 **
## TIF -5.337e-02 9.914e-03 -5.383 7.32e-08 ***
## CAR_TYPEPanel Truck 6.422e-01 2.271e-01 2.827 0.004694 **
## CAR_TYPEPickup 5.283e-01 1.322e-01 3.995 6.47e-05 ***
## CAR_TYPESports Car 1.104e+00 1.696e-01 6.512 7.42e-11 ***
## CAR_TYPEVan 4.579e-01 1.759e-01 2.603 0.009232 **
## CAR_TYPESUV 7.862e-01 1.456e-01 5.398 6.74e-08 ***
## RED_CARYes -1.989e-01 1.199e-01 -1.658 0.097220 .
## OLDCLAIM -1.796e-05 5.177e-06 -3.468 0.000524 ***
## CLM_FREQ 2.075e-01 3.832e-02 5.416 6.10e-08 ***
## REVOKEDYes 8.732e-01 1.218e-01 7.167 7.67e-13 ***
## MVR_PTS 1.147e-01 1.851e-02 6.197 5.75e-10 ***
## CAR_AGE 7.534e-04 1.030e-02 0.073 0.941712
## URBANICITYRural -2.360e+00 1.463e-01 -16.129 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5240.4 on 4530 degrees of freedom
## Residual deviance: 4011.5 on 4494 degrees of freedom
## AIC: 4085.5
##
## Number of Fisher Scoring iterations: 5
Running this model through the stepwise algorithm (stepAIC from the MASS package), removed AGE, HOMEKIDS, YOJ, SEX, RED_CAR and CAR_AGE. AIC is reduced slightly to 4078.2 and accuracy is improved very slightly to 79.54%.
INCOME and HOME_VAL are very right-skewed. To make results more normal, they are log-transformed (adding 1 to make sure that log-transformation is possible for 0 values). The new model again has slightly lower AIC of 4073.2 and slightly higher accuracy at 79.73%.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME + 1) + PARENT1 +
## log(HOME_VAL + 1) + MSTATUS + EDUCATION + JOB + TRAVTIME +
## CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ +
## REVOKED + MVR_PTS + URBANICITY, family = binomial(link = "logit"),
## data = insTRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3062 -0.7025 -0.3932 0.6137 2.9590
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.484e-01 2.494e-01 -1.397 0.162383
## KIDSDRIV 3.248e-01 7.318e-02 4.439 9.05e-06 ***
## log(INCOME + 1) -6.371e-02 1.901e-02 -3.351 0.000805 ***
## PARENT1Yes 4.522e-01 1.255e-01 3.604 0.000313 ***
## log(HOME_VAL + 1) -3.312e-02 9.474e-03 -3.496 0.000473 ***
## MSTATUSNo 3.924e-01 1.145e-01 3.426 0.000613 ***
## EDUCATIONHigh School -5.933e-02 1.221e-01 -0.486 0.626878
## EDUCATIONBachelors -5.946e-01 1.387e-01 -4.288 1.80e-05 ***
## EDUCATIONMasters -7.595e-01 2.179e-01 -3.485 0.000492 ***
## EDUCATIONPhD -3.272e-01 2.696e-01 -1.214 0.224811
## JOBBlue Collar 3.822e-01 2.018e-01 1.894 0.058233 .
## JOBHome Maker 6.055e-02 2.130e-01 0.284 0.776192
## JOBClerical 6.048e-01 2.050e-01 2.950 0.003175 **
## JOBProfessional 2.447e-01 2.315e-01 1.057 0.290386
## JOBManager -4.020e-01 2.480e-01 -1.621 0.105038
## JOBLawyer 5.227e-01 2.983e-01 1.752 0.079716 .
## JOBDoctor -3.706e-01 4.133e-01 -0.897 0.369957
## TRAVTIME 1.422e-02 2.540e-03 5.599 2.16e-08 ***
## CAR_USEPrivate -8.122e-01 1.225e-01 -6.632 3.32e-11 ***
## BLUEBOOK -2.595e-05 6.304e-06 -4.116 3.85e-05 ***
## TIF -5.211e-02 9.911e-03 -5.258 1.45e-07 ***
## CAR_TYPEPanel Truck 6.821e-01 2.146e-01 3.179 0.001478 **
## CAR_TYPEPickup 5.412e-01 1.321e-01 4.097 4.18e-05 ***
## CAR_TYPESports Car 1.057e+00 1.392e-01 7.594 3.10e-14 ***
## CAR_TYPEVan 4.567e-01 1.709e-01 2.673 0.007516 **
## CAR_TYPESUV 7.558e-01 1.120e-01 6.750 1.48e-11 ***
## OLDCLAIM -1.792e-05 5.186e-06 -3.455 0.000551 ***
## CLM_FREQ 2.081e-01 3.827e-02 5.439 5.37e-08 ***
## REVOKEDYes 8.647e-01 1.219e-01 7.092 1.32e-12 ***
## MVR_PTS 1.148e-01 1.853e-02 6.196 5.79e-10 ***
## URBANICITYRural -2.391e+00 1.471e-01 -16.254 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5240.4 on 4530 degrees of freedom
## Residual deviance: 4011.2 on 4500 degrees of freedom
## AIC: 4073.2
##
## Number of Fisher Scoring iterations: 5
It is good to note that a potential issue was discovered with variance-inflation (using vif in the car package) by anlyzing the model using various evaluation methods.
## GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV 1.088467 1 1.043296
## log(INCOME + 1) 2.530758 1 1.590836
## PARENT1 1.427433 1 1.194752
## log(HOME_VAL + 1) 1.965195 1 1.401854
## MSTATUS 2.118875 1 1.455636
## EDUCATION 7.054761 4 1.276616
## JOB 29.531678 7 1.273563
## TRAVTIME 1.034921 1 1.017311
## CAR_USE 2.336829 1 1.528669
## BLUEBOOK 1.575884 1 1.255342
## TIF 1.010050 1 1.005012
## CAR_TYPE 2.221595 5 1.083095
## OLDCLAIM 1.679788 1 1.296066
## CLM_FREQ 1.449881 1 1.204110
## REVOKED 1.341382 1 1.158181
## MVR_PTS 1.174200 1 1.083605
## URBANICITY 1.142851 1 1.069042
TThere is a possible correlation between JOB and EDUCATION
We would like to provide the formula for the final generalized linear model to be TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 + log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY. AIC is 4107.
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME + 1) + PARENT1 +
## log(HOME_VAL + 1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE +
## BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY, family = binomial(link = "logit"),
## data = insTRAIN)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3080 -0.7092 -0.4102 0.6195 2.9691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.012e-01 2.367e-01 -0.850 0.395436
## KIDSDRIV 3.163e-01 7.239e-02 4.369 1.25e-05 ***
## log(INCOME + 1) -4.046e-02 1.317e-02 -3.073 0.002119 **
## PARENT1Yes 4.630e-01 1.241e-01 3.730 0.000192 ***
## log(HOME_VAL + 1) -2.605e-02 8.541e-03 -3.050 0.002291 **
## MSTATUSNo 4.317e-01 1.092e-01 3.952 7.75e-05 ***
## EDUCATIONHigh School -1.621e-01 1.183e-01 -1.371 0.170482
## EDUCATIONBachelors -8.241e-01 1.235e-01 -6.673 2.50e-11 ***
## EDUCATIONMasters -8.845e-01 1.426e-01 -6.202 5.59e-10 ***
## EDUCATIONPhD -8.916e-01 1.931e-01 -4.617 3.89e-06 ***
## TRAVTIME 1.438e-02 2.515e-03 5.717 1.09e-08 ***
## CAR_USEPrivate -8.525e-01 9.843e-02 -8.661 < 2e-16 ***
## BLUEBOOK -2.722e-05 6.234e-06 -4.367 1.26e-05 ***
## TIF -5.137e-02 9.859e-03 -5.211 1.88e-07 ***
## CAR_TYPEPanel Truck 5.620e-01 2.039e-01 2.757 0.005841 **
## CAR_TYPEPickup 4.841e-01 1.287e-01 3.761 0.000169 ***
## CAR_TYPESports Car 1.006e+00 1.371e-01 7.335 2.21e-13 ***
## CAR_TYPEVan 4.125e-01 1.692e-01 2.438 0.014771 *
## CAR_TYPESUV 7.359e-01 1.106e-01 6.655 2.83e-11 ***
## OLDCLAIM -1.776e-05 5.131e-06 -3.461 0.000538 ***
## CLM_FREQ 2.059e-01 3.794e-02 5.428 5.71e-08 ***
## REVOKEDYes 8.704e-01 1.209e-01 7.199 6.07e-13 ***
## MVR_PTS 1.218e-01 1.840e-02 6.618 3.64e-11 ***
## URBANICITYRural -2.332e+00 1.471e-01 -15.855 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5240.4 on 4530 degrees of freedom
## Residual deviance: 4059.0 on 4507 degrees of freedom
## AIC: 4107
##
## Number of Fisher Scoring iterations: 5
| Â | 0 | 1 |
|---|---|---|
| 0 | 1046 | 64 |
| 1 | 241 | 159 |
| llh | llhNull | G2 | McFadden | r2ML | r2CU |
|---|---|---|---|---|---|
| -2029 | -2620 | 1181 | 0.2254 | 0.2295 | 0.3349 |
The accuracy level is 79.80%. McFadden \(R^2\) is 0.2254. As we can see on the plot below the area under the curve is 0.8022.
Please see below the results for K-fold cross validation using 10 iterations. T
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 1046 64
## 1 241 159
##
## Accuracy : 0.798
## 95% CI : (0.7769, 0.818)
## No Information Rate : 0.8523
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3959
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.8127
## Specificity : 0.7130
## Pos Pred Value : 0.9423
## Neg Pred Value : 0.3975
## Prevalence : 0.8523
## Detection Rate : 0.6927
## Detection Prevalence : 0.7351
## Balanced Accuracy : 0.7629
##
## 'Positive' Class : 0
##
By utlizing anovawe can analyze the deviance and observe that all terms are significant although significance for MSTATUS and TRAVTIME are less significant than all other terms.
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: TARGET_FLAG
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 4530 5240.4
## KIDSDRIV 1 31.64 4529 5208.8 1.860e-08 ***
## log(INCOME + 1) 1 45.45 4528 5163.3 1.570e-11 ***
## PARENT1 1 99.30 4527 5064.1 < 2.2e-16 ***
## log(HOME_VAL + 1) 1 45.03 4526 5019.0 1.938e-11 ***
## MSTATUS 1 3.56 4525 5015.5 0.05907 .
## EDUCATION 4 85.47 4521 4930.0 < 2.2e-16 ***
## TRAVTIME 1 5.70 4520 4924.3 0.01697 *
## CAR_USE 1 81.67 4519 4842.6 < 2.2e-16 ***
## BLUEBOOK 1 38.64 4518 4804.0 5.089e-10 ***
## TIF 1 27.91 4517 4776.1 1.272e-07 ***
## CAR_TYPE 5 67.15 4512 4708.9 4.014e-13 ***
## OLDCLAIM 1 49.73 4511 4659.2 1.766e-12 ***
## CLM_FREQ 1 109.12 4510 4550.1 < 2.2e-16 ***
## REVOKED 1 61.82 4509 4488.2 3.762e-15 ***
## MVR_PTS 1 69.84 4508 4418.4 < 2.2e-16 ***
## URBANICITY 1 359.42 4507 4059.0 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Variance inflation factors are low for all variables, so correlation is not high.
## GVIF Df GVIF^(1/(2*Df))
## KIDSDRIV 1.082636 1 1.040498
## log(INCOME + 1) 1.214555 1 1.102068
## PARENT1 1.418362 1 1.190950
## log(HOME_VAL + 1) 1.622557 1 1.273796
## MSTATUS 1.954360 1 1.397984
## EDUCATION 1.345798 4 1.037821
## TRAVTIME 1.033747 1 1.016734
## CAR_USE 1.527886 1 1.236077
## BLUEBOOK 1.565014 1 1.251005
## TIF 1.008322 1 1.004152
## CAR_TYPE 1.935421 5 1.068261
## OLDCLAIM 1.677986 1 1.295371
## CLM_FREQ 1.449134 1 1.203800
## REVOKED 1.339632 1 1.157425
## MVR_PTS 1.171422 1 1.082322
## URBANICITY 1.135802 1 1.065740
Crash is more likely to occur due to the following varibales : - Higher number of DMV points - Higher number of children driving - Single parents and unmarried individuals - Higher commuting distance - License revocation within the past 7 years - Higher number of claims within the past 5 years - Commercial vehicle use is more likely to result in a crash than private. - Rural environment is less likely to result in a crash than urban. - Individuals who have been customers longer are less likely to have a crash. - Higher Blue Book value makes it less likely to result in a crash. - Education makes it less likely to result in a crash. - Higher income and home value make it less likely to result in a crash. - Car type is hard to evaluate, but generally various types make it more likely to have a crash (to different degrees). - Interestingly, larger previous payout make it less likely to result in a crash.
Linear modelling is used to predict the amount of the payout in case of a crash (TARGET_AMT). Only observations where a crash has occurred (TARGET_FLAG==1) are used in training the linear model. If there is no crash, payout will not be needed. As such if observations without a crash are included in the model, they may skew the results.Isolating all observations with a reported crash, leaves 1,601 observations.
As mentioned above the data is divided into a training set (75%; 1,200 observations) and a testing set (25%; 401 observations).
We have managed to build 5 models and compare them :
PARENT1, MSTATUS, BLUEBOOK, OLDCLAIM, CLM_FREQ, REVOKED and JOB.BLUEBOOK, OLDCLAIM, CLM_FREQ and REVOKED.BLUEBOOK, CAR_AGE, CAR_TYPE.BLUEBOOK. This variable seems the most significant.| Model | Adjusted R^2 | Root-Mean Square Error |
|---|---|---|
| Model 1 | 0.01808 | 8416 |
| Model 2 | 0.02218 | 8367 |
| Model 3 | 0.01366 | 8344 |
| Model 4 | 0.01351 | 8373 |
| Model 5 | 0.01202 | 8336 |
Based on our observation we have decided to use model 5 due to the best RMSE value and its simplicity eventhough it had the worst adjusted \(R^2\). In the next section we will be focusing on model 5 for furher analysis.
##
## Call:
## lm(formula = TARGET_AMT ~ BLUEBOOK, data = insLMtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7307 -2864 -1349 483 79170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.041e+03 4.023e+02 10.044 < 2e-16 ***
## BLUEBOOK 1.013e-01 2.566e-02 3.948 8.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7063 on 1198 degrees of freedom
## Multiple R-squared: 0.01284, Adjusted R-squared: 0.01202
## F-statistic: 15.58 on 1 and 1198 DF, p-value: 8.349e-05
…
Consider Box-Cox transformation (plot below is generated using boxcox from the MASS package).
if lambda is picked to be 0, then target variable should be log-transformed.
##
## Call:
## lm(formula = log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7770 -0.3877 0.0599 0.4177 3.0424
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.128e+00 4.592e-02 176.993 < 2e-16 ***
## BLUEBOOK 8.166e-06 2.929e-06 2.788 0.00539 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8062 on 1198 degrees of freedom
## Multiple R-squared: 0.006446, Adjusted R-squared: 0.005617
## F-statistic: 7.772 on 1 and 1198 DF, p-value: 0.005389
This drastically reduced already low adjusted \(R^2\). However, Q-Q plot is improved. I
Looking at the scatterplot of BLUEBOOK vs log(TARGET_AMT), there is a lot of variance and a lot of outliers. It is possible that some points are leverage points that interfere with the model. Two additional model were built in an effort to account for that.
The first model was created using robust regression (rlm in the MASS package). The second model was created using quantile regression (rq in the quantreg package).
Original model has RMSE of 0.7815. RLM model very slightly improves it to 0.7811. Finally, RQ model also very slightly improves it to 0.7771. Fits for all three models are presented in the scatterplot below.
For the sake of clarfication we will be using the first 100 entries of the data; insurance-evaluation-data.csv file.
Evalution data is missing some INCOME and HOME_EVAL values that are used in the binary regression model. Rather than trying to impute those values by replacing them with 0 or mean or media values or by building a model to predict them, these values are left as they are and the model cannot be used to predict the outcome for corresponding observations.
| Index | TARGET_FLAG Prob. | TARGET_FLAG | TARGET_AMT |
|---|---|---|---|
| 3 | 0.1816 | 0 | NA |
| 9 | 0.3604 | 0 | NA |
| 10 | 0.1493 | 0 | NA |
| 18 | 0.2319 | 0 | NA |
| 21 | 0.2756 | 0 | NA |
| 30 | NA | NA | NA |
| 31 | 0.4432 | 0 | NA |
| 37 | 0.4418 | 0 | NA |
| 39 | 0.0289 | 0 | NA |
| 47 | 0.1904 | 0 | NA |
| 60 | 0.0279 | 0 | NA |
| 62 | 0.5549 | 1 | 3812 |
| 63 | NA | NA | NA |
| 64 | 0.0792 | 0 | NA |
| 68 | 0.0261 | 0 | NA |
| 75 | NA | NA | NA |
| 76 | 0.671 | 1 | 3429 |
| 83 | 0.1779 | 0 | NA |
| 87 | 0.5449 | 1 | 3651 |
| 92 | 0.3123 | 0 | NA |
| 98 | 0.1521 | 0 | NA |
| 106 | 0.395 | 0 | NA |
| 107 | 0.0851 | 0 | NA |
| 113 | 0.3603 | 0 | NA |
| 120 | 0.2745 | 0 | NA |
| 123 | 0.4411 | 0 | NA |
| 125 | 0.4386 | 0 | NA |
| 126 | 0.3961 | 0 | NA |
| 128 | 0.1665 | 0 | NA |
| 129 | 0.1554 | 0 | NA |
| 131 | 0.221 | 0 | NA |
| 135 | 0.3263 | 0 | NA |
| 141 | 0.079 | 0 | NA |
| 147 | 0.1926 | 0 | NA |
| 148 | 0.1651 | 0 | NA |
| 151 | 0.0361 | 0 | NA |
| 156 | 0.1378 | 0 | NA |
| 157 | 0.1199 | 0 | NA |
| 174 | 0.0691 | 0 | NA |
| 186 | 0.5531 | 1 | 3668 |
| 193 | 0.2295 | 0 | NA |
| 195 | 0.5546 | 1 | 3831 |
| 212 | 0.0136 | 0 | NA |
| 213 | 0.5128 | 1 | 3920 |
| 217 | 0.0041 | 0 | NA |
| 223 | 0.2552 | 0 | NA |
| 226 | 0.1033 | 0 | NA |
| 228 | 0.2615 | 0 | NA |
| 230 | 0.0138 | 0 | NA |
| 241 | 0.5685 | 1 | 3610 |
| 243 | 0.1487 | 0 | NA |
| 249 | 0.3335 | 0 | NA |
| 281 | 0.7619 | 1 | 3691 |
| 288 | 0.1072 | 0 | NA |
| 294 | NA | NA | NA |
| 295 | 0.2027 | 0 | NA |
| 300 | 0.4442 | 0 | NA |
| 302 | 0.3606 | 0 | NA |
| 303 | 0.1077 | 0 | NA |
| 308 | 0.5506 | 1 | 3594 |
| 319 | 0.0118 | 0 | NA |
| 320 | 0.0944 | 0 | NA |
| 324 | 0.3479 | 0 | NA |
| 331 | 0.2167 | 0 | NA |
| 343 | 0.0511 | 0 | NA |
| 347 | 0.3246 | 0 | NA |
| 348 | 0.8159 | 1 | 3531 |
| 350 | 0.568 | 1 | 3967 |
| 357 | 0.1465 | 0 | NA |
| 358 | 0.058 | 0 | NA |
| 360 | NA | NA | NA |
| 366 | 0.1974 | 0 | NA |
| 367 | 0.7401 | 1 | 3580 |
| 368 | 0.2771 | 0 | NA |
| 376 | 0.7287 | 1 | 3645 |
| 380 | 0.3893 | 0 | NA |
| 388 | 0.3897 | 0 | NA |
| 396 | 0.252 | 0 | NA |
| 398 | 0.1245 | 0 | NA |
| 403 | 0.0448 | 0 | NA |
| 410 | 0.5654 | 1 | 3598 |
| 412 | 0.34 | 0 | NA |
| 420 | 0.2981 | 0 | NA |
| 434 | 0.0345 | 0 | NA |
| 440 | 0.4766 | 0 | NA |
| 450 | 0.5612 | 1 | 3887 |
| 453 | 0.2757 | 0 | NA |
| 464 | 0.2841 | 0 | NA |
| 465 | 0.0522 | 0 | NA |
| 466 | NA | NA | NA |
| 473 | 0.0847 | 0 | NA |
| 476 | 0.0903 | 0 | NA |
| 478 | NA | NA | NA |
| 479 | 0.1898 | 0 | NA |
| 493 | 0.0347 | 0 | NA |
| 497 | 0.2245 | 0 | NA |
| 503 | 0.0068 | 0 | NA |
| 504 | 0.3887 | 0 | NA |
| 505 | 0.3087 | 0 | NA |
| 507 | 0.2996 | 0 | NA |
# Required libraries
library(knitr)
library(kableExtra)
library(gridExtra)
library(ggplot2)
library(dplyr)
library(caTools)
library(pscl)
library(ROCR)
library(MASS)
library(caret)
library(car)
library(Metrics)
library(quantreg)
# Import data
ins <- read.csv(url(paste0("https://raw.githubusercontent.com/",
"ilyakats/CUNY-DATA621/master/hw4/",
"insurance_training_data.csv")),
na.strings=c("","NA"))
# Basic statistic
nrow(ins); ncol(ins)
summary(ins)
# TARGET_FLAG - 6008 are 0, 2153 are 1
table(ins$TARGET_FLAG)
class(ins$TARGET_FLAG)
# Integer (0/1)
# TARGET_AMT
summary(ins[ins$TARGET_FLAG==0, 'TARGET_AMT'])
summary(ins[ins$TARGET_FLAG==1, 'TARGET_AMT'])
class(ins$TARGET_AMT)
# Only available for TARGET_FLAG=1
# Numeric: Ranges from 30.28 to 107600
# KIDSDRIV - No of Driving Children
table(ins$KIDSDRIV)
class(ins$KIDSDRIV)
# Integer - Ranges from 0 to 4
# AGE
summary(ins$AGE)
class(ins$AGE)
# Integer - Ranges from 16 to 81
# 6 NAs
# HOMEKIDS
table(ins$HOMEKIDS)
class(ins$HOMEKIDS)
# Integer - Ranges from 0 to 5
# YOJ - Years on Job
summary(ins$YOJ)
class(ins$YOJ)
# Integer - Ranges from 0 to 23
# 454 NAs - 5.56% of observations
# INCOME
class(ins$INCOME)
summary(ins$INCOME)
# Convert to Numeric - Ranges from $0 to $367,000
# 445 NAs - 5.45% of observations
# PARENT1 - Single Parent?
table(ins$PARENT1)
class(ins$PARENT1); levels(ins$PARENT1)
# Factor - No, Yes
# HOME_VAL - Home Value
class(ins$HOME_VAL)
summary(ins$HOME_VAL)
# Converted to Numeric - Ranges from $0 to $885,300
# 464 NAs - 5.69% of observations
# MSTATUS
table(ins$MSTATUS)
class(ins$MSTATUS); levels(ins$MSTATUS)
# Factor - Yes, No
# SEX
table(ins$SEX)
class(ins$SEX); levels(ins$SEX)
# Factor - M, F
# EDUCATION
table(ins$EDUCATION)
class(ins$EDUCATION); levels(ins$EDUCATION)
# Factor - <HS, HS, BA, MA, PhD
# JOB
table(ins$JOB)
class(ins$JOB); levels(ins$JOB)
# Factor - [Blank], Clerical, Doctor, Home Maker, Lawyer, Manager,
# Professional, Student, Blue Collar
# TRAVTIME - Distance to work
summary(ins$TRAVTIME)
class(ins$TRAVTIME)
# Integer - Ranges from 5 to 142
# CAR_USE
class(ins$CAR_USE); levels(ins$CAR_USE)
table(ins$CAR_USE)
# Factor - Commercial, Private
# BLUEBOOK
class(ins$BLUEBOOK)
summary(ins$BLUEBOOK)
# Numeric - Ranges from $1,500 to $69,740
# TIF - Time in Force
class(ins$TIF)
summary(ins$TIF)
# Integer - Ranges from 1 to 25
# CAR_TYPE
class(ins$CAR_TYPE); levels(ins$CAR_TYPE)
table(ins$CAR_TYPE)
# FActor - Minivan, Panel Truck, Pickup, Sports Car, Van, SUV
# RED_CAR
class(ins$RED_CAR); levels(ins$RED_CAR)
table(ins$RED_CAR)
# Factor - No, Yes
# OLDCLAIM
class(ins$OLDCLAIM)
summary(ins$OLDCLAIM)
# Numeric - Ranges from $0 to $57,040
# CLM_FREQ
class(ins$CLM_FREQ)
summary(ins$CLM_FREQ)
# Integer - Ranges from 0 to 5
# REVOKED
class(ins$REVOKED); levels(ins$REVOKED)
table(ins$REVOKED)
# Factor - No, Yes
# MVR_PTS
class(ins$MVR_PTS)
summary(ins$MVR_PTS)
# Integer - Ranges from 0 to 13
# CAR_AGE
class(ins$CAR_AGE)
summary(ins$CAR_AGE)
nrow(ins[ins$CAR_AGE<0 & !is.na(ins$CAR_AGE), ])
nrow(ins[ins$CAR_AGE==0 & !is.na(ins$CAR_AGE), ])
nrow(ins[ins$CAR_AGE==1 & !is.na(ins$CAR_AGE), ])
# Integer - Ranges from -3 to 28
# 1 observation of -3 - invalid
# 3 observations of 0 - likely invalid
# 1,934 observations of 1 - reasonable (new car)
# 510 NAs - 6.25% of observations
# URBANICITY
class(ins$URBANICITY); levels(ins$URBANICITY)
table(ins$URBANICITY)
# Factor - Urban, Rural
ins$INCOME <- as.numeric(gsub('[$,]', '', ins$INCOME))
ins$HOME_VAL <- as.numeric(gsub('[$,]', '', ins$HOME_VAL))
levels(ins$MSTATUS)[match("z_No",levels(ins$MSTATUS))] <- "No"
levels(ins$SEX)[match("z_F",levels(ins$SEX))] <- "F"
levels(ins$EDUCATION)[match("z_High School",
levels(ins$EDUCATION))] <- "High School"
ins$EDUCATION <- factor(ins$EDUCATION,levels(ins$EDUCATION)[c(1,5,2:4)])
levels(ins$JOB)[match("z_Blue Collar",levels(ins$JOB))] <- "Blue Collar"
ins$BLUEBOOK <- as.numeric(gsub('[$,]', '', ins$BLUEBOOK))
levels(ins$CAR_TYPE)[match("z_SUV",levels(ins$CAR_TYPE))] <- "SUV"
levels(ins$RED_CAR)[match("no",levels(ins$RED_CAR))] <- "No"
levels(ins$RED_CAR)[match("yes",levels(ins$RED_CAR))] <- "Yes"
ins$OLDCLAIM <- as.numeric(gsub('[$,]', '', ins$OLDCLAIM))
levels(ins$URBANICITY)[match("Highly Urban/ Urban",
levels(ins$URBANICITY))] <- "Urban"
levels(ins$URBANICITY)[match("z_Highly Rural/ Rural",
levels(ins$URBANICITY))] <- "Rural"
ins$JOB <- factor(ins$JOB,levels(ins$JOB)[c(7, 8, 3, 1, 6, 5, 4, 2)])
# Drop index column
ins <- ins[-c(1)]
insFull <- ins
ins <- ins[complete.cases(ins), ]
ins[ins$CAR_AGE<1,'CAR_AGE'] <- NA
ins <- ins[complete.cases(ins), ]
# Cuts down from 8,161 to 6,045
# Get only complete cases
nrow(ins[complete.cases(ins), ])
nrow(ins)
insBackup <- ins
# Summary table
sumIns = data.frame(Variable = character(),
Min = integer(),
Median = integer(),
Mean = double(),
SD = double(),
Max = integer(),
Num_NAs = integer(),
Num_Zeros = integer())
for (i in c(3:7,9,14,16,17,20,21,23,24)) {
sumIns <- rbind(sumIns, data.frame(Variable = colnames(ins)[i],
Min = min(ins[,i], na.rm=TRUE),
Median = median(ins[,i], na.rm=TRUE),
Mean = mean(ins[,i], na.rm=TRUE),
SD = sd(ins[,i], na.rm=TRUE),
Max = max(ins[,i], na.rm=TRUE),
Num_NAs = sum(is.na(ins[,i])),
Num_Zeros = length(which(ins[,i]==0)))
)
}
colnames(sumIns) <- c("", "Min", "Median", "Mean", "SD", "Max", "Num of NAs",
"Num of Zeros")
sumIns
# Proportion of target variable
table(ins$TARGET_FLAG)
table(ins$TARGET_FLAG)/sum(table(ins$TARGET_FLAG))
# Exploratory plots (repeated for each variable)
# Get descriptive plots:
# Variables:
# INDEX, TARGET_FLAG, TARGET_AMT, KIDSDRIV, AGE, HOMEKIDS, YOJ, INCOME,
# PARENT1, HOME_VAL, MSTATUS, SEX, EDUCATION, JOB, TRAVTIME, CAR_USE,
# BLUEBOOK, TIF, CAR_TYPE, RED_CAR, OLDCLAIM, CLM_FREQ, REVOKED, MVR_PTS,
# CAR_AGE, URBANICITY,
v <- "TARGET_AMT" # Variable to view
pd <- as.data.frame(cbind(ins[, v], ins$TARGET_FLAG))
colnames(pd) <- c("X", "Y")
# Boxplot
bp <- ggplot(pd, aes(x = 1, y = X)) +
stat_boxplot(geom ='errorbar') + geom_boxplot() +
xlab("Boxplot") + ylab("") + theme(axis.text.x=element_blank(),
axis.ticks.x=element_blank())
# Density plot
hp <- ggplot(pd, aes(x = X)) +
geom_histogram(aes(y=..density..), colour="black", fill="white") +
geom_density(alpha=.2, fill="#FF6666") +
ylab("") + xlab("Density Plot with Mean") +
geom_vline(aes(xintercept=mean(X, na.rm=TRUE)), color="red",
linetype="dashed", size=1)
# Scatterplot
sp <- ggplot(pd, aes(x=X, y=Y)) +
geom_point() +
stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE) +
xlab("Scatterplot with Logistic Regression Line")
grid.arrange(bp, hp, sp, layout_matrix=rbind(c(1,2,2),c(1,3,3)))
# Correlation matrix
cm <- cor(ins, use="pairwise.complete.obs")
cm <- round(cm, 2)
cmout <- as.data.frame(cm) %>% mutate_all(function(x) {
cell_spec(x, "html", color = ifelse(x>0.5 | x<(-0.5),"blue","black"))
})
rownames(cmout) <- colnames(cmout)
cmout %>%
kable("html", escape = F, align = "c", row.names = TRUE) %>%
kable_styling("striped", full_width = F)
pairs(ins)
# Split into train and validation sets
set.seed(88)
split <- sample.split(ins$TARGET_FLAG, SplitRatio = 0.75)
insTRAIN <- subset(ins, split == TRUE)
insTEST <- subset(ins, split == FALSE)
# BINARY REGRESSION MODEL
# Modelling - Basic model
model <- glm (TARGET_FLAG ~ .-TARGET_AMT, data = insTRAIN,
family = binomial(link="logit"))
summary(model)
pred <- predict(model, newdata=subset(insTEST, select=c(1:25)),
type='response')
cm <- confusionMatrix(as.factor(insTEST$TARGET_FLAG),
as.factor(ifelse(pred > 0.5,1,0)))
cm$table
cm$overall['Accuracy']
pR2(model) # McFadden R^2
# Stepwise approach
model <- stepAIC(model, trace=FALSE, direction='both')
# Model tweaking
model <- glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 +
log(HOME_VAL+1) + MSTATUS + EDUCATION + JOB + TRAVTIME +
CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ +
REVOKED + MVR_PTS + URBANICITY,
family = binomial(link = "logit"), data = insTRAIN)
# ROC
pr <- prediction(pred, insTEST$TARGET_FLAG)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf, colorize = TRUE, text.adj = c(-0.2,1.7))
auc <- performance(pr, measure = "auc")
(auc <- auc@y.values[[1]])
# K-Fold cross validation
ctrl <- trainControl(method = "repeatedcv", number = 10,
savePredictions = TRUE)
model_fit <- train(TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 +
log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME +
CAR_USE + BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM +
CLM_FREQ + REVOKED + MVR_PTS + URBANICITY,
data=insTRAIN, method="glm", family="binomial",
trControl = ctrl, tuneLength = 5)
pred <- predict(model_fit, newdata=insTEST)
confusionMatrix(as.factor(insTEST$TARGET_FLAG),
as.factor(ifelse(pred > 0.5,1,0)))
# Deviance residuals
anova(model, test="Chisq")
# VIF
vif(model)
# Take out JOB
ggplot(data = ins, aes(JOB, EDUCATION)) +
geom_jitter()
model <- glm(formula = TARGET_FLAG ~ KIDSDRIV + log(INCOME+1) + PARENT1 +
log(HOME_VAL+1) + MSTATUS + EDUCATION + TRAVTIME + CAR_USE +
BLUEBOOK + TIF + CAR_TYPE + OLDCLAIM + CLM_FREQ + REVOKED +
MVR_PTS + URBANICITY,
family = binomial(link = "logit"), data = insTRAIN)
# LINEAR MODEL
insLM <- ins
insLM <- ins[ins$TARGET_FLAG==1,]
# Split into training and testing sets
split <- sample.split(insLM$TARGET_AMT, SplitRatio = 0.75)
insLMtrain <- subset(insLM, split == TRUE)
insLMtest <- subset(insLM, split == FALSE)
# Initial models
lmModel <- lm(TARGET_AMT ~ .-TARGET_FLAG,data = insLMtrain)
summary(lmModel)
lmModel <- stepAIC(lmModel, trace=FALSE, direction='both')
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ PARENT1 + MSTATUS + BLUEBOOK + OLDCLAIM +
CLM_FREQ + REVOKED + JOB,data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK + OLDCLAIM + CLM_FREQ + REVOKED,
data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK + CAR_AGE + CAR_TYPE,data = insLMtrain)
summary(lmModel)
lmModel <- lm(TARGET_AMT ~ BLUEBOOK,data = insLMtrain)
summary(lmModel)
# Calculate RMSE
pred <- predict(lmModel, newdata=insLMtest)
rmse(insLMtest$TARGET_AMT, pred)
# Model plots
plot(lmModel$residuals, ylab="Residuals")
abline(h=0)
plot(lmModel$fitted.values, lmModel$residuals,
xlab="Fitted Values", ylab="Residuals")
abline(h=0)
qqnorm(lmModel$residuals)
qqline(lmModel$residuals)
boxcox(lmModel)
lmModel <- lm(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel)
pred <- predict(lmModel, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)
lmModel2 <- rlm(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel2)
pred <- predict(lmModel2, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)
lmModel3 <- rq(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
summary(lmModel3)
pred <- predict(lmModel3, newdata=insLMtest)
rmse(log(insLMtest$TARGET_AMT), pred)
plot(log(TARGET_AMT) ~ BLUEBOOK, data = insLMtrain)
abline(lmModel)
abline(lmModel2, col="red")
abline(lmModel3, col="blue")
legend("topright", inset=0.05, bty="n",
legend=c("lm fit", "rlm fit", "rq fit"),
lty=c(1,1,1),
col=c("black", "red", "blue"))
# Prediction
eval <- read.csv(url(paste0("https://raw.githubusercontent.com/",
"ilyakats/CUNY-DATA621/master/hw4/",
"insurance-evaluation-data.csv")),
na.strings=c("","NA"))
results <- eval[,1]
eval$INCOME <- as.numeric(gsub('[$,]', '', eval$INCOME))
eval$HOME_VAL <- as.numeric(gsub('[$,]', '', eval$HOME_VAL))
levels(eval$MSTATUS)[match("z_No",levels(eval$MSTATUS))] <- "No"
levels(eval$SEX)[match("z_F",levels(eval$SEX))] <- "F"
levels(eval$EDUCATION)[match("z_High School",
levels(eval$EDUCATION))] <- "High School"
eval$EDUCATION <- factor(eval$EDUCATION,levels(eval$EDUCATION)[c(1,5,2:4)])
levels(eval$JOB)[match("z_Blue Collar",levels(eval$JOB))] <- "Blue Collar"
eval$BLUEBOOK <- as.numeric(gsub('[$,]', '', eval$BLUEBOOK))
levels(eval$CAR_TYPE)[match("z_SUV",levels(eval$CAR_TYPE))] <- "SUV"
levels(eval$RED_CAR)[match("no",levels(eval$RED_CAR))] <- "No"
levels(eval$RED_CAR)[match("yes",levels(eval$RED_CAR))] <- "Yes"
eval$OLDCLAIM <- as.numeric(gsub('[$,]', '', eval$OLDCLAIM))
levels(eval$URBANICITY)[match("Highly Urban/ Urban",
levels(eval$URBANICITY))] <- "Urban"
levels(eval$URBANICITY)[match("z_Highly Rural/ Rural",
levels(eval$URBANICITY))] <- "Rural"
eval$JOB <- factor(eval$JOB,levels(eval$JOB)[c(7, 8, 3, 1, 6, 5, 4, 2)])
eval <- eval[-c(1)]
pred <- predict(model, newdata=eval, type="response")
results <- cbind(results, prob=round(pred,4))
results <- cbind(results, predict=round(pred,0))
pred <- predict(lmModel, newdata=eval, type="response")
results <- cbind(results, exp(pred))
results <- as.data.frame(results)
results[results$predict==0 & !is.na(results$predict),'V4'] <- NA
results[is.na(results$predict),'V4'] <- NA
colnames(results) <- c("Index", "TARGET_FLAG Prob.",
"TARGET_FLAG", "TARGET_AMT")
pander(head(results, 100))