Overview
Your objective is to build multiple linear regression and binary logistic regression models on the training data to predict the probability that a person will crash their car and also the amount of money it will cost if the person does crash their car. You can only use the variables given to you (or variables that you derive from the variables provided).
1. DATA EXPLORATION
Examine the cases & variables
Data Dictionary
In this homework assignment, you will explore, analyze and model a data set containing 8161 records representing a customer at an auto insurance company. As the KIDSDRIV variable implies, each record represents a customer that could have multiple drivers on the account.
After excluding the INDEX variable, there are a total of 25 variables, 2 of which are response variables:
TARGET_FLAGis a binary variable where a “1” means that the person was in a car crash. A “0” means that the person was not in a car crash.TARGET_AMTis the cost of the accident. It is zero if the person did not crash their car.
Statistical Summary
Among the numerical variables, we notice the following:
CAR_AGE,YOJ&AGEappear to be missing values.Additionally,
CAR_AGEhas a minimum value of -3. We should consider imputing a value for this observation since that value isn’t impossible.The cost of the accident
TARGET_AMThas large skew and kurtosis values, which means that it may be a candidate for transformation.The small values for
INCOME,HOME_VAL&BLUEBOOKindicate that this data set is a few decades old.
Visualizations
Categorical & Discrete variables
Frequencies
10 of the variables are categorical, and 5 are discrete numerical variables with fewer than 20 values.
In our response variable
TARGET_FLAG, more than 70% of the cars have not been in an accident.Many of the variables have one value that is greater than 60%. Hence, we could reasonably assert that the company’s typical customer is married with no kids at home or driving, lives in an urban area, has no claims in the last 5 years and has not had his or her license revoked.
We notice that the
JOBvariable may be missing some values.
Side-by-Side Boxplots
The box plots contain the square root of TARGET_AMT distributions for each of the categorical and discrete variable values. We can see several several noteworthy differences in distributions that will likely inform some variable transformations.
The following characteristics significantly decrease the center of accident cost distribution compared to the other variable characteristics. We would these coeffients to be negative in our models.
Having no kids at home in
HOMEKIDSor no kids driving inKIDSDRIVNot being a single parent (
PARENT1)Being married (
MSTATUS)Having a Bachelors, Masters or PhD in
EDUCATIONBeing a doctor, lawyer, manager or professional in
JOBNot using the car for commercial purposes (
CAR_USE)Driving a minivan
CAR_TYPE& living in a rural areaURBANCITYHaving no claims in the last five years (
CLM_FRQ) or not having had your licence revoked (REVOKED)Having 1 or less motor vehicle points (
MVR_PTS)
Scatterplots, Histograms & Density Plots
Let’s take a look at the relationships between our continuous variables. I have to give credit to classmate Jaan Bernberg for showing me the following GGally plot.
First, the side-by-side box plots along the top row show surprisingly little difference in the variance and distribution between customers without an accident (in red) and those with an accident (in light blue). The only variable with moderately different different distributions is
OLDCLAIM. Customers without accidents have smaller amounts of claim payouts in the last 5 years. Those who have not been in an accident also have a higher range of home values (HOME_VAL)In the histograms along the diagonal, other than
OLDCLAIM, there are not severely different distributions between the 2 types of customers represented byTARGET_FLAG. The variablesBLUEBOOKandCAR_AGEare bimodal & may benefit from transformations.The side-by-side density plots in the 1st column reveal that
INCOMEappears to be higher for customers who have had an accident. In our models, let’s see if this variable defies conventional wisdom and is positively correlated with accidents.Next, let’s take a look at the scatterplots in the 2nd column from the left where our continuous response variable
TARGET_AMTis plotted against the other predictor variables. We don’t see any pronounced positive or negative relationships. However, we see evidence that we may be able to make these relationships more linear by transformingTARGET_AMT.Lastly, we notice that the correlation values in the upper triangal section seem to be very small. We will confirm this in the next section.
Correlations
As we noted above, none of the variable pairs appear to have strong linear correlations. YOJ and INCOME are the most correlated at .31. We will likely not have collinearity among our original variables.
| Var1 | Var2 | Correlation | Rsquared | |
|---|---|---|---|---|
| 1 | YOJ | INCOME | 0.31 | 0.10 |
| 2 | HOME_VAL | CAR_AGE | 0.19 | 0.04 |
| 3 | AGE | CAR_AGE | 0.18 | 0.03 |
| 4 | AGE | HOME_VAL | 0.15 | 0.02 |
| 5 | AGE | YOJ | 0.14 | 0.02 |
The correlation coefficents with the response variable TARGET_AMT are even weaker. OLDCLAIM is only correlated with the response variable at .11.
| Var1 | Var2 | Correlation | Rsquared | |
|---|---|---|---|---|
| 1 | TARGET_AMT | OLDCLAIM | 0.11 | 0.01 |
| 2 | TARGET_AMT | HOME_VAL | -0.08 | 0.01 |
| 3 | TARGET_AMT | CAR_AGE | -0.06 | 0.00 |
| 4 | TARGET_AMT | AGE | -0.05 | 0.00 |
| 5 | TARGET_AMT | TIF | -0.05 | 0.00 |
Missing Values
Let’s see how our missing variables are distributed.
The variables for the age of the car
CAR_AGEand years on the jobYOJare missing about 6% and 5% of their values. The customer age variable (AGE) is missing only 6 values. Otherwise, 88% of the cases are complete.Upon closer inspection, the
JOBvariable actually does not containNAs. It does have a “blank” value. It’s unclear whether the observation is missing. We will consider either replacing it with “Unknown” or imputing it.We should be able to use the other demographic variables to make reasonable imputations for these missing values.
2. DATA PREPARATION
Imputing the Missing Values
Let’s use the missForest package to do nonparametric missing-value imputation using Random Forest on both the training and evaluation sets. We will set the impossible -3 CAR_AGE value to NA as well.
When we use the out-of-the-bag error in order to calculate the normalized root mean-squares error, we see NRMSE values closer to zero than not, which indicates that we have well-fitted imputations.
| Variable | Count | MSE | NRMSE |
|---|---|---|---|
| CAR_AGE | 510 | 16.20 | 0.14 |
| YOJ | 454 | 6.79 | 0.11 |
| AGE | 6 | 46.26 | 0.10 |
Variable Transformations
We will use the distribution differences seen in the side-by-side box plots to create the following new dummy variables. In order to avoid singularity issues, these new variables will replace their corresponding many-value categorical variables. We will run models using both the original variable set and the new variable data set to see which performs better.
NOHOMEKIDS,NOKIDSDRIV,HASCOLLEGE,ISPROFESSIONAL,ISMINIVAN
3. BUILD MODELS
Linear Regression
LM #1: Original Variables with BIC Selection
In our first model, we will use BIC forward and backward selection on the original variables with the missing values imputed. Twenty-six of the model variables were removed, leaving a model with eleven statically significant variables. In this model, being a single parent PARENT1Yes on average increases the cost of the accidents the most, and only using the insured vehicle for non-commercial purposes decreases the response variable the most. Despite having the highest correlation with the response variable OLDCLAIM was not statically significant. However, the model’s adjusted R-squared is incredibly small at .06. This model does not account for very much of the variation in the cost of the accident TARGET_AMT.
##
## Call:
## lm(formula = TARGET_AMT ~ KIDSDRIV + PARENT1 + HOME_VAL + MSTATUS +
## TRAVTIME + CAR_USE + TIF + REVOKED + MVR_PTS + CAR_AGE +
## URBANICITY, data = dplyr::select(imputed_train$ximp, -TARGET_FLAG))
##
## Residuals:
## Min 1Q Median 3Q Max
## -5836 -1664 -827 275 103964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 941.13962 222.44217 4.231 2.35e-05 ***
## KIDSDRIV 377.38211 102.28462 3.690 0.000226 ***
## PARENT1Yes 703.34692 176.07951 3.994 6.54e-05 ***
## HOME_VAL -0.10412 0.03304 -3.151 0.001634 **
## MSTATUSYes -438.26773 126.71698 -3.459 0.000546 ***
## TRAVTIME 13.09238 3.22434 4.060 4.94e-05 ***
## CAR_USEPrivate -848.17685 105.11429 -8.069 8.09e-16 ***
## TIF -46.86535 12.19943 -3.842 0.000123 ***
## REVOKEDYes 495.89795 155.26068 3.194 0.001409 **
## MVR_PTS 212.13760 24.06572 8.815 < 2e-16 ***
## CAR_AGE -51.03092 9.42700 -5.413 6.36e-08 ***
## URBANICITYHighly Urban/ Urban 1502.70221 131.29805 11.445 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4563 on 8149 degrees of freedom
## Multiple R-squared: 0.06052, Adjusted R-squared: 0.05926
## F-statistic: 47.73 on 11 and 8149 DF, p-value: < 2.2e-16
## [1] "removed variable(s): 26"
## [1] "AGE" "HOMEKIDS" "YOJ"
## [4] "INCOME" "SEXM" "EDUCATIONBachelors"
## [7] "EDUCATIONHigh School" "EDUCATIONMasters" "EDUCATIONPhD"
## [10] "JOBBlue Collar" "JOBClerical" "JOBDoctor"
## [13] "JOBHome Maker" "JOBLawyer" "JOBManager"
## [16] "JOBProfessional" "JOBStudent" "BLUEBOOK"
## [19] "CAR_TYPEPanel Truck" "CAR_TYPEPickup" "CAR_TYPESports Car"
## [22] "CAR_TYPESUV" "CAR_TYPEVan" "RED_CARyes"
## [25] "OLDCLAIM" "CLM_FREQ"
In the model’s diagnostic plots, the Residuals vs Fitted plot shows we have some nonconstant variance. The Normal Q-Q plot shows that the right side of the standardized residuals deviates from normality. The Leverage plot shows that we do not have any influential points.
LM #2: Transformed Variables with BIC Selection
Our second model applied the same BIC selection processed to the transformed variable set with the five new variables. This version of the BIC model contains one more variable than the previous model, and it found three of the new variables to be statistically significant. However, adjusted R-squared is only slightly higher than the first model and still remains tiny.
##
## Call:
## lm(formula = TARGET_AMT ~ PARENT1 + MSTATUS + TRAVTIME + CAR_USE +
## TIF + REVOKED + MVR_PTS + CAR_AGE + URBANICITY + NOKIDSDRIV +
## ISPROFESSIONAL + ISMINIVAN, data = train_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5759 -1672 -790 272 104345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1604.767 274.723 5.841 5.38e-09 ***
## PARENT1Yes 605.479 177.581 3.410 0.000654 ***
## MSTATUSYes -619.283 119.384 -5.187 2.18e-07 ***
## TRAVTIME 12.778 3.219 3.969 7.27e-05 ***
## CAR_USEPrivate -592.297 113.393 -5.223 1.80e-07 ***
## TIF -47.497 12.175 -3.901 9.64e-05 ***
## REVOKEDYes 487.327 154.874 3.147 0.001658 **
## MVR_PTS 208.074 24.002 8.669 < 2e-16 ***
## CAR_AGE -42.060 9.901 -4.248 2.18e-05 ***
## URBANICITYHighly Urban/ Urban 1599.193 133.373 11.990 < 2e-16 ***
## NOKIDSDRIV -690.387 161.994 -4.262 2.05e-05 ***
## ISPROFESSIONAL -498.163 122.858 -4.055 5.06e-05 ***
## ISMINIVAN -539.684 117.309 -4.601 4.28e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4553 on 8148 degrees of freedom
## Multiple R-squared: 0.06451, Adjusted R-squared: 0.06313
## F-statistic: 46.82 on 12 and 8148 DF, p-value: < 2.2e-16
The diagnostic plots for the second linear model retain the nonconstant variance and normality issues. The leverage plot does not show any influential points.
LM #3: BIC Selection with Response Variable Transformation
Given the nonnormality and kurtosis we saw earlier, let’s see what the car packages powerTransform function recommends for a negative Box-Cox transformation on TARGET_AMT. The negative Box-Cox transformation can accommodate all of the zero values that the variable has. For the sake of simplicity, let’s add one to TARGET_AMT and take the natural log.
## Skew Power transformation to Normality
##
## Estimated power, lambda
## Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1 -0.2441 -0.244 -0.2511 -0.2371
##
## Estimated location, gamma
## Est gamma Std Err. Wald Lower Bound Wald Upper Bound
## Y1 0.0101 NaN NaN NaN
##
## Likelihood ratio tests about transformation parameters
## LRT df pval
## LR test, lambda = (0) 5942.814 1 0
## LR test, lambda = (1) 143950.207 1 0
Our third model will use log-transformed TARGET_AMT with the newly created dummy variables. This model has two more statistically significant variables than the previous model, including all five of the transformed dummy variables we created. Moreover, adjusted R-squared is much higher at .21. This model explains much more of the variance in the response variable TARGET_AMT.
##
## Call:
## lm(formula = TARGET_AMT ~ YOJ + PARENT1 + HOME_VAL + MSTATUS +
## TRAVTIME + CAR_USE + TIF + CLM_FREQ + REVOKED + MVR_PTS +
## URBANICITY + NOHOMEKIDS + NOKIDSDRIV + HASCOLLEGE + ISPROFESSIONAL +
## ISMINIVAN, data = train_transformed_BCN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2088 -2.3507 -0.9222 2.2972 11.0897
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.510e+00 2.224e-01 11.283 < 2e-16 ***
## YOJ -4.119e-02 9.304e-03 -4.427 9.69e-06 ***
## PARENT1Yes 5.178e-01 1.556e-01 3.328 0.000878 ***
## HOME_VAL -1.166e-04 2.432e-05 -4.793 1.67e-06 ***
## MSTATUSYes -6.164e-01 9.777e-02 -6.304 3.05e-10 ***
## TRAVTIME 1.761e-02 2.309e-03 7.629 2.64e-14 ***
## CAR_USEPrivate -6.164e-01 8.272e-02 -7.451 1.02e-13 ***
## TIF -6.591e-02 8.727e-03 -7.552 4.76e-14 ***
## CLM_FREQ 2.099e-01 3.497e-02 6.002 2.04e-09 ***
## REVOKEDYes 1.105e+00 1.111e-01 9.950 < 2e-16 ***
## MVR_PTS 1.908e-01 1.848e-02 10.323 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.292e+00 9.803e-02 23.380 < 2e-16 ***
## NOHOMEKIDS -3.371e-01 1.080e-01 -3.121 0.001811 **
## NOKIDSDRIV -7.080e-01 1.269e-01 -5.578 2.51e-08 ***
## HASCOLLEGE -7.479e-01 8.953e-02 -8.355 < 2e-16 ***
## ISPROFESSIONAL -5.841e-01 9.799e-02 -5.961 2.61e-09 ***
## ISMINIVAN -7.681e-01 8.430e-02 -9.112 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.263 on 8144 degrees of freedom
## Multiple R-squared: 0.2118, Adjusted R-squared: 0.2102
## F-statistic: 136.8 on 16 and 8144 DF, p-value: < 2.2e-16
After attempting to inverse the model coefficients, they still are not within the range of values we would expect given the distribution of the variables. This may require further research.
| var | transformed_coef_value | inversed_coef_value |
|---|---|---|
| (Intercept) | 2.51 | 11.3 |
| URBANICITYHighly Urban/ Urban | 2.29 | 8.89 |
| REVOKEDYes | 1.11 | 2.02 |
| PARENT1Yes | 0.52 | 0.68 |
| CLM_FREQ | 0.21 | 0.23 |
| MVR_PTS | 0.19 | 0.21 |
| TRAVTIME | 0.02 | 0.02 |
| HOME_VAL | 0 | 0 |
| YOJ | -0.04 | -0.04 |
| TIF | -0.07 | -0.06 |
| NOHOMEKIDS | -0.34 | -0.29 |
| ISPROFESSIONAL | -0.58 | -0.44 |
| CAR_USEPrivate | -0.62 | -0.46 |
| MSTATUSYes | -0.62 | -0.46 |
| NOKIDSDRIV | -0.71 | -0.51 |
| HASCOLLEGE | -0.75 | -0.53 |
| ISMINIVAN | -0.77 | -0.54 |
In the diagnostic plots for model #3, the log transformation did improve the normality of our residuals. However, it made the nonconstant variance much worse.
Binary Regression Models
Binary Regression Model #1: Backward Elimination with Original Variables
For our first model, let’s use all of the original variables with a backward elimination process that removes the predictor with the highest p-value until all of the remaining p-values are statistically significant. This process removed seven predictors, leaving the model with 16 variables. OLDCLAIM has the most practical significance to the model with the largest coefficient. We can interpret the effect of of the variable as a dollar increase in OLDCLAIM, with the other variables held constant, increases the log-odds of the customer having had a car accident by 9.9. At -7.6, CAR_USEPrivate decreases the odds of a customer car accident the most.
##
##
## removed_vars removed_pvalues
## ------------- ----------------
## RED_CAR 0.888
## EDUCATION 0.854
## BLUEBOOK 0.618
## INCOME 0.342
## JOB 0.341
## CAR_TYPE 0.733
## AGE 0.224
##
## Call:
## glm(formula = TARGET_FLAG ~ KIDSDRIV + HOMEKIDS + YOJ + PARENT1 +
## HOME_VAL + MSTATUS + SEX + TRAVTIME + CAR_USE + TIF + OLDCLAIM +
## CLM_FREQ + REVOKED + MVR_PTS + CAR_AGE + URBANICITY, family = "binomial",
## data = dplyr::select(imputed_train$ximp, -TARGET_AMT))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4323 -0.7444 -0.4460 0.7174 2.9576
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.969e+00 1.613e-01 -12.208 < 2e-16 ***
## KIDSDRIV 3.140e-01 5.827e-02 5.389 7.08e-08 ***
## HOMEKIDS 1.249e-01 3.310e-02 3.773 0.000161 ***
## YOJ -4.433e-02 7.108e-03 -6.237 4.45e-10 ***
## PARENT1Yes 3.401e-01 1.058e-01 3.216 0.001300 **
## HOME_VAL -1.355e-04 1.903e-05 -7.124 1.05e-12 ***
## MSTATUSYes -3.727e-01 7.481e-02 -4.982 6.28e-07 ***
## SEXM -2.836e-01 6.017e-02 -4.714 2.43e-06 ***
## TRAVTIME 1.470e-02 1.831e-03 8.028 9.94e-16 ***
## CAR_USEPrivate -7.647e-01 6.033e-02 -12.676 < 2e-16 ***
## TIF -5.124e-02 7.143e-03 -7.174 7.28e-13 ***
## OLDCLAIM 9.902e-05 4.152e-05 2.385 0.017079 *
## CLM_FREQ 1.235e-01 3.130e-02 3.946 7.94e-05 ***
## REVOKEDYes 7.666e-01 7.803e-02 9.824 < 2e-16 ***
## MVR_PTS 1.129e-01 1.337e-02 8.440 < 2e-16 ***
## CAR_AGE -4.725e-02 5.347e-03 -8.837 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.126e+00 1.107e-01 19.209 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7642.5 on 8144 degrees of freedom
## AIC: 7676.5
##
## Number of Fisher Scoring iterations: 5
The initial model has a cross-validation accuracy of .78 and area under the receiver operating characteristic curve of .79. Having no variance inflation factor greater than four VIF_gt_4, this model has no substantial collinearity, and a large p-value for the Hosmer-Lemeshow goodness-of-fit test, which means that we fail to reject the null hypothesis of the model having a good fit.
| model_name | n_vars | model_pvalue | residual_deviance | H_L_pvalue | VIF_gt_4 | CV_accuracy | AUC |
|---|---|---|---|---|---|---|---|
| bk_elim_orig_vars | 16 | 0.00e+00 | 7642.47 | 0.841 | 0 | 0.778 | 0.788 |
Binary Regression Model #2: Backward Elimination with Transformed Variables
Next, let’s apply the same backward elimination process to the transformed set of variables. This process removed six variables, leaving the model with 17 statistically significant variables. All five of the the newly created dummy variables were included. OLDCLAIM still has the largest positive coefficient, but in this model,HOME_VAL followed by ISMINIVAN decreases odds of a customer car accident the most.
##
##
## removed_vars removed_pvalues
## ------------- ----------------
## BLUEBOOK 0.994
## AGE 0.978
## RED_CAR 0.927
## INCOME 0.904
## SEX 0.265
## CAR_AGE 0.176
##
## Call:
## glm(formula = TARGET_FLAG ~ YOJ + PARENT1 + HOME_VAL + MSTATUS +
## TRAVTIME + CAR_USE + TIF + OLDCLAIM + CLM_FREQ + REVOKED +
## MVR_PTS + URBANICITY + NOHOMEKIDS + NOKIDSDRIV + HASCOLLEGE +
## ISPROFESSIONAL + ISMINIVAN, family = "binomial", data = dplyr::select(train_transformed,
## -TARGET_AMT))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3382 -0.7274 -0.4138 0.6820 3.2234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.411e+00 1.868e-01 -7.553 4.26e-14 ***
## YOJ -3.300e-02 7.226e-03 -4.568 4.93e-06 ***
## PARENT1Yes 2.363e-01 1.188e-01 1.988 0.046816 *
## HOME_VAL -9.406e-05 1.948e-05 -4.827 1.38e-06 ***
## MSTATUSYes -5.270e-01 7.914e-02 -6.658 2.77e-11 ***
## TRAVTIME 1.510e-02 1.863e-03 8.106 5.25e-16 ***
## CAR_USEPrivate -4.446e-01 6.295e-02 -7.063 1.63e-12 ***
## TIF -5.430e-02 7.237e-03 -7.503 6.23e-14 ***
## OLDCLAIM 8.934e-05 4.204e-05 2.125 0.033556 *
## CLM_FREQ 1.172e-01 3.181e-02 3.684 0.000229 ***
## REVOKEDYes 7.742e-01 7.946e-02 9.743 < 2e-16 ***
## MVR_PTS 1.123e-01 1.357e-02 8.279 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.259e+00 1.110e-01 20.345 < 2e-16 ***
## NOHOMEKIDS -3.159e-01 8.741e-02 -3.614 0.000301 ***
## NOKIDSDRIV -5.277e-01 9.427e-02 -5.598 2.17e-08 ***
## HASCOLLEGE -5.808e-01 6.894e-02 -8.426 < 2e-16 ***
## ISPROFESSIONAL -4.582e-01 7.636e-02 -6.001 1.96e-09 ***
## ISMINIVAN -7.028e-01 7.338e-02 -9.578 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7444.6 on 8143 degrees of freedom
## AIC: 7480.6
##
## Number of Fisher Scoring iterations: 5
In the second model,both the cross-validation accuracy and the area under the receiver operating characteristic curve have increased from the first model. The blue line of the area under the curve is closer to the upper right corner, indicating that this model is better at identifying true positives and negatives. The model has no substantial collinearity, and a large p-value from the Hosmer-Lemeshow goodness-of-fit test means that we fail to reject the null hypothesis of the model having a good fit.
| model_name | n_vars | model_pvalue | residual_deviance | H_L_pvalue | VIF_gt_4 | CV_accuracy | AUC |
|---|---|---|---|---|---|---|---|
| bk_elim_orig_vars | 16 | 0.00e+00 | 7642.470 | 0.841 | 0 | 0.778 | 0.788 |
| bk_elim_transf_vars | 17 | 0.00e+00 | 7444.605 | 0.766 | 0 | 0.782 | 0.803 |
Binary Regression Model #3: Transformed Variables with BIC
For our third model, let’s apply the BIC selection process to the transformed data set. This model has two variables less than the previous one and has 15 statistically significant variables. All five of the newly created dummy variables were included.
##
## Call:
## glm(formula = TARGET_FLAG ~ YOJ + HOME_VAL + MSTATUS + TRAVTIME +
## CAR_USE + TIF + CLM_FREQ + REVOKED + MVR_PTS + URBANICITY +
## NOHOMEKIDS + NOKIDSDRIV + HASCOLLEGE + ISPROFESSIONAL + ISMINIVAN,
## family = "binomial", data = dplyr::select(train_transformed,
## -TARGET_AMT))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2641 -0.7257 -0.4158 0.6860 3.2336
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.247e+00 1.688e-01 -7.385 1.53e-13 ***
## YOJ -3.313e-02 7.219e-03 -4.589 4.46e-06 ***
## HOME_VAL -9.366e-05 1.944e-05 -4.817 1.46e-06 ***
## MSTATUSYes -6.203e-01 6.395e-02 -9.700 < 2e-16 ***
## TRAVTIME 1.497e-02 1.861e-03 8.046 8.58e-16 ***
## CAR_USEPrivate -4.428e-01 6.290e-02 -7.039 1.94e-12 ***
## TIF -5.438e-02 7.233e-03 -7.518 5.55e-14 ***
## CLM_FREQ 1.572e-01 2.519e-02 6.242 4.31e-10 ***
## REVOKEDYes 7.578e-01 7.911e-02 9.580 < 2e-16 ***
## MVR_PTS 1.171e-01 1.339e-02 8.748 < 2e-16 ***
## URBANICITYHighly Urban/ Urban 2.262e+00 1.105e-01 20.462 < 2e-16 ***
## NOHOMEKIDS -4.253e-01 6.833e-02 -6.224 4.84e-10 ***
## NOKIDSDRIV -5.182e-01 9.385e-02 -5.521 3.37e-08 ***
## HASCOLLEGE -5.811e-01 6.892e-02 -8.432 < 2e-16 ***
## ISPROFESSIONAL -4.568e-01 7.630e-02 -5.987 2.13e-09 ***
## ISMINIVAN -7.042e-01 7.328e-02 -9.610 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9418.0 on 8160 degrees of freedom
## Residual deviance: 7453.2 on 8145 degrees of freedom
## AIC: 7485.2
##
## Number of Fisher Scoring iterations: 5
The third model essentially matches the cross-validation accuracy and the area under the receiver operating characteristic curve of the second model. The orange line of the area under the curve closely overlaps the blue line from the previous model. The model has no substantial collinearity, and it has a large p-value for the Hosmer-Lemeshow goodness-of-fit test, which means that we fail to reject the null hypothesis of the model having a good fit.
| model_name | n_vars | model_pvalue | residual_deviance | H_L_pvalue | VIF_gt_4 | CV_accuracy | AUC |
|---|---|---|---|---|---|---|---|
| bk_elim_orig_vars | 16 | 0.00e+00 | 7642.470 | 0.841 | 0 | 0.778 | 0.788 |
| bk_elim_transf_vars | 17 | 0.00e+00 | 7444.605 | 0.766 | 0 | 0.782 | 0.803 |
| BIC_transf_var_glm | 15 | 0.00e+00 | 7453.214 | 0.693 | 0 | 0.782 | 0.803 |
SELECT MODELS
Evaluate Linear Models
Even though we still need to confirm how to interpret the coefficients, the third linear model performed the best of the three. Not only is the adjusted R-squared much higher, but the predicted R-squared is nearly as high, which indicates that we have not overfit the model.
| model_name | n_predictors | numdf | fstat | p.value | adj.r.squared | pre.r.squared |
|---|---|---|---|---|---|---|
| BIC_orig_var_lm | 11 | 11 | 47.726 | 3.76e-102 | 0.059 | 0.058 |
| BIC_transf_var_lm | 12 | 12 | 46.821 | 1.11e-108 | 0.063 | 0.061 |
| BIC_BCN_transf_var_lm | 16 | 16 | 136.758 | 0.00e+00 | 0.210 | 0.208 |
The table below contains the diagnostic statistics for the models. All three do not have collinearity issues since none of their variables’ variance inflation factors were over 4. Additionally, the moderate p-values for the Durbin-Watson test of independence indicate that we fail to reject the null hypothesis of no autocorrelation. However, the small p-values for the noncontant variance test from the car package and the Anderson-Darling test indicate that we would reject the null hypotheses of homoscedasticity and normality.
| DW.test | NCV.test | AD.test | VIF_gt_4 |
|---|---|---|---|
| 0.400 | 0 | 3.70e-24 | 0 |
| 0.432 | 0 | 3.70e-24 | 0 |
| 0.544 | 0 | 3.70e-24 | 0 |
Evaluate Binary Regression Models
The third BIC model’s cross-validation accuracy and AUC performed has well as second model, but it was more parsimonious with two fewer variables.
| model_name | n_vars | model_pvalue | residual_deviance | H_L_pvalue | VIF_gt_4 | CV_accuracy | AUC |
|---|---|---|---|---|---|---|---|
| bk_elim_orig_vars | 16 | 0.00e+00 | 7642.470 | 0.841 | 0 | 0.778 | 0.788 |
| bk_elim_transf_vars | 17 | 0.00e+00 | 7444.605 | 0.766 | 0 | 0.782 | 0.803 |
| BIC_transf_var_glm | 15 | 0.00e+00 | 7453.214 | 0.693 | 0 | 0.782 | 0.803 |
Confusion Matrix Metrics
The second and third models performed nearly the same across the spectrum of classification metrics. We would only consider choosing the first model with its highest specificity if the issurance company was most concerned about avoiding false negatives.
The strengths of the BIC model include the following:
With only 15 variables, it is the most parsimonious.
It passes the Hosmer-Lemeshow goodness-of-fit test.
It doesn’t have multicollinearity issues.
Diagnostics
Influence Leverage Values
However, a plot of the Standardized Deviance Residuals against the leverage values shows that we have several observations greater than twice the average leverage value. This indicates that there may be other possible variable transformations that we have not considered.
Marginal Model Plots
The marginal model plots of the response variable versus the predictors and the fitted response values show that the model fits reasonably well - although there is some deviation on the right sides of TRAVTIME and MVR_PTS.
Finally, when we apply the BIC model to the evalution data, it predicts that there are 204 insurance customers that would have an auto accident and 1937 that would not.
##
## 0 1
## 1936 205
Code Appendix
knitr::opts_chunk$set(
error = F
, message = T
#,tidy = T
, cache = T
, warning = F
, echo = F
)
# prettydoc::html_pretty:
# theme: cayman
# highlight: github
#Install & load packages
load_install <- function(pkg){
# Load packages. Install them if needed.
# CODE SOURCE: https://gist.github.com/stevenworthington/3178163
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg)) install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE, quietly = TRUE, warn.conflicts = FALSE)
}
# required packages
packages <- c("prettydoc","tidyverse", "caret", "pROC", "DT", "knitr", "ggthemes", "Hmisc", "psych", "corrplot", "reshape2", "car", "MASS", "ResourceSelection", "boot", "tinytex", "devtools", "VIM", "GGally", "missForest", "DMwR", "nortest")
#install_version("rmarkdown",version=1.8)
#table(load_install(packages))
data.frame(load_install(packages))
set.seed(5)
# 1. DATA EXPLORATION
train_data <- read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW4%20Auto%20Accident%20Prediction/insurance_training_data.csv") %>%
dplyr::select(-INDEX) %>%
mutate(
INCOME = as.numeric(INCOME),
HOME_VAL = as.numeric(HOME_VAL),
BLUEBOOK = as.numeric(BLUEBOOK),
OLDCLAIM = as.numeric(OLDCLAIM),
MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
SEX = as.factor(str_remove(SEX, "^z_")),
EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
JOB = as.factor(str_remove(JOB, "^z_")),
CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
URBANICITY = as.factor(str_remove(URBANICITY, "^z_"))
)
eval_data <- read.csv("https://raw.githubusercontent.com/kylegilde/D621-Data-Mining/master/HW4%20Auto%20Accident%20Prediction/insurance-evaluation-data.csv") %>%
dplyr::select(-INDEX) %>%
mutate(
INCOME = as.numeric(INCOME),
HOME_VAL = as.numeric(HOME_VAL),
BLUEBOOK = as.numeric(BLUEBOOK),
OLDCLAIM = as.numeric(OLDCLAIM),
MSTATUS = as.factor(str_remove(MSTATUS, "^z_")),
SEX = as.factor(str_remove(SEX, "^z_")),
EDUCATION = as.factor(str_remove(EDUCATION, "^z_")),
JOB = as.factor(str_remove(JOB, "^z_")),
CAR_TYPE = as.factor(str_remove(CAR_TYPE, "^z_")),
URBANICITY = as.factor(str_remove(URBANICITY, "^z_"))
)
summary_metrics <- function(df){
###Creates summary metrics table
metrics_only <- df[, sapply(df, is.numeric)]
df_metrics <- psych::describe(metrics_only, quant = c(.25,.75))
df_metrics$unique_values = rapply(metrics_only, function(x) length(unique(x)))
df_metrics <-
dplyr::select(df_metrics, n, unique_values, min, Q.1st = Q0.25, median, mean, Q.3rd = Q0.75,
max, range, sd, skew, kurtosis
)
return(df_metrics)
}
metrics_df <- summary_metrics(train_data)
datatable(round(metrics_df, 2), options = list(searching = F, paging = F))
#kable(metrics_df, digits = 1, format.args = list(big.mark = ',', scientific = F, drop0trailing = T))
###Categorical & Discrete variables Frequencies
cat_discrete_vars <- train_data %>%
mutate(
TARGET_FLAG = as.factor(TARGET_FLAG),
KIDSDRIV = as.factor(KIDSDRIV),
HOMEKIDS = as.factor(HOMEKIDS),
CLM_FREQ = as.factor(CLM_FREQ),
MVR_PTS = as.factor(MVR_PTS)
)
cat_discrete_freq <-
cat_discrete_vars[, sapply(cat_discrete_vars, is.factor)] %>%
gather("var", "value") %>%
group_by(var) %>%
count(var, value) %>%
mutate(prop = prop.table(n))
ggplot(data = cat_discrete_freq,
aes(x = reorder(value, prop),
y = prop)) +
geom_bar(stat = "identity") +
facet_wrap(~var, scales = "free") +
coord_flip() +
ggthemes::theme_fivethirtyeight()
# https://stackoverflow.com/questions/34860535/how-to-use-dplyr-to-generate-a-frequency-table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
####Side-by-Side Boxplots
boxplot_data <- cat_discrete_vars %>%
select_if(is.factor) %>%
mutate(TARGET_AMT = cat_discrete_vars$TARGET_AMT) %>%
dplyr::select(-TARGET_FLAG) %>%
reshape2::melt(id.vars = "TARGET_AMT")
### Side-by-Side Boxplots
ggplot(data = boxplot_data, aes(x = value, y = TARGET_AMT)) +
geom_boxplot() +
facet_wrap( ~ variable, scales = "free") +
stat_summary(fun.y=mean, geom="point", size=1, color = "red") +
scale_y_sqrt(breaks = c(1000, 5000, 10000, 20000 * c(1:4))) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))# +
#Reference: https://stackoverflow.com/questions/14604439/plot-multiple-boxplot-in-one-graph?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
### Scatterplots, Histograms & Density Plots
continuous_vars <-
cat_discrete_vars %>%
mutate(TARGET_FLAG = as.numeric(TARGET_FLAG) - 1) %>%
select_if(is.numeric) %>%
mutate(TARGET_FLAG = as.factor(TARGET_FLAG))
memory.limit(size = 20000)
if (!exists("binary_plot")){
binary_plot <- GGally::ggpairs(
na.omit(continuous_vars),
mapping = ggplot2::aes(color = TARGET_FLAG),
lower = list(continuous = wrap('points', size = .9, alpha = .2),
combo = wrap('facetdensity', alpha = 1)),
upper = list(continuous = wrap("cor", size = 3, alpha = 1),
combo = 'box'),
diag = list(continuous = wrap('barDiag', alpha = .9, bins = 15 ))
) +
theme(panel.background = element_rect(fill = 'grey92', color = NA),
panel.spacing = unit(3, "pt"),
panel.grid = element_line(color = 'white'),
strip.background = element_rect(fill = "grey85", color = NA),
#plot.margin = margin(.1, .1, .1, .1, "cm"),
panel.border = element_rect(color = "grey85", fill=NA, size= unit(.5, 'pt')))
}
binary_plot
#http://ggobi.github.io/ggally/#columns_and_mapping
# train_melted <- continuous_vars %>%
# mutate(TARGET_FLAG = as.integer(TARGET_FLAG) - 1) %>%
# dplyr::select(-TARGET_AMT) %>%
# reshape::melt(id.vars = "TARGET_FLAG")
#
#
# na.omit(train_melted) %>%
# ggplot(aes(x = value, y = TARGET_FLAG)) +
# geom_point(position = position_jitter(height=.2, width=.2),
# alpha = .1,
# aes(color = TARGET_FLAG)
# ) +
# geom_smooth(method = "glm", method.args = list(family = "binomial")) +
# facet_wrap(~ variable, scales = "free")
##CORRELATIONS
cormatrix <-
continuous_vars %>%
dplyr::select(-TARGET_FLAG) %>%
cor(use = "complete.obs")
#plot
#corrplot(cormatrix, method = "square", type = "upper")
#find the top correlations
correlations <- c(cormatrix[upper.tri(cormatrix)])
cor_df <-
data.frame(Var1 = rownames(cormatrix)[row(cormatrix)[upper.tri(cormatrix)]],
Var2 = colnames(cormatrix)[col(cormatrix)[upper.tri(cormatrix)]],
Correlation = correlations,
Rsquared = correlations^2
) %>%
arrange(-Rsquared)
#Reference: https://stackoverflow.com/questions/28035001/transform-correlation-matrix-into-dataframe-with-records-for-each-row-column-pai
kable(head(cor_df, 5), digits = 2, row.names = T, caption = "Top Correlated Variable Pairs")
#Corrrelations with TARGET_AMT
TARGET_AMT_corr <- subset(cor_df, Var2 == "TARGET_AMT" | Var1 == "TARGET_AMT")
rownames(TARGET_AMT_corr) <- 1:nrow(TARGET_AMT_corr)
kable(head(TARGET_AMT_corr, 5), digits = 2, row.names = T, caption = "Top Corrrelations with the Response Variable")
## Missing Values
missing_plot <- VIM::aggr(train_data, numbers=TRUE, sortVars=TRUE,
labels=names(train_data),
ylab=c("Missing Value Counts","Pattern"))
sum(train_data$JOB == "")
summary(missing_plot)
# 2. DATA PREPARATION
## Imputing the Missing Values
train_data$CAR_AGE[train_data$CAR_AGE == -3] <- NA
eval_data <- dplyr::select(eval_data, -c(TARGET_FLAG, TARGET_AMT))
## Imputing the Missing Values
if (!exists("imputed_train")){
imputed_train <- missForest(train_data, variablewise = T)
imputed_eval <- missForest(eval_data, variablewise = T)
}
#impute_results
impute_df <-
summary(missing_plot)$missings %>%
mutate(
MSE = as.numeric(imputed_train$OOBerror),
Max = sapply(imputed_train$ximp, function(x) tryCatch(max(x), error=function(err) NA)),
Min = sapply(imputed_train$ximp, function(x) tryCatch(min(x), error=function(err) NA)),
Range = Max - Min,
NRMSE = sqrt(MSE)/Range
) %>%
filter(MSE > 0) %>%
dplyr::select(-c(Max, Min, Range)) %>%
arrange(-NRMSE)
kable(impute_df, digits = 2)
#http://rcompanion.org/handbook/G_14.html
#https://stackoverflow.com/questions/14668972/catch-an-error-by-producing-na?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
## Variable Transformations
levels(imputed_train$ximp$JOB)[1] <- "Unknown"
train_transformed <-
imputed_train$ximp %>%
mutate(
NOHOMEKIDS = as.integer(HOMEKIDS == 0),
NOKIDSDRIV = as.integer(KIDSDRIV == 0),
HASCOLLEGE = as.integer(EDUCATION %in% c("Bachelors", "Masters", "PhD")),
ISPROFESSIONAL = as.integer(JOB %in% c("Doctor", "Lawyer", "Manager", "Professional")),
ISMINIVAN = as.integer(CAR_TYPE == "Minivan")#,
#sqrt.TARGET_AMT = sqrt(TARGET_AMT)
) %>%
dplyr::select(-c(HOMEKIDS, KIDSDRIV, EDUCATION, JOB, CAR_TYPE))
# 3. BUILD MODELS
## Linear Regression
### LM #1: Original Variables with BIC
orig_var_lm <- lm(TARGET_AMT ~ ., data = dplyr::select(imputed_train$ximp, -TARGET_FLAG))
n <- nrow(orig_var_lm$model)
BIC_orig_var_lm <- step(orig_var_lm, k = log(n), trace = 0)
summary(BIC_orig_var_lm)
removed_variables <- function(larger_mod, smaller_mod){
removed <- names(coef(larger_mod))[!names(coef(larger_mod)) %in%
names(coef(smaller_mod))]
print(paste("removed variable(s):", length(removed)))
print(removed)
}
removed_variables(orig_var_lm, BIC_orig_var_lm)
PRESS <- function(linear.model) {
#source: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-press-r
#' calculate the predictive residuals
pr <- residuals(linear.model)/(1 - lm.influence(linear.model)$hat)
#' calculate the PRESS
PRESS <- sum(pr^2)
return(PRESS)
}
pred_r_squared <- function(linear.model) {
#source: https://gist.github.com/tomhopper/8c204d978c4a0cbcb8c0#file-pred_r_squared-r
#' Use anova() to get the sum of squares for the linear model
lm.anova <- anova(linear.model)
#' Calculate the total sum of squares
tss <- sum(lm.anova$'Sum Sq')
# Calculate the predictive R^2
pred.r.squared <- 1 - PRESS(linear.model)/(tss)
return(pred.r.squared)
}
lm_evaluation <- function(lmod) {
### Summarizes the model's key statistics in one row
### https://gist.github.com/stephenturner/722049#file-pvalue-from-lm-object-r-L5
lm_summary <- summary(lmod)
f <- as.numeric(lm_summary$fstatistic)
df_summary <-
data.frame(
model_name = deparse(substitute(lmod)),
n_predictors = ncol(lmod$model) - 1,
numdf = f[2],
fstat = f[1],
p.value = formatC(pf(f[1], f[2], f[3], lower.tail = F), format = "e", digits = 2),
adj.r.squared = lm_summary$adj.r.squared,
pre.r.squared = pred_r_squared(lmod)#,
#rmse = as.numeric(DMwR::regr.eval(lmod$model[1], fitted(lmod), stats = c("rmse")))
)
return(df_summary)
}
lm_diagnotics <- function(lmod){
diag_df <- data.frame(
DW.test = car::durbinWatsonTest(lmod)$p,
NCV.test = car::ncvTest(lmod)$p,
AD.test = formatC(nortest::ad.test(lmod$residuals)$p.value, format = "e", digits = 2),
VIF_gt_4 = sum(car::vif(lmod) > 4)
)
return(diag_df)
}
#evaluate performance & diagnostics
lm_results <- lm_evaluation(BIC_orig_var_lm)
lm_results_diagnostics <- lm_diagnotics(BIC_orig_var_lm)
par(mfrow=c(2,2))
plot(BIC_orig_var_lm)
#http://data.library.virginia.edu/diagnostic-plots/
#http://analyticspro.org/2016/03/07/r-tutorial-how-to-use-diagnostic-plots-for-regression-models/
### LM #2: Transformed Variables with BIC Selection
transf_var_lmod <- lm(TARGET_AMT ~ .-TARGET_FLAG, data = train_transformed)
n <- nrow(transf_var_lmod$model)
BIC_transf_var_lm <- step(transf_var_lmod, k = log(n), trace = 0)
summary(BIC_transf_var_lm)
#evaluate performance & diagnostics
model_eval <- lm_evaluation(BIC_transf_var_lm)
model_diag <- lm_diagnotics(BIC_transf_var_lm)
lm_results <- rbind(lm_results, model_eval)
lm_results_diagnostics <- rbind(lm_results_diagnostics, model_diag)
par(mfrow=c(2,2))
plot(BIC_transf_var_lm)
### LM #3: BIC Selection with Negative Box-Cox Transformation & Transformed Variables
PT <- car::powerTransform(BIC_transf_var_lm, family = "bcnPower")
#PT <- car::powerTransform(BIC_transf_var_lm, family = "yjPower")
summary(PT)
#load_install("VGAM")
train_transformed_BCN <-
train_transformed %>%
mutate(
#TARGET_AMT = car::bcnPower(TARGET_AMT, lambda = PT$lambda, gamma = PT$gamma)
#TARGET_AMT = VGAM::yeo.johnson(TARGET_AMT, PT$lambda)
TARGET_AMT = log(TARGET_AMT + 1)
#TARGET_AMT = (TARGET_AMT+PT$gamma)^PT$lambda
#TARGET_AMT = TARGET_AMT^PT$lambda
)
# Our third model will use the 2-parameter Box-Cox transformation of `TARGET_AMT` with the transformed variables. This model has four more statistically significant variables than the previous model, including all five of the transformed dummy variables we created. Moreover, adjusted R-squared is much higher at .21. This model explains much more of the variance in the response variable `TARGET_AMT`.
BCN_lmod <- lm(TARGET_AMT~.-TARGET_FLAG, data = train_transformed_BCN)
BIC_BCN_transf_var_lm <- step(BCN_lmod, k = log(n), trace = 0)
summary(BIC_BCN_transf_var_lm)
model_eval <- lm_evaluation(BIC_BCN_transf_var_lm)
model_diag <- lm_diagnotics(BIC_BCN_transf_var_lm)
lm_results <- rbind(lm_results, model_eval)
lm_results_diagnostics <- rbind(lm_results_diagnostics, model_diag)
# Interpreting the coefficients after performing 2-parameter Box-Cox transformation can be challenging. After attempting to take the inverse of the transformation, the model coefficients do not seem to make sense given the distribution of the `TARGET_AMT`. Further research is needed.
# Inverse the transformed coefficients
# BCN_inverse <- function(U, lambda, gamma){
# #Calculates the inverse of the negative Box-Cox transformation
# return((U * lambda + 1)^(1/lambda) - gamma)
# #return((U)^(1/lambda) - gamma)
# #return((U)^(1/lambda))
# }
#https://stats.stackexchange.com/questions/233611/reverse-boxcox-transformation-with-negative-values?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#https://stackoverflow.com/questions/18464491/transforming-data-to-normality-what-is-the-best-function-for-a-given-case?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
# inversed_coef_df <-
# data.frame(var = names(coef(BIC_BCN_transf_var_lm))) %>%
# mutate(
# transformed_coef_value = coef(BIC_BCN_transf_var_lm),
# inversed_coef_value = BCN_inverse(transformed_coef_value, PT$lambda, PT$gamma)
# ) %>%
# arrange(-transformed_coef_value)
inversed_coef_df <-
data.frame(var = names(coef(BIC_BCN_transf_var_lm))) %>%
mutate(
transformed_coef_value = coef(BIC_BCN_transf_var_lm),
inversed_coef_value = exp(transformed_coef_value) - 1
#inversed_coef_value = VGAM::yeo.johnson(transformed_coef_value, PT$lambda, inverse = T)
) %>%
arrange(-transformed_coef_value)
kable(inversed_coef_df, digits = 2, format.args = list(big.mark = ',', scientific = F, drop0trailing = T))
par(mfrow=c(2,2))
plot(BIC_BCN_transf_var_lm)
backward_elimination <- function(lmod){
#performs backward elimination model selection
#removes variables until all remaining ones are stat-sig
removed_vars <- c()
removed_pvalues <- c()
#handles category dummy variables
cat_levels <- unlist(lmod$xlevels)
cat_vars <- str_sub(names(cat_levels), 1, nchar(names(cat_levels)) - 1)
cat_var_df <- data.frame(cat_vars,
dummy_vars = str_c(cat_vars, cat_levels),
stringsAsFactors = F)
# checks for p-values > .05 execpt for the intercept
while (max(summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]) > .05){
# find insignificant pvalue
pvalues <- summary(lmod)$coefficients[2:length(summary(lmod)$coefficients[, 4]), 4]
max_pvalue <- max(pvalues)
remove <- names(which.max(pvalues))
#if categorical dummy variable, remove the variable
dummy_var <- dplyr::filter(cat_var_df, dummy_vars == remove)
remove <- ifelse(nrow(dummy_var) > 0, dummy_var[, 1], remove)
#record the removed variables
removed_vars <- c(removed_vars, remove)
removed_pvalues <- c(removed_pvalues, max_pvalue)
# update model
lmod <- update(lmod, as.formula(paste0(".~.-`", remove, "`")))
}
print(kable(data.frame(removed_vars, removed_pvalues), digits = 3))
return(lmod)
}
orig_var_glm <- glm(TARGET_FLAG ~ . , family = "binomial", data = dplyr::select(imputed_train$ximp, -TARGET_AMT))
summary(bk_elim_orig_vars <- backward_elimination(orig_var_glm))
glm_performance <- function(model) {
### Summarizes the model's key statistics
### References: https://www.r-bloggers.com/predicting-creditability-using-logistic-regression-in-r-cross-validating-the-classifier-part-2-2/
### https://www.rdocumentation.org/packages/boot/versions/1.3-20/topics/cv.glm
### https://www.rdocumentation.org/packages/ResourceSelection/versions/0.3-1/topics/hoslem.test
cost <- function(r, pi = 0) mean(abs(r - pi) > 0.5)
df_summary <- data.frame(
# model = bk_elim_orig_vars
model_name = deparse(substitute(model)),
n_vars = length(coef(model)) - 1,
model_pvalue = formatC(pchisq(model$null.deviance - model$deviance, 1, lower=FALSE), format = "e", digits = 2),
residual_deviance = model$deviance,
H_L_pvalue = ResourceSelection::hoslem.test(model$y, fitted(model))$p.value,
VIF_gt_4 = sum(car::vif(model) > 4),
CV_accuracy = 1 - boot::cv.glm(model$model, model, cost = cost, K = 100)$delta[1],
AUC = as.numeric(pROC::roc(model$y, fitted(model))$auc)
)
return(df_summary)
}
kable(glm_models <- glm_performance(bk_elim_orig_vars), digits = 3)
bk_elim_orig_vars_roc <- roc(bk_elim_orig_vars$y, fitted(bk_elim_orig_vars))
plot(bk_elim_orig_vars_roc, legacy.axes = T, main = "Model ROCs", col = "gray", xaxs = "i", yaxs = "i")
#https://web.archive.org/web/20160407221300/http://metaoptimize.com:80/qa/questions/988/simple-explanation-of-area-under-the-roc-curve
transf_var_glm <- glm(TARGET_FLAG ~ . , family = "binomial", data = dplyr::select(train_transformed, -TARGET_AMT))
summary(bk_elim_transf_vars <- backward_elimination(transf_var_glm))
glm_mod <- glm_performance(bk_elim_transf_vars)
kable(glm_models <- rbind(glm_models, glm_mod), digits = 3)
bk_elim_transf_vars_roc <- roc(bk_elim_transf_vars$y, fitted(bk_elim_transf_vars))
plot(bk_elim_orig_vars_roc, legacy.axes = T, main = "Model ROCs", col = "gray", xaxs = "i", yaxs = "i")
plot(bk_elim_transf_vars_roc, add = T, col = "blue", lty = 3)
### Binary Regression Model #3: Transformed Variables with BIC
BIC_transf_var_glm <- step(transf_var_glm, k = log(n), trace = 0)
summary(BIC_transf_var_glm)
glm_mod <- glm_performance(BIC_transf_var_glm)
kable(glm_models <- rbind(glm_models, glm_mod), digits = 3)
BIC_transf_var_glm_roc <- roc(BIC_transf_var_glm$y, fitted(BIC_transf_var_glm))
plot(bk_elim_orig_vars_roc, legacy.axes = T, main = "Model ROCs", col = "gray", xaxs = "i", yaxs = "i")
plot(bk_elim_transf_vars_roc, add = T, col = "blue", lty = 3)
plot(BIC_transf_var_glm_roc, add = T, col = "orange", lty = 3)
kable(lm_results, digits = 3, caption = "Model Summary Statistics")
kable(lm_results_diagnostics, digits = 3, caption = "Model Summary Statistics")
kable(glm_models, digits = 3, caption = "Model Summary Statistics")
# 4. SELECT MODELS
all_models <- as.character(glm_models$model_name)
confusion_metrics <- data.frame(metric = c("Accuracy", "Class. Error Rate", "Sensitivity", "Specificity", "Precision", "F1", "AUC"))
for (i in 1:length(all_models)){
model <- get(all_models[i])
model_name <- all_models[i]
predicted_values <- as.factor(as.integer(fitted(model) > .5))
CM <- confusionMatrix(predicted_values, as.factor(model$y), positive = "1")
caret_metrics <- c(CM$overall[1],
1 - as.numeric(CM$overall[1]),
CM$byClass[c(1, 2, 5, 7)],
get(paste0(model_name, "_roc"))$auc)
confusion_metrics[, model_name] <- caret_metrics
}
confusion_metrics_melted <- confusion_metrics %>%
reshape::melt(id.vars = "metric") %>%
dplyr::rename(model = variable)
ggplot(data = confusion_metrics_melted, aes(x = model, y = value)) +
geom_bar(aes(fill = model), stat='identity') +
theme(axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.title.x=element_blank()) +
facet_grid(~metric)
#https://stackoverflow.com/questions/18624394/ggplot-bar-plot-with-facet-dependent-order-of-categories/18625739?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#https://stackoverflow.com/questions/18158461/grouped-bar-plot-in-ggplot?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
#kable(confusion_metrics, digits = 3)
# influential leverage values
# MARR p291
hvalues <- influence(BIC_transf_var_glm)$hat
stanresDeviance <- residuals(BIC_transf_var_glm) / sqrt(1 - hvalues)
n_predictors <- length(names(BIC_transf_var_glm$model)) - 1
average_leverage <- (n_predictors + 1) / nrow(BIC_transf_var_glm$model)
plot(hvalues, stanresDeviance,
ylab = "Standardized Deviance Residuals",
xlab = "Leverage Values",
ylim = c(-3, 3),
xlim = c(0, 0.015))
abline(v = 2 * average_leverage, lty = 2)
#Reference: http://www.stat.tamu.edu/~sheather/book/docs/rcode/Chapter8.R
if(!exists("marginal_model_plots")){marginal_model_plots <- car::mmps(BIC_transf_var_glm)}
#http://support.sas.com/documentation/cdl/en/statug/68162/HTML/default/viewer.htm#statug_templt_sect027.htm
levels(imputed_eval$ximp$JOB)[1] <- "Unknown"
eval_transformed <-
imputed_eval$ximp %>%
mutate(
NOHOMEKIDS = as.integer(HOMEKIDS == 0),
NOKIDSDRIV = as.integer(KIDSDRIV == 0),
HASCOLLEGE = as.integer(EDUCATION %in% c("Bachelors", "Masters", "PhD")),
ISPROFESSIONAL = as.integer(JOB %in% c("Doctor", "Lawyer", "Manager", "Professional")),
ISMINIVAN = as.integer(CAR_TYPE == "Minivan")#,
#sqrt.TARGET_AMT = sqrt(TARGET_AMT)
) %>%
dplyr::select(-c(HOMEKIDS, KIDSDRIV, EDUCATION, JOB, CAR_TYPE))
eval_results <- predict(BIC_transf_var_glm, newdata = eval_transformed)
table(as.integer(eval_results > .5))